From thomas.bub at thomson.net Fri Dec 1 01:12:14 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Fri, 1 Dec 2006 10:12:14 +0100 Subject: [openib-general] Is an umad_close_port a good idea after I disconnect from the SA with osm_vendor_delete ? Message-ID: Sasha, I'm having trouble to get the patch applied. I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE path but after running the ofed-install script the sources in the /usr/local/ofed din't contain that patch anymore. Can you help me out of the dark and tell me how to build the libvendor.so out of/on the ofed-1.1/SOURCES tree. Thanks Thomas > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Monday, November 27, 2006 5:43 PM > To: Bub Thomas > Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen > Subject: Re: [openib-general] Is an umad_close_port a good idea after I > disconnect from the SA with osm_vendor_delete ? > > On 14:13 Mon 27 Nov , Bub Thomas wrote: > > > > Sasha, > > whom to ask to add this to the osm_vendor functions? > > Please test this patch: > > diff --git a/osm/libvendor/osm_vendor_ibumad.c > b/osm/libvendor/osm_vendor_ibumad.c > index e82695f..4205b23 100644 > --- a/osm/libvendor/osm_vendor_ibumad.c > +++ b/osm/libvendor/osm_vendor_ibumad.c > @@ -545,10 +545,15 @@ osm_vendor_delete( > umad_receiver_t *p_ur; > int agent_id; > > - /* unregister UMAD agents */ > - for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++) > - if ( (*pp_vend)->agents[agent_id] ) > - umad_unregister( (*pp_vend)->umad_port_id, agent_id ); > + if ((*pp_vend)->umad_port_id >= 0) { > + /* unregister UMAD agents */ > + for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++) > + if ( (*pp_vend)->agents[agent_id] ) > + umad_unregister((*pp_vend)->umad_port_id, > + agent_id ); > + umad_close_port((*pp_vend)->umad_port_id); > + (*pp_vend)->umad_port_id = -1; > + } > > clear_madw( *pp_vend ); > /* make sure all ports are closed */ > > > > Or should I file a bug for this > > Good idea too. > > Sasha From dotanb at dev.mellanox.co.il Fri Dec 1 01:20:52 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Fri, 1 Dec 2006 11:20:52 +0200 (IST) Subject: [openib-general] QP creation failure In-Reply-To: <456F7239.4060104@systemfabricworks.com> References: <456F7239.4060104@systemfabricworks.com> Message-ID: <2340.85.65.224.142.1164964852.squirrel@dev.mellanox.co.il> Hi. > Hi, > I'm hoping someone here can help me diagnose a this problem. > I have a really simple test app that uses verbs and is failing to create > a QP on one machine in particular. On other machines the app works and > behaves as expected without any problems. > > The machine in question is 32bit dual CPU Intel system running FC4 and > the released OFED 1.1 with a Mellanox PCI-X HCA (MT23108) > [root at localhost test]# uname -a > Linux localhost.localdomain 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 > 23:08:39 EDT 2005 i686 i686 i386 GNU/Linux > [root at localhost test]# cat /usr/local/ofed/BUILD_ID > OFED-1.1 > > openib-1.1 (REV=9905) > # User space > https://openib.org/svn/gen2/branches/1.1/src/userspace > Git: > ref: refs/heads/ofed_1_1 > commit a083ec1174cb4b5a5052ef5de9a8175df82e864a > > The code in question is pretty simple and as I've said works everywhere > else I've tried it. > > Errno is set to 22, and I've traced the problem to this point in the > OFED stack, so I can see where it fails but still have no idea why: > It fails at line 578 in "src/userspace/libibverbs/src/cmd.c" the > instruction is > 'write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size' > cmd_fd looked valid (was 6), cmd looked to point to a valid structure, > and cmd_size was 96. > > This was called from line 533 of src/userspace/libmthca/src/verbs.c > 'ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, &resp, > sizeof resp);' > > Which was invoked by my code calling ibv_create_qp as seen below: > > > /* create the qpairs */ > init_attr.send_cq = info->cq_hndl; > init_attr.recv_cq = info->cq_hndl; > init_attr.cap.max_send_wr = info->oust_wr_sq; //8 > init_attr.cap.max_recv_wr = info->oust_wr_rq; //8 > init_attr.cap.max_send_sge = info->sg_size_sq; //1 > init_attr.cap.max_recv_sge = info->sg_size_rq; //1 > init_attr.cap.max_inline_data = 1024; > init_attr.qp_type = IBV_QPT_RC; > > if ((info->qp_hndl[CLIENT] = > ibv_create_qp(info->pd_hndl, &init_attr)) == NULL) { > > info->failed = 1; > rc = ERR_INIT_HCA_FAILED; > > } > > > > Any ideas or pointers in the right direction would be greatly appreciated. I think that the problem is the amount of inline data that you try to use. I suggest that you put 0, create the QP and check the value that are being returned from the QP creation and use it. I believe that the maximum size that can be used in this attribute is ~ 420 . Dotan From dotanb at dev.mellanox.co.il Fri Dec 1 01:27:41 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Fri, 1 Dec 2006 11:27:41 +0200 (IST) Subject: [openib-general] Segmentation fault on ib_read_bw In-Reply-To: References: Message-ID: <2462.85.65.224.142.1164965261.squirrel@dev.mellanox.co.il> Hi. > Hi, > > Im using the openib gen2 trunk and was running the performance tests > from that tree. > I get a "Segmentation Fault" on running ib_read_bw and the remaining > tests. > The output is as follows: > ------------------------------------------------------------------ > RDMA_Read BW Test > Connection type : RC > Segmentation fault > > Any particular reason why this is happening? Can you give some more info, such as: which driver git/svn version are you using? which parameters did you use in each side? which distro are you using? which computer arch are you using? thanks Dotan From or.gerlitz at gmail.com Fri Dec 1 05:36:41 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 1 Dec 2006 15:36:41 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1164918691.14800.101.camel@brick.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> Message-ID: <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> On 11/30/06, Ralph Campbell wrote: > On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote: > > So what did you change since v1? How do you deal with fitting 64-bit > > addresses into an sg list entry that has a 32-bit dma_addr_t? > The ipath_map_sg() handler for ib_dma_map_sg() doesn't store > anything in the struct scatterlist. The translation is > done when ipath_sg_dma_address() is called which now > returns u64 instead of dma_addr_t thus avoiding the truncation > problem. And there is this open/TODO of calling kmap(page) on dma mapping time (or when ipath_sg_dma_address is called) and kunmap(page) on dma unmapping time, where you must store the kvaddr between the two calls and the sg does not have a room for it where dma_addr_t is u32 and kvaddr is u64 .... > All of the callers to ib_dma_map_single(), ib_dma_map_page(), > and ib_sg_dma_address() have been modifed to save the address > in a u64 instead of a dma_addr_t. This actually wasn't much > of a change since the address was being cast to u64 anway > when assigned to struct sge.addr. Its fixes a bug, so it actually somehow much of a change. Without it on arch as mentioned above, ipath_dma_map_single would return only a u32 portion of the kvaddr and later the ulp code would place this chopped address in sge.addr and the ipath driver would use the wrong address. Or. From sashak at voltaire.com Fri Dec 1 06:19:01 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 Dec 2006 16:19:01 +0200 Subject: [openib-general] Is an umad_close_port a good idea after I disconnect from the SA with osm_vendor_delete ? In-Reply-To: References: Message-ID: <20061201141901.GC23574@sashak.voltaire.com> Hi Thomas, On 10:12 Fri 01 Dec , Bub Thomas wrote: > Sasha, > I'm having trouble to get the patch applied. > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE > path but after running the ofed-install script the sources in the > /usr/local/ofed din't contain that patch anymore. > Can you help me out of the dark and tell me how to build the > libvendor.so out of/on the ofed-1.1/SOURCES tree. Never did it personally, but you may want to look at https://openib.org/tiki/tiki-index.php?page=OFED+Support for how ofed_patch.sh does this. And you can use svn or git versions of management/osm as well. Sasha > Thanks > Thomas > > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Monday, November 27, 2006 5:43 PM > > To: Bub Thomas > > Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen > > Subject: Re: [openib-general] Is an umad_close_port a good idea after > I > > disconnect from the SA with osm_vendor_delete ? > > > > On 14:13 Mon 27 Nov , Bub Thomas wrote: > > > > > > Sasha, > > > whom to ask to add this to the osm_vendor functions? > > > > Please test this patch: > > > > diff --git a/osm/libvendor/osm_vendor_ibumad.c > > b/osm/libvendor/osm_vendor_ibumad.c > > index e82695f..4205b23 100644 > > --- a/osm/libvendor/osm_vendor_ibumad.c > > +++ b/osm/libvendor/osm_vendor_ibumad.c > > @@ -545,10 +545,15 @@ osm_vendor_delete( > > umad_receiver_t *p_ur; > > int agent_id; > > > > - /* unregister UMAD agents */ > > - for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++) > > - if ( (*pp_vend)->agents[agent_id] ) > > - umad_unregister( (*pp_vend)->umad_port_id, > agent_id ); > > + if ((*pp_vend)->umad_port_id >= 0) { > > + /* unregister UMAD agents */ > > + for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; > agent_id++) > > + if ( (*pp_vend)->agents[agent_id] ) > > + > umad_unregister((*pp_vend)->umad_port_id, > > + agent_id ); > > + umad_close_port((*pp_vend)->umad_port_id); > > + (*pp_vend)->umad_port_id = -1; > > + } > > > > clear_madw( *pp_vend ); > > /* make sure all ports are closed */ > > > > > > > Or should I file a bug for this > > > > Good idea too. > > > > Sasha > > From halr at voltaire.com Fri Dec 1 06:27:16 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 09:27:16 -0500 Subject: [openib-general] Is an umad_close_port a good idea after I disconnect from the SA with osm_vendor_delete ? In-Reply-To: <20061201141901.GC23574@sashak.voltaire.com> References: <20061201141901.GC23574@sashak.voltaire.com> Message-ID: <1164983140.11808.177662.camel@hal.voltaire.com> On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote: > Hi Thomas, > > On 10:12 Fri 01 Dec , Bub Thomas wrote: > > Sasha, > > I'm having trouble to get the patch applied. > > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE > > path but after running the ofed-install script the sources in the > > /usr/local/ofed din't contain that patch anymore. > > Can you help me out of the dark and tell me how to build the > > libvendor.so out of/on the ofed-1.1/SOURCES tree. > > Never did it personally, but you may want to look at > https://openib.org/tiki/tiki-index.php?page=OFED+Support > for how ofed_patch.sh does this. > > And you can use svn or git versions of management/osm as well. There's currently no git version of OFED 1.1 OpenSM AFAIK. -- Hal > Sasha > > > Thanks > > Thomas > > > > > > > -----Original Message----- > > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > > Sent: Monday, November 27, 2006 5:43 PM > > > To: Bub Thomas > > > Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen > > > Subject: Re: [openib-general] Is an umad_close_port a good idea after > > I > > > disconnect from the SA with osm_vendor_delete ? > > > > > > On 14:13 Mon 27 Nov , Bub Thomas wrote: > > > > > > > > Sasha, > > > > whom to ask to add this to the osm_vendor functions? > > > > > > Please test this patch: > > > > > > diff --git a/osm/libvendor/osm_vendor_ibumad.c > > > b/osm/libvendor/osm_vendor_ibumad.c > > > index e82695f..4205b23 100644 > > > --- a/osm/libvendor/osm_vendor_ibumad.c > > > +++ b/osm/libvendor/osm_vendor_ibumad.c > > > @@ -545,10 +545,15 @@ osm_vendor_delete( > > > umad_receiver_t *p_ur; > > > int agent_id; > > > > > > - /* unregister UMAD agents */ > > > - for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++) > > > - if ( (*pp_vend)->agents[agent_id] ) > > > - umad_unregister( (*pp_vend)->umad_port_id, > > agent_id ); > > > + if ((*pp_vend)->umad_port_id >= 0) { > > > + /* unregister UMAD agents */ > > > + for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; > > agent_id++) > > > + if ( (*pp_vend)->agents[agent_id] ) > > > + > > umad_unregister((*pp_vend)->umad_port_id, > > > + agent_id ); > > > + umad_close_port((*pp_vend)->umad_port_id); > > > + (*pp_vend)->umad_port_id = -1; > > > + } > > > > > > clear_madw( *pp_vend ); > > > /* make sure all ports are closed */ > > > > > > > > > > Or should I file a bug for this > > > > > > Good idea too. > > > > > > Sasha > > > > From swise at opengridcomputing.com Fri Dec 1 06:35:28 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 01 Dec 2006 08:35:28 -0600 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <754FC8FE0A97A94B906344259F447D4A0413F81D@ES23SNLNT.srn.sandia.gov> References: <754FC8FE0A97A94B906344259F447D4A0413F811@ES23SNLNT.srn.sandia.gov> <456E991C.4040907@dev.mellanox.co.il> <1164904991.7247.44.camel@trinity.ogc.int> <3EF52E87-47E9-4F0C-AA0D-C2CAA63DFC7C@cisco.com> <936E0840-D941-4BB4-A3C4-CE410D90E0E5@cisco.com> <1164911424.11779.46.camel@stevo-desktop> <1164916697.11779.84.camel@stevo-desktop> <5251E729-5FC0-48B5-9399-0C9466F8A2A2@cisco.com> <1164917426.11779.87.camel@stevo-desktop> <754FC8FE0A97A94B906344259F447D4A0413F81D@ES23SNLNT.srn.sandia.gov> Message-ID: <1164983728.6872.5.camel@stevo-desktop> On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote: > Steve, > > As you know, I have my rnfs kernel running the stable iwarp-stack on > my cluster now. But how do I compile the userspace packages from that > stack? > You build and install the userspace libraries from the iwarp stable branch. This will install all the needed header files to build other packages that depend on them. Like mvapich2-0.9.8, for instance. If rping is working for you, then you've already done this. The user libs and header files are all installed in /usr/local by default. If you have /usr/local/include/rdma/rdma_cma.h, for instance, you've probably already installed the userspace stuff from the iwarp stable branch. To build and install the user libs from the iwarp branch, please see the wiki howto. There is a section describing installing the userspace libraries. https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Hope this helps... Steve. From swise at opengridcomputing.com Fri Dec 1 06:40:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 01 Dec 2006 08:40:00 -0600 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <1164949057.19459.11.camel@localhost> References: <754FC8FE0A97A94B906344259F447D4A0413F811@ES23SNLNT.srn.sandia.gov> <456E991C.4040907@dev.mellanox.co.il> <1164904991.7247.44.camel@trinity.ogc.int> <3EF52E87-47E9-4F0C-AA0D-C2CAA63DFC7C@cisco.com> <936E0840-D941-4BB4-A3C4-CE410D90E0E5@cisco.com> <1164911424.11779.46.camel@stevo-desktop> <1164949057.19459.11.camel@localhost> Message-ID: <1164984000.6872.10.camel@stevo-desktop> On Thu, 2006-11-30 at 20:57 -0800, Matt Leininger wrote: > On Thu, 2006-11-30 at 12:30 -0600, Steve Wise wrote: > > On Thu, 2006-11-30 at 12:12 -0500, Jeff Squyres wrote: > > > It just clicked in my brain as to why you were asking this question. > > > > > > Remember that OMPI currently does not use any CM for OF connections > > > at all. So it's not like it's using the old CM that doesn't support > > > iWARP. OMPI uses its own out-of-band mechanism, which, as I > > > understand it, should work with iWARP just as well as it works for IB. > > > > > > Am I incorrect in thinking that? (I have no iWARP hardware to test > > > with) > > > > iWARP _requires_ the RDMA-CM for connection setup... > > > > So OMPI as it stands today won't work over iwarp devices. > > > > Right now, the only non-uDAPL MPI solution that will work with the iwarp > > stable svn branch + 2.6.17 RNFS is MVAPICH2. > > > > If you utilize uDAPL, then Intel and HP have MPI libs that might work... > > OMPI also has a uDAPL network device (along with a device that uses > verbs directly). So if we just use OMPI uDAPL it should work over > iWarp? > It should. You might have to tweak OMPI slightly to work with uDAPL from the iWARP branch. Or take the latest uDAPL and back-port it to the iwarp branch. Steve. From halr at voltaire.com Fri Dec 1 07:18:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 10:18:40 -0500 Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_sm.c: In osm_sm_mcgrp_join, use CL_PLOCK_RELEASE macro Message-ID: <1164986311.11808.179322.camel@hal.voltaire.com> OpenSM/osm_sm.c: In osm_sm_mcgrp_join, use CL_PLOCK_RELEASE macro rather than calling cl_plock_release directly Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c index 9aa4a36..100f2a0 100644 --- a/osm/opensm/osm_sm.c +++ b/osm/opensm/osm_sm.c @@ -740,7 +740,7 @@ osm_sm_mcgrp_join( status = osm_port_add_mgrp( p_port, mlid ); if( status != IB_SUCCESS ) { - cl_plock_release( p_sm->p_lock ); + CL_PLOCK_RELEASE( p_sm->p_lock ); osm_log( p_sm->p_log, OSM_LOG_ERROR, "osm_sm_mcgrp_join: ERR 2E03: " "Unable to associate port 0x%" PRIx64 " to mlid 0x%X\n", From sashak at voltaire.com Fri Dec 1 07:30:55 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 1 Dec 2006 17:30:55 +0200 Subject: [openib-general] Is an umad_close_port a good idea after I disconnect from the SA with osm_vendor_delete ? In-Reply-To: <1164983140.11808.177662.camel@hal.voltaire.com> References: <20061201141901.GC23574@sashak.voltaire.com> <1164983140.11808.177662.camel@hal.voltaire.com> Message-ID: <20061201153055.GE23574@sashak.voltaire.com> On 09:27 Fri 01 Dec , Hal Rosenstock wrote: > On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote: > > Hi Thomas, > > > > On 10:12 Fri 01 Dec , Bub Thomas wrote: > > > Sasha, > > > I'm having trouble to get the patch applied. > > > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE > > > path but after running the ofed-install script the sources in the > > > /usr/local/ofed din't contain that patch anymore. > > > Can you help me out of the dark and tell me how to build the > > > libvendor.so out of/on the ofed-1.1/SOURCES tree. > > > > Never did it personally, but you may want to look at > > https://openib.org/tiki/tiki-index.php?page=OFED+Support > > for how ofed_patch.sh does this. > > > > And you can use svn or git versions of management/osm as well. > > There's currently no git version of OFED 1.1 OpenSM AFAIK. What about 1.1 git branch? This is same as SVN's 1.1. :) Sasha From Arkady.Kanevsky at netapp.com Fri Dec 1 07:29:41 2006 From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady) Date: Fri, 1 Dec 2006 10:29:41 -0500 Subject: [openib-general] [openfabrics-iwg] OFED 1.2 contents and schedule as proposed by the EWG Message-ID: What about iWARP support? Arkady Kanevsky email: arkady at netapp.com Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16. Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 > -----Original Message----- > From: Bill Boas [mailto:bboas at systemfabricworks.com] > Sent: Thursday, November 30, 2006 2:06 PM > To: 'OPENIB'; openib-promoters at openib.org; > openfabrics-iwg at openfabrics.org > Cc: 'Tziporet Koren'; 'Jeff Squyres'; 'EWG' > Subject: [openfabrics-iwg] OFED 1.2 contents and schedule as > proposed by the EWG > > Following the Developer Summit discussions in Tampa the EWG > is proposing the contents and schedule for OFED 1.2 as > described on their wiki > > https://openib.org/tiki/tiki-index.php?page=OFED+release+procedure > > Many members of the OpenFabrics Board could not be present at > the summit and many members of the OpenFabrics community were > also not present. > > Also the IWG is planning for its next Interoperability Test > Event after which it is probable that the OpenFabrics Logo > program should be in effect. > > Please review this proposal from the EWG carefully to ensure that if:- > > 1) you represent your company in the OpenFabrics community > that your company's product needs in the spring and early > summer of 2007 will be met by OFED 1.2 as proposed; > > 2) you are a customer or end user that may wish to deploy > OFED 1.2 after its release and distribution that it looks > like it will contain what you need for your installations by then; > > 3) you are working for a Linux distribution then the > schedule, process and testing planned by the EWG and the IWG > meet your requirements and schedule; > > 4) your interests do not align with the 3 identified above > but you are also planning to use OFED 1.2 please speak up and > give the community feedback. > > Any other feedback or comments are welcome. > > In my role in the Alliance I'd like to thank Tziporet, Jeff, > Nimrod, Aviram, Bob, Hal, Sean, Tom, Or, Betsy, Roland, > (please forgive me if I left out your name)and everyone who > has been working in the EWG for their tremendous individual > contributions to the Alliance and kernel software. > > Bill Boas > VP, Business Development | System Fabric Works > bboas at systemfabricworks.com | 510-375-8840 > > > -----Original Message----- > From: openfabrics-ewg-bounces at openib.org > [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of > Tziporet Koren > Sent: Thursday, November 30, 2006 6:06 AM > To: EWG > Cc: OPENIB > Subject: [openfabrics-ewg] reminder: OFED 1.2 meeting next Monday > > Hi All, > I wish to remind all that we have the EWG meeting on Monday > 4-Dec at 9am-10am. > Jeff already sent all details. > > Agenda: close OFED 1.2 features after each owner approve that > the schedule can be met (meaning code complete on end of January) > > See also > https://openib.org/tiki/tiki-index.php?page=OFED+release+proce dure for details on the features. > > Tziporet > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > _______________________________________________ > openfabrics-iwg mailing list > openfabrics-iwg at openfabrics.org > https://openfabrics.org/mailman/listinfo/openfabrics-iwg > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From adit.262 at gmail.com Fri Dec 1 07:35:38 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Fri, 1 Dec 2006 10:35:38 -0500 Subject: [openib-general] Segmentation fault on ib_read_bw In-Reply-To: <2462.85.65.224.142.1164965261.squirrel@dev.mellanox.co.il> References: <2462.85.65.224.142.1164965261.squirrel@dev.mellanox.co.il> Message-ID: I managed to get the test working .. I just restarted the server and it was working.. Im actually doing some work with the Xen VMM and Infiniband.. I have setup 2 servers (Pentium D - x86_64 arch) with red hat enterprise linux 4 and Xen 3 VMM running .. The IB driver seems to be working on dom0 in any case and I can do all of the perf tests. I wanted to know if there was any correlation between the QoS setup done using openSM and the perf tests i.e. if I configure QoS in opensm.opts should I be seeing marked differences in the BW from the perf tests? Is there any kind of documentation that gives an idea how the BW can change for diff QoS params? Regards, Adit On 12/1/06, dotanb at dev.mellanox.co.il wrote: > Hi. > > > Hi, > > > > Im using the openib gen2 trunk and was running the performance tests > > from that tree. > > I get a "Segmentation Fault" on running ib_read_bw and the remaining > > tests. > > The output is as follows: > > ------------------------------------------------------------------ > > RDMA_Read BW Test > > Connection type : RC > > Segmentation fault > > > > Any particular reason why this is happening? > > Can you give some more info, such as: > > which driver git/svn version are you using? > which parameters did you use in each side? > which distro are you using? > which computer arch are you using? > > thanks > Dotan > > -- Adit Ranadive Freshman, Georgia Institute of Technology, Atlanta, GA From halr at voltaire.com Fri Dec 1 07:41:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 10:41:53 -0500 Subject: [openib-general] Is an umad_close_port a good idea after I disconnect from the SA with osm_vendor_delete ? In-Reply-To: <20061201153055.GE23574@sashak.voltaire.com> References: <20061201141901.GC23574@sashak.voltaire.com> <1164983140.11808.177662.camel@hal.voltaire.com> <20061201153055.GE23574@sashak.voltaire.com> Message-ID: <1164987703.11808.180039.camel@hal.voltaire.com> On Fri, 2006-12-01 at 10:30, Sasha Khapyorsky wrote: > On 09:27 Fri 01 Dec , Hal Rosenstock wrote: > > On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote: > > > Hi Thomas, > > > > > > On 10:12 Fri 01 Dec , Bub Thomas wrote: > > > > Sasha, > > > > I'm having trouble to get the patch applied. > > > > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE > > > > path but after running the ofed-install script the sources in the > > > > /usr/local/ofed din't contain that patch anymore. > > > > Can you help me out of the dark and tell me how to build the > > > > libvendor.so out of/on the ofed-1.1/SOURCES tree. > > > > > > Never did it personally, but you may want to look at > > > https://openib.org/tiki/tiki-index.php?page=OFED+Support > > > for how ofed_patch.sh does this. > > > > > > And you can use svn or git versions of management/osm as well. > > > > There's currently no git version of OFED 1.1 OpenSM AFAIK. > > What about 1.1 git branch? This is same as SVN's 1.1. :) I sit corrected... -- Hal > Sasha From halr at voltaire.com Fri Dec 1 08:18:05 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 11:18:05 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611913E25B@EPEXCH2.qlogic.org> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E25B@EPEXCH2.qlogic.org> Message-ID: <1164989866.11808.181175.camel@hal.voltaire.com> On Thu, 2006-11-30 at 17:41, Todd Rimmer wrote: > > From: Roland Dreier [mailto:rdreier at cisco.com] > > Sent: Thursday, November 30, 2006 5:32 PM > > To: Todd Rimmer > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] IPv6 and IPoIB scalability issue > > > > > Proposed solution: > > > - add an IPoIB configuration parameter. This parameter could > redirect > > > the Solicited Node Multicast traffic to the IPv6 All Nodes > multicast > > > address (IB GID 0xff01601B.....0000001) > > > > This is silly however. For one thing you are now not following the > > RFC, and compliant IPv6 over IPoIB stacks will send neighbour > > discovery messages to the solicited node address, so they won't be > > received since the node didn't join. > > > > There's no requirement that a SM assign a unique MLID to each > > multicast group. The obvious solution to the problem is simply that > > the SM reuse MLIDs for solicited node multicast groups, perhaps even > > collapsing all of them down to 1 MLID. > > > > I think its worth discussing a number of alternatives. I'm not sure > there is an ideal solution here. > > Doesn't an SM based solution produce other complications? > - Such as the SM/SA must maintain an extremely large list of Multicast > Member records (potentially N^2). Certainly O(N) groups where N is the number of IPv6 hosts (and each group is 1 or more MCMs). > - Host nodes will be joining N multicast groups and maintaining > membership in them (potentially further stressing the SA, etc) Do all IPv6 nodes join all the solicited node groups ? I don't see this occuring (so far) on the subnets I have seen. > Not to mention that the SM would then need to know about IPoIB GID > addressing conventions (which seems like a violation of network layers, > etc). There's already the IPv6 signature as part of the MGID to help with this layering violation. Some SMs already do things with this already. -- Hal > Todd Rimmer > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Dec 1 08:20:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 11:20:15 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <20061130230136.GB32366@obsidianresearch.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org> <1164925747.11808.144971.camel@hal.voltaire.com> <20061130230136.GB32366@obsidianresearch.com> Message-ID: <1164989940.11808.181179.camel@hal.voltaire.com> On Thu, 2006-11-30 at 18:01, Jason Gunthorpe wrote: > On Thu, Nov 30, 2006 at 05:29:16PM -0500, Hal Rosenstock wrote: > > > > IPV6 defines that each node will have a Solicited Node Multicast > > > address. This address is unique per node and is constructed from the > > > IPV6 unicast address of the node. (see RFC 2373 for more details). > > > > > > IP over IB defines that IPV6 multicast addresses map to IB multicast > > > GIDs in a one to one manner. > > > > > > IB defines a multicast address space limit of 4095 LIDs. > > > > actually it is 16K-1 > > For IPv6 only the lower 24 bits of each assigned IPv6 address are > used to construct a solicited node multicast in the range > FF02::1:FF00:0/104. The Solicited Node Multicast address it not > expected to be uniquely subscribed. Any idea on how many would subscribe ? What does this depend on ? > > MGIDs are different from MLIDs. Multiple MGIDs can be mapped onto a > > single MLID if the characteristics are the same. Is that the case for > > the IPv6 groups ? > > The solicited node multicast feature is intended for scalability by > having the switching core prune ND queries. It is OK if the multicast > goes to more nodes than subscribe to it (this happens on cheap > ethernet switch gear without multicast support anyhow). And a similar thing is accomodated within IB. With limited MFT space, the collapse of multiple (similar) MGRPs (MGIDs) on a single MLID is seems important (and reduces some of the scalability issues Todd mentioned in terms of IPv6). > I think the thing to do here is for the SM to have an option to > compress a particular MGID range (using a hash of some kind). Ie > configure so that all of IPv6 FF02::1:FF00:0/104 will use at most 16 > MLIDs. Yes, that is one strategy which seems reasonable to me. > That way the site can select that some MGID's get mapped directly to > MLIDs and others get shared to save LID space. > > Then if you still run out it can randomly combine MGIDs into MLIDs. Yes, that's another wrinkle. -- Hal > Jason From robert.j.woodruff at intel.com Fri Dec 1 09:04:42 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 1 Dec 2006 09:04:42 -0800 Subject: [openib-general] openMPI for 2.6.17.10 kernel Message-ID: Matt wrote, > OMPI also has a uDAPL network device (along with a device that uses >verbs directly). So if we just use OMPI uDAPL it should work over >iWarp? > - Matt This should just work. (famous last words). For OFED 1.2, since the iWarp support will be in the base kernel (2.6.19), it should be easier to test to make sure that uDAPL works both over IB and iWarp as expected. Once this is tested and any issues fixed, Intel MPI, HPMPI, and OMPI (if it has a uDAPL driver) should all work over iWarp in addition to IB. woody From bos at pathscale.com Fri Dec 1 09:13:11 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 01 Dec 2006 09:13:11 -0800 Subject: [openib-general] [PATCH 0 of 2] Add memcpy_cachebypass, a memcpy that doesn't cache reads In-Reply-To: <20061130213820.5ed22d81.akpm@osdl.org> References: <20061130213820.5ed22d81.akpm@osdl.org> Message-ID: <457062A7.20504@pathscale.com> Andrew Morton wrote: > The name memcpy_cachebypass() doesn't tell us whether it bypasses caching > on the source, the dest or both. It'd be nice if it did. > Yep, I'll fix that and resubmit. <456E991C.4040907@dev.mellanox.co.il> <1164904991.7247.44.camel@trinity.ogc.int> <3EF52E87-47E9-4F0C-AA0D-C2CAA63DFC7C@cisco.com> <936E0840-D941-4BB4-A3C4-CE410D90E0E5@cisco.com> <1164911424.11779.46.camel@stevo-desktop> <1164916697.11779.84.camel@stevo-desktop> <5251E729-5FC0-48B5-9399-0C9466F8A2A2@cisco.com> <1164917426.11779.87.camel@stevo-desktop> <754FC8FE0A97A94B906344259F447D4A0413F81D@ES23SNLNT.srn.sandia.gov> <1164983728.6872.5.camel@stevo-desktop> Message-ID: <754FC8FE0A97A94B906344259F447D4A0413F825@ES23SNLNT.srn.sandia.gov> Thanks, Helen ________________________________ From: Steve Wise [mailto:swise at opengridcomputing.com] Sent: Fri 12/1/2006 7:35 AM To: Chen, Helen Y Cc: Jeff Squyres; openib-general at openib.org; Leininger, Matthew L Subject: RE: [openib-general] openMPI for 2.6.17.10 kernel On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote: > Steve, > > As you know, I have my rnfs kernel running the stable iwarp-stack on > my cluster now. But how do I compile the userspace packages from that > stack? > You build and install the userspace libraries from the iwarp stable branch. This will install all the needed header files to build other packages that depend on them. Like mvapich2-0.9.8, for instance. If rping is working for you, then you've already done this. The user libs and header files are all installed in /usr/local by default. If you have /usr/local/include/rdma/rdma_cma.h, for instance, you've probably already installed the userspace stuff from the iwarp stable branch. To build and install the user libs from the iwarp branch, please see the wiki howto. There is a section describing installing the userspace libraries. https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Hope this helps... Steve. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgunthorpe at obsidianresearch.com Fri Dec 1 10:37:17 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 Dec 2006 11:37:17 -0700 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <1164989940.11808.181179.camel@hal.voltaire.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org> <1164925747.11808.144971.camel@hal.voltaire.com> <20061130230136.GB32366@obsidianresearch.com> <1164989940.11808.181179.camel@hal.voltaire.com> Message-ID: <20061201183717.GC32366@obsidianresearch.com> On Fri, Dec 01, 2006 at 11:20:15AM -0500, Hal Rosenstock wrote: > > For IPv6 only the lower 24 bits of each assigned IPv6 address are > > used to construct a solicited node multicast in the range > > FF02::1:FF00:0/104. The Solicited Node Multicast address it not > > expected to be uniquely subscribed. > > Any idea on how many would subscribe ? What does this depend on ? Each node subscribes to a SNM on an interface for each IPv6 address on that interface. In most cases that should mean 1 subscription per interface, but more is possible.. Generally IPv6 addresses should be constructed based on the EUI64 of the IB interface. In this case the lower 24 bits of the SNM will be the lower 24 bits of the EUI64. Thus in many cases the SNMs will be cluster-unique.. Here is another thought.. Is there anything in the spec that says a MGID must map to a MLID? If there is a single subscription why not just do away with the MLID and return a unicast LID of the only subscriber? That would probably solve 90% of the IPv6 issue Todd pointed out. MGID compression would take care of the rest.. Jason From halr at voltaire.com Fri Dec 1 10:53:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 13:53:45 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <20061201183717.GC32366@obsidianresearch.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org> <1164925747.11808.144971.camel@hal.voltaire.com> <20061130230136.GB32366@obsidianresearch.com> <1164989940.11808.181179.camel@hal.voltaire.com> <20061201183717.GC32366@obsidianresearch.com> Message-ID: <1164999211.11808.186439.camel@hal.voltaire.com> On Fri, 2006-12-01 at 13:37, Jason Gunthorpe wrote: > On Fri, Dec 01, 2006 at 11:20:15AM -0500, Hal Rosenstock wrote: > > > For IPv6 only the lower 24 bits of each assigned IPv6 address are > > > used to construct a solicited node multicast in the range > > > FF02::1:FF00:0/104. The Solicited Node Multicast address it not > > > expected to be uniquely subscribed. > > > > Any idea on how many would subscribe ? What does this depend on ? > > Each node subscribes to a SNM on an interface for each IPv6 address on > that interface. In most cases that should mean 1 subscription per > interface, but more is possible.. > Generally IPv6 addresses should be constructed based on the EUI64 of > the IB interface. In this case the lower 24 bits of the SNM will be > the lower 24 bits of the EUI64. Thus in many cases the SNMs will be > cluster-unique.. It seems to depend on the low 24 bits of the IPv6 addresses in the subnet being the same (as to whether there is more than 1 member of these groups). > Here is another thought.. Is there anything in the spec that says a > MGID must map to a MLID? Yes. Here's the first one: p.149 line 3-8 The multicast LID range is a flat identifier space defined as 0xC000 to 0xFFFE. The DLID for any packet which contains a multicast GID shall be within the above specified multicast LID range. I'm sure there are others in the spec if I looked further... > If there is a single subscription why not > just do away with the MLID and return a unicast LID of the only > subscriber? The current spec requirements :-( But this is an interesting idea and may warrant further consideration. -- Hal > That would probably solve 90% of the IPv6 issue Todd > pointed out. MGID compression would take care of the rest.. > > Jason > From jgunthorpe at obsidianresearch.com Fri Dec 1 11:24:12 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 Dec 2006 12:24:12 -0700 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <1164999211.11808.186439.camel@hal.voltaire.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org> <1164925747.11808.144971.camel@hal.voltaire.com> <20061130230136.GB32366@obsidianresearch.com> <1164989940.11808.181179.camel@hal.voltaire.com> <20061201183717.GC32366@obsidianresearch.com> <1164999211.11808.186439.camel@hal.voltaire.com> Message-ID: <20061201192412.GD32366@obsidianresearch.com> On Fri, Dec 01, 2006 at 01:53:45PM -0500, Hal Rosenstock wrote: > > Generally IPv6 addresses should be constructed based on the EUI64 of > > the IB interface. In this case the lower 24 bits of the SNM will be > > the lower 24 bits of the EUI64. Thus in many cases the SNMs will be > > cluster-unique.. > > It seems to depend on the low 24 bits of the IPv6 addresses in the > subnet being the same (as to whether there is more than 1 member of > these groups). Correct. It is common practice for all IPv6 addresses to have the lower 64 bits be the EUI64 of the interface. The administrator can assign a different address, but that could be discouraged for scalability reasoons. > > Here is another thought.. Is there anything in the spec that says a > > MGID must map to a MLID? > > Yes. Here's the first one: > p.149 line 3-8 Hmm. Thats a shame. It is a conformance statment too :< At least the accepetance statements in C9 page 279+ don't specify to check that a MGID is matched with a MLID so at least it should work with current hardware. Jason From halr at voltaire.com Fri Dec 1 11:28:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 14:28:55 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <20061201192412.GD32366@obsidianresearch.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org> <1164925747.11808.144971.camel@hal.voltaire.com> <20061130230136.GB32366@obsidianresearch.com> <1164989940.11808.181179.camel@hal.voltaire.com> <20061201183717.GC32366@obsidianresearch.com> <1164999211.11808.186439.camel@hal.voltaire.com> <20061201192412.GD32366@obsidianresearch.com> Message-ID: <1165001303.11808.187609.camel@hal.voltaire.com> On Fri, 2006-12-01 at 14:24, Jason Gunthorpe wrote: > On Fri, Dec 01, 2006 at 01:53:45PM -0500, Hal Rosenstock wrote: > > > Generally IPv6 addresses should be constructed based on the EUI64 of > > > the IB interface. In this case the lower 24 bits of the SNM will be > > > the lower 24 bits of the EUI64. Thus in many cases the SNMs will be > > > cluster-unique.. > > > > It seems to depend on the low 24 bits of the IPv6 addresses in the > > subnet being the same (as to whether there is more than 1 member of > > these groups). > > Correct. It is common practice for all IPv6 addresses to have the > lower 64 bits be the EUI64 of the interface. The administrator can > assign a different address, but that could be discouraged for > scalability reasoons. > > > > Here is another thought.. Is there anything in the spec that says a > > > MGID must map to a MLID? > > > > Yes. Here's the first one: > > p.149 line 3-8 > > Hmm. Thats a shame. I think there are other issues with this and haven't thought about it enough. What happens if a second node joins that group (as the low 24 bits match) ? How would the LID be revoked and changed to an MLID ? There's more spec checking to do here... > It is a conformance statment too :< At least the > accepetance statements in C9 page 279+ don't specify to check that a > MGID is matched with a MLID I would say that's a hole in the spec right now... > so at least it should work with current > hardware. I would use the word might rather than should in that last sentence. -- Hal > Jason From todd.rimmer at qlogic.com Fri Dec 1 11:42:09 2006 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Fri, 1 Dec 2006 13:42:09 -0600 Subject: [openib-general] IPv6 and IPoIB scalability issue Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> > From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com] > Sent: Friday, December 01, 2006 1:37 PM > To: Hal Rosenstock > Cc: Todd Rimmer; openib-general at openib.org > Subject: Re: [openib-general] IPv6 and IPoIB scalability issue > > > Here is another thought.. Is there anything in the spec that says a > MGID must map to a MLID? If there is a single subscription why not > just do away with the MLID and return a unicast LID of the only > subscriber? That would probably solve 90% of the IPv6 issue Todd > pointed out. MGID compression would take care of the rest.. > Summary of alternatives and trade-offs. Lets assume a 2000 node cluster for analysis. Option 1 use ALL Nodes Multicast Non standard for IPoIB small change to IPoIB code only Works with all existing SMs total of 5 MGIDs in cluster 5 Multicast subscriptions per node total of 10,000 multicast member records in SA for fabric Option 2 compress MGID to MLID mapping Standard for IPoIB modification of SMs required, significant change configuration of MGID space in SM to consider for compression may be required total of 2005 MGIDs in cluster up to 2005 multicast subscriptions per node (sender only for Solicited Node initiators) total of 2000*2005 (4,010,000) multicast member records in SA for fabric Option 3 compress MGID to MLID mapping, use Unicast for Solicited Node MGIDs Standard for IPoIB not clear if standard for IB modification of SMs required, significant change configuration of MGID space in SM to consider for compression may be required configuration of MGID space in SM to use for unicast may be required total of 2005 MGIDs in cluster up to 2005 multicast subscriptions per node (sender only for Solicited Node initiators) total of 2000*2005 (4,010,000) multicast member records in SA for fabric Hence thus far, option 2 is most standard, option 3 may be standard, option 1 has best scalability for SM. It seems worth while to implement option 1 (which should be approx 10-20 lines of code in IPoIB) and continue to pursue option 2 and 3 as SM features. Then customers can choose which option works best for them. Todd Rimmer From jgunthorpe at obsidianresearch.com Fri Dec 1 11:46:21 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 Dec 2006 12:46:21 -0700 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <1165001303.11808.187609.camel@hal.voltaire.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org> <1164925747.11808.144971.camel@hal.voltaire.com> <20061130230136.GB32366@obsidianresearch.com> <1164989940.11808.181179.camel@hal.voltaire.com> <20061201183717.GC32366@obsidianresearch.com> <1164999211.11808.186439.camel@hal.voltaire.com> <20061201192412.GD32366@obsidianresearch.com> <1165001303.11808.187609.camel@hal.voltaire.com> Message-ID: <20061201194621.GE32366@obsidianresearch.com> On Fri, Dec 01, 2006 at 02:28:55PM -0500, Hal Rosenstock wrote: > I think there are other issues with this and haven't thought about it > enough. What happens if a second node joins that group (as the low 24 > bits match) ? How would the LID be revoked and changed to an MLID ? > There's more spec checking to do here... Oh, right, yeah revoking is pretty serious! Oh well. Jason From halr at voltaire.com Fri Dec 1 12:07:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 15:07:23 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> Message-ID: <1165003608.11808.188882.camel@hal.voltaire.com> On Fri, 2006-12-01 at 14:42, Todd Rimmer wrote: > > From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com] > > Sent: Friday, December 01, 2006 1:37 PM > > To: Hal Rosenstock > > Cc: Todd Rimmer; openib-general at openib.org > > Subject: Re: [openib-general] IPv6 and IPoIB scalability issue > > > > > > Here is another thought.. Is there anything in the spec that says a > > MGID must map to a MLID? If there is a single subscription why not > > just do away with the MLID and return a unicast LID of the only > > subscriber? That would probably solve 90% of the IPv6 issue Todd > > pointed out. MGID compression would take care of the rest.. > > > > Summary of alternatives and trade-offs. Lets assume a 2000 node cluster > for analysis. > > Option 1 use ALL Nodes Multicast > Non standard for IPoIB > small change to IPoIB code only > Works with all existing SMs > total of 5 MGIDs in cluster > 5 Multicast subscriptions per node > total of 10,000 multicast member records in SA for fabric IMO if you want to go down this direction, the place to discuss it is on the ipoib IETF mailing list. It is still active although dormant or very sleepy. > Option 2 compress MGID to MLID mapping > Standard for IPoIB > modification of SMs required, significant change Significant in what respect ? The code changes are reasonably simple I think. Is it from the perspective of upgrading SMs in the field for this ? I think it is a feature for better IPv6 support. > configuration of MGID space in SM to consider for compression may be > required > total of 2005 MGIDs in cluster > up to 2005 multicast subscriptions per node (sender only for Solicited > Node initiators) Does the node subscribe to every IPv6 SN group ? > total of 2000*2005 (4,010,000) multicast member records in SA for fabric This is based on the above (which I'm not sure about) and is the worst theoretical case, not the practical case. > Option 3 compress MGID to MLID mapping, use Unicast for Solicited Node > MGIDs > Standard for IPoIB > not clear if standard for IB More issues than this > modification of SMs required, significant change At first glance, there are more issues here than option 2 in terms of SM (and client operation). > configuration of MGID space in SM to consider for compression may be > required > configuration of MGID space in SM to use for unicast may be required > total of 2005 MGIDs in cluster > up to 2005 multicast subscriptions per node (sender only for Solicited > Node initiators) > total of 2000*2005 (4,010,000) multicast member records in SA for fabric > > Hence thus far, option 2 is most standard, option 3 may be standard, > option 1 has best scalability for SM. > > It seems worth while to implement option 1 (which should be approx 10-20 > lines of code in IPoIB) and continue to pursue option 2 and 3 as SM > features. Then customers can choose which option works best for them. I think before pursuing option 1 there needs to be a discussion with the IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu). -- Hal > Todd Rimmer From halr at voltaire.com Fri Dec 1 12:32:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 15:32:09 -0500 Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition Message-ID: <1165005117.11808.189660.camel@hal.voltaire.com> OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition Signed-off-by: Sasha Khapyorsky Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index f7f879b..d6c6968 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp( new_join_state | (p_mcm_port->scope_state & 0xf0); mcmember_rec.scope_state = p_mcm_port->scope_state; + + CL_PLOCK_RELEASE( p_rcv->p_lock ); } else { @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp( "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: " "osm_sm_mcgrp_leave failed\n" ); } - - CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock); - /* Note: The deletion of the mgrp itself will be done in the callback - for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */ } } else @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp( goto Exit; } - CL_PLOCK_RELEASE( p_rcv->p_lock ); - /* Send an SA response */ __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec ); From ralph.campbel at qlogic.com Fri Dec 1 12:39:16 2006 From: ralph.campbel at qlogic.com (Ralph Campbell) Date: Fri, 1 Dec 2006 12:39:16 -0800 (PST) Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> Message-ID: <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> > On 11/30/06, Ralph Campbell wrote: >> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote: >> > So what did you change since v1? How do you deal with fitting 64-bit >> > addresses into an sg list entry that has a 32-bit dma_addr_t? > >> The ipath_map_sg() handler for ib_dma_map_sg() doesn't store >> anything in the struct scatterlist. The translation is >> done when ipath_sg_dma_address() is called which now >> returns u64 instead of dma_addr_t thus avoiding the truncation >> problem. > > And there is this open/TODO of calling kmap(page) on dma mapping time > (or when ipath_sg_dma_address is called) and kunmap(page) on dma > unmapping time, where you must store the kvaddr between the two calls > and the sg does not have a room for it where dma_addr_t is u32 and > kvaddr is u64 .... Although the driver compiles on 32-bit kernels, it is unsupported and never been tested. All known 64-bit systems don't define CONFIG_HIGHMEM. In spite of previous emails suggesting that page_address() can return NULL without CONFIG_HIGHMEM defined, the code in include/linux/mm.h doesn't allow it (assuming the page pointer is valid and not some random address). I verified this with Andrew Morton. I don't see value in adding code which will be unsupported and untested. >> All of the callers to ib_dma_map_single(), ib_dma_map_page(), >> and ib_sg_dma_address() have been modifed to save the address >> in a u64 instead of a dma_addr_t. This actually wasn't much >> of a change since the address was being cast to u64 anway >> when assigned to struct sge.addr. > > Its fixes a bug, so it actually somehow much of a change. Without it > on arch as mentioned above, ipath_dma_map_single would return only a > u32 portion of the kvaddr and later the ulp code would place this > chopped address in sge.addr and the ipath driver would use the wrong > address. > > Or. I only meant that the change was minor compared to the previous patches sent. Of course, fixing a bug is important and not minor. From elsen_david at yahoo.com Fri Dec 1 12:50:15 2006 From: elsen_david at yahoo.com (david elsen) Date: Fri, 1 Dec 2006 12:50:15 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <754FC8FE0A97A94B906344259F447D4A0413F825@ES23SNLNT.srn.sandia.gov> Message-ID: <248325.81711.qm@web58001.mail.re3.yahoo.com> Steve, Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP stable branch? I do not get some of library (librdmacm) gets created to be used by mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel. David "Chen, Helen Y" wrote: Thanks, Helen --------------------------------- From: Steve Wise [mailto:swise at opengridcomputing.com] Sent: Fri 12/1/2006 7:35 AM To: Chen, Helen Y Cc: Jeff Squyres; openib-general at openib.org; Leininger, Matthew L Subject: RE: [openib-general] openMPI for 2.6.17.10 kernel On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote: > Steve, > > As you know, I have my rnfs kernel running the stable iwarp-stack on > my cluster now. But how do I compile the userspace packages from that > stack? > You build and install the userspace libraries from the iwarp stable branch. This will install all the needed header files to build other packages that depend on them. Like mvapich2-0.9.8, for instance. If rping is working for you, then you've already done this. The user libs and header files are all installed in /usr/local by default. If you have /usr/local/include/rdma/rdma_cma.h, for instance, you've probably already installed the userspace stuff from the iwarp stable branch. To build and install the user libs from the iwarp branch, please see the wiki howto. There is a section describing installing the userspace libraries. https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Hope this helps... Steve. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta. -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Fri Dec 1 12:54:20 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 01 Dec 2006 14:54:20 -0600 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <248325.81711.qm@web58001.mail.re3.yahoo.com> References: <248325.81711.qm@web58001.mail.re3.yahoo.com> Message-ID: <1165006460.6872.59.camel@stevo-desktop> On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote: > Steve, > > Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP > stable branch? > > I do not get some of library (librdmacm) gets created to be used by > mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel. > > David > The stable release of the iWARP branch is here: https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable Instructions on setting this up with Chelsio's T3 device are here: https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Steve. From elsen_david at yahoo.com Fri Dec 1 13:14:52 2006 From: elsen_david at yahoo.com (david elsen) Date: Fri, 1 Dec 2006 13:14:52 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <1165006460.6872.59.camel@stevo-desktop> Message-ID: <20061201211452.6831.qmail@web58004.mail.re3.yahoo.com> thanks Steve Wise wrote: On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote: > Steve, > > Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP > stable branch? > > I do not get some of library (librdmacm) gets created to be used by > mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel. > > David > The stable release of the iWARP branch is here: https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable Instructions on setting this up with Chelsio's T3 device are here: https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Steve. --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgunthorpe at obsidianresearch.com Fri Dec 1 13:47:15 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 Dec 2006 14:47:15 -0700 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <1165003608.11808.188882.camel@hal.voltaire.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> Message-ID: <20061201214715.GF32366@obsidianresearch.com> On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote: > > configuration of MGID space in SM to consider for compression may > > be required total of 2005 MGIDs in cluster up to 2005 multicast > > subscriptions per node (sender only for Solicited Node initiators) > > Does the node subscribe to every IPv6 SN group ? A node will only use another nodes SN group in a send-only fashion and only when it is doing neighbour discovery for that node. So at the worst case you potentially have N^2 send-only subscriptions, N normal subscriptions and N groups. If IPv6 SN multicast MLIDs are always routed in the fabric so that all IPv6 nodes can be send-only then the send-only subscriptions don't need to be considered. Presumably because of this send-only join and unjoin can result in no data structure in the SM.. > I think before pursuing option 1 there needs to be a discussion with the > IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu). Option 1 sounds difficult to me. It would be hard to have interop between nodes using this optimization and nodes that don't.. Another approach would be to manipulate the IPv6 address of the node so that the lower 24 bits are the same. That gets the same effect, but I'm not sure how you'd go about doing it :> Jason From David.Costa at Sun.COM Fri Dec 1 14:20:31 2006 From: David.Costa at Sun.COM (David Costa) Date: Fri, 01 Dec 2006 17:20:31 -0500 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Message-ID: <4570AAAF.8070701@Sun.Com> Hello all, I am running the HPCC benchmark on a Sun Blade 8000 blade server. I have two blades running RHEL4U3 and SLESSP3 respectively with 32 GBytes of memory each. The HPCC benchmark is running on a sun developed IB module that uses the Mellanox 25204 chips. When it gets to the MPIRandomAccess test, it immediately fails and I see the following messages listed below. Does anyone know what the messages mean, and a possible underlying cause? Please reply to me directly as I am not subscribed to this list. Thank you, Dave Costa david.costa at sun.com [root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile /usr/local/bin/hpcc 24 - MPI_CANCEL : Internal MPI error! [24] [] Aborting Program! mpirun_rsh: Abort signaled from [24] 26 - MPI_CANCEL : Internal MPI error! [26] [] Aborting Program! 15 - MPI_CANCEL : Internal MPI error! [15] [] Aborting Program! 18 - MPI_CANCEL : Internal MPI error! [18] [] Aborting Program! 22 - MPI_CANCEL : Internal MPI error! [22] [] Aborting Program! 4 - MPI_CANCEL : Internal MPI error! [4] [] Aborting Program! 13 - MPI_CANCEL : Internal MPI error! [13] [] Aborting Program! 11 - MPI_CANCEL : Internal MPI error! 16 - MPI_CANCEL : Internal MPI error! [16] [] Aborting Program! [11] [] Aborting Program! 28 - MPI_CANCEL : Internal MPI error! [28] [] Aborting Program! [19] Abort: [an1-bl1:19] Got completion with error, code=12 at line 2365 in file viacheck.c [23] Abort: [an1-bl1:23] Got completion with error, code=12 at line 2365 in file viacheck.c [17] Abort: [an1-bl1:17] Got completion with error, code=12 at line 2365 in file viacheck.c done. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Dec 1 14:26:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 01 Dec 2006 14:26:12 -0800 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <20061201214715.GF32366@obsidianresearch.com> (Jason Gunthorpe's message of "Fri, 1 Dec 2006 14:47:15 -0700") References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> <20061201214715.GF32366@obsidianresearch.com> Message-ID: > Option 1 sounds difficult to me. It would be hard to have interop > between nodes using this optimization and nodes that don't.. Yes, that is a major problem. One intermediate thing we could do is to have nodes join their own solicited-node group as a full member, but have other nodes send ND messages to the all-nodes group. Then the SM would only have O(N) MCG memberships to maintain. But it still requires the SM to be smart about mapping multiple MCGs to a single MLID. And even if that works, I'm not sure it's compliant with all the relevant RFCs, and it might break in some strange situations... (To be honest though, I think that the SM for a subnet with N nodes should really be beefy enough to handle N^2 multicast memberships. Even 10K nodes leads to only 100M group memberships, which shouldn't be _that_ expensive with the right data structures) - R. From boris at mellanox.com Fri Dec 1 14:29:42 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Fri, 1 Dec 2006 14:29:42 -0800 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Message-ID: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com> Hi David, If you are using OFED-1.1 stack and OSU MVAPICH provided with the OFED-1.1 package as your MPI layer, the attached patch should solve your problem. Please, let me know if that helped. Regards, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of David Costa Sent: Friday, December 01, 2006 2:21 PM To: openib-general at openib.org; David.Costa at Sun.COM; Robert Houk; Anthony Vinciguerra; Thomas Babbit Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Hello all, I am running the HPCC benchmark on a Sun Blade 8000 blade server. I have two blades running RHEL4U3 and SLESSP3 respectively with 32 GBytes of memory each. The HPCC benchmark is running on a sun developed IB module that uses the Mellanox 25204 chips. When it gets to the MPIRandomAccess test, it immediately fails and I see the following messages listed below. Does anyone know what the messages mean, and a possible underlying cause? Please reply to me directly as I am not subscribed to this list. Thank you, Dave Costa david.costa at sun.com [root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile /usr/local/bin/hpcc 24 - MPI_CANCEL : Internal MPI error! [24] [] Aborting Program! mpirun_rsh: Abort signaled from [24] 26 - MPI_CANCEL : Internal MPI error! [26] [] Aborting Program! 15 - MPI_CANCEL : Internal MPI error! [15] [] Aborting Program! 18 - MPI_CANCEL : Internal MPI error! [18] [] Aborting Program! 22 - MPI_CANCEL : Internal MPI error! [22] [] Aborting Program! 4 - MPI_CANCEL : Internal MPI error! [4] [] Aborting Program! 13 - MPI_CANCEL : Internal MPI error! [13] [] Aborting Program! 11 - MPI_CANCEL : Internal MPI error! 16 - MPI_CANCEL : Internal MPI error! [16] [] Aborting Program! [11] [] Aborting Program! 28 - MPI_CANCEL : Internal MPI error! [28] [] Aborting Program! [19] Abort: [an1-bl1:19] Got completion with error, code=12 at line 2365 in file viacheck.c [23] Abort: [an1-bl1:23] Got completion with error, code=12 at line 2365 in file viacheck.c [17] Abort: [an1-bl1:17] Got completion with error, code=12 at line 2365 in file viacheck.c done. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smpi_cancel.patch Type: application/octet-stream Size: 1116 bytes Desc: smpi_cancel.patch URL: From rdreier at cisco.com Fri Dec 1 14:28:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 01 Dec 2006 14:28:01 -0800 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test In-Reply-To: <4570AAAF.8070701@Sun.Com> (David Costa's message of "Fri, 01 Dec 2006 17:20:31 -0500") References: <4570AAAF.8070701@Sun.Com> Message-ID: > 24 - MPI_CANCEL : Internal MPI error! It might be useful to know what MPI implementation you're using... (Also, knowing where you got your IB drivers and what version they are wouldn't hurt either) - R. From elsen_david at yahoo.com Fri Dec 1 14:30:28 2006 From: elsen_david at yahoo.com (david elsen) Date: Fri, 1 Dec 2006 14:30:28 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <1165006460.6872.59.camel@stevo-desktop> Message-ID: <190551.24739.qm@web58010.mail.re3.yahoo.com> Hi Steve, I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable for the Ammasso card. While compiling the libamso library, I got the following error: make all-am make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \ then mv -f ".deps/src_amso_la-cq.Tpo" ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c -fPIC -DPIC -o .libs/src_amso_la-cq.o In file included from src/cq.c:42: src/amso.h: In function 'to_amso_dev': src/amso.h:83: warning: implicit declaration of function 'offsetof' src/amso.h:83: error: expected expression before 'struct' src/amso.h: In function 'to_amso_ctx': src/amso.h:88: error: expected expression before 'struct' src/amso.h: In function 'to_amso_pd': src/amso.h:93: error: expected expression before 'struct' src/amso.h: In function 'to_amso_cq': src/amso.h:98: error: expected expression before 'struct' src/amso.h: In function 'to_amso_qp': src/amso.h:103: error: expected expression before 'struct' make[1]: *** [src_amso_la-cq.lo] Error 1 make[1]: Leaving directory `/usr/src/gen2/branches/iwarp/userspace/libamso' make: *** [all] Error 2 which seems to be complaining something in amso.h file in the following lins: #define to_amso_xxx(xxx, type) \ ((struct amso_##type *) \ ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx))) Can you let me know if I am missing something? Thanks, David Steve Wise wrote: On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote: > Steve, > > Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP > stable branch? > > I do not get some of library (librdmacm) gets created to be used by > mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel. > > David > The stable release of the iWARP branch is here: https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable Instructions on setting this up with Chelsio's T3 device are here: https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Steve. --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Fri Dec 1 14:40:04 2006 From: elsen_david at yahoo.com (david elsen) Date: Fri, 1 Dec 2006 14:40:04 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <190551.24739.qm@web58010.mail.re3.yahoo.com> Message-ID: <142979.41727.qm@web58003.mail.re3.yahoo.com> Steve, I added #include in amso.h file, then I can compile it. David david elsen wrote: Hi Steve, I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable for the Ammasso card. While compiling the libamso library, I got the following error: make all-am make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \ then mv -f ".deps/src_amso_la-cq.Tpo" ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c -fPIC -DPIC -o .libs/src_amso_la-cq.o In file included from src/cq.c:42: src/amso.h: In function 'to_amso_dev': src/amso.h:83: warning: implicit declaration of function 'offsetof' src/amso.h:83: error: expected expression before 'struct' src/amso.h: In function 'to_amso_ctx': src/amso.h:88: error: expected expression before 'struct' src/amso.h: In function 'to_amso_pd': src/amso.h:93: error: expected expression before 'struct' src/amso.h: In function 'to_amso_cq': src/amso.h:98: error: expected expression before 'struct' src/amso.h: In function 'to_amso_qp': src/amso.h:103: error: expected expression before 'struct' make[1]: *** [src_amso_la-cq.lo] Error 1 make[1]: Leaving directory `/usr/src/gen2/branches/iwarp/userspace/libamso' make: *** [all] Error 2 which seems to be complaining something in amso.h file in the following lins: #define to_amso_xxx(xxx, type) \ ((struct amso_##type *) \ ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx))) Can you let me know if I am missing something? Thanks, David Steve Wise wrote: On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote: > Steve, > > Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP > stable branch? > > I do not get some of library (librdmacm) gets created to be used by > mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel. > > David > The stable release of the iWARP branch is here: https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable Instructions on setting this up with Chelsio's T3 device are here: https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Steve. --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta._______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From maya986 at 012.net.il Fri Dec 1 13:54:52 2006 From: maya986 at 012.net.il (=?windows-1255?Q?=F7=EC=E9=F4=E9=ED_=EC=E0=E9=F8=E5=F2=E9=ED?=) Date: Fri, 1 Dec 2006 23:54:52 +0200 Subject: [openib-general] =?windows-1255?b?6fkg7Oog8Onx6eXvIOHo7O749+jp?= =?windows-1255?b?8OI/ICAg7vLsIDExLDAwMCD5Iucg4efl4/kg7O764Onu6e0u?= Message-ID: <4ac3c53551a1a31224e6d220001ab284@012.net.il> ×©×œ×•× ×•×¡×œ×™×—×” על ההפרעה ! ל "שיר בעצמך" ×“×¨×•×©×™× × ×¦×™×’×™/ות מכירה ×˜×œ×¤×•× ×™×™× ×œ×¢×‘×•×“×ª שיווק, ×דמין ומכירות. *לו"×– עבודה: ימי ×-×”, שעות :18:00-9:00 *סביבת עבודה צעירה ודינ×מית, ×יכותית ותומכת. *שכר - בסיס+עמלות - מעל 11,000 ש"×— למת×ימי×. דרישות: *וותק של ×©× ×ª×™×™× ×œ×¤×—×•×ª ×‘×ž×§×•× ×¢×‘×•×“×” קוד×. * נסיון במוקד טלמרקטינג *נכונות לעבודה בלחץ ובשעות מטורפות. * כושר שכנוע גבוה *נכונות לעזור ל×נשי×. *רצון להצליח בגדול. ×ž×™×§×•× ×”×ž×©×¨×”: ת"× ×× ×ת/ עונה על הדרישות- שלח/×™ קו"×— מפורט (לפי שני×) במייל חוזר. בתודה, סיגל. ×. מנהלת ×›"×, "שיר בעצמך" shir4u.co.il From elsen_david at yahoo.com Fri Dec 1 14:58:06 2006 From: elsen_david at yahoo.com (david elsen) Date: Fri, 1 Dec 2006 14:58:06 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <142979.41727.qm@web58003.mail.re3.yahoo.com> Message-ID: <602111.87729.qm@web58007.mail.re3.yahoo.com> Steve, I can run rping, rdma_lat etc on the Ammasso card but when I try to run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. ./mpdboot -n 1 debug: starting /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory running mpdallexit on ammasso1 LAUNCHED mpd on ammasso1 via debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d debug: mpd on ammasso1 on port 35352 RUNNING: mpd on ammasso1 debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''} Thanks, David david elsen wrote: Steve, I added #include in amso.h file, then I can compile it. David david elsen wrote: Hi Steve, I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable for the Ammasso card. While compiling the libamso library, I got the following error: make all-am make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \ then mv -f ".deps/src_amso_la-cq.Tpo" ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c -fPIC -DPIC -o .libs/src_amso_la-cq.o In file included from src/cq.c:42: src/amso.h: In function 'to_amso_dev': src/amso.h:83: warning: implicit declaration of function 'offsetof' src/amso.h:83: error: expected expression before 'struct' src/amso.h: In function 'to_amso_ctx': src/amso.h:88: error: expected expression before 'struct' src/amso.h: In function 'to_amso_pd': src/amso.h:93: error: expected expression before 'struct' src/amso.h: In function 'to_amso_cq': src/amso.h:98: error: expected expression before 'struct' src/amso.h: In function 'to_amso_qp': src/amso.h:103: error: expected expression before 'struct' make[1]: *** [src_amso_la-cq.lo] Error 1 make[1]: Leaving directory `/usr/src/gen2/branches/iwarp/userspace/libamso' make: *** [all] Error 2 which seems to be complaining something in amso.h file in the following lins: #define to_amso_xxx(xxx, type) \ ((struct amso_##type *) \ ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx))) Can you let me know if I am missing something? Thanks, David Steve Wise wrote: On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote: > Steve, > > Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP > stable branch? > > I do not get some of library (librdmacm) gets created to be used by > mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel. > > David > The stable release of the iWARP branch is here: https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable Instructions on setting this up with Chelsio's T3 device are here: https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Steve. --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta._______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited. --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Fri Dec 1 14:58:12 2006 From: elsen_david at yahoo.com (david elsen) Date: Fri, 1 Dec 2006 14:58:12 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <142979.41727.qm@web58003.mail.re3.yahoo.com> Message-ID: <803619.69421.qm@web58009.mail.re3.yahoo.com> Steve, I can run rping, rdma_lat etc on the Ammasso card but when I try to run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. ./mpdboot -n 1 debug: starting /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory running mpdallexit on ammasso1 LAUNCHED mpd on ammasso1 via debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d debug: mpd on ammasso1 on port 35352 RUNNING: mpd on ammasso1 debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''} Thanks, David david elsen wrote: Steve, I added #include in amso.h file, then I can compile it. David david elsen wrote: Hi Steve, I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable for the Ammasso card. While compiling the libamso library, I got the following error: make all-am make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso' if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \ then mv -f ".deps/src_amso_la-cq.Tpo" ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c -fPIC -DPIC -o .libs/src_amso_la-cq.o In file included from src/cq.c:42: src/amso.h: In function 'to_amso_dev': src/amso.h:83: warning: implicit declaration of function 'offsetof' src/amso.h:83: error: expected expression before 'struct' src/amso.h: In function 'to_amso_ctx': src/amso.h:88: error: expected expression before 'struct' src/amso.h: In function 'to_amso_pd': src/amso.h:93: error: expected expression before 'struct' src/amso.h: In function 'to_amso_cq': src/amso.h:98: error: expected expression before 'struct' src/amso.h: In function 'to_amso_qp': src/amso.h:103: error: expected expression before 'struct' make[1]: *** [src_amso_la-cq.lo] Error 1 make[1]: Leaving directory `/usr/src/gen2/branches/iwarp/userspace/libamso' make: *** [all] Error 2 which seems to be complaining something in amso.h file in the following lins: #define to_amso_xxx(xxx, type) \ ((struct amso_##type *) \ ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx))) Can you let me know if I am missing something? Thanks, David Steve Wise wrote: On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote: > Steve, > > Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP > stable branch? > > I do not get some of library (librdmacm) gets created to be used by > mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel. > > David > The stable release of the iWARP branch is here: https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable Instructions on setting this up with Chelsio's T3 device are here: https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 Steve. --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta._______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited. --------------------------------- Cheap Talk? Check out Yahoo! Messenger's low PC-to-Phone call rates. -------------- next part -------------- An HTML attachment was scrubbed... URL: From surs at cse.ohio-state.edu Fri Dec 1 14:57:14 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Fri, 1 Dec 2006 17:57:14 -0500 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test In-Reply-To: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com> Message-ID: <20061201225713.GA7343@cse.ohio-state.edu> Hi Boris, Thanks for forwarding the patch to the list. This patch was also added to the MVAPICH svn repository (both trunk and 0.9.8 bugfix branches) a few days back. David: If you are using MVAPICH, you can check out from the SVN 0.9.8 bugfix branch too. Thanks, Sayantan. * On Dec,3 Boris Shpolyansky wrote : > Hi David, > > If you are using OFED-1.1 stack and OSU MVAPICH provided with the OFED-1.1 > package as your MPI layer, > the attached patch should solve your problem. > > Please, let me know if that helped. > > Regards, > > Boris Shpolyansky > Application Engineer > Mellanox Technologies Inc. > 2900 Stender Way > Santa Clara, CA 95054 > Tel.: (408) 916 0014 > Fax: (408) 970 3403 > Cell: (408) 834 9365 > www.mellanox.com -- http://www.cse.ohio-state.edu/~surs From Thomas.Talpey at netapp.com Fri Dec 1 14:57:21 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 01 Dec 2006 17:57:21 -0500 Subject: [openib-general] NFS/RDMA for Linux: client and server update release 7 Message-ID: Network Appliance is pleased to announce release 7 of the NFS/RDMA client and server for Linux 2.6.18. This update to the August release fixes known issues, improves usability and server stability, and supports NFSv4. The code supports both Infiniband and iWARP transports over the standard openfabrics Linux facility. This code is functionally similar to the previous RC6 release, with many bugfixes and performance improvements applied. The client and server now use port 2050 (instead of overloading the standard NFS/TCP 2049), pending further discussion and official assignment as proposed in the most recent IETF working group meeting. An alignment issue leading to performance impact on IA64 architectures has been corrected in the server. Extensive further testing on Infiniband and iWARP was performed and (for example) this NFS/RDMA code was demonstrated running Oracle 10g Reliable Application Clusters at SuperComputing 2006 last month. A full list of bugs resolved is available at the project's tracking page: We welcome protocol comments, implementation comments and user experience, directly or on any of the above mailing lists. Tom Talpey, for the NFS/RDMA project. From rdreier at cisco.com Fri Dec 1 15:12:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 01 Dec 2006 15:12:41 -0800 Subject: [openib-general] NFS/RDMA for Linux: client and server update release 7 In-Reply-To: ( Thomas Talpey's message of "Fri, 01 Dec 2006 17:57:21 -0500") References: Message-ID: What is the status of moving this code towards merging to the upstream kernel? Thanks, Roland From swise at opengridcomputing.com Fri Dec 1 15:14:31 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 01 Dec 2006 17:14:31 -0600 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <602111.87729.qm@web58007.mail.re3.yahoo.com> References: <602111.87729.qm@web58007.mail.re3.yahoo.com> Message-ID: <1165014871.6872.85.camel@stevo-desktop> I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing their dev team so maybe they can help. Steve. On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote: > Steve, > > I can run rping, rdma_lat etc on the Ammasso card but when I try to > run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. > > ./mpdboot -n 1 > debug: starting > /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: > librdmacm.so: cannot open shared object file: No such file or > directory > running mpdallexit on ammasso1 > LAUNCHED mpd on ammasso1 via > debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d > debug: mpd on ammasso1 on port 35352 > RUNNING: mpd on ammasso1 > debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, > 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''} > > Thanks, > David > > david elsen wrote: > Steve, > > I added > > #include > > in amso.h file, then I can compile it. > > David > > > david elsen wrote: > Hi Steve, > I am trying to use the > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > for the Ammasso card. > > While compiling the libamso library, I got the > following error: > make all-am > make[1]: Entering directory > `/usr/src/gen2/branches/iwarp/userspace/libamso' > if /bin/sh ./libtool --tag=CC --mode=compile gcc > -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE > -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF > ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo > `test -f 'src/cq.c' || echo './'`src/cq.c; \ > then mv -f ".deps/src_amso_la-cq.Tpo" > ".deps/src_amso_la-cq.Plo"; else rm -f > ".deps/src_amso_la-cq.Tpo"; exit 1; fi > mkdir .libs > gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall > -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP > -MF .deps/src_amso_la-cq.Tpo -c src/cq.c -fPIC -DPIC > -o .libs/src_amso_la-cq.o > In file included from src/cq.c:42: > src/amso.h: In function 'to_amso_dev': > src/amso.h:83: warning: implicit declaration of > function 'offsetof' > src/amso.h:83: error: expected expression before > 'struct' > src/amso.h: In function 'to_amso_ctx': > src/amso.h:88: error: expected expression before > 'struct' > src/amso.h: In function 'to_amso_pd': > src/amso.h:93: error: expected expression before > 'struct' > src/amso.h: In function 'to_amso_cq': > src/amso.h:98: error: expected expression before > 'struct' > src/amso.h: In function 'to_amso_qp': > src/amso.h:103: error: expected expression before > 'struct' > make[1]: *** [src_amso_la-cq.lo] Error 1 > make[1]: Leaving directory > `/usr/src/gen2/branches/iwarp/userspace/libamso' > make: *** [all] Error 2 > > which seems to be complaining something in amso.h file > in the following lins: > > #define to_amso_xxx(xxx, type) > \ > ((struct amso_##type *) > \ > ((void *) ib##xxx - offsetof(struct > amso_##type, ibv_##xxx))) > > Can you let me know if I am missing something? > Thanks, > David > > Steve Wise wrote: > > > On Fri, 2006-12-01 at 12:50 -0800, david elsen > wrote: > > Steve, > > > > Is this > https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP > > stable branch? > > > > I do not get some of library (librdmacm) > gets created to be used by > > mvapich2-0.9.8 on the Fedora 6 distribution > with 2.6.17.13 kernel. > > > > David > > > > The stable release of the iWARP branch is > here: > > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > > > Instructions on setting this up with Chelsio's > T3 device are here: > > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 > > > Steve. > > > > > > ______________________________________________________ > Everyone is raving about the all-new Yahoo! Mail > beta._______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > ______________________________________________________________ > Access over 1 million songs - Yahoo! Music Unlimited. > > > > > ______________________________________________________________________ > Access over 1 million songs - Yahoo! Music Unlimited. From rdreier at cisco.com Fri Dec 1 15:14:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 01 Dec 2006 15:14:34 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> (Ralph Campbell's message of "Fri, 1 Dec 2006 12:39:16 -0800 (PST)") References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> Message-ID: > Although the driver compiles on 32-bit kernels, it is unsupported > and never been tested. All known 64-bit systems don't define > CONFIG_HIGHMEM. In spite of previous emails suggesting that > page_address() can return NULL without CONFIG_HIGHMEM defined, > the code in include/linux/mm.h doesn't allow it (assuming the > page pointer is valid and not some random address). > I verified this with Andrew Morton. Hmm, is there no way to make this work on 32-bit kernels? I don't want to do something that we'll have to change again if we want to make things work on 32-bits. (And I know that qlogic has no intention of supporting the driver on 32-bit kernels, but we shouldn't make it impossible for someone else to fix it) From rdreier at cisco.com Fri Dec 1 15:15:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 01 Dec 2006 15:15:44 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1164918691.14800.101.camel@brick.pathscale.com> (Ralph Campbell's message of "Thu, 30 Nov 2006 12:31:31 -0800") References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> Message-ID: Oh yeah, one other thing... could you respin this so that all the new dma_xxx wrappers go into a new file like (and include that from )? ib_verbs.h is already too big I think. From jgunthorpe at obsidianresearch.com Fri Dec 1 15:17:53 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 Dec 2006 16:17:53 -0700 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <1165003608.11808.188882.camel@hal.voltaire.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> Message-ID: <20061201231753.GG32366@obsidianresearch.com> On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote: > > total of 2000*2005 (4,010,000) multicast member records in SA for fabric > > This is based on the above (which I'm not sure about) and is the worst > theoretical case, not the practical case. It isn't in the IB spec, but what would really help here is to be able to join a multicast prefix (more than 1 group with a single entry). Todd's option 1 optimization is then easially realized by having all IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries) as full members. This provides interoperability between with stacks with this feature and without. Option 2 works better as well because all the nodes join FF02::1:FF00:0/104 as a send-only member on startup and then you only get N*2 multicast records to maintain. This also would improve the performance of IPv6 ND by not having to join/leave the SN groups for each ND query. IBA would have to be changed to support a prefix bits field in the MCMemberRecord structure though.. Jason From David.Costa at Sun.COM Fri Dec 1 15:22:51 2006 From: David.Costa at Sun.COM (David Costa) Date: Fri, 01 Dec 2006 18:22:51 -0500 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test In-Reply-To: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com> Message-ID: <4570B94B.6030202@Sun.Com> My apologies to everyone who replied, I am indeed using OFED 1.1 and the included OSU MVAPICH. I will try your patch on Monday Boris and reply to the list about how I made out. Best Regards, Dave Costa Boris Shpolyansky wrote: > Hi David, > > If you are using OFED-1.1 stack and OSU MVAPICH provided with the > OFED-1.1 package as your MPI layer, > the attached patch should solve your problem. > > Please, let me know if that helped. > > Regards, > > Boris Shpolyansky > Application Engineer > Mellanox Technologies Inc. > 2900 Stender Way > Santa Clara, CA 95054 > Tel.: (408) 916 0014 > Fax: (408) 970 3403 > Cell: (408) 834 9365 > www.mellanox.com > > ------------------------------------------------------------------------ > *From:* openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] *On Behalf Of *David Costa > *Sent:* Friday, December 01, 2006 2:21 PM > *To:* openib-general at openib.org; David.Costa at Sun.COM; Robert Houk; > Anthony Vinciguerra; Thomas Babbit > *Subject:* [openib-general] HPCC benchmark aborts at MPIRandomAccess test > > Hello all, > > I am running the HPCC benchmark on a Sun Blade 8000 blade server. I > have two blades running RHEL4U3 and SLESSP3 respectively with 32 > GBytes of memory each. The HPCC benchmark is running on a sun > developed IB module that uses the Mellanox 25204 chips. When it gets > to the MPIRandomAccess test, it immediately fails and I see the > following messages listed below. > > Does anyone know what the messages mean, and a possible underlying > cause? Please reply to me directly as I am not subscribed to this list. > > Thank you, > > Dave Costa > david.costa at sun.com > > > [root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile > /usr/local/bin/hpcc > 24 - MPI_CANCEL : Internal MPI error! > [24] [] Aborting Program! > mpirun_rsh: Abort signaled from [24] > 26 - MPI_CANCEL : Internal MPI error! > [26] [] Aborting Program! > 15 - MPI_CANCEL : Internal MPI error! > [15] [] Aborting Program! > 18 - MPI_CANCEL : Internal MPI error! > [18] [] Aborting Program! > 22 - MPI_CANCEL : Internal MPI error! > [22] [] Aborting Program! > 4 - MPI_CANCEL : Internal MPI error! > [4] [] Aborting Program! > 13 - MPI_CANCEL : Internal MPI error! > [13] [] Aborting Program! > 11 - MPI_CANCEL : Internal MPI error! > 16 - MPI_CANCEL : Internal MPI error! > [16] [] Aborting Program! > [11] [] Aborting Program! > 28 - MPI_CANCEL : Internal MPI error! > [28] [] Aborting Program! > [19] Abort: [an1-bl1:19] Got completion with error, code=12 > at line 2365 in file viacheck.c > [23] Abort: [an1-bl1:23] Got completion with error, code=12 > at line 2365 in file viacheck.c > [17] Abort: [an1-bl1:17] Got completion with error, code=12 > at line 2365 in file viacheck.c > done. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Dec 1 15:25:35 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 18:25:35 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <20061201214715.GF32366@obsidianresearch.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> <20061201214715.GF32366@obsidianresearch.com> Message-ID: <1165015489.11808.195631.camel@hal.voltaire.com> On Fri, 2006-12-01 at 16:47, Jason Gunthorpe wrote: > On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote: > > > > configuration of MGID space in SM to consider for compression may > > > be required total of 2005 MGIDs in cluster up to 2005 multicast > > > subscriptions per node (sender only for Solicited Node initiators) > > > > Does the node subscribe to every IPv6 SN group ? > > A node will only use another nodes SN group in a send-only fashion and > only when it is doing neighbour discovery for that node. > > So at the worst case you potentially have N^2 send-only subscriptions, > N normal subscriptions and N groups. Send only subscriptions are largely the same (in terms of SM/SA) as full subscriptions except in a couple of details. > If IPv6 SN multicast MLIDs are always routed in the fabric so that all > IPv6 nodes can be send-only then the send-only subscriptions don't > need to be considered. Presumably because of this send-only join and > unjoin can result in no data structure in the SM.. There is a data structure associated with these memberships. -- Hal > > I think before pursuing option 1 there needs to be a discussion with the > > IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu). > > Option 1 sounds difficult to me. It would be hard to have interop > between nodes using this optimization and nodes that don't.. > > Another approach would be to manipulate the IPv6 address of the node > so that the lower 24 bits are the same. That gets the same effect, but > I'm not sure how you'd go about doing it :> > > Jason From halr at voltaire.com Fri Dec 1 15:32:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 18:32:34 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> <20061201214715.GF32366@obsidianresearch.com> Message-ID: <1165015925.11808.195836.camel@hal.voltaire.com> On Fri, 2006-12-01 at 17:26, Roland Dreier wrote: > > Option 1 sounds difficult to me. It would be hard to have interop > > between nodes using this optimization and nodes that don't.. > > Yes, that is a major problem. > > One intermediate thing we could do is to have nodes join their own > solicited-node group as a full member, but have other nodes send ND > messages to the all-nodes group. Then the SM would only have O(N) > MCG memberships to maintain. But it still requires the SM to be smart > about mapping multiple MCGs to a single MLID. > > And even if that works, I'm not sure it's compliant with all the > relevant RFCs, and it might break in some strange situations... > > (To be honest though, I think that the SM for a subnet with N nodes > should really be beefy enough to handle N^2 multicast memberships. > Even 10K nodes leads to only 100M group memberships, which shouldn't > be _that_ expensive with the right data structures) The data structures are one concern. The others would be routing N large (multicast) trees and also the SA transaction rate this causes (similar to the large path record request case). -- Hal > - R. From halr at voltaire.com Fri Dec 1 15:47:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Dec 2006 18:47:38 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <20061201231753.GG32366@obsidianresearch.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> <20061201231753.GG32366@obsidianresearch.com> Message-ID: <1165016801.11808.196277.camel@hal.voltaire.com> On Fri, 2006-12-01 at 18:17, Jason Gunthorpe wrote: > On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote: > > > > total of 2000*2005 (4,010,000) multicast member records in SA for fabric > > > > This is based on the above (which I'm not sure about) and is the worst > > theoretical case, not the practical case. > > It isn't in the IB spec, but what would really help here is to be able > to join a multicast prefix (more than 1 group with a single entry). > > Todd's option 1 optimization is then easially realized by having all > IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries) These are IPmc groups not IB mc groups though. I suppose you are asking for the equivalent function in IB. When that subscribe is done, would it automatically collapse to 1 MLID ? If that's what you mean, a spec extension for this could be proposed and carried forward at the (IBTA) MgtWG. Is there a special value of those 24 bits which is not used (and could be used to indicate subscribe all) ? Or do you see another way to indicate this ? There are some reserved bits at the end of MCMemberRecord which could also be used to indicate this. That's probably better. > as > full members. This provides interoperability between with stacks with > this feature and without. > > Option 2 works better as well because all the nodes join > FF02::1:FF00:0/104 as a send-only member on startup and then you only > get N*2 multicast records to maintain. > > This also would improve the performance of IPv6 ND by not having to > join/leave the SN groups for each ND query. > > IBA would have to be changed to support a prefix bits field in the > MCMemberRecord structure though.. Is a full prefix needed or only 1 bit indicating join all ? If a prefix is needed, it sounds like it is 24 bits in width. (That appears more than what is available but I'll look more). -- Hal > Jason From ralph.campbel at qlogic.com Fri Dec 1 16:27:59 2006 From: ralph.campbel at qlogic.com (Ralph Campbell) Date: Fri, 1 Dec 2006 16:27:59 -0800 (PST) Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> Message-ID: <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> > > Although the driver compiles on 32-bit kernels, it is unsupported > > and never been tested. All known 64-bit systems don't define > > CONFIG_HIGHMEM. In spite of previous emails suggesting that > > page_address() can return NULL without CONFIG_HIGHMEM defined, > > the code in include/linux/mm.h doesn't allow it (assuming the > > page pointer is valid and not some random address). > > I verified this with Andrew Morton. > > Hmm, is there no way to make this work on 32-bit kernels? I don't > want to do something that we'll have to change again if we want to > make things work on 32-bits. > > (And I know that qlogic has no intention of supporting the driver on > 32-bit kernels, but we shouldn't make it impossible for someone else > to fix it) I don't think this is impossible to implement. I just wanted to avoid the work unless you and others thought it was really worth it given the reality that we already have a large test matrix of platforms, distros, and kernel versions and it probably won't get much testing. It is possible that at some point 32-bit kernels will become a priority but I don't know when that might happen. From ralph.campbel at qlogic.com Fri Dec 1 16:28:37 2006 From: ralph.campbel at qlogic.com (Ralph Campbell) Date: Fri, 1 Dec 2006 16:28:37 -0800 (PST) Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> Message-ID: <40471.71.131.5.186.1165019317.squirrel@rocky.pathscale.com> > Oh yeah, one other thing... > > could you respin this so that all the new dma_xxx wrappers go into a > new file like (and include that from > )? ib_verbs.h is already too big I think. Sure, no problem. From rowland at cse.ohio-state.edu Fri Dec 1 16:36:56 2006 From: rowland at cse.ohio-state.edu (Shaun Rowland) Date: Fri, 01 Dec 2006 19:36:56 -0500 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <1165014871.6872.85.camel@stevo-desktop> References: <602111.87729.qm@web58007.mail.re3.yahoo.com> <1165014871.6872.85.camel@stevo-desktop> Message-ID: <4570CAA8.5080806@cse.ohio-state.edu> Steve Wise wrote: > I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing their > dev team so maybe they can help. > > Steve. > > > > On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote: >> Steve, >> >> I can run rping, rdma_lat etc on the Ammasso card but when I try to >> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. >> >> ./mpdboot -n 1 >> debug: starting >> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: >> librdmacm.so: cannot open shared object file: No such file or >> directory >> running mpdallexit on ammasso1 >> LAUNCHED mpd on ammasso1 via >> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d >> debug: mpd on ammasso1 on port 35352 >> RUNNING: mpd on ammasso1 >> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, >> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''} Hello David and Steve. We discussed this problem in detail on the mvapich-discuss list recently. David, you indicated the following in your last email about this to mvapich-discuss on 11/26/2006: "For some reason, it is working in SuSE, and not working in Fedora." Is this still the case? Were the libraries built specifically on the Fedora Core 6 system, or are you using libraries that were built on SuSE? I assume they were built on Fedora Core 6. Were you trying to run this as root or as a regular user? I am not sure exactly how this might affect shared library loading, but it is possible there is a difference. In our previous discussion, your library path did indeed have a librdmacm.so file, though it could not be loaded for an unknown reason. It is unclear to me if this email thread indicates that you have tried to rebuild that and are experiencing the same issue. Where you able to try running that test shared library example I gave and did it work? Did it work as the same user you are trying to run MVAPICH as? It seems clear this is a runtime loader problem on Fedora Core 6, or on your particular configuration there. That is what cannot find the library. It is similar to the libtest code I provided as an example: [rowland at e14-oib libtest]$ ls Makefile test.c test.h test-program.c [rowland at e14-oib libtest]$ make normal gcc -c -fPIC test.c gcc -shared -Wl,-soname,libtest.so.1 -o libtest.so.1.0 test.o ln -s libtest.so.1.0 libtest.so.1 ln -s libtest.so.1 libtest.so gcc -c -o test-program.o test-program.c gcc -o test-program test-program.o -L/home/7/rowland/libtest -ltest [rowland at e14-oib libtest]$ ldd test-program libtest.so.1 => not found libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000) /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000) [rowland at e14-oib libtest]$ ./test-program ./test-program: error while loading shared libraries: libtest.so.1: cannot open shared object file: No such file or directory [rowland at e14-oib libtest]$ export LD_LIBRARY_PATH=$PWD [rowland at e14-oib libtest]$ ldd test-program libtest.so.1 => /home/7/rowland/libtest/libtest.so.1 (0x00002abbf9aee000) libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000) /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000) [rowland at e14-oib libtest]$ ./test-program In shared library function... In previous email your ldd output showed the library was being found: Please see the output of ldd /usr/local/mvapich2/bin/mpdroot : [root at ammasso1 ~]# ldd /usr/local/mvapich2/bin/mpdroot linux-gate.so.1 => (0xffffe000) librdmacm.so => /usr/local/lib/librdmacm.so (0xb7fec000) libibverbs.so.2 => /usr/local/lib/libibverbs.so.2 (0xb7fe5000) libibumad.so.1 => /usr/local/lib/libibumad.so.1 (0xb7fdc000) libpthread.so.0 => /lib/libpthread.so.0 (0x0012a000) libc.so.6 => /lib/libc.so.6 (0x00ca7000) libsysfs.so.2 => /usr/lib/libsysfs.so.2 (0x00369000) libdl.so.2 => /lib/libdl.so.2 (0x00de6000) libibcommon.so.1 => /usr/local/lib/libibcommon.so.1 (0xb7fcb000) /lib/ld-linux.so.2 (0x002d8000) But that path is different than the one you are quoting above. Does an ldd on /root/0.9.8-RELEASE/bin/mpdroot find librdmacm.so too, as the same user you are trying to run it as? I have one more idea for you to try here. You can do the following: export LD_DEBUG=all /root/0.9.8-RELEASE/bin/mpdroot >&output unset LD_DEBUG Then take a look at the output file to see if there are any relevant error messages. Don't forget to unset LD_DEBUG before doing anything else. Also, just to be sure, if you run "file " what does it say? It should indicate that it is a shared library as similarly to: [rowland at e14-oib libtest]$ file /usr/local/ofed/lib64/librdmacm.so* /usr/local/ofed/lib64/librdmacm.so: symbolic link to `librdmacm.so.0.9.0' /usr/local/ofed/lib64/librdmacm.so.0.9.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped Unfortunately, we do not have any Fedora Core 6 systems to investigate this problem on at this time, and I don't know anything about what might be there that would cause a problem. As far as I know, there shouldn't be. However, it seems there is some runtime issue on your Fedora Core 6 machine or with how this is being run there. If it is in fact working on another distribution as you indicated in your previous response to us, then that also strongly points in this direction. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ From rdreier at cisco.com Fri Dec 1 17:09:40 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 01 Dec 2006 17:09:40 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> (Ralph Campbell's message of "Fri, 1 Dec 2006 16:27:59 -0800 (PST)") References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> Message-ID: > I don't think this is impossible to implement. I just wanted > to avoid the work unless you and others thought it was really > worth it given the reality that we already have a large > test matrix of platforms, distros, and kernel versions and > it probably won't get much testing. It is possible that > at some point 32-bit kernels will become a priority > but I don't know when that might happen. So you think you could do the ib_dma_xxx stuff for ipath without affecting anything outside of ipath? (Assuming this is merged of course) What would be the rough outline of how that would work? From jgunthorpe at obsidianresearch.com Fri Dec 1 17:20:54 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 1 Dec 2006 18:20:54 -0700 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <1165016801.11808.196277.camel@hal.voltaire.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> <20061201231753.GG32366@obsidianresearch.com> <1165016801.11808.196277.camel@hal.voltaire.com> Message-ID: <20061202012054.GH32366@obsidianresearch.com> On Fri, Dec 01, 2006 at 06:47:38PM -0500, Hal Rosenstock wrote: > > It isn't in the IB spec, but what would really help here is to be able > > to join a multicast prefix (more than 1 group with a single entry). > > > > Todd's option 1 optimization is then easially realized by having all > > IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries) > > These are IPmc groups not IB mc groups though. I suppose you are asking > for the equivalent function in IB. When that subscribe is done, > would it Right, it looks like the MGID would be close to FF1E:601B:xxxx::1FF00:0/104 for IPv6 SN multicast (RFC4391). My thinking was to add a new 8 bit field to MCMemberRecord called prefixLen. Broadly (without considering how to manage compatability) joins like we have today would set prefixLen to 128. To do this suggestion we'd set prefixLen to 104. prefixLen of 104 means 2**24 MGID addresses are matched by the join and the node is subscribed to them all. The existing MGID field is used to encode the prefix bits, only the first 104 bits are used. Off hand I don't see a way to indicate the length using the existing record fields. If we call a MCMemberRecord with a prefixLen != 128 a prefix join.. MLID mapping is a little tricky in this scheme since once a prefix join is registered you have to start unioning membership lists with other joins to get the right spans. (ie joins to FF1E::1000/120 and FF1E::1001/128 may have different MLIDs but they would both reach a unioned membership) That would mean that a send-only prefix join MLID would effectively be a broadcast MLID so if you use it with option 2 you reduce the SM query rate and subscription load, but you are just broadcasting ND packets. I guess the sensible use would be with option 1 where both the send-only and and full-membership join are a /104 prefix join. [Basically, it ends up using broadcasting like Todd sugested, but in a way where the SM can properly integrate IPoIB stacks that don't use prefix joins.] I don't know if this is worth persuing.. Certainly if the main issue is just MLID usage then option 2 is much simpler. Something like this might be part of improving IPv6 ND scalability but that is a different problem entirely (and does anyone care?).. Jason From Thomas.Talpey at netapp.com Fri Dec 1 18:00:30 2006 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Fri, 01 Dec 2006 21:00:30 -0500 Subject: [openib-general] NFS/RDMA for Linux: client and server update release 7 In-Reply-To: References: Message-ID: At 06:12 PM 12/1/2006, Roland Dreier wrote: >What is the status of moving this code towards merging to the upstream kernel? For the client there are two main prerequisites, both in the RPC layer and both in progress. One is the completion of the RPC transport switch merge, mainly the ability to load as modules. The second is a new mount syscall api, to allow transport-specific arguments to be passed in. We have a temporary solution for that at the moment. When these two are in place, the client is ready to consider merging. The server actually doesn't have these dependencies, but it does need to be updated to match the new code in 2.6.19 which raises the maximum rpc payload size, and some additional hardening/improvements which we found in code review. We're waiting to complete this work, which hopefully will be this month. Bottom line, we can put it on the table soon. Tom. From ralph.campbel at qlogic.com Fri Dec 1 18:08:42 2006 From: ralph.campbel at qlogic.com (Ralph Campbell) Date: Fri, 1 Dec 2006 18:08:42 -0800 (PST) Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> Message-ID: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> > > I don't think this is impossible to implement. I just wanted > > to avoid the work unless you and others thought it was really > > worth it given the reality that we already have a large > > test matrix of platforms, distros, and kernel versions and > > it probably won't get much testing. It is possible that > > at some point 32-bit kernels will become a priority > > but I don't know when that might happen. > > So you think you could do the ib_dma_xxx stuff for ipath without > affecting anything outside of ipath? (Assuming this is merged of course) > What would be the rough outline of how that would work? Basically, use a hash table to store the kmap result. See attached for 90% of the code. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ipath_dma.c URL: From elsen_david at yahoo.com Fri Dec 1 19:07:24 2006 From: elsen_david at yahoo.com (david elsen) Date: Fri, 1 Dec 2006 19:07:24 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <4570CAA8.5080806@cse.ohio-state.edu> Message-ID: <837388.34727.qm@web58012.mail.re3.yahoo.com> Shaun, It was working on one of my Fedora system. I tried to do the same installation on my other system which has SuSe 9.3 and it is not working there. So I am not sure what is going on with this. Thanks, David Shaun Rowland wrote: Steve Wise wrote: > I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing their > dev team so maybe they can help. > > Steve. > > > > On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote: >> Steve, >> >> I can run rping, rdma_lat etc on the Ammasso card but when I try to >> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. >> >> ./mpdboot -n 1 >> debug: starting >> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: >> librdmacm.so: cannot open shared object file: No such file or >> directory >> running mpdallexit on ammasso1 >> LAUNCHED mpd on ammasso1 via >> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d >> debug: mpd on ammasso1 on port 35352 >> RUNNING: mpd on ammasso1 >> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, >> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''} Hello David and Steve. We discussed this problem in detail on the mvapich-discuss list recently. David, you indicated the following in your last email about this to mvapich-discuss on 11/26/2006: "For some reason, it is working in SuSE, and not working in Fedora." Is this still the case? Were the libraries built specifically on the Fedora Core 6 system, or are you using libraries that were built on SuSE? I assume they were built on Fedora Core 6. Were you trying to run this as root or as a regular user? I am not sure exactly how this might affect shared library loading, but it is possible there is a difference. In our previous discussion, your library path did indeed have a librdmacm.so file, though it could not be loaded for an unknown reason. It is unclear to me if this email thread indicates that you have tried to rebuild that and are experiencing the same issue. Where you able to try running that test shared library example I gave and did it work? Did it work as the same user you are trying to run MVAPICH as? It seems clear this is a runtime loader problem on Fedora Core 6, or on your particular configuration there. That is what cannot find the library. It is similar to the libtest code I provided as an example: [rowland at e14-oib libtest]$ ls Makefile test.c test.h test-program.c [rowland at e14-oib libtest]$ make normal gcc -c -fPIC test.c gcc -shared -Wl,-soname,libtest.so.1 -o libtest.so.1.0 test.o ln -s libtest.so.1.0 libtest.so.1 ln -s libtest.so.1 libtest.so gcc -c -o test-program.o test-program.c gcc -o test-program test-program.o -L/home/7/rowland/libtest -ltest [rowland at e14-oib libtest]$ ldd test-program libtest.so.1 => not found libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000) /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000) [rowland at e14-oib libtest]$ ./test-program ./test-program: error while loading shared libraries: libtest.so.1: cannot open shared object file: No such file or directory [rowland at e14-oib libtest]$ export LD_LIBRARY_PATH=$PWD [rowland at e14-oib libtest]$ ldd test-program libtest.so.1 => /home/7/rowland/libtest/libtest.so.1 (0x00002abbf9aee000) libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000) /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000) [rowland at e14-oib libtest]$ ./test-program In shared library function... In previous email your ldd output showed the library was being found: Please see the output of ldd /usr/local/mvapich2/bin/mpdroot : [root at ammasso1 ~]# ldd /usr/local/mvapich2/bin/mpdroot linux-gate.so.1 => (0xffffe000) librdmacm.so => /usr/local/lib/librdmacm.so (0xb7fec000) libibverbs.so.2 => /usr/local/lib/libibverbs.so.2 (0xb7fe5000) libibumad.so.1 => /usr/local/lib/libibumad.so.1 (0xb7fdc000) libpthread.so.0 => /lib/libpthread.so.0 (0x0012a000) libc.so.6 => /lib/libc.so.6 (0x00ca7000) libsysfs.so.2 => /usr/lib/libsysfs.so.2 (0x00369000) libdl.so.2 => /lib/libdl.so.2 (0x00de6000) libibcommon.so.1 => /usr/local/lib/libibcommon.so.1 (0xb7fcb000) /lib/ld-linux.so.2 (0x002d8000) But that path is different than the one you are quoting above. Does an ldd on /root/0.9.8-RELEASE/bin/mpdroot find librdmacm.so too, as the same user you are trying to run it as? I have one more idea for you to try here. You can do the following: export LD_DEBUG=all /root/0.9.8-RELEASE/bin/mpdroot >&output unset LD_DEBUG Then take a look at the output file to see if there are any relevant error messages. Don't forget to unset LD_DEBUG before doing anything else. Also, just to be sure, if you run "file " what does it say? It should indicate that it is a shared library as similarly to: [rowland at e14-oib libtest]$ file /usr/local/ofed/lib64/librdmacm.so* /usr/local/ofed/lib64/librdmacm.so: symbolic link to `librdmacm.so.0.9.0' /usr/local/ofed/lib64/librdmacm.so.0.9.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped Unfortunately, we do not have any Fedora Core 6 systems to investigate this problem on at this time, and I don't know anything about what might be there that would cause a problem. As far as I know, there shouldn't be. However, it seems there is some runtime issue on your Fedora Core 6 machine or with how this is being run there. If it is in fact working on another distribution as you indicated in your previous response to us, then that also strongly points in this direction. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ --------------------------------- Check out the all-new Yahoo! Mail beta - Fire up a more powerful email and get things done faster. -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Sat Dec 2 04:27:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Dec 2006 07:27:37 -0500 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <20061202012054.GH32366@obsidianresearch.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> <20061201231753.GG32366@obsidianresearch.com> <1165016801.11808.196277.camel@hal.voltaire.com> <20061202012054.GH32366@obsidianresearch.com> Message-ID: <1165062394.11808.222794.camel@hal.voltaire.com> On Fri, 2006-12-01 at 20:20, Jason Gunthorpe wrote: > On Fri, Dec 01, 2006 at 06:47:38PM -0500, Hal Rosenstock wrote: > > > > It isn't in the IB spec, but what would really help here is to be able > > > to join a multicast prefix (more than 1 group with a single entry). > > > > > > Todd's option 1 optimization is then easially realized by having all > > > IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries) > > > > These are IPmc groups not IB mc groups though. I suppose you are asking > > for the equivalent function in IB. When that subscribe is done, > > would it > > Right, it looks like the MGID would be close to > FF1E:601B:xxxx::1FF00:0/104 for IPv6 SN multicast (RFC4391). > > My thinking was to add a new 8 bit field to MCMemberRecord called > prefixLen. Broadly (without considering how to manage compatability) > joins like we have today would set prefixLen to 128. To do this > suggestion we'd set prefixLen to 104. > > prefixLen of 104 means 2**24 MGID addresses are matched by the join > and the node is subscribed to them all. The existing MGID field is > used to encode the prefix bits, only the first 104 bits are used. > > Off hand I don't see a way to indicate the length using the existing > record fields. Another 8 bit field could do this if this were needed. > If we call a MCMemberRecord with a prefixLen != 128 a prefix join.. And hence the backward compatibility issue. One way to handle this would be an exception (if component mask does not specify PrefixLength rather than being wildcarded, it assumes PrefixLength of 128. There may be others. > MLID mapping is a little tricky in this scheme since once a prefix > join is registered you have to start unioning membership lists with > other joins to get the right spans. (ie joins to FF1E::1000/120 and > FF1E::1001/128 may have different MLIDs but they would both reach a > unioned membership) > > That would mean that a send-only prefix join MLID would effectively be > a broadcast MLID so if you use it with option 2 you reduce the SM > query rate and subscription load, but you are just broadcasting ND > packets. I guess the sensible use would be with option 1 where both > the send-only and and full-membership join are a /104 prefix join. > [Basically, it ends up using broadcasting like Todd sugested, but in > a way where the SM can properly integrate IPoIB stacks that > don't use prefix joins.] > > I don't know if this is worth persuing.. Certainly if the main issue > is just MLID usage then option 2 is much simpler. Something like this > might be part of improving IPv6 ND scalability but that is a different > problem entirely (and does anyone care?).. Is that only an IPoIB issue though or is it more generic and apply to other networks ? -- Hal > Jason From eitan at mellanox.co.il Sat Dec 2 07:51:53 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 02 Dec 2006 17:51:53 +0200 Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition In-Reply-To: <1165005117.11808.189660.camel@hal.voltaire.com> References: <1165005117.11808.189660.camel@hal.voltaire.com> Message-ID: <4571A119.9040408@mellanox.co.il> Hi Hal, I see you are doing some work on optimizing the locking scheme in the multicast registration flow (join and leave). What kind of testing do you do? In the simulated environment we do not have currently a test that will fire pairs of join/leave or join/leave/join and verify correctness.Maybe we should have one written. Eitan Hal Rosenstock wrote: > OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate > unneeded lock acquisition > > Signed-off-by: Sasha Khapyorsky > Signed-off-by: Hal Rosenstock > > diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c > index f7f879b..d6c6968 100644 > --- a/osm/opensm/osm_sa_mcmember_record.c > +++ b/osm/opensm/osm_sa_mcmember_record.c > @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp( > new_join_state | (p_mcm_port->scope_state & 0xf0); > > mcmember_rec.scope_state = p_mcm_port->scope_state; > + > + CL_PLOCK_RELEASE( p_rcv->p_lock ); > } > else > { > @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp( > "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: " > "osm_sm_mcgrp_leave failed\n" ); > } > - > - CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock); > - /* Note: The deletion of the mgrp itself will be done in the callback > - for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */ > } > } > else > @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp( > goto Exit; > } > > - CL_PLOCK_RELEASE( p_rcv->p_lock ); > - > /* Send an SA response */ > __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec ); > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Sat Dec 2 08:01:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Dec 2006 11:01:23 -0500 Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition In-Reply-To: <4571A119.9040408@mellanox.co.il> References: <1165005117.11808.189660.camel@hal.voltaire.com> <4571A119.9040408@mellanox.co.il> Message-ID: <1165075257.11808.230252.camel@hal.voltaire.com> Hi Eitan, On Sat, 2006-12-02 at 10:51, Eitan Zahavi wrote: > Hi Hal, > > I see you are doing some work on optimizing the locking scheme in the > multicast registration flow > (join and leave). Yes and it goes further than this. Additional patch(es) will be coming. So does this look OK to you ? > What kind of testing do you do? Two fold: 1. Tested in another simulated environment 2. Tested in a large cluster where there is a larger join/leave race issue which started us down the road looking at these code paths more > In the simulated environment we do not have currently a test that will > fire pairs of join/leave or > join/leave/join and verify correctness.Maybe we should have one written. Sure; you are welcome to add one. I don't have time to do this now. -- Hal > Eitan > > Hal Rosenstock wrote: > > OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate > > unneeded lock acquisition > > > > Signed-off-by: Sasha Khapyorsky > > Signed-off-by: Hal Rosenstock > > > > diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c > > index f7f879b..d6c6968 100644 > > --- a/osm/opensm/osm_sa_mcmember_record.c > > +++ b/osm/opensm/osm_sa_mcmember_record.c > > @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp( > > new_join_state | (p_mcm_port->scope_state & 0xf0); > > > > mcmember_rec.scope_state = p_mcm_port->scope_state; > > + > > + CL_PLOCK_RELEASE( p_rcv->p_lock ); > > } > > else > > { > > @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp( > > "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: " > > "osm_sm_mcgrp_leave failed\n" ); > > } > > - > > - CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock); > > - /* Note: The deletion of the mgrp itself will be done in the callback > > - for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */ > > } > > } > > else > > @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp( > > goto Exit; > > } > > > > - CL_PLOCK_RELEASE( p_rcv->p_lock ); > > - > > /* Send an SA response */ > > __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec ); > > > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From eitan at mellanox.co.il Sat Dec 2 08:13:27 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 02 Dec 2006 18:13:27 +0200 Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition In-Reply-To: <1165075257.11808.230252.camel@hal.voltaire.com> References: <1165005117.11808.189660.camel@hal.voltaire.com> <4571A119.9040408@mellanox.co.il> <1165075257.11808.230252.camel@hal.voltaire.com> Message-ID: <4571A627.8090501@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Sat, 2006-12-02 at 10:51, Eitan Zahavi wrote: > >> Hi Hal, >> >> I see you are doing some work on optimizing the locking scheme in the >> multicast registration flow >> (join and leave). >> > > Yes and it goes further than this. Additional patch(es) will be coming. > > So does this look OK to you ? > > I hope Yevgeny will be able to review the entire flow next week. >> What kind of testing do you do? >> > > Two fold: > 1. Tested in another simulated environment > 2. Tested in a large cluster where there is a larger join/leave race > issue which started us down the road looking at these code paths more > > >> In the simulated environment we do not have currently a test that will >> fire pairs of join/leave or >> join/leave/join and verify correctness.Maybe we should have one written. >> > > Sure; you are welcome to add one. I don't have time to do this now. > I will try and get to that next week. I will let you know when it is available. > -- Hal > > >> Eitan >> >> Hal Rosenstock wrote: >> >>> OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate >>> unneeded lock acquisition >>> >>> Signed-off-by: Sasha Khapyorsky >>> Signed-off-by: Hal Rosenstock >>> >>> diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c >>> index f7f879b..d6c6968 100644 >>> --- a/osm/opensm/osm_sa_mcmember_record.c >>> +++ b/osm/opensm/osm_sa_mcmember_record.c >>> @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp( >>> new_join_state | (p_mcm_port->scope_state & 0xf0); >>> >>> mcmember_rec.scope_state = p_mcm_port->scope_state; >>> + >>> + CL_PLOCK_RELEASE( p_rcv->p_lock ); >>> } >>> else >>> { >>> @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp( >>> "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: " >>> "osm_sm_mcgrp_leave failed\n" ); >>> } >>> - >>> - CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock); >>> - /* Note: The deletion of the mgrp itself will be done in the callback >>> - for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */ >>> } >>> } >>> else >>> @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp( >>> goto Exit; >>> } >>> >>> - CL_PLOCK_RELEASE( p_rcv->p_lock ); >>> - >>> /* Send an SA response */ >>> __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec ); >>> >>> >>> >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Sat Dec 2 11:29:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Dec 2006 14:29:57 -0500 Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition In-Reply-To: <4571A627.8090501@mellanox.co.il> References: <1165005117.11808.189660.camel@hal.voltaire.com> <4571A119.9040408@mellanox.co.il> <1165075257.11808.230252.camel@hal.voltaire.com> <4571A627.8090501@mellanox.co.il> Message-ID: <1165087712.11808.237780.camel@hal.voltaire.com> On Sat, 2006-12-02 at 11:13, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Sat, 2006-12-02 at 10:51, Eitan Zahavi wrote: > > > >> Hi Hal, > >> > >> I see you are doing some work on optimizing the locking scheme in the > >> multicast registration flow > >> (join and leave). > >> > > > > Yes and it goes further than this. Additional patch(es) will be coming. > > > > So does this look OK to you ? > > > > > I hope Yevgeny will be able to review the entire flow next week. > >> What kind of testing do you do? > >> > > > > Two fold: > > 1. Tested in another simulated environment > > 2. Tested in a large cluster where there is a larger join/leave race > > issue which started us down the road looking at these code paths more Also, the multicast flows in osmtest. (I forgot to mention those). > > > >> In the simulated environment we do not have currently a test that will > >> fire pairs of join/leave or > >> join/leave/join and verify correctness.Maybe we should have one written. > >> > > > > Sure; you are welcome to add one. I don't have time to do this now. > > > I will try and get to that next week. I will let you know when it is > available. Great; Thanks. -- Hal > > -- Hal > > > > > >> Eitan > >> > >> Hal Rosenstock wrote: > >> > >>> OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate > >>> unneeded lock acquisition > >>> > >>> Signed-off-by: Sasha Khapyorsky > >>> Signed-off-by: Hal Rosenstock > >>> > >>> diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c > >>> index f7f879b..d6c6968 100644 > >>> --- a/osm/opensm/osm_sa_mcmember_record.c > >>> +++ b/osm/opensm/osm_sa_mcmember_record.c > >>> @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp( > >>> new_join_state | (p_mcm_port->scope_state & 0xf0); > >>> > >>> mcmember_rec.scope_state = p_mcm_port->scope_state; > >>> + > >>> + CL_PLOCK_RELEASE( p_rcv->p_lock ); > >>> } > >>> else > >>> { > >>> @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp( > >>> "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: " > >>> "osm_sm_mcgrp_leave failed\n" ); > >>> } > >>> - > >>> - CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock); > >>> - /* Note: The deletion of the mgrp itself will be done in the callback > >>> - for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */ > >>> } > >>> } > >>> else > >>> @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp( > >>> goto Exit; > >>> } > >>> > >>> - CL_PLOCK_RELEASE( p_rcv->p_lock ); > >>> - > >>> /* Send an SA response */ > >>> __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec ); > >>> > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> openib-general mailing list > >>> openib-general at openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From jgunthorpe at obsidianresearch.com Sat Dec 2 11:57:45 2006 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Sat, 2 Dec 2006 12:57:45 -0700 Subject: [openib-general] IPv6 and IPoIB scalability issue In-Reply-To: <1165062394.11808.222794.camel@hal.voltaire.com> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org> <1165003608.11808.188882.camel@hal.voltaire.com> <20061201231753.GG32366@obsidianresearch.com> <1165016801.11808.196277.camel@hal.voltaire.com> <20061202012054.GH32366@obsidianresearch.com> <1165062394.11808.222794.camel@hal.voltaire.com> Message-ID: <20061202195745.GA19174@obsidianresearch.com> On Sat, Dec 02, 2006 at 07:27:37AM -0500, Hal Rosenstock wrote: > > I don't know if this is worth persuing.. Certainly if the main issue > > is just MLID usage then option 2 is much simpler. Something like this > > might be part of improving IPv6 ND scalability but that is a different > > problem entirely (and does anyone care?).. > > Is that only an IPoIB issue though or is it more generic and apply to > other networks ? If the IB router spec goes down the path of pushing alot of the responsability onto the routers then routers will have similar problems with joining/tracking a large number of gorups. A prefix join concept might be part of improving that. I'm not sure what other protocols use extensive multicast like IPv6, but multicast use on local segments is definately becoming more common. Is anyone worried about IPv4 broadcast ARP scalability? With RDMA CM and MPI all-to-all is that going to be a problem? IPv6 SN is a solution to that .. Jason From surs at cse.ohio-state.edu Sat Dec 2 13:34:56 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sat, 2 Dec 2006 16:34:56 -0500 Subject: [openib-general] RNR_RETRY_EXC_ERR and completion opcode in "send_lat" Message-ID: <20061202213454.GB31661@cse.ohio-state.edu> Hi, I have a question about the "status" field for a completion which is due to RNR retry exceeded error. I trivially modified the `send_lat' program (from the Gen2 perftest directory) to use SRQ and not post receives after some specified time. Given the "rnr_retry" attribute of the QP not to be 7 (infinite retry), I'm expecting the sender to get an erroneous completion with IBV_WC_RNR_RETRY_EXC_ERR. So far so good ... however, the completion I pull out of the send_cq, lists the opcode of the completion to be IBV_WC_RECV! Is this expected? I am using OFED 1.1 on dual Intel Xeon machines with Mellanox DDR HCAs (two ports) and in MemFree mode. The distribution used is RH AS4 (Nahant Update 3), with kernel version 2.6.17.7. If someone could explain this behavior, or suggest a workaround, it'd be great. TIA, Sayantan. ======= <--Print out at client--> Send Completion wth error at client: wc.status 13, IBV_WC_RNR_RETRY_EXC_ERR 13, wc.opcode 128 Failed status 13: wr_id 1 scnt=26, rcnt=25, ccnt=0 <--Print out--> <--Poll CQ snippet--> /* poll on scq */ do { ne = ibv_poll_cq(ctx->scq, 1, &wc); } while (!user_param->use_event && ne < 1); if (ne < 0) { fprintf(stderr, "poll SCQ failed %d\n", ne); return 12; } if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Send Completion wth error at %s:\n", user_param->servername ? "client" : "server"); fprintf(stderr, "wc.status %d, IBV_WC_RNR_RETRY_EXC_ERR %d, wc.opcode %d\n", wc.status, IBV_WC_RNR_RETRY_EXC_ERR, wc.opcode); fprintf(stderr, "Failed status %d: wr_id %d\n", wc.status, (int) wc.wr_id); fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n", scnt, rcnt, ccnt); { ... <--Poll CQ snippet--> -- http://www.cse.ohio-state.edu/~surs From swise at opengridcomputing.com Sat Dec 2 14:49:17 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:49:17 -0600 Subject: [openib-general] [PATCH v2 00/13] 2.6.20 Chelsio T3 RDMA Driver Message-ID: <20061202224917.27014.15424.stgit@dell3.ogc.int> Version 2 changes: - Make code sparse endian clean - Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays - Clean up confusing bitfields - Use random32() instead of local random function - Use krefs to track endpoint reference counts - Misc nits ----- The following series implements the Chelsio T3 iWARP/RDMA Driver to be considered for inclusion in 2.6.20. It depends on the Chelsio T3 Ethernet Driver which is also under review now for 2.6.20. See: http://www.mail-archive.com/netdev at vger.kernel.org/msg26619.html The patches are against 2.6.19. This patch series can also be pulled from: http://www.opengridcomputing.com/downloads/iw_cxgb3_patches_v2.tar.bz2 The Chelsio T3 Ethernet Driver patch can be pulled from: http://service.chelsio.com/kernel.org/cxgb3.patch.bz2 A complete GIT kernel tree with all the T3 drivers can be pulled from: git://staging.openfabrics.org/~swise/cxgb3.git Thanks, Steve. From swise at opengridcomputing.com Sat Dec 2 14:49:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:49:27 -0600 Subject: [openib-general] [PATCH v2 01/13] Linux RDMA Core Changes In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202224927.27014.24669.stgit@dell3.ogc.int> Support provider-specific data in ib_uverbs_cmd_req_notify_cq(). The Chelsio iwarp provider library needs to pass information to the kernel verb for re-arming the CQ. Signed-off-by: Steve Wise --- drivers/infiniband/core/uverbs_cmd.c | 9 +++++++-- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 3 ++- drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 ++- drivers/infiniband/hw/ehca/ehca_reqs.c | 3 ++- drivers/infiniband/hw/ipath/ipath_cq.c | 4 +++- drivers/infiniband/hw/ipath/ipath_verbs.h | 3 ++- drivers/infiniband/hw/mthca/mthca_cq.c | 6 ++++-- drivers/infiniband/hw/mthca/mthca_dev.h | 4 ++-- include/rdma/ib_verbs.h | 5 +++-- 10 files changed, 28 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 743247e..5dd1de9 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i int out_len) { struct ib_uverbs_req_notify_cq cmd; + struct ib_udata udata; struct ib_cq *cq; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i if (!cq) return -EINVAL; - ib_req_notify_cq(cq, cmd.solicited_only ? - IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + INIT_UDATA(&udata, buf + sizeof cmd, 0, + in_len - sizeof cmd, 0); + + cq->device->req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP, + &udata); put_cq_read(cq); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h index 1b17dcd..716f9dc 100644 --- a/drivers/infiniband/hw/amso1100/c2.h +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata); /* CM */ extern int c2_llp_connect(struct iw_cm_id *cm_id, diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 05c9154..7ce8bca 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n return npolled; } -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct c2_mq_shared __iomem *shared; struct c2_cq *cq; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 3720e30..566b30c 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata); struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index b46bda1..3ed6992 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -634,7 +634,8 @@ poll_cq_exit0: return ret; } -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 87462e0..27ba4db 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq) * ipath_req_notify_cq - change the notification type for a completion queue * @ibcq: the completion queue * @notify: the type of notification to request + * @udata: user data * * Returns 0 for success. * * This may be called from interrupt context. Also called by * ib_req_notify_cq() in the generic verbs code. */ -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); unsigned long flags; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 8039f6e..0d39960 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_ int ipath_destroy_cq(struct ib_cq *ibcq); -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata); int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 149b369..ec7bb79 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -723,7 +723,8 @@ repoll: return err == 0 || err == -EAGAIN ? npolled : err; } -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, + struct ib_udata *udata) { __be32 doorbell[2]; @@ -740,7 +741,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, return 0; } -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct mthca_cq *cq = to_mcq(ibcq); __be32 doorbell[2]; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index fe5cecf..6b9ccf6 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 8eacc35..e3e1a2c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -941,7 +941,8 @@ struct ib_device { struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); int (*req_notify_cq)(struct ib_cq *cq, - enum ib_cq_notify cq_notify); + enum ib_cq_notify cq_notify, + struct ib_udata *udata); int (*req_ncomp_notif)(struct ib_cq *cq, int wc_cnt); struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_ static inline int ib_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) { - return cq->device->req_notify_cq(cq, cq_notify); + return cq->device->req_notify_cq(cq, cq_notify, NULL); } /** From swise at opengridcomputing.com Sat Dec 2 14:49:37 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:49:37 -0600 Subject: [openib-general] [PATCH v2 02/13] Device Discovery and ULLD Linkage In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202224937.27014.951.stgit@dell3.ogc.int> Code to discover all the T3 devices and register them with the T3 RDMA Core and the Linux RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 189 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 175 +++++++++++++++++++++++++++++++++ 2 files changed, 364 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c new file mode 100644 index 0000000..acbe449 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -0,0 +1,189 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" +#include "iwch_user.h" +#include "iwch.h" +#include "iwch_cm.h" + +#define DRV_VERSION "1.1" + +MODULE_AUTHOR("Boyd Faulkner, Steve Wise"); +MODULE_DESCRIPTION("Chelsio T3 RDMA Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; + +static void open_rnic_dev(struct t3cdev *); +static void close_rnic_dev(struct t3cdev *); + +struct cxgb3_client t3c_client = { + .name = "iw_cxgb3", + .add = open_rnic_dev, + .remove = close_rnic_dev, + .handlers = t3c_handlers, + .redirect = iwch_ep_redirect +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(dev_mutex); + +static void rnic_init(struct iwch_dev *rnicp) +{ + PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); + idr_init(&rnicp->cqidr); + idr_init(&rnicp->qpidr); + idr_init(&rnicp->mmidr); + spin_lock_init(&rnicp->lock); + + rnicp->attr.vendor_id = 0x168; + rnicp->attr.vendor_part_id = 7; + rnicp->attr.max_qps = T3_MAX_NUM_QP - 32; + rnicp->attr.max_wrs = (1UL << 24) - 1; + rnicp->attr.max_sge_per_wr = T3_MAX_SGE; + rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE; + rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1; + rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1; + rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev); + rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE; + rnicp->attr.max_pds = T3_MAX_NUM_PD - 1; + rnicp->attr.mem_pgsizes_bitmask = 0x7FFF; /* 4KB-128MB */ + rnicp->attr.can_resize_wq = 0; + rnicp->attr.max_rdma_reads_per_qp = 8; + rnicp->attr.max_rdma_read_resources = + rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps; + rnicp->attr.max_rdma_read_qp_depth = 8; /* IRD */ + rnicp->attr.max_rdma_read_depth = + rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps; + rnicp->attr.rq_overflow_handled = 0; + rnicp->attr.can_modify_ird = 0; + rnicp->attr.can_modify_ord = 0; + rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1; + rnicp->attr.stag0_value = 1; + rnicp->attr.zbva_support = 1; + rnicp->attr.local_invalidate_fence = 1; + rnicp->attr.cq_overflow_detection = 1; + return; +} + +static void open_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *rnicp; + static int vers_printed; + + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + if (!vers_printed++) + printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", + DRV_VERSION); + rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); + if (!rnicp) { + printk(KERN_ERR MOD "Cannot allocate ib device\n"); + return; + } + rnicp->rdev.ulp = rnicp; + rnicp->rdev.t3cdev_p = tdev; + + if (cxio_rdev_open(&rnicp->rdev)) { + printk(KERN_ERR MOD "Unable to open CXIO rdev\n"); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + rnic_init(rnicp); + + mutex_lock(&dev_mutex); + list_add_tail(&rnicp->entry, &dev_list); + mutex_unlock(&dev_mutex); + + if (iwch_register_device(rnicp)) { + printk(KERN_ERR MOD "Unable to register device\n"); + close_rnic_dev(tdev); + } + printk(KERN_INFO MOD "Initialized device %s\n", + pci_name(rnicp->rdev.rnic_info.pdev)); + return; +} + +static void close_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *dev, *tmp; + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + mutex_lock(&dev_mutex); + list_for_each_entry_safe(dev, tmp, &dev_list, entry) { + if (dev->rdev.t3cdev_p == tdev) { + list_del(&dev->entry); + iwch_unregister_device(dev); + cxio_rdev_close(&dev->rdev); + idr_destroy(&dev->cqidr); + idr_destroy(&dev->qpidr); + idr_destroy(&dev->mmidr); + ib_dealloc_device(&dev->ibdev); + break; + } + } + mutex_unlock(&dev_mutex); +} + +extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb); + +static int __init iwch_init_module(void) +{ + int err; + + err = cxio_hal_init(); + if (err) + return err; + err = iwch_cm_init(); + if (err) + return err; + cxio_register_ev_cb(iwch_ev_dispatch); + cxgb3_register_client(&t3c_client); + return 0; +} + +static void __exit iwch_exit_module(void) +{ + cxgb3_unregister_client(&t3c_client); + cxio_unregister_ev_cb(iwch_ev_dispatch); + iwch_cm_term(); + cxio_hal_exit(); +} + +module_init(iwch_init_module); +module_exit(iwch_exit_module); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h new file mode 100644 index 0000000..411bfcd --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_H__ +#define __IWCH_H__ + +#include +#include +#include +#include + +#include + +#include "cxio_hal.h" +#include "cxgb3_offload.h" + +struct iwch_pd; +struct iwch_cq; +struct iwch_qp; +struct iwch_mr; + +struct iwch_rnic_attributes { + u32 vendor_id; + u32 vendor_part_id; + u32 max_qps; + u32 max_wrs; /* Max for any SQ/RQ */ + u32 max_sge_per_wr; + u32 max_sge_per_rdma_write_wr; /* for RDMA Write WR */ + u32 max_cqs; + u32 max_cqes_per_cq; + u32 max_mem_regs; + u32 max_phys_buf_entries; /* for phys buf list */ + u32 max_pds; + + /* + * The memory page sizes supported by this RNIC. + * Bit position i in bitmap indicates page of + * size (4k)^i. Phys block list mode unsupported. + */ + u32 mem_pgsizes_bitmask; + u8 can_resize_wq; + + /* + * The maximum number of RDMA Reads that can be outstanding + * per QP with this RNIC as the target. + */ + u32 max_rdma_reads_per_qp; + + /* + * The maximum number of resources used for RDMA Reads + * by this RNIC with this RNIC as the target. + */ + u32 max_rdma_read_resources; + + /* + * The max depth per QP for initiation of RDMA Read + * by this RNIC. + */ + u32 max_rdma_read_qp_depth; + + /* + * The maximum depth for initiation of RDMA Read + * operations by this RNIC on all QPs + */ + u32 max_rdma_read_depth; + u8 rq_overflow_handled; + u32 can_modify_ird; + u32 can_modify_ord; + u32 max_mem_windows; + u32 stag0_value; + u8 zbva_support; + u8 local_invalidate_fence; + u32 cq_overflow_detection; +}; + +struct iwch_dev { + struct ib_device ibdev; + struct cxio_rdev rdev; + u32 device_cap_flags; + struct iwch_rnic_attributes attr; + struct idr cqidr; + struct idr qpidr; + struct idr mmidr; + spinlock_t lock; + struct list_head entry; +}; + +static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct iwch_dev, ibdev); +} + +static inline int t3b_device(struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3B); +} + +static inline int t3a_device(struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3A); +} + +static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid) +{ + return idr_find(&rhp->cqidr, cqid); +} + +static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid) +{ + return idr_find(&rhp->qpidr, qpid); +} + +static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) +{ + return idr_find(&rhp->mmidr, mmid); +} + +static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, + void *handle, u32 id) +{ + int ret; + u32 newid; + + do { + if (!idr_pre_get(idr, GFP_KERNEL)) { + return -ENOMEM; + } + spin_lock_irq(&rhp->lock); + ret = idr_get_new_above(idr, handle, id, &newid); + BUG_ON(newid != id); + spin_unlock_irq(&rhp->lock); + } while (ret == -EAGAIN); + + return ret; +} + +static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id) +{ + spin_lock_irq(&rhp->lock); + idr_remove(idr, id); + spin_unlock_irq(&rhp->lock); +} + +extern struct cxgb3_client t3c_client; +extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; +#endif From swise at opengridcomputing.com Sat Dec 2 14:49:47 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:49:47 -0600 Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data Structures In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202224947.27014.59189.stgit@dell3.ogc.int> Provider methods to support the Linux RDMA verbs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_provider.c | 1170 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_provider.h | 362 ++++++++ drivers/infiniband/hw/cxgb3/iwch_user.h | 68 ++ 3 files changed, 1600 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c new file mode 100644 index 0000000..4bef081 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -0,0 +1,1170 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" +#include "iwch_user.h" + +static int iwch_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return -ENOSYS; +} + +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + return ERR_PTR(-ENOSYS); +} + +static int iwch_ah_destroy(struct ib_ah *ah) +{ + return -ENOSYS; +} + +static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + return -ENOSYS; +} + +static int iwch_dealloc_ucontext(struct ib_ucontext *context) +{ + struct iwch_dev *rhp = to_iwch_dev(context->device); + struct iwch_ucontext *ucontext = to_iwch_ucontext(context); + PDBG("%s context %p\n", __FUNCTION__, context); + cxio_release_ucontext(&rhp->rdev, &ucontext->uctx); + kfree(ucontext); + return 0; +} + +static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct iwch_ucontext *context; + struct iwch_dev *rhp = to_iwch_dev(ibdev); + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + cxio_init_ucontext(&rhp->rdev, &context->uctx); + INIT_LIST_HEAD(&context->mmaps); + return &context->ibucontext; +} + +static int iwch_destroy_cq(struct ib_cq *ib_cq) +{ + struct iwch_cq *chp; + + PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq); + chp = to_iwch_cq(ib_cq); + + remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid); + atomic_dec(&chp->refcnt); + wait_event(chp->wait, !atomic_read(&chp->refcnt)); + + cxio_destroy_cq(&chp->rhp->rdev, &chp->cq); + kfree(chp); + return 0; +} + +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + struct iwch_create_cq_resp uresp; + + PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries); + rhp = to_iwch_dev(ibdev); + chp = kzalloc(sizeof(*chp), GFP_KERNEL); + if (!chp) + return ERR_PTR(-ENOMEM); + + if (t3a_device(rhp)) { + + /* + * T3A: Add some fluff to handle extra CQEs inserted + * for various errors. + * Additional CQE possibilities: + * TERMINATE, + * incoming RDMA WRITE Failures + * incoming RDMA READ REQUEST FAILUREs + * NOTE: We cannot ensure the CQ won't overflow. + */ + entries += 16; + } + entries = roundup_pow_of_two(entries); + chp->cq.size_log2 = long_log2(entries); + + if (cxio_create_cq(&rhp->rdev, &chp->cq)) { + kfree(chp); + return ERR_PTR(-ENOMEM); + } + chp->rhp = rhp; + chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1; + spin_lock_init(&chp->lock); + atomic_set(&chp->refcnt, 1); + init_waitqueue_head(&chp->wait); + insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid); + + if (context) { + struct iwch_mm_entry *mm; + + mm = kmalloc(sizeof *mm, GFP_KERNEL); + if (!mm) { + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-ENOMEM); + } + uresp.cqid = chp->cq.cqid; + uresp.size_log2 = chp->cq.size_log2; + uresp.physaddr = virt_to_phys(chp->cq.queue); + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm); + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-EFAULT); + } + mm->addr = uresp.physaddr; + mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * + sizeof (struct t3_cqe)); + insert_mmap(to_iwch_ucontext(context), mm); + } + PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n", + chp->cq.cqid, chp, (1 << chp->cq.size_log2), + (u64)chp->cq.dma_addr); + return &chp->ibcq; +} + +static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + struct iwch_cq *chp = to_iwch_cq(cq); + struct t3_cq oldcq, newcq; + int ret; + + PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe); + + /* We don't downsize... */ + if (cqe <= cq->cqe) + return 0; + + /* create new t3_cq with new size */ + cqe = roundup_pow_of_two(cqe+1); + newcq.size_log2 = long_log2(cqe); + + /* Dont allow resize to less than the current wce count */ + if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) { + return -ENOMEM; + } + + /* Quiesce all QPs using this CQ */ + ret = iwch_quiesce_qps(chp); + if (ret) { + return ret; + } + + ret = cxio_create_cq(&chp->rhp->rdev, &newcq); + if (ret) { + kfree(chp); + return ret; + } + + /* copy CQEs */ + memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * + sizeof(struct t3_cqe)); + + /* old iwch_qp gets new t3_cq but keeps old cqid */ + oldcq = chp->cq; + chp->cq = newcq; + chp->cq.cqid = oldcq.cqid; + + /* resize new t3_cq to update the HW context */ + ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq); + if (ret) { + chp->cq = oldcq; + return ret; + } + chp->ibcq.cqe = (1<cq.size_log2) - 1; + + /* destroy old t3_cq */ + oldcq.cqid = newcq.cqid; + ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq); + if (ret) { + printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", + __FUNCTION__, ret); + } + + /* add user hooks here */ + + /* resume qps */ + ret = iwch_resume_qps(chp); + return ret; +} + +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + enum t3_cq_opcode cq_op; + int err; + unsigned long flag; + struct iwch_req_notify_cq ucmd; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + if (notify == IB_CQ_SOLICITED) + cq_op = CQ_ARM_SE; + else + cq_op = CQ_ARM_AN; + if (udata && t3b_device(rhp)) { + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return -EFAULT; + spin_lock_irqsave(&chp->lock, flag); + chp->cq.rptr = ucmd.rptr; + } else + spin_lock_irqsave(&chp->lock, flag); + PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); + err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); + spin_unlock_irqrestore(&chp->lock, flag); + if (err) + printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, + chp->cq.cqid); + return err; +} + +static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + int len = vma->vm_end - vma->vm_start; + u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT; + struct cxio_rdev *rdev_p; + int ret = 0; + struct iwch_mm_entry *mm; + struct iwch_ucontext *ucontext; + + PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, + pgaddr, len); + + if (vma->vm_start & (PAGE_SIZE-1)) { + return -EINVAL; + } + + rdev_p = &(to_iwch_dev(context->device)->rdev); + ucontext = to_iwch_ucontext(context); + + mm = remove_mmap(ucontext, pgaddr, len); + if (!mm) + return -EINVAL; + kfree(mm); + + if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && + (pgaddr < (rdev_p->rnic_info.udbell_physbase + + rdev_p->rnic_info.udbell_len))) { + + /* + * Map T3 DB register. + */ + if (vma->vm_flags & VM_READ) { + return -EPERM; + } + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; + vma->vm_flags &= ~VM_MAYREAD; + ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } else { + + /* + * Map WQ or CQ contig dma memory... + */ + ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } + + return ret; +} + +static int iwch_deallocate_pd(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + + php = to_iwch_pd(pd); + rhp = php->rhp; + PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid); + cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid); + kfree(php); + return 0; +} + +static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_pd *php; + u32 pdid; + struct iwch_dev *rhp; + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + rhp = (struct iwch_dev *) ibdev; + pdid = cxio_hal_get_pdid(rhp->rdev.rscp); + if (!pdid) + return ERR_PTR(-EINVAL); + php = kzalloc(sizeof(*php), GFP_KERNEL); + if (!php) { + cxio_hal_put_pdid(rhp->rdev.rscp, pdid); + return ERR_PTR(-ENOMEM); + } + php->pdid = pdid; + php->rhp = rhp; + if (context) { + if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) { + iwch_deallocate_pd(&php->ibpd); + return ERR_PTR(-EFAULT); + } + } + PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php); + return &php->ibpd; +} + +static int iwch_dereg_mr(struct ib_mr *ib_mr) +{ + struct iwch_dev *rhp; + struct iwch_mr *mhp; + u32 mmid; + + PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr); + /* There can be no memory windows */ + if (atomic_read(&ib_mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(ib_mr); + rhp = mhp->rhp; + mmid = mhp->attr.stag >> 8; + cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, + mhp->attr.pbl_addr); + remove_handle(rhp, &rhp->mmidr, mmid); + if (mhp->kva) + kfree((void *) (unsigned long) mhp->kva); + PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); + kfree(mhp); + return 0; +} + +static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + __be64 *page_list; + int shift; + u64 total_size; + int npages; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + int ret; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + php = to_iwch_pd(pd); + rhp = php->rhp; + + acc = iwch_convert_access(acc); + + + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, + &total_size, &npages, &shift, &page_list); + if (ret) + goto err; + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + + /* NOTE: TPT perms are backwards from BIND WR perms! */ + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + ret = iwch_register_mem(rhp, php, mhp, shift, page_list); + kfree(page_list); + if (ret) { + goto err; + } + return &mhp->ibmr; +err: + kfree(mhp); + return ERR_PTR(ret); + +} + +static int iwch_reregister_phys_mem(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 * iova_start) +{ + + struct iwch_mr mh, *mhp; + struct iwch_pd *php; + struct iwch_dev *rhp; + int new_acc; + __be64 *page_list = NULL; + int shift = 0; + u64 total_size; + int npages; + int ret; + + PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd); + + /* There can be no memory windows */ + if (atomic_read(&mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(mr); + rhp = mhp->rhp; + php = to_iwch_pd(mr->pd); + + /* make sure we are on the same adapter */ + if (rhp != php->rhp) + return -EINVAL; + + new_acc = mhp->attr.perms; + + memcpy(&mh, mhp, sizeof *mhp); + + if (mr_rereg_mask & IB_MR_REREG_PD) + php = to_iwch_pd(pd); + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mh.attr.perms = iwch_convert_access(acc); + if (mr_rereg_mask & IB_MR_REREG_TRANS) + ret = build_phys_page_list(buffer_list, num_phys_buf, + iova_start, + &total_size, &npages, + &shift, &page_list); + + ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages); + kfree(page_list); + if (ret) { + return ret; + } + if (mr_rereg_mask & IB_MR_REREG_PD) + mhp->attr.pdid = php->pdid; + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mhp->attr.perms = acc; + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + mhp->attr.zbva = 0; + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + } + + return 0; +} + + +struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + __be64 *pages; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + struct iwch_reg_user_mr_resp uresp; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + shift = ffs(region->page_size) - 1; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + acc = iwch_convert_access(acc); + + i = n = 0; + + list_for_each_entry(chunk, ®ion->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = cpu_to_be64(sg_dma_address( + &chunk->page_list[j]) + + region->page_size * k); + } + } + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + mhp->attr.va_fbo = region->virt_base; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) region->length; + mhp->attr.pbl_size = i; + err = iwch_register_mem(rhp, php, mhp, shift, pages); + kfree(pages); + if (err) + goto err; + + if (udata && t3b_device(rhp)) { + uresp.pbl_addr = (mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3; + PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, + uresp.pbl_addr); + + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + iwch_dereg_mr(&mhp->ibmr); + err = -EFAULT; + goto err; + } + } + + return &mhp->ibmr; + +err: + kfree(mhp); + return ERR_PTR(err); +} + +struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva; + struct ib_mr *ibmr; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + + /* + * T3 only supports 32 bits of size. + */ + bl.size = 0xffffffff; + bl.addr = 0; + kva = 0; + ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva); + return ibmr; +} + +struct ib_mw *iwch_alloc_mw(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mw *mhp; + u32 mmid; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_MW; + mhp->attr.stag = stag; + mmid = (stag) >> 8; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag); + return &(mhp->ibmw); +} + +int iwch_dealloc_mw(struct ib_mw *mw) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + u32 mmid; + + mhp = to_iwch_mw(mw); + rhp = mhp->rhp; + mmid = (mw->rkey) >> 8; + cxio_deallocate_window(&rhp->rdev, mhp->attr.stag); + remove_handle(rhp, &rhp->mmidr, mmid); + kfree(mhp); + PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp); + return 0; +} + +static int iwch_destroy_qp(struct ib_qp *ib_qp) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_qp_attributes attrs; + struct iwch_ucontext *ucontext; + + qhp = to_iwch_qp(ib_qp); + rhp = qhp->rhp; + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0); + } + wait_event(qhp->wait, !qhp->ep); + + remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid); + + atomic_dec(&qhp->refcnt); + wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); + + ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) + : NULL; + cxio_destroy_qp(&rhp->rdev, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx); + + PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, + ib_qp, qhp->wq.qpid, qhp); + kfree(qhp); + return 0; +} + +static struct ib_qp *iwch_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *attrs, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_pd *php; + struct iwch_cq *schp; + struct iwch_cq *rchp; + struct iwch_create_qp_resp uresp; + int wqsize, sqsize, rqsize; + struct iwch_ucontext *ucontext; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + if (attrs->qp_type != IB_QPT_RC) + return ERR_PTR(-EINVAL); + php = to_iwch_pd(pd); + rhp = php->rhp; + schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid); + rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid); + if (!schp || !rchp) + return ERR_PTR(-EINVAL); + + /* The RQT size must be # of entries + 1 rounded up to a power of two */ + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr); + if (rqsize == attrs->cap.max_recv_wr) + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1); + + /* T3 doesn't support RQT depth < 16 */ + if (rqsize < 16) + rqsize = 16; + + if (rqsize > T3_MAX_RQ_SIZE) + return ERR_PTR(-EINVAL); + + /* + * NOTE: The SQ and total WQ sizes don't need to be + * a power of two. However, all the code assumes + * they are. EG: Q_FREECNT() and friends. + */ + sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); + wqsize = roundup_pow_of_two(rqsize + sqsize); + PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, + wqsize, sqsize, rqsize); + qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); + if (!qhp) + return ERR_PTR(-ENOMEM); + qhp->wq.size_log2 = long_log2(wqsize); + qhp->wq.rq_size_log2 = long_log2(rqsize); + qhp->wq.sq_size_log2 = long_log2(sqsize); + ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; + if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) { + kfree(qhp); + return ERR_PTR(-ENOMEM); + } + attrs->cap.max_recv_wr = rqsize - 1; + attrs->cap.max_send_wr = sqsize; + qhp->rhp = rhp; + qhp->attr.pd = php->pdid; + qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid; + qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid; + qhp->attr.sq_num_entries = attrs->cap.max_send_wr; + qhp->attr.rq_num_entries = attrs->cap.max_recv_wr; + qhp->attr.sq_max_sges = attrs->cap.max_send_sge; + qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge; + qhp->attr.rq_max_sges = attrs->cap.max_recv_sge; + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.next_state = IWCH_QP_STATE_IDLE; + + /* + * XXX - These don't get passed in from the openib user + * at create time. The CM sets them via a QP modify. + * Need to fix... I think the CM should + */ + qhp->attr.enable_rdma_read = 1; + qhp->attr.enable_rdma_write = 1; + qhp->attr.enable_bind = 1; + qhp->attr.max_ord = 1; + qhp->attr.max_ird = 1; + + spin_lock_init(&qhp->lock); + init_waitqueue_head(&qhp->wait); + atomic_set(&qhp->refcnt, 1); + insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid); + + if (udata) { + + struct iwch_mm_entry *mm1, *mm2; + + mm1 = kmalloc(sizeof *mm1, GFP_KERNEL); + if (!mm1) { + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + mm2 = kmalloc(sizeof *mm2, GFP_KERNEL); + if (!mm2) { + kfree(mm1); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + uresp.qpid = qhp->wq.qpid; + uresp.size_log2 = qhp->wq.size_log2; + uresp.sq_size_log2 = qhp->wq.sq_size_log2; + uresp.rq_size_log2 = qhp->wq.rq_size_log2; + uresp.physaddr = virt_to_phys(qhp->wq.queue); + uresp.doorbell = qhp->wq.udb; + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm1); + kfree(mm2); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-EFAULT); + } + mm1->addr = uresp.physaddr; + mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr)); + insert_mmap(ucontext, mm1); + mm2->addr = uresp.doorbell & PAGE_MASK; + mm2->len = PAGE_SIZE; + insert_mmap(ucontext, mm2); + } + qhp->ibqp.qp_num = qhp->wq.qpid; + init_timer(&(qhp->timer)); + PDBG("%s sq_num_entries %d, rq_num_entries %d " + "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n", + __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries, + qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2); + return (&qhp->ibqp); +} + +static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + enum iwch_qp_attr_mask mask = 0; + struct iwch_qp_attributes attrs; + + PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp); + + /* iwarp does not support the RTR state */ + if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR)) + attr_mask &= ~IB_QP_STATE; + + /* Make sure we still have something left to do */ + if (!attr_mask) + return 0; + + memset(&attrs, 0, sizeof attrs); + qhp = to_iwch_qp(ibqp); + rhp = qhp->rhp; + + attrs.next_state = iwch_convert_state(attr->qp_state); + attrs.enable_rdma_read = (attr->qp_access_flags & + IB_ACCESS_REMOTE_READ) ? 1 : 0; + attrs.enable_rdma_write = (attr->qp_access_flags & + IB_ACCESS_REMOTE_WRITE) ? 1 : 0; + attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0; + + + mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0; + mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? + (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0; + + return iwch_modify_qp(rhp, qhp, mask, &attrs, 0); +} + +void iwch_qp_add_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + atomic_inc(&(to_iwch_qp(qp)->refcnt)); +} + +void iwch_qp_rem_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt))) + wake_up(&(to_iwch_qp(qp)->wait)); +} + +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn) +{ + PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn); + return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn); +} + + +static int iwch_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + *pkey = 0; + return 0; +} + +static int iwch_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct iwch_dev *dev; + + PDBG("%s ibdev %p, port %d, index %d, gid %p\n", + __FUNCTION__, ibdev, port, index, gid); + dev = to_iwch_dev(ibdev); + BUG_ON(port == 0 || port > 2); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6); + return 0; +} + +static int iwch_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + + struct iwch_dev *dev; + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + + dev = to_iwch_dev(ibdev); + memset(props, 0, sizeof *props); + memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + props->device_cap_flags = dev->device_cap_flags; + props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor; + props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device; + props->max_mr_size = ~0ull; + props->max_qp = dev->attr.max_qps; + props->max_qp_wr = dev->attr.max_wrs; + props->max_sge = dev->attr.max_sge_per_wr; + props->max_sge_rd = 1; + props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp; + props->max_cq = dev->attr.max_cqs; + props->max_cqe = dev->attr.max_cqes_per_cq; + props->max_mr = dev->attr.max_mem_regs; + props->max_pd = dev->attr.max_pds; + props->local_ca_ack_delay = 0; + + return 0; +} + +static int iwch_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 2; + props->active_speed = 2; + props->max_msg_sz = -1; + + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.fw_version); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.driver); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, dev); + return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor, + dev->rdev.rnic_info.pdev->device); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *iwch_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +int iwch_register_device(struct iwch_dev *dev) +{ + int ret; + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX); + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + dev->ibdev.owner = THIS_MODULE; + dev->device_cap_flags = + (IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + dev->ibdev.node_type = RDMA_NODE_RNIC; + memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); + dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports; + dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.query_device = iwch_query_device; + dev->ibdev.query_port = iwch_query_port; + dev->ibdev.modify_port = iwch_modify_port; + dev->ibdev.query_pkey = iwch_query_pkey; + dev->ibdev.query_gid = iwch_query_gid; + dev->ibdev.alloc_ucontext = iwch_alloc_ucontext; + dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext; + dev->ibdev.mmap = iwch_mmap; + dev->ibdev.alloc_pd = iwch_allocate_pd; + dev->ibdev.dealloc_pd = iwch_deallocate_pd; + dev->ibdev.create_ah = iwch_ah_create; + dev->ibdev.destroy_ah = iwch_ah_destroy; + dev->ibdev.create_qp = iwch_create_qp; + dev->ibdev.modify_qp = iwch_ib_modify_qp; + dev->ibdev.destroy_qp = iwch_destroy_qp; + dev->ibdev.create_cq = iwch_create_cq; + dev->ibdev.destroy_cq = iwch_destroy_cq; + dev->ibdev.resize_cq = iwch_resize_cq; + dev->ibdev.poll_cq = iwch_poll_cq; + dev->ibdev.get_dma_mr = iwch_get_dma_mr; + dev->ibdev.reg_phys_mr = iwch_register_phys_mem; + dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem; + dev->ibdev.reg_user_mr = iwch_reg_user_mr; + dev->ibdev.dereg_mr = iwch_dereg_mr; + dev->ibdev.alloc_mw = iwch_alloc_mw; + dev->ibdev.bind_mw = iwch_bind_mw; + dev->ibdev.dealloc_mw = iwch_dealloc_mw; + + dev->ibdev.attach_mcast = iwch_multicast_attach; + dev->ibdev.detach_mcast = iwch_multicast_detach; + dev->ibdev.process_mad = iwch_process_mad; + + dev->ibdev.req_notify_cq = iwch_arm_cq; + dev->ibdev.post_send = iwch_post_send; + dev->ibdev.post_recv = iwch_post_receive; + + + dev->ibdev.iwcm = + (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs), + GFP_KERNEL); + dev->ibdev.iwcm->connect = iwch_connect; + dev->ibdev.iwcm->accept = iwch_accept_cr; + dev->ibdev.iwcm->reject = iwch_reject_cr; + dev->ibdev.iwcm->create_listen = iwch_create_listen; + dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen; + dev->ibdev.iwcm->add_ref = iwch_qp_add_ref; + dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref; + dev->ibdev.iwcm->get_qp = iwch_get_qp; + + ret = ib_register_device(&dev->ibdev); + if (ret) + goto bail1; + + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + if (ret) { + goto bail2; + } + } + return 0; +bail2: + ib_unregister_device(&dev->ibdev); +bail1: + return ret; +} + +void iwch_unregister_device(struct iwch_dev *dev) +{ + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) + class_device_remove_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + ib_unregister_device(&dev->ibdev); + return; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h new file mode 100644 index 0000000..76616ac --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -0,0 +1,362 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_PROVIDER_H__ +#define __IWCH_PROVIDER_H__ + +#include +#include +#include +#include +#include "t3cdev.h" +#include "iwch.h" +#include "cxio_wr.h" +#include "cxio_hal.h" + +struct iwch_pd { + struct ib_pd ibpd; + u32 pdid; + struct iwch_dev *rhp; +}; + +static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct iwch_pd, ibpd); +} + +struct tpt_attributes { + u32 stag; + u32 state:1; + u32 type:2; + u32 rsvd:1; + enum tpt_mem_perm perms; + u32 remote_invaliate_disable:1; + u32 zbva:1; + u32 mw_bind_enable:1; + u32 page_size:5; + + u32 pdid; + u32 qpid; + u32 pbl_addr; + u32 len; + u64 va_fbo; + u32 pbl_size; +}; + +struct iwch_mr { + struct ib_mr ibmr; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +typedef struct iwch_mw iwch_mw_handle; + +static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct iwch_mr, ibmr); +} + +struct iwch_mw { + struct ib_mw ibmw; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw) +{ + return container_of(ibmw, struct iwch_mw, ibmw); +} + +struct iwch_cq { + struct ib_cq ibcq; + struct iwch_dev *rhp; + struct t3_cq cq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; +}; + +static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct iwch_cq, ibcq); +} + +enum IWCH_QP_FLAGS { + QP_QUIESCED = 0x01 +}; + +struct iwch_mpa_attributes { + u8 recv_marker_enabled; + u8 xmit_marker_enabled; /* iWARP: enable inbound Read Resp. */ + u8 crc_enabled; + u8 version; /* 0 or 1 */ +}; + +struct iwch_qp_attributes { + u32 scq; + u32 rcq; + u32 sq_num_entries; + u32 rq_num_entries; + u32 sq_max_sges; + u32 sq_max_sges_rdma_write; + u32 rq_max_sges; + u32 state; + u8 enable_rdma_read; + u8 enable_rdma_write; /* enable inbound Read Resp. */ + u8 enable_bind; + u8 enable_mmid0_fastreg; /* Enable STAG0 + Fast-register */ + /* + * Next QP state. If specify the current state, only the + * QP attributes will be modified. + */ + u32 max_ord; + u32 max_ird; + u32 pd; /* IN */ + u32 next_state; + char terminate_buffer[52]; + u32 terminate_msg_len; + u8 is_terminate_local; + struct iwch_mpa_attributes mpa_attr; /* IN-OUT */ + struct iwch_ep *llp_stream_handle; + char *stream_msg_buf; /* Last stream msg. before Idle -> RTS */ + u32 stream_msg_buf_len; /* Only on Idle -> RTS */ +}; + +struct iwch_qp { + struct ib_qp ibqp; + struct iwch_dev *rhp; + struct iwch_ep *ep; + struct iwch_qp_attributes attr; + struct t3_wq wq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; + enum IWCH_QP_FLAGS flags; + struct timer_list timer; +}; + +static inline int qp_quiesced(struct iwch_qp *qhp) +{ + return (qhp->flags & QP_QUIESCED); +} + +static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct iwch_qp, ibqp); +} + +void iwch_qp_add_ref(struct ib_qp *qp); +void iwch_qp_rem_ref(struct ib_qp *qp); +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn); + +struct iwch_ucontext { + struct ib_ucontext ibucontext; + struct cxio_ucontext uctx; + struct list_head mmaps; +}; + +static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c) +{ + return container_of(c, struct iwch_ucontext, ibucontext); +} + +struct iwch_mm_entry { + struct list_head entry; + u64 addr; + unsigned len; +}; + +static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext, + u64 addr, unsigned len) +{ + struct list_head *pos, *nxt; + struct iwch_mm_entry *mm; + + mutex_lock(&ucontext->uctx.lock); + list_for_each_safe(pos, nxt, &ucontext->mmaps) { + + mm = list_entry(pos, struct iwch_mm_entry, entry); + if (mm->addr == addr && mm->len == len) { + list_del_init(&mm->entry); + mutex_unlock(&ucontext->uctx.lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, + mm->len); + return mm; + } + } + mutex_unlock(&ucontext->uctx.lock); + return NULL; +} + +static inline void insert_mmap(struct iwch_ucontext *ucontext, + struct iwch_mm_entry *mm) +{ + mutex_lock(&ucontext->uctx.lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len); + list_add_tail(&mm->entry, &ucontext->mmaps); + mutex_unlock(&ucontext->uctx.lock); +} + +enum iwch_qp_attr_mask { + IWCH_QP_ATTR_NEXT_STATE = 1 << 0, + IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7, + IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8, + IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9, + IWCH_QP_ATTR_MAX_ORD = 1 << 11, + IWCH_QP_ATTR_MAX_IRD = 1 << 12, + IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22, + IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23, + IWCH_QP_ATTR_MPA_ATTR = 1 << 24, + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25, + IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_MAX_ORD | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_STREAM_MSG_BUFFER | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE) +}; + +int iwch_modify_qp(struct iwch_dev *rhp, + struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal); + +enum iwch_qp_state { + IWCH_QP_STATE_IDLE, + IWCH_QP_STATE_RTS, + IWCH_QP_STATE_ERROR, + IWCH_QP_STATE_TERMINATE, + IWCH_QP_STATE_CLOSING, + IWCH_QP_STATE_TOT +}; + +static inline int iwch_convert_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + case IB_QPS_INIT: + return IWCH_QP_STATE_IDLE; + case IB_QPS_RTS: + return IWCH_QP_STATE_RTS; + case IB_QPS_SQD: + return IWCH_QP_STATE_CLOSING; + case IB_QPS_SQE: + return IWCH_QP_STATE_TERMINATE; + case IB_QPS_ERR: + return IWCH_QP_STATE_ERROR; + default: + return -1; + } +} + +enum iwch_mem_perms { + IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0, + IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1, + IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2, + IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3, + IWCH_MEM_ACCESS_ATOMICS = 1 << 4, + IWCH_MEM_ACCESS_BINDING = 1 << 5, + IWCH_MEM_ACCESS_LOCAL = + (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE), + IWCH_MEM_ACCESS_REMOTE = + (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ) + /* cannot go beyond 1 << 31 */ +} __attribute__ ((packed)); + +static inline u32 iwch_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0) + | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) | + (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) | + IWCH_MEM_ACCESS_LOCAL_READ; +} + +enum iwch_mmid_state { + IWCH_STAG_STATE_VALID, + IWCH_STAG_STATE_INVALID +}; + +enum iwch_qp_query_flags { + IWCH_QP_QUERY_CONTEXT_NONE = 0x0, /* No ctx; Only attrs */ + IWCH_QP_QUERY_CONTEXT_GET = 0x1, /* Get ctx + attrs */ + IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2, /* Not Supported */ + + /* + * Quiesce QP context; Consumer + * will NOT replay outstanding WR + */ + IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4, + IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8, + IWCH_QP_QUERY_TEST_USERWRITE = 0x32 /* Test special */ +}; + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc); +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg); +int iwch_register_device(struct iwch_dev *dev); +void iwch_unregister_device(struct iwch_dev *dev); +int iwch_quiesce_qps(struct iwch_cq *chp); +int iwch_resume_qps(struct iwch_cq *chp); +void stop_read_rep_timer(struct iwch_qp *qhp); +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list); +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages); +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list); + + +#define IWCH_NODE_DESC "cxgb3 Chelsio Communications" + +#endif diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h new file mode 100644 index 0000000..4e4b9c9 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_user.h @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_USER_H__ +#define __IWCH_USER_H__ + +#define IWCH_UVERBS_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct iwch_create_cq_resp { + __u64 physaddr; + __u32 cqid; + __u32 size_log2; +}; + +struct iwch_create_qp_resp { + __u64 physaddr; + __u64 doorbell; + __u32 qpid; + __u32 size_log2; + __u32 sq_size_log2; + __u32 rq_size_log2; +}; + +struct iwch_reg_user_mr_resp { + __u32 pbl_addr; +}; + +struct iwch_req_notify_cq { + __u32 rptr; +}; +#endif From swise at opengridcomputing.com Sat Dec 2 14:49:58 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:49:58 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202224958.27014.65970.stgit@dell3.ogc.int> This code implements the iWARP CM provider methods for the Chelsio driver. The Chelsio ULLD is used to setup and teardown TCP connections, and the T3 RDMA Core is used to move the connections in and out of RDMA mode. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 2059 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 223 ++++ 2 files changed, 2282 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c new file mode 100644 index 0000000..5c59396 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -0,0 +1,2059 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "tcb.h" +#include "cxgb3_offload.h" +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" + +char *states[] = { + "idle", + "listen", + "connecting", + "mpa_wait_req", + "mpa_req_sent", + "mpa_req_rcvd", + "mpa_rep_sent", + "fpdu_mode", + "aborting", + "closing", + "moribund", + "dead", + NULL, +}; + +static int ep_timeout_secs = 10; +module_param(ep_timeout_secs, int, 0444); +MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " + "in seconds (default=10)"); + +static int mpa_rev = 1; +module_param(mpa_rev, int, 0444); +MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " + "1 is spec compliant. (default=1)"); + +static int markers_enabled = 0; +module_param(markers_enabled, int, 0444); +MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)"); + +static int crc_enabled = 1; +module_param(crc_enabled, int, 0444); +MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)"); + +static int rcv_win = 512 * 1024; +module_param(rcv_win, int, 0444); +MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)"); + +static int snd_win = 512 * 1024; +module_param(snd_win, int, 0444); +MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)"); + +static unsigned int nocong = 1; +module_param(nocong, uint, 0444); +MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)"); + +static void process_work(void *ctx); +static struct workqueue_struct *workq; +DECLARE_WORK(skb_work, process_work, NULL); + +static struct sk_buff_head rxq; +static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS]; + +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp); +static void ep_timeout(unsigned long arg); +static void connect_reply_upcall(struct iwch_ep *ep, int status); + +static void start_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + if (timer_pending(&ep->timer)) { + PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + } else + get_ep(&ep->com); + ep->timer.expires = jiffies + ep_timeout_secs * HZ; + ep->timer.data = (unsigned long)ep; + ep->timer.function = ep_timeout; + add_timer(&ep->timer); +} + +static void stop_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + put_ep(&ep->com); +} + +static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) +{ + struct cpl_tid_release *req; + + skb = get_skb(skb, sizeof *req, GFP_KERNEL); + if (!skb) + return; + req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); + skb->priority = CPL_PRIORITY_SETUP; + tdev->send(tdev, skb); + return; +} + +int iwch_quiesce_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +int iwch_resume_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = 0; + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static void set_emss(struct iwch_ep *ep, u16 opt) +{ + PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt); + ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40; + if (G_TCPOPT_TSTAMP(opt)) + ep->emss -= 12; + if (ep->emss < 128) + ep->emss = 128; + PDBG("emss=%d\n", ep->emss); +} + +static int state_comp_exch(struct iwch_ep_common *epc, + enum iwch_ep_state comp, + enum iwch_ep_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&epc->lock, flags); + ret = (epc->state == comp); + if (ret) + epc->state = exch; + spin_unlock_irqrestore(&epc->lock, flags); + return ret; +} + +static enum iwch_ep_state state_read(struct iwch_ep_common *epc) +{ + unsigned long flags; + enum iwch_ep_state state; + + spin_lock_irqsave(&epc->lock, flags); + state = epc->state; + spin_unlock_irqrestore(&epc->lock, flags); + return state; +} + +static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new) +{ + unsigned long flags; + + spin_lock_irqsave(&epc->lock, flags); + PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], + states[new]); + epc->state = new; + spin_unlock_irqrestore(&epc->lock, flags); + return; +} + +static void *alloc_ep(int size, gfp_t gfp) +{ + struct iwch_ep_common *epc; + + epc = kmalloc(size, gfp); + if (epc) { + memset(epc, 0, size); + kref_init(&epc->kref); + spin_lock_init(&epc->lock); + init_waitqueue_head(&epc->waitq); + } + PDBG("%s alloc ep %p\n", __FUNCTION__, epc); + return (void *) epc; +} + +void __free_ep(struct kref *kref) +{ + struct iwch_ep_common *epc; + epc = container_of(kref, struct iwch_ep_common, kref); + PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]); + kfree(epc); +} + +static void release_ep_resources(struct iwch_ep *ep) +{ + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + state_set(&ep->com, DEAD); + cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, ep->hwtid, NULL); + put_ep(&ep->com); +} + +static void process_work(void *ctx) +{ + struct sk_buff *skb = NULL; + void *ep; + struct t3cdev *tdev; + int ret; + + while ((skb = skb_dequeue(&rxq))) { + ep = *((void **) (skb->cb)); + tdev = *((struct t3cdev **) (skb->cb + sizeof(void *))); + ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep); + if (ret & CPL_RET_BUF_DONE) + kfree_skb(skb); + + /* + * ep was referenced in sched(), and is freed here. + */ + put_ep((struct iwch_ep_common *)ep); + } +} + +static int status2errno(int status) +{ + switch (status) { + case CPL_ERR_NONE: + return 0; + case CPL_ERR_CONN_RESET: + return -ECONNRESET; + case CPL_ERR_ARP_MISS: + return -EHOSTUNREACH; + case CPL_ERR_CONN_TIMEDOUT: + return -ETIMEDOUT; + case CPL_ERR_TCAM_FULL: + return -ENOMEM; + case CPL_ERR_CONN_EXIST: + return -EADDRINUSE; + default: + return -EIO; + } +} + +/* + * Try and reuse skbs already allocated... + */ +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp) +{ + if (skb) { + BUG_ON(skb_cloned(skb)); + skb_trim(skb, 0); + skb_get(skb); + } else { + skb = alloc_skb(len, gfp); + } + return skb; +} + +static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip, + __be32 peer_ip, __be16 local_port, + __be16 peer_port, u8 tos) +{ + struct rtable *rt; + struct flowi fl = { + .oif = 0, + .nl_u = { + .ip4_u = { + .daddr = peer_ip, + .saddr = local_ip, + .tos = tos} + }, + .proto = IPPROTO_TCP, + .uli_u = { + .ports = { + .sport = local_port, + .dport = peer_port} + } + }; + + if (ip_route_output_flow(&rt, &fl, NULL, 0)) + return NULL; + return rt; +} + +static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu) +{ + int i = 0; + + while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu) + ++i; + return i; +} + +static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb) +{ + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for an active open. + */ +static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + printk(KERN_ERR MOD "ARP failure duing connect\n"); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for a CPL_ABORT_REQ. Change it into a no RST variant + * and send it along. + */ +static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_abort_req *req = cplhdr(skb); + + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + req->cmd = CPL_ABORT_NO_RST; + cxgb3_ofld_send(dev, skb); +} + +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) +{ + struct cpl_close_con_req *req; + struct sk_buff *skb; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) +{ + struct cpl_abort_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(skb, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, abort_arp_failure); + req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); + req->cmd = CPL_ABORT_SEND_RST; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_connect(struct iwch_ep *ep) +{ + struct cpl_act_open_req *req; + struct sk_buff *skb; + u32 opt0h, opt0l, opt2; + unsigned int mtu_idx; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + skb->priority = CPL_PRIORITY_SETUP; + set_arp_failure_handler(skb, act_open_req_arp_failure); + + req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid)); + req->local_port = ep->com.local_addr.sin_port; + req->peer_port = ep->com.remote_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_ip = ep->com.remote_addr.sin_addr.s_addr; + req->opt0h = htonl(opt0h); + req->opt0l = htonl(opt0l); + req->params = 0; + req->opt2 = htonl(opt2); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + + PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen); + + BUG_ON(skb_cloned(skb)); + + mpalen = sizeof(*mpa) + ep->plen; + if (skb->data + mpalen + sizeof(*req) > skb->end) { + kfree_skb(skb); + skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + connect_reply_upcall(ep, -ENOMEM); + return; + } + } + skb_trim(skb, 0); + skb_reserve(skb, sizeof(*req)); + skb_put(skb, mpalen); + skb->priority = CPL_PRIORITY_DATA; + mpa = (struct mpa_message *) skb->data; + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key)); + mpa->flags = (crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->private_data_size = htons(ep->plen); + mpa->revision = mpa_rev; + + if (ep->plen) + memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + start_ep_timer(ep); + state_set(&ep->com, MPA_REQ_SENT); + return; +} + +static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = MPA_REJECT; + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb again. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(mpalen); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + ep->mpa_skb = skb; + state_set(&ep->com, MPA_REP_SENT); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_establish *req = cplhdr(skb); + unsigned int tid = GET_TID(req); + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid); + + dst_confirm(ep->dst); + + /* setup the hwtid for this connection */ + ep->hwtid = tid; + cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid); + + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + /* dealloc the atid */ + cxgb3_free_atid(ep->com.tdev, ep->atid); + + /* start MPA negotiation */ + send_mpa_req(ep, skb); + + return 0; +} + +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb) +{ + PDBG("%s ep %p\n", __FILE__, ep); + state_set(&ep->com, ABORTING); + send_abort(ep, skb, GFP_KERNEL); +} + +static void close_complete_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + if (ep->com.cm_id) { + PDBG("close complete delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void peer_close_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_DISCONNECT; + if (ep->com.cm_id) { + PDBG("peer close delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static void peer_abort_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + event.status = -ECONNRESET; + if (ep->com.cm_id) { + PDBG("abort delivered ep %p cm_id %p tid %d\n", ep, + ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_reply_upcall(struct iwch_ep *ep, int status) +{ + struct iw_cm_event event; + + PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REPLY; + event.status = status; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + + if ((status == 0) || (status == -ECONNREFUSED)) { + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + } + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, + ep->hwtid, status); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } + if (status < 0) { + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_request_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REQUEST; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + event.provider_data = ep; + if (state_read(&ep->parent_ep->com) != DEAD) + ep->parent_ep->com.cm_id->event_handler( + ep->parent_ep->com.cm_id, + &event); + put_ep(&ep->parent_ep->com); + ep->parent_ep = NULL; +} + +static void established_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_ESTABLISHED; + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static int update_rx_credits(struct iwch_ep *ep, u32 credits) +{ + struct cpl_rx_data_ack *req; + struct sk_buff *skb; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n"); + return 0; + } + + req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); + req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); + skb->priority = CPL_PRIORITY_ACK; + ep->com.tdev->send(ep->com.tdev, skb); + return credits; +} + +static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + int err; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) + return; + state_set(&ep->com, FPDU_MODE); + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + err = -EINVAL; + goto err; + } + + /* + * copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * if we don't even have the mpa message, then bail. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* Validate MPA header. */ + if (mpa->revision != mpa_rev) { + err = -EPROTO; + goto err; + } + if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) { + err = -EPROTO; + goto err; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + err = -EPROTO; + goto err; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + err = -EPROTO; + goto err; + } + + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + if (mpa->flags & MPA_REJECT) { + err = -ECONNREFUSED; + goto err; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. And + * the MPA header is valid. + */ + + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ird; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD; + + /* bind QP and TID with INIT_WR */ + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + if (!err) + goto out; +err: + abort_connection(ep, skb); +out: + connect_reply_upcall(ep, err); + return; +} + +static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) + return; + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + abort_connection(ep, skb); + return; + } + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * If we don't even have the mpa message, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* + * Validate MPA Header. + */ + if (mpa->revision != mpa_rev) { + abort_connection(ep, skb); + return; + } + + if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) { + abort_connection(ep, skb); + return; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + abort_connection(ep, skb); + return; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + abort_connection(ep, skb); + return; + } + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. + */ + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + state_set(&ep->com, MPA_REQ_RCVD); + + /* drive upcall */ + connect_request_upcall(ep); + return; +} + +static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_rx_data *hdr = cplhdr(skb); + unsigned int dlen = ntohs(hdr->len); + + PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen); + + skb_pull(skb, sizeof(*hdr)); + skb_trim(skb, dlen); + + switch (state_read(&ep->com)) { + case MPA_REQ_SENT: + process_mpa_reply(ep, skb); + break; + case MPA_REQ_WAIT: + process_mpa_request(ep, skb); + break; + case MPA_REP_SENT: + break; + default: + printk(KERN_ERR MOD "%s Unexpected streaming data." + " ep %p state %d tid %d\n", + __FUNCTION__, ep, state_read(&ep->com), ep->hwtid); + + /* + * The ep will timeout and inform the ULP of the failure. + * See ep_timeout(). + */ + break; + } + + /* update RX credits */ + update_rx_credits(ep, dlen); + + return CPL_RET_BUF_DONE; +} + +/* + * Upcall from the adapter indicating data has been transmitted. + * For us its just the single MPA request or reply. We can now free + * the skb holding the mpa message. + */ +static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_wr_ack *hdr = cplhdr(skb); + unsigned int credits = ntohs(hdr->credits); + enum iwch_qp_attr_mask mask; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + + if (credits == 0) + return CPL_RET_BUF_DONE; + BUG_ON(credits != 1); + BUG_ON(ep->mpa_skb == NULL); + kfree_skb(ep->mpa_skb); + ep->mpa_skb = NULL; + dst_confirm(ep->dst); + if (state_read(&ep->com) == MPA_REP_SENT) { + struct iwch_qp_attributes attrs; + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (!ep->com.rpl_err) { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + + ep->com.rpl_done = 1; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + } + return CPL_RET_BUF_DONE; +} + +static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + close_complete_upcall(ep); + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status, + status2errno(rpl->status)); + connect_reply_upcall(ep, status2errno(rpl->status)); + state_set(&ep->com, DEAD); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, GET_TID(rpl), NULL); + cxgb3_free_atid(ep->com.tdev, ep->atid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + return -ENOMEM; + } + + req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); + req->local_port = ep->com.local_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_port = 0; + req->peer_ip = 0; + req->peer_netmask = 0; + req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS); + req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10)); + req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); + + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_pass_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, + rpl->status, status2errno(rpl->status)); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + + return CPL_RET_BUF_DONE; +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_close_listserv_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, + void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_close_listserv_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + return CPL_RET_BUF_DONE; +} + +static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) +{ + struct cpl_pass_accept_rpl *rpl; + unsigned int mtu_idx; + u32 opt0h, opt0l, opt2; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(*rpl)); + skb_get(skb); + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + + rpl = cplhdr(skb); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(opt0h); + rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT); + rpl->opt2 = htonl(opt2); + rpl->rsvd = rpl->opt2; /* workaround for HW bug */ + skb->priority = CPL_PRIORITY_SETUP; + l2t_send(ep->com.tdev, skb, ep->l2t); + + return; +} + +static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip, + struct sk_buff *skb) +{ + PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, + peer_ip); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(struct cpl_tid_release)); + skb_get(skb); + + if (tdev->type == T3B) + release_tid(tdev, hwtid, skb); + else { + struct cpl_pass_accept_rpl *rpl; + + rpl = cplhdr(skb); + skb->priority = CPL_PRIORITY_SETUP; + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, + hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(F_TCAM_BYPASS); + rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); + rpl->opt2 = 0; + rpl->rsvd = rpl->opt2; + tdev->send(tdev, skb); + } +} + +static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *child_ep, *parent_ep = ctx; + struct cpl_pass_accept_req *req = cplhdr(skb); + unsigned int hwtid = GET_TID(req); + struct dst_entry *dst; + struct l2t_entry *l2t; + struct rtable *rt; + struct iff_mac tim; + + PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid); + + if (state_read(&parent_ep->com) != LISTEN) { + printk(KERN_ERR "%s - listening ep not in LISTEN\n", + __FUNCTION__); + goto reject; + } + + /* + * Find the netdev for this connection request. + */ + tim.mac_addr = req->dst_mac; + tim.vlan_tag = ntohs(req->vlan_tag); + if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) { + printk(KERN_ERR + "%s bad dst mac %02x %02x %02x %02x %02x %02x\n", + __FUNCTION__, + req->dst_mac[0], + req->dst_mac[1], + req->dst_mac[2], + req->dst_mac[3], + req->dst_mac[4], + req->dst_mac[5]); + goto reject; + } + + /* Find output route */ + rt = find_route(tdev, + req->local_ip, + req->peer_ip, + req->local_port, + req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid))); + if (!rt) { + printk(KERN_ERR MOD "%s - failed to find dst entry!\n", + __FUNCTION__); + goto reject; + } + dst = &rt->u.dst; + l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev->if_port); + if (!l2t) { + printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n", + __FUNCTION__); + dst_release(dst); + goto reject; + } + child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL); + if (!child_ep) { + printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n", + __FUNCTION__); + l2t_release(L2DATA(tdev), l2t); + dst_release(dst); + goto reject; + } + state_set(&child_ep->com, CONNECTING); + child_ep->com.tdev = tdev; + child_ep->com.cm_id = NULL; + child_ep->com.local_addr.sin_family = PF_INET; + child_ep->com.local_addr.sin_port = req->local_port; + child_ep->com.local_addr.sin_addr.s_addr = req->local_ip; + child_ep->com.remote_addr.sin_family = PF_INET; + child_ep->com.remote_addr.sin_port = req->peer_port; + child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip; + get_ep(&parent_ep->com); + child_ep->parent_ep = parent_ep; + child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid)); + child_ep->l2t = l2t; + child_ep->dst = dst; + child_ep->hwtid = hwtid; + init_timer(&child_ep->timer); + cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid); + accept_cr(child_ep, req->peer_ip, skb); + goto out; +reject: + reject_cr(tdev, hwtid, req->peer_ip, skb); +out: + return CPL_RET_BUF_DONE; +} + +static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_pass_establish *req = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + dst_confirm(ep->dst); + state_set(&ep->com, MPA_REQ_WAIT); + start_ep_timer(ep); + + return CPL_RET_BUF_DONE; +} + +static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + int ret; + int abort = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + dst_confirm(ep->dst); + switch (state_read(&ep->com)) { + case MPA_REQ_WAIT: + state_set(&ep->com, CLOSING); + break; + case MPA_REQ_SENT: + state_set(&ep->com, CLOSING); + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + state_set(&ep->com, CLOSING); + get_ep(&ep->com); + break; + case MPA_REP_SENT: + state_set(&ep->com, CLOSING); + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case FPDU_MODE: + state_set(&ep->com, CLOSING); + peer_close_upcall(ep); + attrs.next_state = IWCH_QP_STATE_CLOSING; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) { + printk(KERN_ERR MOD "%s - qp <- closing err!\n", + __FUNCTION__); + abort = 1; + } + break; + case ABORTING: + goto out; + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + goto out; + case MORIBUND: + stop_ep_timer(ep); + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + goto out; + case DEAD: + goto out; + default: + BUG_ON(1); + } + iwch_ep_disconnect(ep, abort, GFP_KERNEL); +out: + return CPL_RET_BUF_DONE; +} + +/* + * Returns whether an ABORT_REQ_RSS message is a negative advice. + */ +static inline int is_neg_adv_abort(unsigned int status) +{ + return status == CPL_ERR_RTX_NEG_ADVICE || + status == CPL_ERR_PERSIST_NEG_ADVICE; +} + +static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_abort_req_rss *req = cplhdr(skb); + struct iwch_ep *ep = ctx; + struct cpl_abort_rpl *rpl; + struct sk_buff *rpl_skb; + struct iwch_qp_attributes attrs; + int ret; + int state; + + if (is_neg_adv_abort(req->status)) { + PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, + ep->hwtid); + t3_l2t_send_event(ep->com.tdev, ep->l2t); + return CPL_RET_BUF_DONE; + } + + state = state_read(&ep->com); + PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state); + switch (state) { + case CONNECTING: + break; + case MPA_REQ_WAIT: + break; + case MPA_REQ_SENT: + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REP_SENT: + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + get_ep(&ep->com); + break; + case MORIBUND: + stop_ep_timer(ep); + case FPDU_MODE: + case CLOSING: + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) + printk(KERN_ERR MOD + "%s - qp <- error failed!\n", + __FUNCTION__); + } + peer_abort_upcall(ep); + break; + case ABORTING: + break; + case DEAD: + PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__); + return CPL_RET_BUF_DONE; + default: + BUG_ON(1); + break; + } + dst_confirm(ep->dst); + + rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL); + if (!rpl_skb) { + printk(KERN_ERR MOD "%s - cannot allocate skb!\n", + __FUNCTION__); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; + } + rpl_skb->priority = CPL_PRIORITY_DATA; + rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl)); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL)); + rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); + rpl->cmd = CPL_ABORT_NO_RST; + ep->com.tdev->send(ep->com.tdev, rpl_skb); + if (state != ABORTING) + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(!ep); + + /* The cm_id may be null if we failed to connect */ + switch (state_read(&ep->com)) { + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + break; + case MORIBUND: + stop_ep_timer(ep); + if ((ep->com.cm_id) && (ep->com.qp)) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + break; + case DEAD: + default: + BUG_ON(1); + break; + } + + return CPL_RET_BUF_DONE; +} + +/* + * T3A does 3 things when a TERM is received: + * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet + * 2) generate an async event on the QP with the TERMINATE opcode + * 3) post a TERMINATE opcde cqe into the associated CQ. + * + * For (1), we save the message in the qp for later consumer consumption. + * For (2), we move the QP into TERMINATE, post a QP event and disconnect. + * For (3), we toss the CQE in cxio_poll_cq(). + * + * terminate() handles case (1)... + */ +static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb_pull(skb, sizeof(struct cpl_rdma_terminate)); + PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len); + memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len); + ep->com.qp->attr.terminate_msg_len = skb->len; + ep->com.qp->attr.is_terminate_local = 0; + return CPL_RET_BUF_DONE; +} + +static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_rdma_ec_status *rep = cplhdr(skb); + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, + rep->status); + if (rep->status) { + struct iwch_qp_attributes attrs; + + printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n", + __FUNCTION__, ep->hwtid); + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + abort_connection(ep, NULL); + } + return CPL_RET_BUF_DONE; +} + +static void ep_timeout(unsigned long arg) +{ + struct iwch_ep *ep = (struct iwch_ep *)arg; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) { + struct sk_buff *skb; + + connect_reply_upcall(ep, -ETIMEDOUT); + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) { + struct sk_buff *skb; + + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) { + struct sk_buff *skb; + + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + put_ep(&ep->com); +} + +int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + state_set(&ep->com, CLOSING); + if (mpa_rev == 0) + abort_connection(ep, NULL); + else { + err = send_mpa_reject(ep, pdata, pdata_len); + err = send_halfclose(ep, GFP_KERNEL); + } + return 0; +} + +int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + struct iwch_ep *ep = to_ep(cm_id); + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_qp *qp = get_qhp(h, conn_param->qpn); + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + BUG_ON(!qp); + + if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) || + (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) { + abort_connection(ep, NULL); + return -EINVAL; + } + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = qp; + + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord); + get_ep(&ep->com); + err = send_mpa_reply(ep, conn_param->private_data, + conn_param->private_data_len); + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + put_ep(&ep->com); + return err; + } + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + } else { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + put_ep(&ep->com); + return err; +} + +int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_ep *ep; + struct rtable *rt; + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto out; + } + init_timer(&ep->timer); + ep->plen = conn_param->private_data_len; + if (ep->plen) + memcpy(ep->mpa_pkt + sizeof(struct mpa_message), + conn_param->private_data, ep->plen); + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + ep->com.tdev = h->rdev.t3cdev_p; + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = get_qhp(h, conn_param->qpn); + BUG_ON(!ep->com.qp); + PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, + ep->com.qp, cm_id); + + /* + * Allocate an active TID to initiate a TCP connection. + */ + ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->atid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + /* find a route */ + rt = find_route(h->rdev.t3cdev_p, + cm_id->local_addr.sin_addr.s_addr, + cm_id->remote_addr.sin_addr.s_addr, + cm_id->local_addr.sin_port, + cm_id->remote_addr.sin_port, IPTOS_LOWDELAY); + if (!rt) { + printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__); + err = -EHOSTUNREACH; + goto fail3; + } + ep->dst = &rt->u.dst; + + /* get a l2t entry */ + ep->l2t = t3_l2t_get(ep->com.tdev, + ep->dst->neighbour, + ep->dst->neighbour->dev->if_port); + if (!ep->l2t) { + printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__); + err = -ENOMEM; + goto fail4; + } + + state_set(&ep->com, CONNECTING); + ep->tos = IPTOS_LOWDELAY; + ep->com.local_addr = cm_id->local_addr; + ep->com.remote_addr = cm_id->remote_addr; + + /* send connect request to rnic */ + err = send_connect(ep); + if (!err) + goto out; + + l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t); +fail4: + dst_release(ep->dst); +fail3: + cxgb3_free_atid(ep->com.tdev, ep->atid); +fail2: + put_ep(&ep->com); +out: + return err; +} + +int iwch_create_listen(struct iw_cm_id *cm_id, int backlog) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_listen_ep *ep; + + + might_sleep(); + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto fail1; + } + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.tdev = h->rdev.t3cdev_p; + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->backlog = backlog; + ep->com.local_addr = cm_id->local_addr; + + /* + * Allocate a server TID. + */ + ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + state_set(&ep->com, LISTEN); + err = listen_start(ep); + if (err) + goto fail3; + + /* wait for pass_open_rpl */ + wait_event(ep->com.waitq, ep->com.rpl_done); + err = ep->com.rpl_err; + if (!err) { + cm_id->provider_data = ep; + goto out; + } +fail3: + cxgb3_free_stid(ep->com.tdev, ep->stid); +fail2: + put_ep(&ep->com); +fail1: +out: + return err; +} + +int iwch_destroy_listen(struct iw_cm_id *cm_id) +{ + int err; + struct iwch_listen_ep *ep = to_listen_ep(cm_id); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + might_sleep(); + state_set(&ep->com, DEAD); + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + err = listen_stop(ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + cxgb3_free_stid(ep->com.tdev, ep->stid); + err = ep->com.rpl_err; + cm_id->rem_ref(cm_id); + put_ep(&ep->com); + return err; +} + +int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) +{ + int ret=0; + int state; + + + state = state_read(&ep->com); + PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, + states[state], abrupt); + if (state == DEAD) { + PDBG("%s already dead ep %p\n", __FUNCTION__, ep); + return 0; + } + if (abrupt) { + if (state != ABORTING) { + state_set(&ep->com, ABORTING); + ret = send_abort(ep, NULL, gfp); + } + } else { + + if (state != CLOSING) + state_set(&ep->com, CLOSING); + else { + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + } + + ret = send_halfclose(ep, gfp); + } + return ret; +} + +int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, + struct l2t_entry *l2t) +{ + struct iwch_ep *ep = ctx; + + if (ep->dst != old) + return 0; + + PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, + l2t); + dst_hold(new); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + ep->l2t = l2t; + dst_release(old); + ep->dst = new; + return 1; +} + +/* + * All the CM events are handled on a work queue to have a safe context. + */ +static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep_common *epc = ctx; + + get_ep(epc); + + /* + * Save ctx and tdev in the skb->cb area. + */ + *((void **) skb->cb) = ctx; + *((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev; + + /* + * Queue the skb and schedule the worker thread. + */ + skb_queue_tail(&rxq, skb); + queue_work(workq, &skb_work); + return 0; +} + +int __init iwch_cm_init(void) +{ + skb_queue_head_init(&rxq); + + workq = create_singlethread_workqueue("iw_cxgb3"); + if (!workq) + return -ENOMEM; + + /* + * All upcalls from the T3 Core go to sched() to + * schedule the processing on a work queue. + */ + t3c_handlers[CPL_ACT_ESTABLISH] = sched; + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; + t3c_handlers[CPL_RX_DATA] = sched; + t3c_handlers[CPL_TX_DMA_ACK] = sched; + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; + t3c_handlers[CPL_ABORT_RPL] = sched; + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; + t3c_handlers[CPL_PASS_ESTABLISH] = sched; + t3c_handlers[CPL_PEER_CLOSE] = sched; + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; + t3c_handlers[CPL_RDMA_TERMINATE] = sched; + t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + + /* + * These are the real handlers that are called from a + * work queue. + */ + work_handlers[CPL_ACT_ESTABLISH] = act_establish; + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; + work_handlers[CPL_RX_DATA] = rx_data; + work_handlers[CPL_TX_DMA_ACK] = tx_ack; + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; + work_handlers[CPL_ABORT_RPL] = abort_rpl; + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; + work_handlers[CPL_PEER_CLOSE] = peer_close; + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; + work_handlers[CPL_RDMA_TERMINATE] = terminate; + work_handlers[CPL_RDMA_EC_STATUS] = ec_status; + return 0; +} + +void __exit iwch_cm_term(void) +{ + flush_workqueue(workq); + destroy_workqueue(workq); +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h new file mode 100644 index 0000000..893f9d0 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -0,0 +1,223 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _IWCH_CM_H_ +#define _IWCH_CM_H_ + +#include +#include +#include +#include + +#include +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" + +#define MPA_KEY_REQ "MPA ID Req Frame" +#define MPA_KEY_REP "MPA ID Rep Frame" + +#define MPA_MAX_PRIVATE_DATA 256 +#define MPA_REV 0 /* XXX - amso1100 uses rev 0 ! */ +#define MPA_REJECT 0x20 +#define MPA_CRC 0x40 +#define MPA_MARKERS 0x80 +#define MPA_FLAGS_MASK 0xE0 + +#define put_ep(ep) { \ + PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_put(&((ep)->kref), __free_ep); \ +} + +#define get_ep(ep) { \ + PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_get(&((ep)->kref)); \ +} + +struct mpa_message { + u8 key[16]; + u8 flags; + u8 revision; + __be16 private_data_size; + u8 private_data[0]; +}; + +struct terminate_message { + u8 layer_etype; + u8 ecode; + __be16 hdrct_rsvd; + u8 len_hdrs[0]; +}; + +#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28) + +enum iwch_layers_types { + LAYER_RDMAP = 0x00, + LAYER_DDP = 0x10, + LAYER_MPA = 0x20, + RDMAP_LOCAL_CATA = 0x00, + RDMAP_REMOTE_PROT = 0x01, + RDMAP_REMOTE_OP = 0x02, + DDP_LOCAL_CATA = 0x00, + DDP_TAGGED_ERR = 0x01, + DDP_UNTAGGED_ERR = 0x02, + DDP_LLP = 0x03 +}; + +enum iwch_rdma_ecodes { + RDMAP_INV_STAG = 0x00, + RDMAP_BASE_BOUNDS = 0x01, + RDMAP_ACC_VIOL = 0x02, + RDMAP_STAG_NOT_ASSOC = 0x03, + RDMAP_TO_WRAP = 0x04, + RDMAP_INV_VERS = 0x05, + RDMAP_INV_OPCODE = 0x06, + RDMAP_STREAM_CATA = 0x07, + RDMAP_GLOBAL_CATA = 0x08, + RDMAP_CANT_INV_STAG = 0x09, + RDMAP_UNSPECIFIED = 0xff +}; + +enum iwch_ddp_ecodes { + DDPT_INV_STAG = 0x00, + DDPT_BASE_BOUNDS = 0x01, + DDPT_STAG_NOT_ASSOC = 0x02, + DDPT_TO_WRAP = 0x03, + DDPT_INV_VERS = 0x04, + DDPU_INV_QN = 0x01, + DDPU_INV_MSN_NOBUF = 0x02, + DDPU_INV_MSN_RANGE = 0x03, + DDPU_INV_MO = 0x04, + DDPU_MSG_TOOBIG = 0x05, + DDPU_INV_VERS = 0x06 +}; + +enum iwch_mpa_ecodes { + MPA_CRC_ERR = 0x02, + MPA_MARKER_ERR = 0x03 +}; + +enum iwch_ep_state { + IDLE = 0, + LISTEN, + CONNECTING, + MPA_REQ_WAIT, + MPA_REQ_SENT, + MPA_REQ_RCVD, + MPA_REP_SENT, + FPDU_MODE, + ABORTING, + CLOSING, + MORIBUND, + DEAD, +}; + +struct iwch_ep_common { + struct iw_cm_id *cm_id; + struct iwch_qp *qp; + struct t3cdev *tdev; + enum iwch_ep_state state; + struct kref kref; + spinlock_t lock; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + wait_queue_head_t waitq; + int rpl_done; + int rpl_err; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; + unsigned int stid; + int backlog; +}; + +struct iwch_ep { + struct iwch_ep_common com; + struct iwch_ep *parent_ep; + struct timer_list timer; + unsigned int atid; + u32 hwtid; + u32 snd_seq; + struct l2t_entry *l2t; + struct dst_entry *dst; + struct sk_buff *mpa_skb; + struct iwch_mpa_attributes mpa_attr; + unsigned int mpa_pkt_len; + u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA]; + u8 tos; + u16 emss; + u16 plen; + u32 ird; + u32 ord; +}; + +static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_ep *)cm_id->provider_data; +} + +static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_listen_ep *)cm_id->provider_data; +} + +static inline int compute_wscale(int win) +{ + int wscale = 0; + + while (wscale < 14 && (65535< References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225008.27014.4428.stgit@dell3.ogc.int> Code to manipulate the QP. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++ 1 files changed, 1007 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c new file mode 100644 index 0000000..9f6b251 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -0,0 +1,1007 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" + +#define NO_SUPPORT -1 + +static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr, + u8 * flit_cnt) +{ + int i; + u32 plen; + + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE; + else + wqe->send.rdmaop = T3_SEND; + wqe->send.rem_stag = 0; + break; +#if 0 /* Not currently supported */ + case TYPE_SEND_INVALIDATE: + case TYPE_SEND_INVALIDATE_IMMEDIATE: + wqe->send.rdmaop = T3_SEND_WITH_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; + case TYPE_SEND_SE_INVALIDATE: + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; +#endif + default: + break; + } + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->send.reserved[0] = 0; + wqe->send.reserved[1] = 0; + wqe->send.reserved[2] = 0; + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + plen = 4; + wqe->send.sgl[0].stag = wr->imm_data; + wqe->send.sgl[0].len = __constant_cpu_to_be32(0); + wqe->send.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 5; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + } + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); + } + wqe->send.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + int i; + u32 plen; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->write.rdmaop = T3_RDMA_WRITE; + wqe->write.reserved[0] = 0; + wqe->write.reserved[1] = 0; + wqe->write.reserved[2] = 0; + wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey); + wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr); + + if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + plen = 4; + wqe->write.sgl[0].stag = wr->imm_data; + wqe->write.sgl[0].len = __constant_cpu_to_be32(0); + wqe->write.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 6; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->write.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->write.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->write.sgl[i].to = + cpu_to_be64(wr->sg_list[i].addr); + } + wqe->write.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 5 + ((wr->num_sge) << 1); + } + wqe->write.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + if (wr->num_sge > 1) + return -EINVAL; + wqe->read.rdmaop = T3_READ_REQ; + wqe->read.reserved[0] = 0; + wqe->read.reserved[1] = 0; + wqe->read.reserved[2] = 0; + wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr); + wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey); + wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length); + wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr); + *flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3; + return 0; +} + +/* + * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. + */ +static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp, + struct ib_sge *sg_list, u32 num_sgle, + u32 * pbl_addr, u8 * page_size) +{ + int i; + struct iwch_mr *mhp; + u32 offset; + for (i = 0; i < num_sgle; i++) { + + mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8); + if (!mhp) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (!mhp->attr.state) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (mhp->attr.zbva) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + + if (sg_list[i].addr < mhp->attr.va_fbo) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) < + sg_list[i].addr) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) > + mhp->attr.va_fbo + ((u64) mhp->attr.len)) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + offset = sg_list[i].addr - mhp->attr.va_fbo; + offset += ((u32) mhp->attr.va_fbo) % + (1UL << (12 + mhp->attr.page_size)); + pbl_addr[i] = ((mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3) + + (offset >> (12 + mhp->attr.page_size)); + page_size[i] = mhp->attr.page_size; + } + return 0; +} + +static inline int iwch_build_rdma_recv(struct iwch_dev *rhp, + union t3_wr *wqe, + struct ib_recv_wr *wr) +{ + int i, err = 0; + u32 pbl_addr[4]; + u8 page_size[4]; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, + page_size); + if (err) + return err; + wqe->recv.pagesz[0] = page_size[0]; + wqe->recv.pagesz[1] = page_size[1]; + wqe->recv.pagesz[2] = page_size[2]; + wqe->recv.pagesz[3] = page_size[3]; + wqe->recv.num_sgle = cpu_to_be32(wr->num_sge); + for (i = 0; i < wr->num_sge; i++) { + wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + + /* to in the WQE == the offset into the page */ + wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % + (1UL << (12 + page_size[i]))); + + /* pbl_addr is the adapters address in the PBL */ + wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); + } + for (; i < T3_MAX_SGE; i++) { + wqe->recv.sgl[i].stag = 0; + wqe->recv.sgl[i].len = 0; + wqe->recv.sgl[i].to = 0; + wqe->recv.pbl_addr[i] = 0; + } + return 0; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + int err = 0; + u8 t3_wr_flit_cnt; + enum t3_wr_opcode t3_wr_opcode = 0; + enum t3_wr_flags t3_wr_flags; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if (num_wrs <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + while (wr) { + if (num_wrs == 0) { + err = -ENOMEM; + *bad_wr = wr; + break; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + t3_wr_flags = 0; + if (wr->send_flags & IB_SEND_SOLICITED) + t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; + if (wr->send_flags & IB_SEND_SIGNALED) + t3_wr_flags |= T3_COMPLETION_FLAG; + sqp = qhp->wq.sq + + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + t3_wr_opcode = T3_WR_SEND; + err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + t3_wr_opcode = T3_WR_WRITE; + err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_READ: + t3_wr_opcode = T3_WR_READ; + t3_wr_flags = 0; /* T3 reads are always signaled */ + err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt); + if (err) + break; + sqp->read_len = wqe->read.local_len; + if (!qhp->wq.oldest_read) + qhp->wq.oldest_read = sqp; + break; + default: + PDBG("%s post of type=%d TBD!\n", __FUNCTION__, + wr->opcode); + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp->wr_id = wr->wr_id; + sqp->opcode = wr2opcode(t3_wr_opcode); + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED); + + build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, t3_wr_flit_cnt); + PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", + __FUNCTION__, wr->wr_id, idx, + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2), + sqp->opcode); + wr = wr->next; + num_wrs--; + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + int err = 0; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, + qhp->wq.rq_size_log2) - 1; + if (!wr) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + while (wr) { + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + if (num_wrs) + err = iwch_build_rdma_recv(qhp->rhp, wqe, wr); + else + err = -ENOMEM; + if (err) { + *bad_wr = wr; + break; + } + qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = + wr->wr_id; + build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, sizeof(struct t3_receive_wr) >> 3); + PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x " + "wqe %p \n", __FUNCTION__, wr->wr_id, idx, + qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); + ++(qhp->wq.rq_wptr); + ++(qhp->wq.wptr); + wr = wr->next; + num_wrs--; + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_qp *qhp; + union t3_wr *wqe; + u32 pbl_addr; + u8 page_size; + u32 num_wrs; + unsigned long flag; + struct ib_sge sgl; + int err=0; + enum t3_wr_flags t3_wr_flags; + u32 idx; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(qp); + mhp = to_iwch_mw(mw); + rhp = qhp->rhp; + + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if ((num_wrs) <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, + mw, mw_bind); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + + t3_wr_flags = 0; + if (mw_bind->send_flags & IB_SEND_SIGNALED) + t3_wr_flags = T3_COMPLETION_FLAG; + + sgl.addr = mw_bind->addr; + sgl.lkey = mw_bind->mr->lkey; + sgl.length = mw_bind->length; + wqe->bind.reserved = 0; + wqe->bind.type = T3_VA_BASED_TO; + + /* TBD: check perms */ + wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags); + wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey); + wqe->bind.mw_stag = cpu_to_be32(mw->rkey); + wqe->bind.mw_len = cpu_to_be32(mw_bind->length); + wqe->bind.mw_va = cpu_to_be64(mw_bind->addr); + err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size); + if (err) { + spin_unlock_irqrestore(&qhp->lock, flag); + return err; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + sqp->wr_id = mw_bind->wr_id; + sqp->opcode = T3_BIND_MW; + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED); + wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr); + wqe->bind.mr_pagesz = page_size; + wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; + build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, + sizeof(struct t3_bind_mw_wr) >> 3); + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + spin_unlock_irqrestore(&qhp->lock, flag); + + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + + return err; +} + +static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode, + int tagged) +{ + switch (t3err) { + case TPT_ERR_STAG: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_STAG; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_INV_STAG; + } + break; + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_STAG_NOT_ASSOC; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_STAG_NOT_ASSOC; + } + break; + case TPT_ERR_WRAP: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_TO_WRAP; + break; + case TPT_ERR_BOUND: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_BASE_BOUNDS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + } + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_CANT_INV_STAG; + break; + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + *layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_OUT_OF_RQE: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_NOBUF; + break; + case TPT_ERR_PBL_ADDR_BOUND: + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + break; + case TPT_ERR_CRC: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_CRC_ERR; + break; + case TPT_ERR_MARKER: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_MARKER_ERR; + break; + case TPT_ERR_PDU_LEN_ERR: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + break; + case TPT_ERR_DDP_VERSION: + if (tagged) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_VERS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_VERS; + } + break; + case TPT_ERR_RDMA_VERSION: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_VERS; + break; + case TPT_ERR_OPCODE: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_OPCODE; + break; + case TPT_ERR_DDP_QUEUE_NUM: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_QN; + break; + case TPT_ERR_MSN: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_RANGE; + break; + case TPT_ERR_TBIT: + *layer_type = LAYER_DDP|DDP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_MO: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MO; + break; + default: + *layer_type = LAYER_RDMAP|DDP_LOCAL_CATA; + *ecode = 0; + break; + } +} + +/* + * This posts a TERMINATE with layer=RDMA, type=catastrophic. + */ +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) +{ + union t3_wr *wqe; + struct terminate_message *term; + int status; + int tagged = 0; + struct sk_buff *skb; + + PDBG("%s %d\n", __FUNCTION__, __LINE__); + skb = alloc_skb(40, GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (union t3_wr *)skb_put(skb, 40); + memset(wqe, 0, 40); + wqe->send.rdmaop = T3_TERMINATE; + + /* immediate data length */ + wqe->send.plen = htonl(4); + + /* immediate data starts here. */ + term = (struct terminate_message *)wqe->send.sgl; + if (rsp_msg) { + status = CQE_STATUS(rsp_msg->cqe); + if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE) + tagged = 1; + if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) || + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) + tagged = 2; + } else { + status = TPT_ERR_INTERNAL_ERR; + } + build_term_codes(status, &term->layer_etype, &term->ecode, tagged); + build_fw_riwrh((void *)wqe, T3_WR_SEND, + T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, + qhp->ep->hwtid, 5); + skb->priority = CPL_PRIORITY_DATA; + return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb)); +} + +/* + * Assumes qhp lock is held. + */ +static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + struct iwch_cq *rchp, *schp; + int count; + + rchp = get_chp(qhp->rhp, qhp->attr.rcq); + schp = get_chp(qhp->rhp, qhp->attr.scq); + + PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp); + /* take a ref on the qhp since we must release the lock */ + atomic_inc(&qhp->refcnt); + spin_unlock_irqrestore(&qhp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&rchp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&rchp->cq); + cxio_count_rcqes(&rchp->cq, &qhp->wq, &count); + cxio_flush_rq(&qhp->wq, &rchp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&rchp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&schp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&schp->cq); + cxio_count_scqes(&schp->cq, &qhp->wq, &count); + cxio_flush_sq(&qhp->wq, &schp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&schp->lock, *flag); + + /* deref */ + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); + + spin_lock_irqsave(&qhp->lock, *flag); +} + +static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + if (t3b_device(qhp->rhp)) + cxio_set_wq_in_error(&qhp->wq); + else + __flush_qp(qhp, flag); +} + + +/* + * Return non zero if at least one RECV was pre-posted. + */ +static inline int rqes_posted(struct iwch_qp *qhp) +{ + return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV); +} + +static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs) +{ + struct t3_rdma_init_attr init_attr; + int ret; + + init_attr.tid = qhp->ep->hwtid; + init_attr.qpid = qhp->wq.qpid; + init_attr.pdid = qhp->attr.pd; + init_attr.scqid = qhp->attr.scq; + init_attr.rcqid = qhp->attr.rcq; + init_attr.rq_addr = qhp->wq.rq_addr; + init_attr.rq_size = 1 << qhp->wq.rq_size_log2; + init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | + qhp->attr.mpa_attr.recv_marker_enabled | + (qhp->attr.mpa_attr.xmit_marker_enabled << 1) | + (qhp->attr.mpa_attr.crc_enabled << 2); + + /* + * XXX - The IWCM doesn't quite handle getting these + * attrs set before going into RTS. For now, just turn + * them on always... + */ +#if 0 + init_attr.qpcaps = qhp->attr.enableRdmaRead | + (qhp->attr.enableRdmaWrite << 1) | + (qhp->attr.enableBind << 2) | + (qhp->attr.enable_stag0_fastreg << 3) | + (qhp->attr.enable_stag0_fastreg << 4); +#else + init_attr.qpcaps = 0x1f; +#endif + init_attr.tcp_emss = qhp->ep->emss; + init_attr.ord = qhp->attr.max_ord; + init_attr.ird = qhp->attr.max_ird; + init_attr.qp_dma_addr = qhp->wq.dma_addr; + init_attr.qp_dma_size = (1UL << qhp->wq.size_log2); + init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0; + PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d " + "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, + init_attr.rq_addr, init_attr.rq_size, + init_attr.flags, init_attr.qpcaps); + ret = cxio_rdma_init(&rhp->rdev, &init_attr); + PDBG("%s ret %d\n", __FUNCTION__, ret); + return ret; +} + +int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal) +{ + int ret = 0; + struct iwch_qp_attributes newattr = qhp->attr; + unsigned long flag; + int disconnect = 0; + int terminate = 0; + int abort = 0; + int free = 0; + struct iwch_ep *ep = NULL; + + PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, + qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, + (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1); + + spin_lock_irqsave(&qhp->lock, flag); + + /* Process attr changes if in IDLE */ + if (mask & IWCH_QP_ATTR_VALID_MODIFY) { + if (qhp->attr.state != IWCH_QP_STATE_IDLE) { + ret = -EIO; + goto out; + } + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ) + newattr.enable_rdma_read = attrs->enable_rdma_read; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE) + newattr.enable_rdma_write = attrs->enable_rdma_write; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND) + newattr.enable_bind = attrs->enable_bind; + if (mask & IWCH_QP_ATTR_MAX_ORD) { + if (attrs->max_ord > + rhp->attr.max_rdma_read_qp_depth) { + ret = -EINVAL; + goto out; + } + newattr.max_ord = attrs->max_ord; + } + if (mask & IWCH_QP_ATTR_MAX_IRD) { + if (attrs->max_ird > + rhp->attr.max_rdma_reads_per_qp) { + ret = -EINVAL; + goto out; + } + newattr.max_ird = attrs->max_ird; + } + qhp->attr = newattr; + } + + if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) + goto out; + if (qhp->attr.state == attrs->next_state) + goto out; + + switch (qhp->attr.state) { + case IWCH_QP_STATE_IDLE: + switch (attrs->next_state) { + case IWCH_QP_STATE_RTS: + if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) { + ret = -EINVAL; + goto out; + } + if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) { + ret = -EINVAL; + goto out; + } + qhp->attr.mpa_attr = attrs->mpa_attr; + qhp->attr.llp_stream_handle = attrs->llp_stream_handle; + qhp->ep = qhp->attr.llp_stream_handle; + qhp->attr.state = IWCH_QP_STATE_RTS; + + /* + * Ref the endpoint here and deref when we + * disassociate the endpoint from the QP. This + * happens in CLOSING->IDLE transition or *->ERROR + * transition. + */ + get_ep(&qhp->ep->com); + spin_unlock_irqrestore(&qhp->lock, flag); + ret = rdma_init(rhp, qhp, mask, attrs); + spin_lock_irqsave(&qhp->lock, flag); + if (ret) + goto err; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + flush_qp(qhp, &flag); + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_RTS: + switch (attrs->next_state) { + case IWCH_QP_STATE_CLOSING: + BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2); + qhp->attr.state = IWCH_QP_STATE_CLOSING; + if (!internal) { + abort=0; + disconnect = 1; + ep = qhp->ep; + } + break; + case IWCH_QP_STATE_TERMINATE: + qhp->attr.state = IWCH_QP_STATE_TERMINATE; + if (!internal) + terminate = 1; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + if (!internal) { + abort=1; + disconnect = 1; + ep = qhp->ep; + } + goto err; + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_CLOSING: + if (!internal) { + ret = -EINVAL; + goto out; + } + switch (attrs->next_state) { + case IWCH_QP_STATE_IDLE: + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.llp_stream_handle = NULL; + put_ep(&qhp->ep->com); + qhp->ep = NULL; + wake_up(&qhp->wait); + break; + case IWCH_QP_STATE_ERROR: + goto err; + default: + ret = -EINVAL; + goto err; + } + break; + case IWCH_QP_STATE_ERROR: + if (attrs->next_state != IWCH_QP_STATE_IDLE) { + ret = -EINVAL; + goto out; + } + + if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || + !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) { + ret = -EINVAL; + goto out; + } + qhp->attr.state = IWCH_QP_STATE_IDLE; + memset(&qhp->attr, 0, sizeof(qhp->attr)); + break; + case IWCH_QP_STATE_TERMINATE: + if (!internal) { + ret = -EINVAL; + goto out; + } + goto err; + break; + default: + printk(KERN_ERR "%s in a bad state %d\n", + __FUNCTION__, qhp->attr.state); + ret = -EINVAL; + goto err; + break; + } + goto out; +err: + PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, + qhp->wq.qpid); + + /* disassociate the LLP connection */ + qhp->attr.llp_stream_handle = NULL; + ep = qhp->ep; + qhp->ep = NULL; + qhp->attr.state = IWCH_QP_STATE_ERROR; + free=1; + wake_up(&qhp->wait); + BUG_ON(!ep); + flush_qp(qhp, &flag); +out: + spin_unlock_irqrestore(&qhp->lock, flag); + + if (terminate) + iwch_post_terminate(qhp, NULL); + + /* + * If disconnect is 1, then we need to initiate a disconnect + * on the EP. This can be a normal close (RTS->CLOSING) or + * an abnormal close (RTS/CLOSING->ERROR). + */ + if (disconnect) + iwch_ep_disconnect(ep, abort, GFP_KERNEL); + + /* + * If free is 1, then we've disassociated the EP from the QP + * and we need to dereference the EP. + */ + if (free) + put_ep(&ep->com); + + PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state); + return ret; +} + +static int quiesce_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_quiesce_tid(qhp->ep); + qhp->flags |= QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +static int resume_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_resume_tid(qhp->ep); + qhp->flags &= ~QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +int iwch_quiesce_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) { + quiesce_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) + quiesce_qp(qhp); + } + return 0; +} + +int iwch_resume_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) { + resume_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp)) + resume_qp(qhp); + } + return 0; +} From swise at opengridcomputing.com Sat Dec 2 14:50:18 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:50:18 -0600 Subject: [openib-general] [PATCH v2 06/13] Completion Queues In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225018.27014.78386.stgit@dell3.ogc.int> Functions to manipulate CQs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cq.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c new file mode 100644 index 0000000..9d82df4 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" + +/* + * Get one cq entry from cxio and map it to openib. + * + * Returns: + * 0 EMPTY; + * 1 cqe returned + * -EAGAIN caller must try again + * any other -errno fatal error + */ +int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp, + struct ib_wc *wc) +{ + struct iwch_qp *qhp = NULL; + struct t3_cqe cqe, *rd_cqe; + struct t3_wq *wq; + u32 credit = 0; + u8 cqe_flushed; + u64 cookie; + int ret = 1; + + rd_cqe = cxio_next_cqe(&chp->cq); + + if (!rd_cqe) + return 0; + + qhp = get_qhp(rhp, CQE_QPID(*rd_cqe)); + if (!qhp) + wq = NULL; + else { + spin_lock(&qhp->lock); + wq = &(qhp->wq); + } + ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie, + &credit); + if (t3a_device(chp->rhp) && credit) { + PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, + credit, chp->cq.cqid); + cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit); + } + + if (ret) { + ret = -EAGAIN; + goto out; + } + ret = 1; + + wc->wr_id = cookie; + wc->qp_num = qhp->wq.qpid; + wc->vendor_err = CQE_STATUS(cqe); + + PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x " + "lo 0x%x cookie 0x%llx\n", __FUNCTION__, + CQE_QPID(cqe), CQE_TYPE(cqe), + CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe), + CQE_WRID_LOW(cqe), cookie); + + if (CQE_TYPE(cqe) == 0) { + if (!CQE_STATUS(cqe)) + wc->byte_len = CQE_LEN(cqe); + else + wc->byte_len = 0; + wc->opcode = IB_WC_RECV; + } else { + switch (CQE_OPCODE(cqe)) { + case T3_RDMA_WRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case T3_READ_REQ: + wc->opcode = IB_WC_RDMA_READ; + wc->byte_len = CQE_LEN(cqe); + break; + case T3_SEND: + case T3_SEND_WITH_SE: + wc->opcode = IB_WC_SEND; + break; + case T3_BIND_MW: + wc->opcode = IB_WC_BIND_MW; + break; + + /* these aren't supported yet */ + case T3_SEND_WITH_INV: + case T3_SEND_WITH_SE_INV: + case T3_LOCAL_INV: + case T3_FAST_REGISTER: + default: + printk(KERN_ERR MOD "Unexpected opcode %d " + "in the CQE received for QPID=0x%0x\n", + CQE_OPCODE(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + goto out; + } + } + + if (cqe_flushed) + wc->status = IB_WC_WR_FLUSH_ERR; + else { + + switch (CQE_STATUS(cqe)) { + case TPT_ERR_SUCCESS: + wc->status = IB_WC_SUCCESS; + break; + case TPT_ERR_STAG: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_PDID: + wc->status = IB_WC_LOC_PROT_ERR; + break; + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_WRAP: + wc->status = IB_WC_GENERAL_ERR; + break; + case TPT_ERR_BOUND: + wc->status = IB_WC_LOC_LEN_ERR; + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + wc->status = IB_WC_MW_BIND_ERR; + break; + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + case TPT_ERR_OPCODE: + wc->status = IB_WC_FATAL_ERR; + break; + case TPT_ERR_SWFLUSH: + wc->status = IB_WC_WR_FLUSH_ERR; + break; + default: + printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for " + "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + } + } +out: + if (wq) + spin_unlock(&qhp->lock); + return ret; +} + +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + unsigned long flags; + int npolled; + int err = 0; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + + spin_lock_irqsave(&chp->lock, flags); + for (npolled = 0; npolled < num_entries; ++npolled) { +#ifdef DEBUG + int i=0; +#endif + + /* + * Because T3 can post CQEs that are _not_ associated + * with a WR, we might have to poll again after removing + * one of these. + */ + do { + err = iwch_poll_cq_one(rhp, chp, wc + npolled); +#ifdef DEBUG + BUG_ON(++i > 1000); +#endif + } while (err == -EAGAIN); + if (err <= 0) + break; + } + spin_unlock_irqrestore(&chp->lock, flags); + + if (err < 0) + return err; + else { + return npolled; + } +} + +int iwch_modify_cq(struct ib_cq *cq, int cqe) +{ + PDBG("iwch_modify_cq: TBD\n"); + return 0; +} From swise at opengridcomputing.com Sat Dec 2 14:50:28 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:50:28 -0600 Subject: [openib-general] [PATCH v2 07/13] Async Event Handler In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225028.27014.27124.stgit@dell3.ogc.int> Code to handle async events coming from the T3 RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_ev.c | 228 +++++++++++++++++++++++++++++++++ 1 files changed, 228 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c new file mode 100644 index 0000000..bf767b2 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c @@ -0,0 +1,228 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp, + struct respQ_msg_t *rsp_msg, + enum ib_event_type ib_event, + int send_term) +{ + struct ib_event event; + struct iwch_qp_attributes attrs; + struct iwch_qp *qhp; + + printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + + spin_lock(&rnicp->lock); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + + if (!qhp) { + printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", + __FUNCTION__, CQE_STATUS(rsp_msg->cqe), + CQE_QPID(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + if ((qhp->attr.state == IWCH_QP_STATE_ERROR) || + (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) { + PDBG("%s AE received after RTS - " + "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, + qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + atomic_inc(&qhp->refcnt); + spin_unlock(&rnicp->lock); + + event.event = ib_event; + event.device = chp->ibcq.device; + if (ib_event == IB_EVENT_CQ_ERR) + event.element.cq = &chp->ibcq; + else + event.element.qp = &qhp->ibqp; + + if (qhp->ibqp.event_handler) + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + + attrs.next_state = IWCH_QP_STATE_TERMINATE; + if (send_term && (qhp->attr.state == IWCH_QP_STATE_RTS) && + !iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 1)) + iwch_post_terminate(qhp, rsp_msg); + + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); +} + +void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb) +{ + struct iwch_dev *rnicp; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + struct iwch_cq *chp; + struct iwch_qp *qhp; + u32 cqid = RSPQ_CQID(rsp_msg); + + rnicp = (struct iwch_dev *) rdev_p->ulp; + spin_lock(&rnicp->lock); + chp = get_chp(rnicp, cqid); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + if (!chp || !qhp) { + printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d " + "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", + cqid, CQE_QPID(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), + CQE_WRID_LOW(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + goto out; + } + iwch_qp_add_ref(&qhp->ibqp); + atomic_inc(&chp->refcnt); + spin_unlock(&rnicp->lock); + + /* + * 1) completion of our sending a TERMINATE. + * 2) incoming TERMINATE message. + */ + if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && + (CQE_STATUS(rsp_msg->cqe) == 0)) { + if (SQ_TYPE(rsp_msg->cqe)) { + PDBG("%s QPID 0x%x ep %p disconnecting\n", + __FUNCTION__, qhp->wq.qpid, qhp->ep); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } else { + PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, + qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, + IB_EVENT_QP_REQ_ERR, 0); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } + goto done; + } + + /* Bad incoming Read request */ + if (SQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + /* Bad incoming write */ + if (RQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + switch (CQE_STATUS(rsp_msg->cqe)) { + + /* Completion Events */ + case TPT_ERR_SUCCESS: + + /* + * Confirm the destination entry if this is a RECV completion. + */ + if (qhp->ep && SQ_TYPE(rsp_msg->cqe)) + dst_confirm(qhp->ep->dst); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + break; + + case TPT_ERR_STAG: + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + case TPT_ERR_WRAP: + case TPT_ERR_BOUND: + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1); + break; + + /* Device Fatal Errors */ + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1); + break; + + /* QP Fatal Errors */ + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_PBL_ADDR_BOUND: + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_OPCODE: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_RQE_ADDR_BOUND: + case TPT_ERR_IRD_OVERFLOW: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + + default: + printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", + CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + } +done: + if (atomic_dec_and_test(&chp->refcnt)) + wake_up(&chp->wait); + iwch_qp_rem_ref(&qhp->ibqp); +out: + dev_kfree_skb_irq(skb); +} From swise at opengridcomputing.com Sat Dec 2 14:50:38 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:50:38 -0600 Subject: [openib-general] [PATCH v2 08/13] Memory Registration In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225038.27014.90811.stgit@dell3.ogc.int> Functions to register memory regions. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_mem.c | 170 ++++++++++++++++++++++++++++++++ 1 files changed, 170 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c new file mode 100644 index 0000000..774d11e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c @@ -0,0 +1,170 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include + +#include +#include + +#include "cxio_hal.h" +#include "iwch.h" +#include "iwch_provider.h" + +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list) +{ + u32 stag; + u32 mmid; + + + if (cxio_register_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages) +{ + u32 stag; + u32 mmid; + + + /* We could support this... */ + if (npages > mhp->attr.pbl_size) + return -ENOMEM; + + stag = mhp->attr.stag; + if (cxio_reregister_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list) +{ + u64 mask; + int i, j, n; + + mask = 0; + *total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) + return -EINVAL; + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return -EINVAL; + *total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + if (*total_size > 0xFFFFFFFFULL) + return -ENOMEM; + + /* Find largest page shift we can use to cover buffers */ + for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift)) + if (num_phys_buf > 1) { + if ((1ULL << *shift) & mask) + break; + } else + if (1ULL << *shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << *shift) - 1))) + break; + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1); + buffer_list[0].addr &= ~0ull << *shift; + + *npages = 0; + for (i = 0; i < num_phys_buf; ++i) + *npages += (buffer_list[i].size + + (1ULL << *shift) - 1) >> *shift; + + if (!*npages) + return -EINVAL; + + *page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL); + if (!*page_list) + return -ENOMEM; + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; + ++j) + (*page_list)[n++] = cpu_to_be64(buffer_list[i].addr + + ((u64) j << *shift)); + + PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n", + __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages); + + return 0; + +} From swise at opengridcomputing.com Sat Dec 2 14:50:48 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:50:48 -0600 Subject: [openib-general] [PATCH v2 09/13] Core WQE/CQE Types In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225048.27014.69535.stgit@dell3.ogc.int> T3 WQE and CQE structures, defines, etc... Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_wr.h | 685 ++++++++++++++++++++++++++++ 1 files changed, 685 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h new file mode 100644 index 0000000..45870be --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h @@ -0,0 +1,685 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_WR_H__ +#define __CXIO_WR_H__ + +#include +#include +#include +#include "firmware_exports.h" + +#define T3_MAX_SGE 4 + +#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) +#define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ + ((rptr)!=(wptr)) ) +#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1)) +#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<> S_FW_RIWR_OP)) & M_FW_RIWR_OP) + +#define S_FW_RIWR_SOPEOP 22 +#define M_FW_RIWR_SOPEOP 0x3 +#define V_FW_RIWR_SOPEOP(x) ((x) << S_FW_RIWR_SOPEOP) + +#define S_FW_RIWR_FLAGS 8 +#define M_FW_RIWR_FLAGS 0x3fffff +#define V_FW_RIWR_FLAGS(x) ((x) << S_FW_RIWR_FLAGS) +#define G_FW_RIWR_FLAGS(x) ((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS) + +#define S_FW_RIWR_TID 8 +#define V_FW_RIWR_TID(x) ((x) << S_FW_RIWR_TID) + +#define S_FW_RIWR_LEN 0 +#define V_FW_RIWR_LEN(x) ((x) << S_FW_RIWR_LEN) + +#define S_FW_RIWR_GEN 31 +#define V_FW_RIWR_GEN(x) ((x) << S_FW_RIWR_GEN) + +struct t3_sge { + __be32 stag; + __be32 len; + __be64 to; +}; + +/* If num_sgle is zero, flit 5+ contains immediate data.*/ +struct t3_send_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be32 plen; /* 3 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ +}; + +struct t3_local_inv_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 stag; /* 2 */ + __be32 reserved3; +}; + +struct t3_rdma_write_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 stag_sink; + __be64 to_sink; /* 3 */ + __be32 plen; /* 4 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 5+ */ +}; + +struct t3_rdma_read_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be64 rem_to; /* 3 */ + __be32 local_stag; /* 4 */ + __be32 local_len; + __be64 local_to; /* 5 */ +}; + +enum t3_addr_type { + T3_VA_BASED_TO = 0x0, + T3_ZERO_BASED_TO = 0x1 +} __attribute__ ((packed)); + +enum t3_mem_perms { + T3_MEM_ACCESS_LOCAL_READ = 0x1, + T3_MEM_ACCESS_LOCAL_WRITE = 0x2, + T3_MEM_ACCESS_REM_READ = 0x4, + T3_MEM_ACCESS_REM_WRITE = 0x8 +} __attribute__ ((packed)); + +struct t3_bind_mw_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u16 reserved; /* 2 */ + u8 type; + u8 perms; + __be32 mr_stag; + __be32 mw_stag; /* 3 */ + __be32 mw_len; + __be64 mw_va; /* 4 */ + __be32 mr_pbl_addr; /* 5 */ + u8 reserved2[3]; + u8 mr_pagesz; +}; + +struct t3_receive_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 pagesz[T3_MAX_SGE]; + __be32 num_sgle; /* 2 */ + struct t3_sge sgl[T3_MAX_SGE]; /* 3+ */ + __be32 pbl_addr[T3_MAX_SGE]; +}; + +struct t3_bypass_wr { + struct fw_riwrh wrh; + union t3_wrid wrid; /* 1 */ +}; + +struct t3_modify_qp_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 flags; /* 2 */ + __be32 quiesce; /* 2 */ + __be32 max_ird; /* 3 */ + __be32 max_ord; /* 3 */ + __be64 sge_cmd; /* 4 */ + __be64 ctx1; /* 5 */ + __be64 ctx0; /* 6 */ +}; + +enum t3_modify_qp_flags { + MODQP_QUIESCE = 0x01, + MODQP_MAX_IRD = 0x02, + MODQP_MAX_ORD = 0x04, + MODQP_WRITE_EC = 0x08, + MODQP_READ_EC = 0x10, +}; + + +enum t3_mpa_attrs { + uP_RI_MPA_RX_MARKER_ENABLE = 0x1, + uP_RI_MPA_TX_MARKER_ENABLE = 0x2, + uP_RI_MPA_CRC_ENABLE = 0x4, + uP_RI_MPA_IETF_ENABLE = 0x8 +} __attribute__ ((packed)); + +enum t3_qp_caps { + uP_RI_QP_RDMA_READ_ENABLE = 0x01, + uP_RI_QP_RDMA_WRITE_ENABLE = 0x02, + uP_RI_QP_BIND_ENABLE = 0x04, + uP_RI_QP_FAST_REGISTER_ENABLE = 0x08, + uP_RI_QP_STAG0_ENABLE = 0x10 +} __attribute__ ((packed)); + +struct t3_rdma_init_attr { + u32 tid; + u32 qpid; + u32 pdid; + u32 scqid; + u32 rcqid; + u32 rq_addr; + u32 rq_size; + enum t3_mpa_attrs mpaattrs; + enum t3_qp_caps qpcaps; + u16 tcp_emss; + u32 ord; + u32 ird; + u64 qp_dma_addr; + u32 qp_dma_size; + u32 flags; +}; + +struct t3_rdma_init_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 qpid; /* 2 */ + __be32 pdid; + __be32 scqid; /* 3 */ + __be32 rcqid; + __be32 rq_addr; /* 4 */ + __be32 rq_size; + u8 mpaattrs; /* 5 */ + u8 qpcaps; + __be16 ulpdu_size; + __be32 flags; /* bits 31-1 - reservered */ + /* bit 0 - set if RECV posted */ + __be32 ord; /* 6 */ + __be32 ird; + __be64 qp_dma_addr; /* 7 */ + __be32 qp_dma_size; /* 8 */ + u32 rsvd; +}; + +struct t3_genbit { + u64 flit[15]; + __be64 genbit; +}; + +enum rdma_init_wr_flags { + RECVS_POSTED = 1, +}; + +union t3_wr { + struct t3_send_wr send; + struct t3_rdma_write_wr write; + struct t3_rdma_read_wr read; + struct t3_receive_wr recv; + struct t3_local_inv_wr local_inv; + struct t3_bind_mw_wr bind; + struct t3_bypass_wr bypass; + struct t3_rdma_init_wr init; + struct t3_modify_qp_wr qp_mod; + struct t3_genbit genbit; + u64 flit[16]; +}; + +#define T3_SQ_CQE_FLIT 13 +#define T3_SQ_COOKIE_FLIT 14 + +#define T3_RQ_COOKIE_FLIT 13 +#define T3_RQ_CQE_FLIT 14 + +static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe) +{ + return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags)); +} + +static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, + enum t3_wr_flags flags, u8 genbit, u32 tid, + u8 len) +{ + wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | + V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_FLAGS(flags)); + wmb(); + wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | + V_FW_RIWR_TID(tid) | + V_FW_RIWR_LEN(len)); + /* 2nd gen bit... */ + ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit); +} + +/* + * T3 ULP2_TX commands + */ +enum t3_utx_mem_op { + T3_UTX_MEM_READ = 2, + T3_UTX_MEM_WRITE = 3 +}; + +/* T3 MC7 RDMA TPT entry format */ + +enum tpt_mem_type { + TPT_NON_SHARED_MR = 0x0, + TPT_SHARED_MR = 0x1, + TPT_MW = 0x2, + TPT_MW_RELAXED_PROTECTION = 0x3 +}; + +enum tpt_addr_type { + TPT_ZBTO = 0, + TPT_VATO = 1 +}; + +enum tpt_mem_perm { + TPT_LOCAL_READ = 0x8, + TPT_LOCAL_WRITE = 0x4, + TPT_REMOTE_READ = 0x2, + TPT_REMOTE_WRITE = 0x1 +}; + +struct tpt_entry { + __be32 valid_stag_pdid; + __be32 flags_pagesize_qpid; + + __be32 rsvd_pbl_addr; + __be32 len; + __be32 va_hi; + __be32 va_low_or_fbo; + + __be32 rsvd_bind_cnt_or_pstag; + __be32 rsvd_pbl_size; +}; + +#define S_TPT_VALID 31 +#define V_TPT_VALID(x) ((x) << S_TPT_VALID) +#define F_TPT_VALID V_TPT_VALID(1U) + +#define S_TPT_STAG_KEY 23 +#define M_TPT_STAG_KEY 0xFF +#define V_TPT_STAG_KEY(x) ((x) << S_TPT_STAG_KEY) +#define G_TPT_STAG_KEY(x) (((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY) + +#define S_TPT_STAG_STATE 22 +#define V_TPT_STAG_STATE(x) ((x) << S_TPT_STAG_STATE) +#define F_TPT_STAG_STATE V_TPT_STAG_STATE(1U) + +#define S_TPT_STAG_TYPE 20 +#define M_TPT_STAG_TYPE 0x3 +#define V_TPT_STAG_TYPE(x) ((x) << S_TPT_STAG_TYPE) +#define G_TPT_STAG_TYPE(x) (((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE) + +#define S_TPT_PDID 0 +#define M_TPT_PDID 0xFFFFF +#define V_TPT_PDID(x) ((x) << S_TPT_PDID) +#define G_TPT_PDID(x) (((x) >> S_TPT_PDID) & M_TPT_PDID) + +#define S_TPT_PERM 28 +#define M_TPT_PERM 0xF +#define V_TPT_PERM(x) ((x) << S_TPT_PERM) +#define G_TPT_PERM(x) (((x) >> S_TPT_PERM) & M_TPT_PERM) + +#define S_TPT_REM_INV_DIS 27 +#define V_TPT_REM_INV_DIS(x) ((x) << S_TPT_REM_INV_DIS) +#define F_TPT_REM_INV_DIS V_TPT_REM_INV_DIS(1U) + +#define S_TPT_ADDR_TYPE 26 +#define V_TPT_ADDR_TYPE(x) ((x) << S_TPT_ADDR_TYPE) +#define F_TPT_ADDR_TYPE V_TPT_ADDR_TYPE(1U) + +#define S_TPT_MW_BIND_ENABLE 25 +#define V_TPT_MW_BIND_ENABLE(x) ((x) << S_TPT_MW_BIND_ENABLE) +#define F_TPT_MW_BIND_ENABLE V_TPT_MW_BIND_ENABLE(1U) + +#define S_TPT_PAGE_SIZE 20 +#define M_TPT_PAGE_SIZE 0x1F +#define V_TPT_PAGE_SIZE(x) ((x) << S_TPT_PAGE_SIZE) +#define G_TPT_PAGE_SIZE(x) (((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE) + +#define S_TPT_PBL_ADDR 0 +#define M_TPT_PBL_ADDR 0x1FFFFFFF +#define V_TPT_PBL_ADDR(x) ((x) << S_TPT_PBL_ADDR) +#define G_TPT_PBL_ADDR(x) (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR) + +#define S_TPT_QPID 0 +#define M_TPT_QPID 0xFFFFF +#define V_TPT_QPID(x) ((x) << S_TPT_QPID) +#define G_TPT_QPID(x) (((x) >> S_TPT_QPID) & M_TPT_QPID) + +#define S_TPT_PSTAG 0 +#define M_TPT_PSTAG 0xFFFFFF +#define V_TPT_PSTAG(x) ((x) << S_TPT_PSTAG) +#define G_TPT_PSTAG(x) (((x) >> S_TPT_PSTAG) & M_TPT_PSTAG) + +#define S_TPT_PBL_SIZE 0 +#define M_TPT_PBL_SIZE 0xFFFFF +#define V_TPT_PBL_SIZE(x) ((x) << S_TPT_PBL_SIZE) +#define G_TPT_PBL_SIZE(x) (((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE) + +/* + * CQE defs + */ +struct t3_cqe { + __be32 header; + __be32 len; + union { + struct { + __be32 stag; + __be32 msn; + } rcqe; + struct { + u32 wrid_hi; + u32 wrid_low; + } scqe; + } u; +}; + +#define S_CQE_OOO 31 +#define M_CQE_OOO 0x1 +#define G_CQE_OOO(x) ((((x) >> S_CQE_OOO)) & M_CQE_OOO) +#define V_CEQ_OOO(x) ((x)<> S_CQE_QPID)) & M_CQE_QPID) +#define V_CQE_QPID(x) ((x)<> S_CQE_SWCQE)) & M_CQE_SWCQE) +#define V_CQE_SWCQE(x) ((x)<> S_CQE_GENBIT) & M_CQE_GENBIT) +#define V_CQE_GENBIT(x) ((x)<> S_CQE_STATUS)) & M_CQE_STATUS) +#define V_CQE_STATUS(x) ((x)<> S_CQE_TYPE)) & M_CQE_TYPE) +#define V_CQE_TYPE(x) ((x)<> S_CQE_OPCODE)) & M_CQE_OPCODE) +#define V_CQE_OPCODE(x) ((x)<queue->flit[13] = 1; +} + +static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + return NULL; +} + +static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +#endif From swise at opengridcomputing.com Sat Dec 2 14:50:58 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:50:58 -0600 Subject: [openib-general] [PATCH v2 10/13] Core HAL In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225058.27014.33454.stgit@dell3.ogc.int> The RDMA Core interfaces with the T3 HW and ULLD providing a low level RDMA interface. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_hal.h | 201 ++++ 2 files changed, 1503 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c new file mode 100644 index 0000000..367c834 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c @@ -0,0 +1,1302 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include +#include +#include +#include + +#include "cxio_resource.h" +#include "cxio_hal.h" +#include "cxgb3_offload.h" +#include "sge_defs.h" + +static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC]; +static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL; + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (!strcmp(rdev_tbl[i]->dev_name, dev_name)) + return rdev_tbl[i]; + return NULL; +} + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev + *tdev) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (rdev_tbl[i]->t3cdev_p == tdev) + return rdev_tbl[i]; + return NULL; +} + +static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (!rdev_tbl[i]) { + rdev_tbl[i] = rdev_p; + break; + } + return (i == T3_MAX_NUM_RNIC); +} + +static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i] == rdev_p) { + rdev_tbl[i] = NULL; + break; + } +} + +int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit) +{ + int ret; + struct t3_cqe *cqe; + u32 rptr; + + struct rdma_cq_op setup; + setup.id = cq->cqid; + setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0; + setup.op = op; + ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup); + + if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) + return ret; + + /* + * If the rearm returned an index other than our current index, + * then there might be CQE's in flight (being DMA'd). We must wait + * here for them to complete or the consumer can miss a notification. + */ + if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) { + int i=0; + + rptr = cq->rptr; + + /* + * Keep the generation correct by bumping rptr until it + * matches the index returned by the rearm - 1. + */ + while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret) + rptr++; + + /* + * Now rptr is the index for the (last) cqe that was + * in-flight at the time the HW rearmed the CQ. We + * spin until that CQE is valid. + */ + cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2); + while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) { + udelay(1); + if (i++ > 1000000) { + BUG_ON(1); + printk(KERN_ERR "%s: stalled rnic\n", + rdev_p->dev_name); + return -EIO; + } + } + } + return 0; +} + +static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid) +{ + struct rdma_cq_setup setup; + setup.id = cqid; + setup.base_addr = 0; /* NULL address */ + setup.size = 0; /* disaable the CQ */ + setup.credits = 0; + setup.credit_thres = 0; + setup.ovfl_mode = 0; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) +{ + u64 sge_cmd; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = qpid << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe); + + cq->cqid = cxio_hal_get_cqid(rdev_p->rscp); + if (!cq->cqid) + return -ENOMEM; + cq->sw_queue = kzalloc(size, GFP_KERNEL); + if (!cq->sw_queue) + return -ENOMEM; + cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) * + sizeof(struct t3_cqe), + &(cq->dma_addr), GFP_KERNEL); + if (!cq->queue) { + kfree(cq->sw_queue); + return -ENOMEM; + } + pci_unmap_addr_set(cq, mapping, cq->dma_addr); + memset(cq->queue, 0, size); + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = 65535; + setup.credit_thres = 1; + if (rdev_p->t3cdev_p->type == T3B) + setup.ovfl_mode = 0; + else + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = setup.size; + setup.credit_thres = setup.size; /* TBD: overflow recovery */ + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + u32 qpid; + int i; + + mutex_lock(&uctx->lock); + if (!list_empty(&uctx->qpids)) { + entry = list_entry(uctx->qpids.next, struct cxio_qpid_list, + entry); + list_del(&entry->entry); + qpid = entry->qpid; + kfree(entry); + } else { + qpid = cxio_hal_get_qpid(rdev_p->rscp); + if (!qpid) + goto out; + for (i = qpid+1; i & rdev_p->qpmask; i++) { + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + break; + entry->qpid = i; + list_add_tail(&entry->entry, &uctx->qpids); + } + } +out: + mutex_unlock(&uctx->lock); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid, + struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + return; + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + entry->qpid = qpid; + mutex_lock(&uctx->lock); + list_add_tail(&entry->entry, &uctx->qpids); + mutex_unlock(&uctx->lock); +} + +void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct list_head *pos, *nxt; + struct cxio_qpid_list *entry; + + mutex_lock(&uctx->lock); + list_for_each_safe(pos, nxt, &uctx->qpids) { + entry = list_entry(pos, struct cxio_qpid_list, entry); + list_del_init(&entry->entry); + if (!(entry->qpid & rdev_p->qpmask)) + cxio_hal_put_qpid(rdev_p->rscp, entry->qpid); + kfree(entry); + } + mutex_unlock(&uctx->lock); +} + +void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + INIT_LIST_HEAD(&uctx->qpids); + mutex_init(&uctx->lock); +} + +int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain, + struct t3_wq *wq, struct cxio_ucontext *uctx) +{ + int depth = 1UL << wq->size_log2; + int rqsize = 1UL << wq->rq_size_log2; + + wq->qpid = get_qpid(rdev_p, uctx); + if (!wq->qpid) + return -ENOMEM; + + wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL); + if (!wq->rq) + goto err1; + + wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize); + if (!wq->rq_addr) + goto err2; + + wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL); + if (!wq->sq) + goto err3; + + wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + depth * sizeof(union t3_wr), + &(wq->dma_addr), GFP_KERNEL); + if (!wq->queue) + goto err4; + + memset(wq->queue, 0, depth * sizeof(union t3_wr)); + pci_unmap_addr_set(wq, mapping, wq->dma_addr); + wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + if (!kernel_domain) + wq->udb = (u64)rdev_p->rnic_info.udbell_physbase + + (wq->qpid << rdev_p->qpshift); + PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__, + wq->qpid, wq->doorbell, wq->udb); + return 0; +err4: + kfree(wq->sq); +err3: + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize); +err2: + kfree(wq->rq); +err1: + put_qpid(rdev_p, wq->qpid, uctx); + return -ENOMEM; +} + +int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + int err; + err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid); + kfree(cq->sw_queue); + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) + * sizeof(struct t3_cqe), cq->queue, + pci_unmap_addr(cq, mapping)); + cxio_hal_put_cqid(rdev_p->rscp, cq->cqid); + return err; +} + +int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq, + struct cxio_ucontext *uctx) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (wq->size_log2)) + * sizeof(union t3_wr), wq->queue, + pci_unmap_addr(wq, mapping)); + kfree(wq->sq); + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2)); + kfree(wq->rq); + put_qpid(rdev_p, wq->qpid, uctx); + return 0; +} + +static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(T3_SEND) | + V_CQE_TYPE(0) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + u32 ptr; + + PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq); + + /* flush RQ */ + PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, + wq->rq_rptr, wq->rq_wptr, count); + ptr = wq->rq_rptr + count; + while (ptr++ != wq->rq_wptr) + insert_recv_cqe(wq, cq); +} + +static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, + struct t3_swsq *sqp) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(sqp->opcode) | + V_CQE_TYPE(1) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + cqe.u.scqe.wrid_hi = sqp->sq_wptr; + + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + __u32 ptr; + struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2); + + ptr = wq->sq_rptr + count; + sqp += count; + while (ptr != wq->sq_wptr) { + insert_sq_cqe(wq, cq, sqp); + sqp++; + ptr++; + } +} + +/* + * Move all CQEs from the HWCQ into the SWCQ. + */ +void cxio_flush_hw_cq(struct t3_cq *cq) +{ + struct t3_cqe *cqe, *swcqe; + + PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid); + cqe = cxio_next_hw_cqe(cq); + while (cqe) { + PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n", + __FUNCTION__, cq->rptr, cq->sw_wptr); + swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2); + *swcqe = *cqe; + swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1)); + cq->sw_wptr++; + cq->rptr++; + cqe = cxio_next_hw_cqe(cq); + } +} + +static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq) +{ + if (CQE_OPCODE(*cqe) == T3_TERMINATE) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) + return 0; + + return 1; +} + +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) && + (CQE_QPID(*cqe) == wq->qpid)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + PDBG("%s count zero %d\n", __FUNCTION__, *count); + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) && + (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p) +{ + struct rdma_cq_setup setup; + setup.id = 0; + setup.base_addr = 0; /* NULL address */ + setup.size = 1; /* enable the CQ */ + setup.credits = 0; + + /* force SGE to redirect to RspQ and interrupt */ + setup.credit_thres = 0; + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) +{ + int err; + u64 sge_cmd, ctx0, ctx1; + u64 base_addr; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + + + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + err = cxio_hal_init_ctrl_cq(rdev_p); + if (err) { + PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err); + return err; + } + rdev_p->ctrl_qp.workq = dma_alloc_coherent( + &(rdev_p->rnic_info.pdev->dev), + (1 << T3_CTRL_QP_SIZE_LOG2) * + sizeof(union t3_wr), + &(rdev_p->ctrl_qp.dma_addr), + GFP_KERNEL); + if (!rdev_p->ctrl_qp.workq) { + PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__); + return -ENOMEM; + } + pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, + rdev_p->ctrl_qp.dma_addr); + rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + memset(rdev_p->ctrl_qp.workq, 0, + (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr)); + + init_MUTEX(&rdev_p->ctrl_qp.sem); + init_waitqueue_head(&rdev_p->ctrl_qp.waitq); + + /* update HW Ctrl QP context */ + base_addr = rdev_p->ctrl_qp.dma_addr; + base_addr >>= 12; + ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) | + V_EC_BASE_LO((u32) base_addr & 0xffff)); + ctx0 <<= 32; + ctx0 |= V_EC_CREDITS(FW_WR_NUM); + base_addr >>= 16; + ctx1 = (u32) base_addr; + base_addr >>= 32; + ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) | + V_EC_TYPE(0) | V_EC_GEN(1) | + V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32; + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1, + T3_CTL_QP_TID, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + wqe->ctx1 = cpu_to_be64(ctx1); + wqe->ctx0 = cpu_to_be64(ctx0); + PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n", + (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq, + 1 << T3_CTRL_QP_SIZE_LOG2); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << T3_CTRL_QP_SIZE_LOG2) + * sizeof(union t3_wr), rdev_p->ctrl_qp.workq, + pci_unmap_addr(&rdev_p->ctrl_qp, mapping)); + return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID); +} + +/* write len bytes of data into addr (32B aligned address) + * If data is NULL, clear len byte of memory to zero. + * caller aquires the sem before the call + */ +static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, + u32 len, void *data, int completion) +{ + u32 i, nr_wqe, copy_len; + u8 *copy_data; + u8 wr_len, utx_len; /* lenght in 8 byte flit */ + enum t3_wr_flags flag; + __be64 *wqe; + u64 utx_cmd; + addr &= 0x7FFFFFF; + nr_wqe = len % 96 ? len / 96 + 1 : len / 96; /* 96B max per WQE */ + PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n", + __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, + nr_wqe, data, addr); + utx_len = 3; /* in 32B unit */ + for (i = 0; i < nr_wqe; i++) { + if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2)) { + PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, " + "wait for more space i %d\n", __FUNCTION__, + rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i); + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + !Q_FULL(rdev_p->ctrl_qp.rptr, + rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2))) { + PDBG("%s ctrl_qp workq interrupted\n", + __FUNCTION__); + return -ERESTARTSYS; + } + PDBG("%s ctrl_qp wakeup, continue posting work request " + "i %d\n", __FUNCTION__, i); + } + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + flag = 0; + if (i == (nr_wqe - 1)) { + /* last WQE */ + flag = completion ? T3_COMPLETION_FLAG : 0; + if (len % 32) + utx_len = len / 32 + 1; + else + utx_len = len / 32; + } + + /* + * Force a CQE to return the credit to the workq in case + * we posted more than half the max QP size of WRs + */ + if ((i != 0) && + (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) { + flag = T3_COMPLETION_FLAG; + PDBG("%s force completion at i %d\n", __FUNCTION__, i); + } + + /* build the utx mem command */ + wqe += (sizeof(struct t3_bypass_wr) >> 3); + utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3); + utx_cmd <<= 32; + utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1); + *wqe = cpu_to_be64(utx_cmd); + wqe++; + copy_data = (u8 *) data + i * 96; + copy_len = len > 96 ? 96 : len; + + /* clear memory content if data is NULL */ + if (data) + memcpy(wqe, copy_data, copy_len); + else + memset(wqe, 0, copy_len); + if (copy_len % 32) + memset(((u8 *) wqe) + copy_len, 0, + 32 - (copy_len % 32)); + wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + + (utx_len << 2); + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + + /* wptr in the WRID[31:0] */ + ((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr; + + /* + * This must be the last write with a memory barrier + * for the genbit + */ + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag, + Q_GENBIT(rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID, + wr_len); + if (flag == T3_COMPLETION_FLAG) + ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID); + len -= 96; + rdev_p->ctrl_qp.wptr++; + } + return 0; +} + +/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size + * OUT: stag index, actual pbl_size, pbl_addr allocated. + * TBD: shared memory region support + */ +static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, + u32 *stag, u8 stag_state, u32 pdid, + enum tpt_mem_type type, enum tpt_mem_perm perm, + u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl, + u32 *pbl_size, u32 *pbl_addr) +{ + int err; + struct tpt_entry tpt; + u32 stag_idx; + u32 wptr; + int rereg = (*stag != T3_STAG_UNSET); + + stag_state = stag_state > 0; + stag_idx = (*stag) >> 8; + + if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) { + stag_idx = cxio_hal_get_stag(rdev_p->rscp); + if (!stag_idx) + return -ENOMEM; + *stag = (stag_idx << 8) | ((*stag) & 0xFF); + } + PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", + __FUNCTION__, stag_state, type, pdid, stag_idx); + + if (reset_tpt_entry) + cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3); + else if (!rereg) { + *pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3); + if (!*pbl_addr) { + return -ENOMEM; + } + } + + down_interruptible(&rdev_p->ctrl_qp.sem); + + /* write PBL first if any - update pbl only if pbl list exist */ + if (pbl) { + + PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n", + __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base, + *pbl_size); + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + (*pbl_addr >> 5), + (*pbl_size << 3), pbl, 0); + if (err) + goto ret; + } + + /* write TPT entry */ + if (reset_tpt_entry) + memset(&tpt, 0, sizeof(tpt)); + else { + tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID | + V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) | + V_TPT_STAG_STATE(stag_state) | + V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid)); + BUG_ON(page_size >= 28); + tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | + F_TPT_MW_BIND_ENABLE | + V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) | + V_TPT_PAGE_SIZE(page_size)); + tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3)); + tpt.len = cpu_to_be32(len); + tpt.va_hi = cpu_to_be32((u32) (to >> 32)); + tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL)); + tpt.rsvd_bind_cnt_or_pstag = 0; + tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2)); + } + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + stag_idx + + (rdev_p->rnic_info.tpt_base >> 5), + sizeof(tpt), &tpt, 1); + + /* release the stag index to free pool */ + if (reset_tpt_entry) + cxio_hal_put_stag(rdev_p->rscp, stag_idx); +ret: + wptr = rdev_p->ctrl_qp.wptr; + up(&rdev_p->ctrl_qp.sem); + if (!err) + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + SEQ32_GE(rdev_p->ctrl_qp.rptr, + wptr))) + return -ERESTARTSYS; + return err; +} + +/* IN : stag key, pdid, pbl_size + * Out: stag index, actaul pbl_size, and pbl_addr allocated. + */ +int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr) +{ + *stag = T3_STAG_UNSET; + return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, + perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr)); +} + +int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, + u32 pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + &pbl_size, &pbl_addr); +} + +int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) +{ + u32 pbl_size = 0; + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0, + NULL, &pbl_size, NULL); +} + +int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + NULL, NULL); +} + +int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) +{ + struct t3_rdma_init_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC); + if (!skb) + return -ENOMEM; + PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p); + wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe)); + wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT)); + wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) | + V_FW_RIWR_LEN(sizeof(*wqe) >> 3)); + wqe->wrid.id1 = 0; + wqe->qpid = cpu_to_be32(attr->qpid); + wqe->pdid = cpu_to_be32(attr->pdid); + wqe->scqid = cpu_to_be32(attr->scqid); + wqe->rcqid = cpu_to_be32(attr->rcqid); + wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base); + wqe->rq_size = cpu_to_be32(attr->rq_size); + wqe->mpaattrs = attr->mpaattrs; + wqe->qpcaps = attr->qpcaps; + wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss); + wqe->flags = cpu_to_be32(attr->flags); + wqe->ord = cpu_to_be32(attr->ord); + wqe->ird = cpu_to_be32(attr->ird); + wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr); + wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); + wqe->rsvd = 0; + skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = ev_cb; +} + +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = NULL; +} + +static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb) +{ + static int cnt; + struct cxio_rdev *rdev_p = NULL; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x" + " se %0x notify %0x cqbranch %0x creditth %0x\n", + cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg), + RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg), + RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg), + RSPQ_CREDIT_THRESH(rsp_msg)); + PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d " + "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + rdev_p = (struct cxio_rdev *)t3cdev_p->ulp; + if (!rdev_p) { + PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__, + t3cdev_p); + return 0; + } + if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) { + rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1; + wake_up_interruptible(&rdev_p->ctrl_qp.waitq); + dev_kfree_skb_irq(skb); + } else if (CQE_QPID(rsp_msg->cqe) == 0xfff8) + dev_kfree_skb_irq(skb); + else if (cxio_ev_cb) + (*cxio_ev_cb) (rdev_p, skb); + else + dev_kfree_skb_irq(skb); + cnt++; + return 0; +} + +/* Caller takes care of locking if needed */ +int cxio_rdev_open(struct cxio_rdev *rdev_p) +{ + struct net_device *netdev_p = NULL; + int err = 0; + if (strlen(rdev_p->dev_name)) { + if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) { + return -EBUSY; + } + netdev_p = dev_get_by_name(rdev_p->dev_name); + if (!netdev_p) { + return -EINVAL; + } + dev_put(netdev_p); + } else if (rdev_p->t3cdev_p) { + if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) { + return -EBUSY; + } + netdev_p = rdev_p->t3cdev_p->lldev; + strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name, + T3_MAX_DEV_NAME_LEN); + } else { + PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__); + return -EINVAL; + } + + if (cxio_hal_add_rdev(rdev_p)) + return -ENOMEM; + + PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name); + memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp)); + if (!rdev_p->t3cdev_p) + rdev_p->t3cdev_p = T3CDEV(netdev_p); + rdev_p->t3cdev_p->ulp = (void *) rdev_p; + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS, + &(rdev_p->rnic_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS, + &(rdev_p->port_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + + /* + * qpshift is the number of bits to shift the qpid left in order + * to get the correct address of the doorbell for that qp. + */ + cxio_init_ucontext(rdev_p, &rdev_p->uctx); + rdev_p->qpshift = PAGE_SHIFT - + long_log2(65536 >> + long_log2(rdev_p->rnic_info.udbell_len >> + PAGE_SHIFT)); + rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT; + rdev_p->qpmask = (65536 >> long_log2(rdev_p->qpnr)) - 1; + PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d " + "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n", + __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base, + rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p), + rdev_p->rnic_info.pbl_base, + rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base, + rdev_p->rnic_info.rqt_top); + PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu " + "qpnr %d qpmask 0x%x\n", + rdev_p->rnic_info.udbell_len, + rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr, + rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask); + + err = cxio_hal_init_ctrl_qp(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing ctrl_qp.\n", + __FUNCTION__, err); + goto err1; + } + err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0, + 0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ, + T3_MAX_NUM_PD); + if (err) { + printk(KERN_ERR "%s error %d initializing hal resources.\n", + __FUNCTION__, err); + goto err2; + } + err = cxio_hal_pblpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing pbl mem pool.\n", + __FUNCTION__, err); + goto err3; + } + err = cxio_hal_rqtpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing rqt mem pool.\n", + __FUNCTION__, err); + goto err4; + } + return 0; +err4: + cxio_hal_pblpool_destroy(rdev_p); +err3: + cxio_hal_destroy_resource(rdev_p->rscp); +err2: + cxio_hal_destroy_ctrl_qp(rdev_p); +err1: + cxio_hal_delete_rdev(rdev_p); + return err; +} + +void cxio_rdev_close(struct cxio_rdev *rdev_p) +{ + if (rdev_p) { + cxio_hal_pblpool_destroy(rdev_p); + cxio_hal_rqtpool_destroy(rdev_p); + cxio_hal_delete_rdev(rdev_p); + rdev_p->t3cdev_p->ulp = NULL; + cxio_hal_destroy_ctrl_qp(rdev_p); + cxio_hal_destroy_resource(rdev_p->rscp); + } +} + +int __init cxio_hal_init(void) +{ + if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI)) + return -ENOMEM; + memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *)); + t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler); + return 0; +} + +void __exit cxio_hal_exit(void) +{ + int i; + t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL); + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + cxio_rdev_close(rdev_tbl[i]); + cxio_hal_destroy_rhdl_resource(); +} + +static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_swsq *sqp; + __u32 ptr = wq->sq_rptr; + int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr); + + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + while (count--) + if (!sqp->signaled) { + ptr++; + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + } else if (sqp->complete) { + + /* + * Insert this completed cqe into the swcq. + */ + PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n", + __FUNCTION__, Q_PTR2IDX(ptr, wq->sq_size_log2), + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)); + sqp->cqe.header |= htonl(V_CQE_SWCQE(1)); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) + = sqp->cqe; + cq->sw_wptr++; + sqp->signaled = 0; + break; + } else + break; +} + +static inline void create_read_req_cqe(struct t3_wq *wq, + struct t3_cqe *hw_cqe, + struct t3_cqe *read_cqe) +{ + read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr; + read_cqe->len = wq->oldest_read->read_len; + read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) | + V_CQE_SWCQE(SW_CQE(*hw_cqe)) | + V_CQE_OPCODE(T3_READ_REQ) | + V_CQE_TYPE(1)); +} + +/* + * Return a ptr to the next read wr in the SWSQ or NULL. + */ +static inline void advance_oldest_read(struct t3_wq *wq) +{ + + u32 rptr = wq->oldest_read - wq->sq + 1; + u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2); + + while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) { + wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2); + + if (wq->oldest_read->opcode == T3_READ_REQ) + return; + rptr++; + } + wq->oldest_read = NULL; +} + +/* + * cxio_poll_cq + * + * Caller must: + * check the validity of the first CQE, + * supply the wq assicated with the qpid. + * + * credit: cq credit to return to sge. + * cqe_flushed: 1 iff the CQE is flushed. + * cqe: copy of the polled CQE. + * + * return value: + * 0 CQE returned, + * -1 CQE skipped, try again. + */ +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit) +{ + int ret = 0; + struct t3_cqe *hw_cqe, read_cqe; + + *cqe_flushed = 0; + *credit = 0; + hw_cqe = cxio_next_cqe(cq); + + PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x" + " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe), + CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe), + CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe), + CQE_WRID_LOW(*hw_cqe)); + + /* + * skip cqe's not affiliated with a QP. + */ + if (wq == NULL) { + ret = -1; + goto skip_cqe; + } + + /* + * Gotta tweak READ completions: + * 1) the cqe doesn't contain the sq_wptr from the wr. + * 2) opcode not reflected from the wr. + * 3) read_len not reflected from the wr. + * 4) cq_type is RQ_TYPE not SQ_TYPE. + */ + if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) { + + /* + * Don't write to the HWCQ, so create a new read req CQE + * in local memory. + */ + create_read_req_cqe(wq, hw_cqe, &read_cqe); + hw_cqe = &read_cqe; + advance_oldest_read(wq); + } + + /* + * T3A: Discard TERMINATE CQEs. + */ + if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) { + ret = -1; + wq->error = 1; + goto skip_cqe; + } + + if (CQE_STATUS(*hw_cqe) || wq->error) { + *cqe_flushed = wq->error; + wq->error = 1; + + /* + * T3A inserts errors into the CQE. We cannot return + * these as work completions. + */ + /* incoming write failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE) + && RQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + /* incoming read request failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + + /* incoming SEND with no receive posted failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) { + ret = -1; + goto skip_cqe; + } + goto proc_cqe; + } + + /* + * RECV completion. + */ + if (RQ_TYPE(*hw_cqe)) { + + /* + * HW only validates 4 bits of MSN. So we must validate that + * the MSN in the SEND is the next expected MSN. If its not, + * then we complete this with TPT_ERR_MSN and mark the wq in + * error. + */ + if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) { + wq->error = 1; + hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN)); + goto proc_cqe; + } + goto proc_cqe; + } + + /* + * If we get here its a send completion. + * + * Handle out of order completion. These get stuffed + * in the SW SQ. Then the SW SQ is walked to move any + * now in-order completions into the SW CQ. This handles + * 2 cases: + * 1) reaping unsignaled WRs when the first subsequent + * signaled WR is completed. + * 2) out of order read completions. + */ + if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) { + struct t3_swsq *sqp; + + PDBG("%s out of order completion going in swsq at idx %ld\n", + __FUNCTION__, + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2)); + sqp = wq->sq + + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2); + sqp->cqe = *hw_cqe; + sqp->complete = 1; + ret = -1; + goto flush_wq; + } + +proc_cqe: + *cqe = *hw_cqe; + + /* + * Reap the associated WR(s) that are freed up with this + * completion. + */ + if (SQ_TYPE(*hw_cqe)) { + wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe); + PDBG("%s completing sq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2)); + *cookie = (wq->sq + + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id; + wq->sq_rptr++; + } else { + PDBG("%s completing rq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + *cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + wq->rq_rptr++; + } + +flush_wq: + /* + * Flush any completed cqes that are now in-order. + */ + flush_completed_wrs(wq, cq); + +skip_cqe: + if (SW_CQE(*hw_cqe)) { + PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->sw_rptr); + ++cq->sw_rptr; + } else { + PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->rptr); + ++cq->rptr; + + /* + * T3A: compute credits. + */ + if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1))) + || ((cq->rptr - cq->wptr) >= 128)) { + *credit = cq->rptr - cq->wptr; + cq->wptr = cq->rptr; + } + } + return ret; +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h new file mode 100644 index 0000000..bde5cfb --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_HAL_H__ +#define __CXIO_HAL_H__ + +#include +#include + +#include "t3_cpl.h" +#include "t3cdev.h" +#include "cxgb3_ctl_defs.h" +#include "cxio_wr.h" + +#define T3_CTRL_QP_ID FW_RI_SGEEC_START +#define T3_CTL_QP_TID FW_RI_TID_START +#define T3_CTRL_QP_SIZE_LOG2 8 +#define T3_CTRL_CQ_ID 0 + +/* TBD */ +#define T3_MAX_NUM_RNIC 8 +#define T3_MAX_NUM_RI (1<<15) +#define T3_MAX_NUM_QP (1<<15) +#define T3_MAX_NUM_CQ (1<<15) +#define T3_MAX_NUM_PD (1<<15) +#define T3_MAX_PBL_SIZE 256 +#define T3_MAX_RQ_SIZE 1024 +#define T3_MAX_NUM_STAG (1<<15) + +#define T3_STAG_UNSET 0xffffffff + +#define T3_MAX_DEV_NAME_LEN 32 + +struct cxio_hal_ctrl_qp { + u32 wptr; + u32 rptr; + struct semaphore sem; /* for the wtpr, can sleep */ + wait_queue_head_t waitq; /* wait for RspQ/CQE msg */ + union t3_wr *workq; /* the work request queue */ + dma_addr_t dma_addr; /* pci bus address of the workq */ + DECLARE_PCI_UNMAP_ADDR(mapping) + void __iomem *doorbell; +}; + +struct cxio_hal_resource { + struct kfifo *tpt_fifo; + spinlock_t tpt_fifo_lock; + struct kfifo *qpid_fifo; + spinlock_t qpid_fifo_lock; + struct kfifo *cqid_fifo; + spinlock_t cqid_fifo_lock; + struct kfifo *pdid_fifo; + spinlock_t pdid_fifo_lock; +}; + +struct cxio_qpid_list { + struct list_head entry; + u32 qpid; +}; + +struct cxio_ucontext { + struct list_head qpids; + struct mutex lock; +}; + +struct cxio_rdev { + char dev_name[T3_MAX_DEV_NAME_LEN]; + struct t3cdev *t3cdev_p; + struct rdma_info rnic_info; + struct adap_ports port_info; + struct cxio_hal_resource *rscp; + struct cxio_hal_ctrl_qp ctrl_qp; + void *ulp; + unsigned long qpshift; + u32 qpnr; + u32 qpmask; + struct cxio_ucontext uctx; + struct gen_pool *pbl_pool; + struct gen_pool *rqt_pool; +}; + +static inline int cxio_num_stags(struct cxio_rdev *rdev_p) +{ + return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5)); +} + +typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p, + struct sk_buff * skb); + +#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff) +#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff) +#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1) +#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1) +#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1) +#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1) +#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1) +#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1) +#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1) + +struct respQ_msg_t { + __be32 flags; /* flit 0 */ + __be32 cq_ptrid; + __be64 rsvd; /* flit 1 */ + struct t3_cqe cqe; /* flits 2-3 */ +}; + +enum t3_cq_opcode { + CQ_ARM_AN = 0x2, + CQ_ARM_SE = 0x6, + CQ_FORCE_AN = 0x3, + CQ_CREDIT_UPDATE = 0x7 +}; + +int cxio_rdev_open(struct cxio_rdev *rdev); +void cxio_rdev_close(struct cxio_rdev *rdev); +int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit); +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid); +int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode); +int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr); +int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, + u32 pbl_addr); +int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); +int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag); +int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr); +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +u32 cxio_hal_get_rhdl(void); +void cxio_hal_put_rhdl(u32 rhdl); +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp); +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid); +int __init cxio_hal_init(void); +void __exit cxio_hal_exit(void); +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_flush_hw_cq(struct t3_cq *cq); +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit); + +#define MOD "iw_cxgb3: " +#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args) + +#ifdef DEBUG +void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag); +void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift); +void cxio_dump_wqe(union t3_wr *wqe); +void cxio_dump_wce(struct t3_cqe *wce); +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents); +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid); +#endif + +#endif From swise at opengridcomputing.com Sat Dec 2 14:51:09 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:51:09 -0600 Subject: [openib-general] [PATCH v2 11/13] Core Resource Allocation In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225108.27014.11770.stgit@dell3.ogc.int> Core functions to carve up adapter memory, stag, qp, and cq IDs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_resource.c | 331 ++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_resource.h | 70 +++++ 2 files changed, 401 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c new file mode 100644 index 0000000..444df15 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c @@ -0,0 +1,331 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +/* Crude resource management */ +#include +#include +#include +#include +#include +#include +#include "cxio_resource.h" +#include "cxio_hal.h" + +static struct kfifo *rhdl_fifo; +static spinlock_t rhdl_fifo_lock; + +#define RANDOM_SIZE 16 + +static int __cxio_init_resource_fifo(struct kfifo **fifo, + spinlock_t *fifo_lock, + u32 nr, u32 skip_low, + u32 skip_high, + int random) +{ + u32 i, j, entry = 0, idx; + u32 random_bytes; + u32 rarray[16]; + spin_lock_init(fifo_lock); + + *fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock); + if (IS_ERR(*fifo)) + return -ENOMEM; + + for (i = 0; i < skip_low + skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32)); + if (random) { + j = 0; + random_bytes = random32(); + for (i = 0; i < RANDOM_SIZE; i++) + rarray[i] = i + skip_low; + for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) { + if (j >= RANDOM_SIZE) { + j = 0; + random_bytes = random32(); + } + idx = (random_bytes >> (j * 2)) & 0xF; + __kfifo_put(*fifo, + (unsigned char *) &rarray[idx], + sizeof(u32)); + rarray[idx] = i; + j++; + } + for (i = 0; i < RANDOM_SIZE; i++) + __kfifo_put(*fifo, + (unsigned char *) &rarray[i], + sizeof(u32)); + } else + for (i = skip_low; i < nr - skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32)); + + for (i = 0; i < skip_low + skip_high; i++) + kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32)); + return 0; +} + +static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 0)); +} + +static int cxio_init_resource_fifo_random(struct kfifo **fifo, + spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 1)); +} + +static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p) +{ + u32 i; + + spin_lock_init(&rdev_p->rscp->qpid_fifo_lock); + + rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), + GFP_KERNEL, + &rdev_p->rscp->qpid_fifo_lock); + if (IS_ERR(rdev_p->rscp->qpid_fifo)) + return -ENOMEM; + + for (i = 16; i < T3_MAX_NUM_QP; i++) + if (!(i & rdev_p->qpmask)) + __kfifo_put(rdev_p->rscp->qpid_fifo, + (unsigned char *) &i, sizeof(u32)); + return 0; +} + +int cxio_hal_init_rhdl_resource(u32 nr_rhdl) +{ + return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1, + 0); +} + +void cxio_hal_destroy_rhdl_resource(void) +{ + kfifo_free(rhdl_fifo); +} + +/* nr_* must be power of 2 */ +int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid) +{ + int err = 0; + struct cxio_hal_resource *rscp; + + rscp = kmalloc(sizeof(*rscp), GFP_KERNEL); + if (!rscp) + return -ENOMEM; + rdev_p->rscp = rscp; + err = cxio_init_resource_fifo_random(&rscp->tpt_fifo, + &rscp->tpt_fifo_lock, + nr_tpt, 1, 0); + if (err) + goto tpt_err; + err = cxio_init_qpid_fifo(rdev_p); + if (err) + goto qpid_err; + err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, + nr_cqid, 1, 0); + if (err) + goto cqid_err; + err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, + nr_pdid, 1, 0); + if (err) + goto pdid_err; + return 0; +pdid_err: + kfifo_free(rscp->cqid_fifo); +cqid_err: + kfifo_free(rscp->qpid_fifo); +qpid_err: + kfifo_free(rscp->tpt_fifo); +tpt_err: + return -ENOMEM; +} + +/* + * returns 0 if no resource available + */ +static inline u32 cxio_hal_get_resource(struct kfifo *fifo) +{ + u32 entry; + if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32))) + return entry; + else + return 0; /* fifo emptry */ +} + +static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry) +{ + BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0); +} + +u32 cxio_hal_get_rhdl(void) +{ + return cxio_hal_get_resource(rhdl_fifo); +} + +void cxio_hal_put_rhdl(u32 rhdl) +{ + cxio_hal_put_resource(rhdl_fifo, rhdl); +} + +u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->tpt_fifo); +} + +void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag) +{ + cxio_hal_put_resource(rscp->tpt_fifo, stag); +} + +u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp) +{ + u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid) +{ + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + cxio_hal_put_resource(rscp->qpid_fifo, qpid); +} + +u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->cqid_fifo); +} + +void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid) +{ + cxio_hal_put_resource(rscp->cqid_fifo, cqid); +} + +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->pdid_fifo); +} + +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid) +{ + cxio_hal_put_resource(rscp->pdid_fifo, pdid); +} + +void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp) +{ + kfifo_free(rscp->tpt_fifo); + kfifo_free(rscp->cqid_fifo); + kfifo_free(rscp->qpid_fifo); + kfifo_free(rscp->pdid_fifo); + kfree(rscp); +} + +/* + * PBL Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_PBL_SHIFT 8 /* 256B == min PBL size (32 entries) */ +#define PBL_CHUNK 2*1024*1024 + +u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size); + return (u32)addr; +} + +void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size); + gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size); +} + +int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1); + if (rdev_p->pbl_pool) + for (i = rdev_p->rnic_info.pbl_base; + i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; + i += PBL_CHUNK) + gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1); + return rdev_p->pbl_pool ? 0 : -ENOMEM; +} + +void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->pbl_pool); +} + +/* + * RQT Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_RQT_SHIFT 10 /* 1KB == mini RQT size (16 entries) */ +#define RQT_CHUNK 2*1024*1024 + +u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6); + return (u32)addr; +} + +void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6); + gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6); +} + +int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1); + if (rdev_p->rqt_pool) + for (i = rdev_p->rnic_info.rqt_base; + i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; + i += RQT_CHUNK) + gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1); + return rdev_p->rqt_pool ? 0 : -ENOMEM; +} + +void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->rqt_pool); +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h new file mode 100644 index 0000000..a6bbe83 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_RESOURCE_H__ +#define __CXIO_RESOURCE_H__ + +#include +#include +#include +#include +#include +#include +#include +#include "cxio_hal.h" + +extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl); +extern void cxio_hal_destroy_rhdl_resource(void); +extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, + u32 nr_pdid); +extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag); +extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid); +extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid); +extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp); + +#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base ) +extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); + +#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base ) +extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); +#endif From swise at opengridcomputing.com Sat Dec 2 14:51:19 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:51:19 -0600 Subject: [openib-general] [PATCH v2 12/13] Core Debug functions In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225119.27014.65672.stgit@dell3.ogc.int> Debug code to dump various data structs, some of which are in adapter memory. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_dbg.c | 205 +++++++++++++++++++++++++++ 1 files changed, 205 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c new file mode 100644 index 0000000..22f4f75 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c @@ -0,0 +1,205 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifdef DEBUG +#include +#include "common.h" +#include "cxgb3_ioctl.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size = 32; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base; + m->len = size; + PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size, npages; + + shift += 12; + npages = (len + (1ULL << shift) - 1) >> shift; + size = npages * sizeof(u64); + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = pbl_addr; + m->len = size; + PDBG("%s PBL addr 0x%x len %d depth %d\n", + __FUNCTION__, m->addr, m->len, npages); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_wqe(union t3_wr *wqe) +{ + __be64 *data = (__be64 *)wqe; + uint size = (uint)(be64_to_cpu(*data) & 0xff); + + if (size == 0) + size = 8; + while (size > 0) { + PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data)); + size--; + data++; + } +} + +void cxio_dump_wce(struct t3_cqe *wce) +{ + __be64 *data = (__be64 *)wce; + int size = sizeof(*wce); + + while (size > 0) { + PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data)); + size -= 8; + data++; + } +} + +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents) +{ + struct ch_mem_range *m; + int size = nents * 64; + u64 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base; + m->len = size; + PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid) +{ + struct ch_mem_range *m; + int size = TCB_SIZE; + u32 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_CM; + m->addr = hwtid * size; + m->len = size; + PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u32 *)m->buf; + while (size > 0) { + printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", + m->addr, + *(data+2), *(data+3), *(data),*(data+1), + *(data+6), *(data+7), *(data+4), *(data+5)); + size -= 32; + data += 8; + m->addr += 32; + } + kfree(m); +} +#endif From swise at opengridcomputing.com Sat Dec 2 14:51:29 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 02 Dec 2006 16:51:29 -0600 Subject: [openib-general] [PATCH v2 13/13] Kconfig/Makefile In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202225129.27014.42302.stgit@dell3.ogc.int> Signed-off-by: Steve Wise --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/cxgb3/Kconfig | 27 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/Makefile | 12 ++++++++++++ drivers/infiniband/hw/cxgb3/locking.txt | 25 +++++++++++++++++++++++++ 5 files changed, 66 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 59b3932..06453ab 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/hw/ipath/Kconfig" source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/hw/amso1100/Kconfig" +source "drivers/infiniband/hw/cxgb3/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index 570b30a..69bdd55 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mt obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ +obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig new file mode 100644 index 0000000..84f0f6e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Kconfig @@ -0,0 +1,27 @@ +config INFINIBAND_CXGB3 + tristate "Chelsio RDMA Driver" + depends on CHELSIO_T3 && INFINIBAND + select GENERIC_ALLOCATOR + ---help--- + This is an iWARP/RDMA driver for the Chelsio T3 1GbE and + 10GbE adapters. + + For general information about Chelsio and our products, visit + our website at . + + For customer support, please visit our customer support page at + . + + Please send feedback to . + + To compile this driver as a module, choose M here: the module + will be called iw_cxgb3. + +config INFINIBAND_CXGB3_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_CXGB3 + default n + ---help--- + This option causes the Chelsio RDMA driver to produce copious + amounts of debug messages. Select this if you are developing + the driver or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile new file mode 100644 index 0000000..0df2b3d --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Makefile @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \ + -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core + +obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o + +iw_cxgb3-y := iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \ + iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o + +ifdef CONFIG_INFINIBAND_CXGB3_DEBUG +EXTRA_CFLAGS += -DDEBUG -O1 -g +iw_cxgb3-y += core/cxio_dbg.o +endif diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt new file mode 100644 index 0000000..e5e9991 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/locking.txt @@ -0,0 +1,25 @@ +cq lock: + - spin lock + - used to synchronize the t3_cq + +qp lock: + - spin lock + - used to synchronize updates to the qp state, attrs, and the t3_wq. + - touched on interrupt and process context + +rnicp lock: + - spin lock + - touched on interrupt and process context + - used around lookup tables mapping CQID and QPID to a structure. + - used also to bump the refcnt atomically with the lookup. + +poll: + lock+disable on cq lock + lock qp lock for each cqe that is polled around the call + to cxio_poll_cq(). + +post: + lock+disable qp lock + +global mutex iwch_mutex: + used to maintain global device list. From romieu at fr.zoreil.com Sat Dec 2 15:13:30 2006 From: romieu at fr.zoreil.com (Francois Romieu) Date: Sun, 3 Dec 2006 00:13:30 +0100 Subject: [openib-general] [PATCH v2 00/13] 2.6.20 Chelsio T3 RDMA Driver In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> Message-ID: <20061202231329.GA10719@electric-eye.fr.zoreil.com> Steve Wise : [...] > Version 2 changes: > > - Make code sparse endian clean > - Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays > - Clean up confusing bitfields > - Use random32() instead of local random function > - Use krefs to track endpoint reference counts > - Misc nits > > ----- > > The following series implements the Chelsio T3 iWARP/RDMA Driver to > be considered for inclusion in 2.6.20. It depends on the Chelsio T3 > Ethernet Driver which is also under review now for 2.6.20. See: I understood that Stephen expressed some doubts regarding the inclusion of TOE enabled features. Was his point addressed ? -- Ueimor From shemminger at osdl.org Sat Dec 2 16:24:47 2006 From: shemminger at osdl.org (Stephen Hemminger) Date: Sat, 02 Dec 2006 16:24:47 -0800 Subject: [openib-general] [PATCH v2 00/13] 2.6.20 Chelsio T3 RDMA Driver In-Reply-To: <20061202231329.GA10719@electric-eye.fr.zoreil.com> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202231329.GA10719@electric-eye.fr.zoreil.com> Message-ID: <4572194F.8060309@osdl.org> Francois Romieu wrote: > Steve Wise : > [...] > >> Version 2 changes: >> >> - Make code sparse endian clean >> - Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays >> - Clean up confusing bitfields >> - Use random32() instead of local random function >> - Use krefs to track endpoint reference counts >> - Misc nits >> >> ----- >> >> The following series implements the Chelsio T3 iWARP/RDMA Driver to >> be considered for inclusion in 2.6.20. It depends on the Chelsio T3 >> Ethernet Driver which is also under review now for 2.6.20. See: >> > > I understood that Stephen expressed some doubts regarding the inclusion > of TOE enabled features. > > Was his point addressed ? > > My comments were about different hardware. From dotanb at dev.mellanox.co.il Sat Dec 2 22:34:32 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 03 Dec 2006 08:34:32 +0200 Subject: [openib-general] RNR_RETRY_EXC_ERR and completion opcode in "send_lat" In-Reply-To: <20061202213454.GB31661@cse.ohio-state.edu> References: <20061202213454.GB31661@cse.ohio-state.edu> Message-ID: <45726FF8.3000807@dev.mellanox.co.il> Hi Sayantan. Sayantan Sur wrote: >Hi, > >I have a question about the "status" field for a completion which is due >to RNR retry exceeded error. I trivially modified the `send_lat' program >(from the Gen2 perftest directory) to use SRQ and not post receives >after some specified time. Given the "rnr_retry" attribute of the QP not >to be 7 (infinite retry), I'm expecting the sender to get an erroneous >completion with IBV_WC_RNR_RETRY_EXC_ERR. > >So far so good ... however, the completion I pull out of the send_cq, >lists the opcode of the completion to be IBV_WC_RECV! Is this expected? > >I am using OFED 1.1 on dual Intel Xeon machines with Mellanox DDR HCAs >(two ports) and in MemFree mode. The distribution used is RH AS4 (Nahant >Update 3), with kernel version 2.6.17.7. > >If someone could explain this behavior, or suggest a workaround, it'd be >great. > >TIA, >Sayantan. > > I toke the following text from the man pages that i wrote to the libibverbs: "Not all wc attributes are always valid. If the completion status is other than IBV_WC_SUCCESS, only the following attributes are valid: wr_id, status, qp_num, and vendor_err." In other words, the opcode is not valid if you have a completion with error. Thanks Dotan From mst at mellanox.co.il Sat Dec 2 23:12:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 3 Dec 2006 09:12:55 +0200 Subject: [openib-general] userspace git conversion status/cut over In-Reply-To: References: <1164904025.11808.133123.camel@hal.voltaire.com> Message-ID: <20061203071255.GA4377@mellanox.co.il> > Michael, I was reluctant to answer you, because I could remember you wrote > once in a thread asking each maintainer to create/define branch names > in a format so that ofed build script can pick the code properly. > Unfortunately I could not find that anymore. Can you pls restate that? > Currently libehca has only 1.0 and 1.1 (which is ofed-1.1.1). > Thanks > Nam you only need 2 tags: OFED 1.1 should be refs/tags/vofed-1.1 Assuming you have code for 1.0 (which I don't recall ehca having), tag it with refs/tags/vofed-1.0 otherwise remove the tag -- MST From ogerlitz at voltaire.com Sat Dec 2 23:45:54 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 03 Dec 2006 09:45:54 +0200 Subject: [openib-general] NFS/RDMA for Linux: client and server update release 7 In-Reply-To: References: Message-ID: <457280B2.3030709@voltaire.com> Talpey, Thomas wrote: > At 06:12 PM 12/1/2006, Roland Dreier wrote: >> What is the status of moving this code towards merging to the upstream kernel? > > For the client there are two main prerequisites, both in the RPC layer > and both in progress. One is the completion of the RPC transport switch > merge, mainly the ability to load as modules. The second is a new mount > syscall api, to allow transport-specific arguments to be passed in. We > have a temporary solution for that at the moment. When these two are > in place, the client is ready to consider merging. When these two are ready for merging, note that you don't have to wait for them to be merged in rX and then push the client for rX+1, you can push them all together. Moreover, if the rnfs client is the only user of these features you might not be able to push them without it being pushed as well. > Bottom line, we can put it on the table soon. As was stated over this list few times in the past, as your code is an rdma driver which was never send out to this list for RFC (sending a pointer to some tgz does not count as its not the common practice in the linux kernel open source dev cycle) you better put it on the table sooner then later. Since 2.6.20 has been open, its seems the correct time if you consider pushing it for 2.6.21 . Or. From ogerlitz at voltaire.com Sun Dec 3 00:31:43 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 03 Dec 2006 10:31:43 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> Message-ID: <45728B6F.6040905@voltaire.com> Ralph Campbell wrote: >> On 11/30/06, Ralph Campbell wrote: >>> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote: >>>> So what did you change since v1? How do you deal with fitting 64-bit >>>> addresses into an sg list entry that has a 32-bit dma_addr_t? > Although the driver compiles on 32-bit kernels, it is unsupported > and never been tested. All known 64-bit systems don't define > CONFIG_HIGHMEM. In spite of previous emails suggesting that > page_address() can return NULL without CONFIG_HIGHMEM defined, > the code in include/linux/mm.h doesn't allow it (assuming the > page pointer is valid and not some random address). > I verified this with Andrew Morton. Can you provide the quote from include/linux/mm.h of the code that disallows it? looking there i don't see the enforcement. mmm, your consulting with Andrew Morton was not over this thread... well Christoph Hellwig comment on the V1 thread tells a different story: Only for GFP_KERNEL allocations you can assume page_address is valid, and the scatterlist passed to a SCSI LLDD can contain any type of pages. Currently on all 64bit architectures page_address works on all pages, but that's an implementation detail that could change any time and that you should not rely on. see http://www.mail-archive.com/openib-general at openib.org/msg27132.html As i have mentioned in the past, this (no kvaddr for a page) comes into play when a SCSI LLD (eg iSER, SRP) gets DIRECT I/O or AIO (SDP) pages from user space. Or. From ogerlitz at voltaire.com Sun Dec 3 00:36:35 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 03 Dec 2006 10:36:35 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> Message-ID: <45728C93.6020504@voltaire.com> Ralph Campbell wrote: > Basically, use a hash table to store the kmap result. > See attached for 90% of the code. > static u64 ipath_dma_map_page(struct ib_device *dev, > struct page *page, > unsigned long offset, > size_t size, > enum dma_data_direction direction) > { > u64 addr; > > BUG_ON(!valid_dma_direction(direction)); > > if (offset + size > PAGE_SIZE) { > addr = BAD_DMA_ADDRESS; > goto done; > } > > #ifdef CONFIG_HIGHMEM > /* handle highmem pages */ > if (PageHighMem(page)) { > void *v = kmap(page); another comment we have got on iser, is that this code can be called context that requires kmap_atomic (and xxx_dma_unmap_page in a context that requires kunmap_atomic). This imposes another problem, since the kmap_atomic slots are somehow limited and with this patch the ipath driver would hold those mapping for relatively long time (ie it does not kmap/copy/kunmap). > > if (!v) > addr = BAD_DMA_ADDRESS; > else { > addr = (u64) v + offset; > hash_insert(dev, v + offset, page); > } > goto done; > } > #endif > addr = (u64) page_address(page); > if (addr) > addr += offset; > > done: > return addr; > } From ogerlitz at voltaire.com Sun Dec 3 00:42:55 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 03 Dec 2006 10:42:55 +0200 Subject: [openib-general] [PATCH v2 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1164911024.14800.74.camel@brick.pathscale.com> References: <1164911024.14800.74.camel@brick.pathscale.com> Message-ID: <45728E0F.9020106@voltaire.com> Ralph Campbell wrote: > This patch implements the interposing DMA mapping functions to allow > support for IOMMUs and remove the dependence on phys_to_virt(). > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Wed Nov 29 13:55:07 2006 -0800 > +/** > + * ipath_dma_map_single - Map a kernel virtual address to DMA address > + * @device: The device for which the dma_addr is to be created > + * @cpu_addr: The kernel virtual address > + * @size: The size of the region in bytes > + * @direction: The direction of the DMA > + */ > +static u64 ipath_dma_map_single(struct ib_device *dev, > + void *cpu_addr, size_t size, > + enum dma_data_direction direction) > +{ > + BUG_ON(!valid_dma_direction(direction)); > + return (u64) cpu_addr; > +} if ipath_dma_map_single is a NO OP > +/** > + * ipath_sync_single_for_cpu - Prepare DMA region to be accessed by CPU > + * @device: The device for which the DMA address was created > + * @addr: The DMA address > + * @size: The size of the region in bytes > + * @dir: The direction of the DMA > + */ > +static void ipath_sync_single_for_cpu(struct ib_device *dev, > + u64 addr, > + size_t size, > + enum dma_data_direction dir) > +{ > + dma_sync_single_for_cpu(dev->dma_device, addr, size, dir); > +} then why ipath_sync_single_for_cpu does something? am i just pointing on a cleanup or there's something more deep here? Or. From tziporet at dev.mellanox.co.il Sun Dec 3 02:19:04 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 03 Dec 2006 12:19:04 +0200 Subject: [openib-general] reminder: OFED 1.2 meeting next Monday In-Reply-To: <1164895558.11808.128480.camel@hal.voltaire.com> References: <456EE52E.2060208@dev.mellanox.co.il> <1164895558.11808.128480.camel@hal.voltaire.com> Message-ID: <4572A498.4090201@dev.mellanox.co.il> Hal Rosenstock wrote: > On Thu, 2006-11-30 at 09:05, Tziporet Koren wrote: > >> Hi All, >> I wish to remind all that we have the EWG meeting on Monday 4-Dec at >> 9am-10am. >> > > Which tz ? > > 9-10am PST Meeting details (if you don't have them): ______________________________________________________________________________ Jeffrey Squyres has invited you to a Cisco MeetingPlace Conference Date/Time: DEC 4, 2006 at 12:00PM America/New_York Length: 60 Frequency: 10 Meeting ID: 2106670 Meeting Password: Global Access Numbers: http://cisco.com/en/US/about/doing_business/conferencing/index.html US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 India: +91.80.4103.3979 Germany: +49.619.6773.9002 Japan: +81.3.5763.9394 China: +86.10.8515.5666 From arjan at infradead.org Sun Dec 3 04:07:18 2006 From: arjan at infradead.org (Arjan van de Ven) Date: Sun, 03 Dec 2006 13:07:18 +0100 Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data Structures In-Reply-To: <20061202224947.27014.59189.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224947.27014.59189.stgit@dell3.ogc.int> Message-ID: <1165147639.3233.211.camel@laptopd505.fenrus.org> On Sat, 2006-12-02 at 16:49 -0600, Steve Wise wrote: > + > +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, > + struct ib_ah_attr *ah_attr) > +{ > + return ERR_PTR(-ENOSYS); > +} -ENOSYS is just about ALWAYS a bug in that it's guaranteed to be the wrong error code ;) From mst at mellanox.co.il Sun Dec 3 04:47:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 3 Dec 2006 14:47:06 +0200 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: <20061203124623.GA15614@mellanox.co.il> References: <20061203124623.GA15614@mellanox.co.il> Message-ID: <20061203124706.GB15614@mellanox.co.il> > > Quoting r. Roland Dreier : > > Subject: [GIT PULL] please pull infiniband.git > > > > Linus, please pull from > > > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > > > This tree is also available from kernel.org mirrors at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > > > ... > > > > > IB/ucm: Fix deadlock in cleanup > > Can this go into -stable for 2.6.18.x? Sorry, that should have been 2.6.19.y. -- MST From mst at mellanox.co.il Sun Dec 3 04:42:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 3 Dec 2006 14:42:43 +0200 Subject: [openib-general] [CM] what happen if the path in the REQ packet (primary or alternate) is not reversible? In-Reply-To: <456C74C5.5070007@ichips.intel.com> References: <456C74C5.5070007@ichips.intel.com> Message-ID: <20061203124243.GE4296@mellanox.co.il> > The reversible bit needs to be set as well. Like this, then? --- SRP must set IB_SA_PATH_REC_REVERSIBLE since that's the only kind of path CM currently supports. Untested. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 4b09147..df98754 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -266,6 +266,7 @@ static void srp_path_rec_completion(int static int srp_lookup_path(struct srp_target_port *target) { target->path.numb_path = 1; + target->path.reversible = 1; init_completion(&target->done); @@ -276,6 +277,7 @@ static int srp_lookup_path(struct srp_ta IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_REVERSIBLE | IB_SA_PATH_REC_PKEY, SRP_PATH_REC_TIMEOUT_MS, GFP_KERNEL, -- MST From mst at mellanox.co.il Sun Dec 3 04:46:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 3 Dec 2006 14:46:23 +0200 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <20061203124623.GA15614@mellanox.co.il> > Quoting r. Roland Dreier : > Subject: [GIT PULL] please pull infiniband.git > > Linus, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > ... > > IB/ucm: Fix deadlock in cleanup Can this go into -stable for 2.6.18.x? -- MST From tziporet at dev.mellanox.co.il Sun Dec 3 05:49:55 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 03 Dec 2006 15:49:55 +0200 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test In-Reply-To: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com> Message-ID: <4572D603.6080101@dev.mellanox.co.il> Boris Shpolyansky wrote: > Hi David, > > If you are using OFED-1.1 stack and OSU MVAPICH provided with the > OFED-1.1 package as your MPI layer, > the attached patch should solve your problem. > > Please, let me know if that helped. > > Regards, > Boris, Please add this to OFED 1.1 support page Thanks, Tziporet From jengelh at linux01.gwdg.de Sun Dec 3 08:03:35 2006 From: jengelh at linux01.gwdg.de (Jan Engelhardt) Date: Sun, 3 Dec 2006 17:03:35 +0100 (MET) Subject: [openib-general] [PATCH v2 02/13] Device Discovery and ULLD Linkage In-Reply-To: <20061202224937.27014.951.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224937.27014.951.stgit@dell3.ogc.int> Message-ID: Hi, Some questions,suggestions,: >+cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; Can it be static'ified? (I suppose not.) >+struct cxgb3_client t3c_client = { >+ .name = "iw_cxgb3", >+ .add = open_rnic_dev, >+ .remove = close_rnic_dev, >+ .handlers = t3c_handlers, >+ .redirect = iwch_ep_redirect >+}; Can it be const'ified? >+static void rnic_init(struct iwch_dev *rnicp) >+{ >+ PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); >+ idr_init(&rnicp->cqidr); >+ idr_init(&rnicp->qpidr); >+ idr_init(&rnicp->mmidr); >+ spin_lock_init(&rnicp->lock); >+ >+ rnicp->attr.vendor_id = 0x168; >+ rnicp->attr.vendor_part_id = 7; Sugg.: typeof(rnicp->attr) *a = &rnicp->attr; // replace typeof with proper thing a->vendor_id = 0x168; a->vendor_part_id = 7; shortens the lines a bit. >+ rnicp->attr.max_qps = T3_MAX_NUM_QP - 32; >+ rnicp->attr.max_wrs = (1UL << 24) - 1; >+ rnicp->attr.max_sge_per_wr = T3_MAX_SGE; >+ rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE; >+ rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1; >+ rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1; >+ rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev); >+ rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE; >+ rnicp->attr.max_pds = T3_MAX_NUM_PD - 1; >+ rnicp->attr.mem_pgsizes_bitmask = 0x7FFF; /* 4KB-128MB */ >+ rnicp->attr.can_resize_wq = 0; >+ rnicp->attr.max_rdma_reads_per_qp = 8; >+ rnicp->attr.max_rdma_read_resources = >+ rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps; >+ rnicp->attr.max_rdma_read_qp_depth = 8; /* IRD */ >+ rnicp->attr.max_rdma_read_depth = >+ rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps; >+ rnicp->attr.rq_overflow_handled = 0; >+ rnicp->attr.can_modify_ird = 0; >+ rnicp->attr.can_modify_ord = 0; >+ rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1; >+ rnicp->attr.stag0_value = 1; >+ rnicp->attr.zbva_support = 1; >+ rnicp->attr.local_invalidate_fence = 1; >+ rnicp->attr.cq_overflow_detection = 1; >+ return; >+} >+ >--- /dev/null >+++ b/drivers/infiniband/hw/cxgb3/iwch.h >+static inline int t3b_device(struct iwch_dev *rhp) >+{ >+ return (rhp->rdev.t3cdev_p->type == T3B); >+} >+ >+static inline int t3a_device(struct iwch_dev *rhp) >+{ >+ return (rhp->rdev.t3cdev_p->type == T3A); >+} These two can be constified for sure: static inline int t3a_device(const struct iwch_dev *rhp) >+ >+static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid) >+{ >+ return idr_find(&rhp->cqidr, cqid); >+} >+ >+static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid) >+{ >+ return idr_find(&rhp->qpidr, qpid); >+} >+ >+static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) >+{ >+ return idr_find(&rhp->mmidr, mmid); >+} Here I am not sure. -`J' -- From surs at cse.ohio-state.edu Sun Dec 3 11:57:41 2006 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sun, 03 Dec 2006 14:57:41 -0500 Subject: [openib-general] RNR_RETRY_EXC_ERR and completion opcode in "send_lat" In-Reply-To: <45726FF8.3000807@dev.mellanox.co.il> References: <20061202213454.GB31661@cse.ohio-state.edu> <45726FF8.3000807@dev.mellanox.co.il> Message-ID: <45732C35.5060107@cse.ohio-state.edu> Hi Dotan, Thanks a lot for this information. Sayantan. Dotan Barak wrote: > Hi Sayantan. > Sayantan Sur wrote: > >> Hi, >> >> I have a question about the "status" field for a completion which is due >> to RNR retry exceeded error. I trivially modified the `send_lat' program >> (from the Gen2 perftest directory) to use SRQ and not post receives >> after some specified time. Given the "rnr_retry" attribute of the QP not >> to be 7 (infinite retry), I'm expecting the sender to get an erroneous >> completion with IBV_WC_RNR_RETRY_EXC_ERR. >> >> So far so good ... however, the completion I pull out of the send_cq, >> lists the opcode of the completion to be IBV_WC_RECV! Is this expected? >> >> I am using OFED 1.1 on dual Intel Xeon machines with Mellanox DDR HCAs >> (two ports) and in MemFree mode. The distribution used is RH AS4 (Nahant >> Update 3), with kernel version 2.6.17.7. >> >> If someone could explain this behavior, or suggest a workaround, it'd be >> great. >> >> TIA, >> Sayantan. >> >> > I toke the following text from the man pages that i wrote to the > libibverbs: > "Not all wc attributes are always valid. If the completion status is > other than IBV_WC_SUCCESS, only the following attributes are > valid: > wr_id, status, qp_num, and vendor_err." > > In other words, the opcode is not valid if you have a completion with > error. > > Thanks > Dotan -- http://www.cse.ohio-state.edu/~surs From krkumar2 at in.ibm.com Sun Dec 3 19:44:57 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Mon, 04 Dec 2006 09:14:57 +0530 Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in c2_qp_modify. Message-ID: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com> vq_req is leaked in error cases. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/hw/amso1100/c2_qp.c new/drivers/infiniband/hw/amso1100/c2_qp.c --- org/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-15 12:40:04.000000000 +0530 +++ new/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-16 18:10:03.000000000 +0530 @@ -161,8 +161,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s if (attr_mask & IB_QP_STATE) { /* Ensure the state is valid */ - if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) - return -EINVAL; + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) { + err = -EINVAL; + goto bail0; + } wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); @@ -184,9 +186,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s if (attr->cur_qp_state != IB_QPS_RTR && attr->cur_qp_state != IB_QPS_RTS && attr->cur_qp_state != IB_QPS_SQD && - attr->cur_qp_state != IB_QPS_SQE) - return -EINVAL; - else + attr->cur_qp_state != IB_QPS_SQE) { + err = -EINVAL; + goto bail0; + } else wr.next_qp_state = cpu_to_be32(to_c2_state(attr->cur_qp_state)); From mst at mellanox.co.il Mon Dec 4 00:59:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 10:59:45 +0200 Subject: [openib-general] CMA issue: SDP login compliancy Message-ID: <20061204085945.GC20943@mellanox.co.il> Hi! SDP compliance statement *requires* that a consumer checks the Responder Resources field in the connection Request/Response, verifying that it is > 0. This is part of CA 4-41 in the spec. However Responder Resources field does not seem to be exposed by the CMA API. I think knowing this value (at least in REQ, but preferably in REP is well) is also important for any ULP that does RDMA reads. Should/can CMA/UCMA be extended to pass this to the user? This might be something we need to address before UCMA merge to avoid ABI breakage later. -- MST From monis at voltaire.com Mon Dec 4 01:11:33 2006 From: monis at voltaire.com (Moni Shoua) Date: Mon, 04 Dec 2006 11:11:33 +0200 Subject: [openib-general] [PATCH v2] IB_mthca HCA profile module parameters In-Reply-To: References: <456336AC.2070803@voltaire.com> Message-ID: <4573E645.8050806@voltaire.com> Roland Dreier wrote: > OK, getting better, but still not there: > > > + if (mthca_is_memfree(mdev)) { > > + mthca_check_profile_and_warn(num_udav,default_profile.num_udav, > > + MTHCA_DEFAULT_NUM_UDAV); > > + mthca_check_profile_and_warn(fmr_reserved_mtts,default_profile.fmr_reserved_mtts, > > + MTHCA_DEFAULT_NUM_RESERVED_MTTS); > > fmr_reserved_mtts is actually only used for NON-memfree HCAs. > > Also, coding style is spaces after commas, so you should do > > foo(x, y, z); > > instead of > > foo(x,y,z); > > I think it will be easier if I just try to fix this patch up myself > instead of iterating like this... > Hi, It is fine with me if you make the necessary changes in this patch to make it acceptable. I would appericaite if you send publish it after that to let me learn about the changes you made. thanks and sorry for the delay in replying. MoniS From mst at mellanox.co.il Mon Dec 4 02:01:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 12:01:32 +0200 Subject: [openib-general] QUESTION: git-tag howto In-Reply-To: References: Message-ID: <20061204100132.GE20943@mellanox.co.il> Good question. There are different kinds of tags. For OFED we used lightweigh tags so far. Linus uses signed tags for Linux releases. What *should* we use for OFED long term? Comments wellcome. Summary below. ---- Lightweigh tag is just a reference, not a real object. As such, they do not have a name and are not immutable. you create them with git tag Tag objects have a name and so are immutable. These are created with git tag -a for unsigned tags and git tag -s/-u for signed tags. Quoting r. Hoang-Nam Nguyen : Subject: QUESTION: git-tag howto Hi Michael! Can you please give me some advices how to git-tag? Do I really need to generate a gpg key for that? It would be great if you can describe me briefly your git-tag procedure. Thanks! Nam -- MST From ogerlitz at voltaire.com Mon Dec 4 02:55:23 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 04 Dec 2006 12:55:23 +0200 Subject: [openib-general] Local SA caching - why we need it In-Reply-To: References: Message-ID: <4573FE9B.20400@voltaire.com> Woodruff, Robert J wrote: > This really is not an issue with the Intel MPI connection establishment > design, rather, any application (or set of applications) that needs to > establish lots of connections will have the same issue. This is over simplification: you were mentioning that in your testing the SA scaled to 15K queries/second, lets set that limit to 10K. **if** your (anyone's) MPI makes sure it does not impose on the SA a load of more then 10K queries per seconds, it would take 100 seconds for the SA to provide 1M paths to a 1K process job with 1K paths/for/each rank. This somehow simple design change reduces your job start time from today's infinite to 100 seconds plus the time it takes to do all the other work (IP2GID resolution && QP create/modify-init-rtr-rts && CM exchanges && your-mpi-etcs). The local-sa shrinks the paths-fetching-time from 100 seconds to zero and your startup code time would reduce to the "other" time. So now you are either very happy, or just hit the next roadblock since the "other" time is not negligible. In the devcon and elsewhere on this thread i was trying to say this and mention some ideas re the next roadblock, not sure why you did not want to hear it. ================================================== When a closed source product sets requirements on open source software they should be willing to discuss some/of/the actual design and implementation of their SW. Specifically, when competing SW products (specifically open source ones) are claimed to need the exact or similar set of functionalities, you should be willing to have a discussion. They might even get some good advice for free... The local SA was not developed for Intel MPI needs, but rather in the framework of the path-forward project, for future open MPI usage and/or other requirements of the labs (routing algorithms/visualization etc). With this at hand, and your instant request to include it in OFED 1.2 the group here thinks that a local/distributed SA can be quite good solution for the roadblock you are hitting and does not disagree to include it in OFED 1.2 in the non disturbing form you were mentioning. But, having what seems to be a trend of more MPIs which are now in or soon to be in a transition towards moving to use the RDMA CM for their job start, I would ask to hold off with a kernel push, to first have the local sa solution tested and more over see what problems are seen by other (and yours) MPIs when attempting to scale over Infiniband. Or. From johnpol at 2ka.mipt.ru Mon Dec 4 03:08:26 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Mon, 4 Dec 2006 14:08:26 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061202224958.27014.65970.stgit@dell3.ogc.int> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> Message-ID: <20061204110825.GA26251@2ka.mipt.ru> On Sat, Dec 02, 2006 at 04:49:58PM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) > +{ > + struct cpl_close_con_req *req; > + struct sk_buff *skb; > + > + PDBG("%s ep %p\n", __FUNCTION__, ep); > + skb = get_skb(NULL, sizeof(*req), gfp); > + if (!skb) { > + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); > + return -ENOMEM; > + } > + skb->priority = CPL_PRIORITY_DATA; > + set_arp_failure_handler(skb, arp_failure_discard); > + req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req)); > + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); > + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); > + l2t_send(ep->com.tdev, skb, ep->l2t); > + return 0; > +} > + > +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) > +{ > + struct cpl_abort_req *req; > + > + PDBG("%s ep %p\n", __FUNCTION__, ep); > + skb = get_skb(skb, sizeof(*req), gfp); > + if (!skb) { > + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", > + __FUNCTION__); > + return -ENOMEM; > + } > + skb->priority = CPL_PRIORITY_DATA; > + set_arp_failure_handler(skb, abort_arp_failure); > + req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req)); > + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ)); > + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); > + req->cmd = CPL_ABORT_SEND_RST; > + l2t_send(ep->com.tdev, skb, ep->l2t); > + return 0; > +} > + > +static int send_connect(struct iwch_ep *ep) > +{ > + struct cpl_act_open_req *req; > + struct sk_buff *skb; > + u32 opt0h, opt0l, opt2; > + unsigned int mtu_idx; > + int wscale; > + > + PDBG("%s ep %p\n", __FUNCTION__, ep); > + > + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); > + if (!skb) { > + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", > + __FUNCTION__); > + return -ENOMEM; > + } > + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); > + wscale = compute_wscale(rcv_win); > + opt0h = V_NAGLE(0) | > + V_NO_CONG(nocong) | > + V_KEEP_ALIVE(1) | > + F_TCAM_BYPASS | > + V_WND_SCALE(wscale) | > + V_MSS_IDX(mtu_idx) | > + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); > + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); > + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); > + skb->priority = CPL_PRIORITY_SETUP; > + set_arp_failure_handler(skb, act_open_req_arp_failure); > + > + req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req)); > + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); > + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid)); > + req->local_port = ep->com.local_addr.sin_port; > + req->peer_port = ep->com.remote_addr.sin_port; > + req->local_ip = ep->com.local_addr.sin_addr.s_addr; > + req->peer_ip = ep->com.remote_addr.sin_addr.s_addr; > + req->opt0h = htonl(opt0h); > + req->opt0l = htonl(opt0l); > + req->params = 0; > + req->opt2 = htonl(opt2); > + l2t_send(ep->com.tdev, skb, ep->l2t); > + return 0; > +} ... > +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) > +{ > + struct iwch_ep *ep = ctx; > + struct cpl_act_establish *req = cplhdr(skb); > + unsigned int tid = GET_TID(req); > + > + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid); > + > + dst_confirm(ep->dst); > + > + /* setup the hwtid for this connection */ > + ep->hwtid = tid; > + cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid); > + > + ep->snd_seq = ntohl(req->snd_isn); > + > + set_emss(ep, ntohs(req->tcp_opt)); > + > + /* dealloc the atid */ > + cxgb3_free_atid(ep->com.tdev, ep->atid); > + > + /* start MPA negotiation */ > + send_mpa_req(ep, skb); > + > + return 0; > +} > + > +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb) > +{ > + PDBG("%s ep %p\n", __FILE__, ep); > + state_set(&ep->com, ABORTING); > + send_abort(ep, skb, GFP_KERNEL); > +} Could you convince network core developers that it is not own TCP implementation which will mess with existing one? This and a lot of other changes in this driver definitely says you implement your own stack of protocols on top of infiniband hardware. -- Evgeniy Polyakov From mst at mellanox.co.il Mon Dec 4 04:43:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 14:43:04 +0200 Subject: [openib-general] Local SA caching - why we need it In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611913E075@EPEXCH2.qlogic.org> References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E075@EPEXCH2.qlogic.org> Message-ID: <20061204124304.GB31314@mellanox.co.il> > A well designed SA cache/replica can use the assorted InformInfo notices > from the SM to detect when GIDs come and go and hence properly update > the relevant subset of its replica. OK, but its not clear that keeping local sa cache per node is the answer. Specifically it seems to be good for some workloads/topologies, but worst case numbers do not look good to me: It seems that (especially at startup time), the number of notifications sent would be linear with the network size N. Since all N nodes need to be notified, we get O(N^2) notifications instead of O(N^2) queries, which does not seem to be a win - especially if you consider that queries can be on demand while notifications aren't. For example, a design using SA redirection with O(\sqrt N) SA replicas would give you O(\sqrt N) notifications per replica and this way we would get O(N \sqrt N) notifications and O(N \sqrt N) queries. This would also move the code out from kernel to userspace SA. I also note that the local_sa design from here: https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/core/local_sa.c does not seem to implement any InformInfo notices, which seems to guarantee query storms with low cache timeout values, and connection timeouts on topology changes with high cache timeout values. -- MST From mst at mellanox.co.il Mon Dec 4 06:22:14 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 16:22:14 +0200 Subject: [openib-general] oops with multicast patches In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> Message-ID: <20061204142214.GA5426@mellanox.co.il> OK, I got back to this finally. First, I reproduced the crash again, with spinlock debugger enabled. It seems we are looking at some use-after-free. Next, I'll try adding the debugging patch Sean posted, and see what this gives. > When running the test ib_mcast_full, both of the hosts (10.4.10.136-137 ) got kernel oops (see below). > This test first restart the driver, and after that it attached to the max available multicast groups. BUG: spinlock bad magic on CPU#1, ib_mad2/15709 Unable to handle kernel paging request at 00000001003e0107 RIP: {spin_bug+116} PGD 75f1a067 PUD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: mst_pciconf mst_pci ib_mthca ib_umad ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod button battery ac ohci_hcd i2c_amd8111 i2c_amd756 i2c_core tg3 floppy ext3 jbd Pid: 15709, comm: ib_mad2 Not tainted 2.6.17.9-smp #1 RIP: 0010:[] {spin_bug+116} RSP: 0018:ffff810043b43ca8 EFLAGS: 00010006 RAX: 0000000000000000 RBX: 00000001003e0003 RCX: ffffffff80438e07 RDX: ffffffff80480e18 RSI: 0000000000000046 RDI: ffffffff80480e00 RBP: ffff81007a1cd840 R08: 00000000ffffffff R09: 0000000000000004 R10: 0000000100000000 R11: 0000000000000046 R12: ffff81007a1cd838 R13: 0000000000000293 R14: 0000000000000000 R15: ffffffff880732ce FS: 00002b76b6c7e4e0(0000) GS:ffff81007df2eac0(0000) knlGS:00000000f7fd78e0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000001003e0107 CR3: 0000000075d1e000 CR4: 00000000000006e0 Process ib_mad2 (pid: 15709, threadinfo ffff810043b42000, task ffff81007d047880) Stack: 0000000000000003 ffff81007a1cd840 ffff81007a1cd840 ffffffff802e02cd ffff81007cfe9600 ffff81007a1cd840 ffff81007a1cd838 ffffffff8040ee4b 0000000000000246 ffffffff8807beff Call Trace: {_raw_spin_lock+28} {_spin_lock_irqsave+11} {:ib_sa:release_group+27} {:ib_sa:mcast_work_handler+1280} {find_busiest_group+304} {:ib_mad:timeout_sends+0} {:ib_sa:ib_sa_mcmember_rec_callback+64} {_spin_unlock_irq+7} {thread_return+100} {:ib_sa:send_handler+74} {:ib_mad:timeout_sends+397} {run_workqueue+161} {worker_thread+0} {keventd_create_kthread+0} {worker_thread+261} {default_wake_function+0} {keventd_create_kthread+0} {default_wake_function+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Code: 44 8b 83 04 01 00 00 48 8d 8b a0 02 00 00 8b 55 04 41 89 c1 RIP {spin_bug+116} RSP CR2: 00000001003e0107 <3>BUG: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0, irqs_disabled():1 Call Trace: {__might_sleep+190} {blocking_notifier_call_chain+31} {do_exit+34} {_spin_lock_irqsave+11} {vgacon_set_cursor_size+51} {do_page_fault+1852} {:ib_core:ib_ud_header_pack+135} {:ib_mthca:build_mlx_header+464} {:ib_mad:timeout_sends+0} {:ib_mad:timeout_sends+0} {error_exit+0} {:ib_mad:timeout_sends+0} {spin_bug+116} {spin_bug+97} {_raw_spin_lock+28} {_spin_lock_irqsave+11} {:ib_sa:release_group+27} {:ib_sa:mcast_work_handler+1280} {find_busiest_group+304} {:ib_mad:timeout_sends+0} {:ib_sa:ib_sa_mcmember_rec_callback+64} {_spin_unlock_irq+7} {thread_return+100} {:ib_sa:send_handler+74} {:ib_mad:timeout_sends+397} {run_workqueue+161} {worker_thread+0} {keventd_create_kthread+0} {worker_thread+261} {default_wake_function+0} {keventd_create_kthread+0} {default_wake_function+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Triggering sysrq show that this was during module removal: modprobe D ffff810055b1be48 0 22388 1 4048 (NOTLB) ffff810055b1be48 0000000155b1bdc0 ffff81007a1c9880 000000000000050e 000002b14226ffda ffff81007c1888c0 0000000000000001 0000000155b1be88 ffff81007a1c9880 0000000000001707 Call Trace: {wait_for_completion+229} {wait_for_completion+165} {default_wake_functio n+0} {default_wake_function+0} {:ib_sa:mcast_cleanup +25} {:ib_sa:ib_sa_cleanup+6} {sys_delete_module+411 } {__up_write+20} {sys_munmap+91} {system_call+126} -- Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd. From jlentini at netapp.com Mon Dec 4 06:50:09 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 4 Dec 2006 09:50:09 -0500 (EST) Subject: [openib-general] NFS/RDMA for Linux: client and server update release 7 In-Reply-To: References: Message-ID: At 06:12 PM 12/1/2006, Roland Dreier wrote: > What is the status of moving this code towards merging to the > upstream kernel? I covered this last month as part of my OFA Summit presentation. The slides are available here: http://openfabrics.org/conference/nov2006sc/ofa_summit_nfs_rdma.pdf From halr at voltaire.com Mon Dec 4 06:50:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Dec 2006 09:50:04 -0500 Subject: [openib-general] IPoIB and MC Group leaving Message-ID: <1165243803.25587.5906.camel@hal.voltaire.com> Roland, Currently, the IPoIB code issues what I would term a "preemptive" leave to the SA in a number of cases: ulp/ipoib/ipoib_multicast.c:ipoib_mcast_leave ... /* * Just make one shot at leaving and don't wait for a reply; * if we fail, too bad. */ ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE, 0, GFP_ATOMIC, NULL, mcast, &mcast->query); This is to make sure node is not registered in any groups. This leave may not be successful. Failure is "normal" when the subnet is starting up "fresh". There are other cases where the failure is indeed a failure. However, it is "unsafe" to issue a subsequent join until the leave has been responded to as that is the only "reliability" guarantee that the SA has received the request and processed it. I know the comment says that the result of the leave is irrelevant. However, the fact that it has been processed or not is needed for the subsequent (related) join to be issued. Pipelining of joins/leaves can only occur if they are unrelated. I'm not sure the IBA spec is clear on this. Am I wrong about this ? -- Hal From eitan at mellanox.co.il Mon Dec 4 07:02:38 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 04 Dec 2006 17:02:38 +0200 Subject: [openib-general] IPoIB and MC Group leaving In-Reply-To: <1165243803.25587.5906.camel@hal.voltaire.com> References: <1165243803.25587.5906.camel@hal.voltaire.com> Message-ID: <4574388E.8020207@mellanox.co.il> Actually I do not see the point in leaving all groups and immediately joining them again. Hal Rosenstock wrote: > Roland, > > Currently, the IPoIB code issues what I would term a "preemptive" leave > to the SA in a number of cases: > > ulp/ipoib/ipoib_multicast.c:ipoib_mcast_leave > ... > /* > * Just make one shot at leaving and don't wait for a reply; > * if we fail, too bad. > */ > ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec, > IB_SA_MCMEMBER_REC_MGID | > IB_SA_MCMEMBER_REC_PORT_GID | > IB_SA_MCMEMBER_REC_PKEY | > IB_SA_MCMEMBER_REC_JOIN_STATE, > 0, GFP_ATOMIC, NULL, > mcast, &mcast->query); > > This is to make sure node is not registered in any groups. This leave > may not be successful. Failure is "normal" when the subnet is starting > up "fresh". There are other cases where the failure is indeed a failure. > > However, it is "unsafe" to issue a subsequent join until the leave has > been responded to as that is the only "reliability" guarantee that the > SA has received the request and processed it. I know the comment says > that the result of the leave is irrelevant. However, the fact that it > has been processed or not is needed for the subsequent (related) join to > be issued. Pipelining of joins/leaves can only occur if they are > unrelated. I'm not sure the IBA spec is clear on this. Am I wrong about > this ? > > -- Hal > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Mon Dec 4 07:26:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 17:26:24 +0200 Subject: [openib-general] oops with multicast patches In-Reply-To: <20061204142214.GA5426@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> Message-ID: <20061204152624.GA8269@mellanox.co.il> > OK, I got back to this finally. First, I reproduced the crash again, > with spinlock debugger enabled. It seems we are looking at some use-after-free. > Next, I'll try adding the debugging patch Sean posted, and see what this gives. Sean, Yohad here tried adding your debugging patch and reproduced the crash. Unfortunately, none of the BUG_ON errors got triggered. Here's the trace from the last crash: BUG: spinlock bad magic on CPU#1, ib_mad2/17805 lock: ffff810079fc4140, .magic: 00000000, .owner: /-32512, .owner_cpu: 2039181760 Call Trace: {_raw_spin_lock+28} {_spin_lock_irqsave+11} {:ib_sa:release_group+27} {:ib_sa:mcast_work_handler+1345} {:ib_mad:ib_mad_post_receive_mads+268} {_spin_unlock_irq+7} {:ib_mad:timeout_sends+0} {:ib_sa:ib_sa_mcmember_rec_callback+64} {_spin_unlock_irq+7} {thread_return+100} {:ib_sa:send_handler+74} {:ib_mad:timeout_sends+397} {run_workqueue+161} {worker_thread+0} {keventd_create_kthread+0} {worker_thread+261} {default_wake_function+0} {keventd_create_kthread+0} {default_wake_function+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} -- MST From rdreier at cisco.com Mon Dec 4 07:45:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 04 Dec 2006 07:45:52 -0800 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061204110825.GA26251@2ka.mipt.ru> (Evgeniy Polyakov's message of "Mon, 4 Dec 2006 14:08:26 +0300") References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> Message-ID: > Could you convince network core developers that it is not own TCP > implementation which will mess with existing one? I'm not qualified to comment on this... > This and a lot of other changes in this driver definitely says you > implement your own stack of protocols on top of infiniband hardware. ...but I do know this driver is for 10-gig ethernet HW. - R. From rdreier at cisco.com Mon Dec 4 07:49:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 04 Dec 2006 07:49:20 -0800 Subject: [openib-general] IPoIB and MC Group leaving In-Reply-To: <1165243803.25587.5906.camel@hal.voltaire.com> (Hal Rosenstock's message of "04 Dec 2006 09:50:04 -0500") References: <1165243803.25587.5906.camel@hal.voltaire.com> Message-ID: > This is to make sure node is not registered in any groups. This leave > may not be successful. Failure is "normal" when the subnet is starting > up "fresh". There are other cases where the failure is indeed a failure. As far as I know, IPoIB will not leave a group unless it thinks it has joined the group. What is the code path for a "preemptive" leave? - R. From rdreier at cisco.com Mon Dec 4 07:51:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 04 Dec 2006 07:51:26 -0800 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: <20061203124623.GA15614@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 3 Dec 2006 14:46:23 +0200") References: <20061203124623.GA15614@mellanox.co.il> Message-ID: > > IB/ucm: Fix deadlock in cleanup > Can this go into -stable for 2.6.18.x? Yes. If you can send to stable@ that would be great. From halr at voltaire.com Mon Dec 4 08:01:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 04 Dec 2006 11:01:28 -0500 Subject: [openib-general] IPoIB and MC Group leaving In-Reply-To: References: <1165243803.25587.5906.camel@hal.voltaire.com> Message-ID: <1165248082.25587.8839.camel@hal.voltaire.com> On Mon, 2006-12-04 at 10:49, Roland Dreier wrote: > > This is to make sure node is not registered in any groups. This leave > > may not be successful. Failure is "normal" when the subnet is starting > > up "fresh". There are other cases where the failure is indeed a failure. > > As far as I know, IPoIB will not leave a group unless it thinks it has > joined the group. What is the code path for a "preemptive" leave? OK maybe I have that part wrong but what about the other part: The fact that a leave doesn't wait for the response and then a join is issued. I think there is a race condition here perhaps triggered by client reregistration. -- Hal > - R. From swise at opengridcomputing.com Mon Dec 4 08:20:51 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 04 Dec 2006 10:20:51 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> Message-ID: <1165249251.32724.26.camel@stevo-desktop> On Mon, 2006-12-04 at 07:45 -0800, Roland Dreier wrote: > > Could you convince network core developers that it is not own TCP > > implementation which will mess with existing one? > > I'm not qualified to comment on this... > I don't understand your question? > > This and a lot of other changes in this driver definitely says you > > implement your own stack of protocols on top of infiniband hardware. > > ...but I do know this driver is for 10-gig ethernet HW. > There is no SW TCP stack in this driver. The HW supports RDMA over TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet (aka iWARP). The device is a 10 GbE device, not Infiniband. The Ethernet driver, upon which the rdma driver depends, acts both like a traditional Ethernet NIC for the Linux stack as well as a TCP offload device for the RDMA driver allowing establishment of RDMA connections. The Connection Manager (patch 04/13) sends/receives messages from the Ethernet driver that sets up HW TCP connections for doing RDMA. While this is indeed implementing TCP offload, it is _not_ integrating it with the sockets layer nor the linux stack and offloading sockets connections. Its only supporting offload connections for the RDMA driver to do iWARP. The Ammasso device is another example of this (drivers/infiniband/hw/amso1100). Deep iSCSI adapters are another example of this. Steve. From swise at opengridcomputing.com Mon Dec 4 08:24:34 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 04 Dec 2006 10:24:34 -0600 Subject: [openib-general] [PATCH v2 00/13] 2.6.20 Chelsio T3 RDMA Driver In-Reply-To: <4572194F.8060309@osdl.org> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202231329.GA10719@electric-eye.fr.zoreil.com> <4572194F.8060309@osdl.org> Message-ID: <1165249474.32724.30.camel@stevo-desktop> > >> > > > > I understood that Stephen expressed some doubts regarding the inclusion > > of TOE enabled features. > > > > Was his point addressed ? > > > > > > My comments were about different hardware. Just to clarify: Stephen is working on the Chelsio T2 HW driver. The drivers Divy and I are submitting are for the new Chelsio T3 hardware. Two drivers are being submitted: The Ethernet driver (submitted by Divy) and the RDMA driver (submitted by me) which requires the Ethernet driver. The RDMA driver will live in drivers/infiniband/hw/cxgb3 and the Ethernet driver will live in drivers/net/cxgb3. Steve. From swise at opengridcomputing.com Mon Dec 4 08:28:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 04 Dec 2006 10:28:25 -0600 Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data Structures In-Reply-To: <1165147639.3233.211.camel@laptopd505.fenrus.org> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224947.27014.59189.stgit@dell3.ogc.int> <1165147639.3233.211.camel@laptopd505.fenrus.org> Message-ID: <1165249706.32724.35.camel@stevo-desktop> On Sun, 2006-12-03 at 13:07 +0100, Arjan van de Ven wrote: > On Sat, 2006-12-02 at 16:49 -0600, Steve Wise wrote: > > > + > > +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, > > + struct ib_ah_attr *ah_attr) > > +{ > > + return ERR_PTR(-ENOSYS); > > +} > > > -ENOSYS is just about ALWAYS a bug in that it's guaranteed to be the > wrong error code ;) This is a method that is not supported by the iWARP transport. ENOSYS indicates this. I _think_ this is SOP for the infinband subsystem. Roland, I think at one time we were talking about changing the Core to better handle this? Either with attributes/capabilities that the low level driver can set, or by set these method ptrs to NULL and the core should handle it in the wrapper function... Steve. From rdreier at cisco.com Mon Dec 4 08:45:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 04 Dec 2006 08:45:30 -0800 Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data Structures In-Reply-To: <1165249706.32724.35.camel@stevo-desktop> (Steve Wise's message of "Mon, 04 Dec 2006 10:28:25 -0600") References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224947.27014.59189.stgit@dell3.ogc.int> <1165147639.3233.211.camel@laptopd505.fenrus.org> <1165249706.32724.35.camel@stevo-desktop> Message-ID: > Roland, I think at one time we were talking about changing the Core to > better handle this? Either with attributes/capabilities that the low > level driver can set, or by set these method ptrs to NULL and the core > should handle it in the wrapper function... Yes, it would make sense to change the midlayer so we have different sets of mandatory functions for IB and iWARP drivers. For example, the iwcm functions probably should be mandatory for iWARP devices, right? - R. From mst at mellanox.co.il Mon Dec 4 08:44:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 18:44:48 +0200 Subject: [openib-general] [PATCH -stable] IB/ucm: Fix deadlock in cleanup In-Reply-To: <20060403154741.GB14808@mellanox.co.il> References: <20060403154741.GB14808@mellanox.co.il> Message-ID: <20061204164448.GA15375@mellanox.co.il> ib_ucm_cleanup_events() holds file_mutex while calling ib_destroy_cm_id(). This can deadlock since ib_destroy_cm_id() flushes event handlers, and ib_ucm_event_handler() needs file_mutex, too. Therefore, drop the file_mutex during the call to ib_destroy_cm_id(). Signed-off-by: Michael S. Tsirkin Signed-off-by: Roland Dreier Acked-by: Sean Hefty --- Hello, -stable team! This patch backports commit f469b2626f48829c06e40ac799c1edf62b12048e to 2.6.19. Please consider it for 2.6.19.y - this fixes a deadlock reproduced here at Mellanox. diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 1f4f2d2..f15220a 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -161,12 +161,14 @@ static void ib_ucm_cleanup_events(struct ib_ucm_context *ctx) struct ib_ucm_event, ctx_list); list_del(&uevent->file_list); list_del(&uevent->ctx_list); + mutex_unlock(&ctx->file->file_mutex); /* clear incoming connections. */ if (ib_ucm_new_cm_id(uevent->resp.event)) ib_destroy_cm_id(uevent->cm_id); kfree(uevent); + mutex_lock(&ctx->file->file_mutex); } mutex_unlock(&ctx->file->file_mutex); } -- MST From swise at opengridcomputing.com Mon Dec 4 08:50:48 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 04 Dec 2006 10:50:48 -0600 Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data Structures In-Reply-To: References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224947.27014.59189.stgit@dell3.ogc.int> <1165147639.3233.211.camel@laptopd505.fenrus.org> <1165249706.32724.35.camel@stevo-desktop> Message-ID: <1165251048.32724.37.camel@stevo-desktop> On Mon, 2006-12-04 at 08:45 -0800, Roland Dreier wrote: > > Roland, I think at one time we were talking about changing the Core to > > better handle this? Either with attributes/capabilities that the low > > level driver can set, or by set these method ptrs to NULL and the core > > should handle it in the wrapper function... > > Yes, it would make sense to change the midlayer so we have different > sets of mandatory functions for IB and iWARP drivers. For example, > the iwcm functions probably should be mandatory for iWARP devices, right? > Yes. The iWARP devices must all support the iwcm methods for sure. From mst at mellanox.co.il Mon Dec 4 08:57:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 18:57:12 +0200 Subject: [openib-general] CMA issue: SDP login compliancy In-Reply-To: <20061204085945.GC20943@mellanox.co.il> References: <20061204085945.GC20943@mellanox.co.il> Message-ID: <20061204165712.GC15375@mellanox.co.il> > Subject: CMA issue: SDP login compliancy > > Hi! > SDP compliance statement *requires* that a consumer checks the > Responder Resources field in the connection Request/Response, > verifying that it is > 0. This is part of CA 4-41 in the spec. > > However Responder Resources field does not seem to be exposed by the CMA API. I > think knowing this value (at least in REQ, but preferably in REP is well) is > also important for any ULP that does RDMA reads. > > Should/can CMA/UCMA be extended to pass this to the user? This might be > something we need to address before UCMA merge to avoid ABI breakage later. Steve, could you please comment on the iWarp side of things? Does iwarp connection get the number of RDMA read requests remote side can support during connection setup? Or is this IB-specific? -- MST From swise at opengridcomputing.com Mon Dec 4 09:14:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 04 Dec 2006 11:14:43 -0600 Subject: [openib-general] CMA issue: SDP login compliancy In-Reply-To: <20061204165712.GC15375@mellanox.co.il> References: <20061204085945.GC20943@mellanox.co.il> <20061204165712.GC15375@mellanox.co.il> Message-ID: <1165252483.32724.44.camel@stevo-desktop> On Mon, 2006-12-04 at 18:57 +0200, Michael S. Tsirkin wrote: > > Subject: CMA issue: SDP login compliancy > > > > Hi! > > SDP compliance statement *requires* that a consumer checks the > > Responder Resources field in the connection Request/Response, > > verifying that it is > 0. This is part of CA 4-41 in the spec. > > > > However Responder Resources field does not seem to be exposed by the CMA API. I > > think knowing this value (at least in REQ, but preferably in REP is well) is > > also important for any ULP that does RDMA reads. > > > > Should/can CMA/UCMA be extended to pass this to the user? This might be > > something we need to address before UCMA merge to avoid ABI breakage later. > > Steve, could you please comment on the iWarp side of things? > Does iwarp connection get the number of RDMA read requests remote side > can support during connection setup? > Or is this IB-specific? > I believe Sean's latest CMA patches under consideration for 2.6.20 support this from a CMA perspective. See http://thread.gmane.org/gmane.linux.drivers.openib/33576/focus=33580 iWARP (MPA protocol) currently doesn't exchange this information across the wire at connection setup, but there are proposals in the works to support this (It requires a wire protocol change). So eventually, iWARP will provide the remote peer's responder resources in the connection events. Steve. From mst at mellanox.co.il Mon Dec 4 09:37:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 19:37:35 +0200 Subject: [openib-general] CMA issue: SDP login compliancy In-Reply-To: <1165252483.32724.44.camel@stevo-desktop> References: <1165252483.32724.44.camel@stevo-desktop> Message-ID: <20061204173735.GD15375@mellanox.co.il> > I believe Sean's latest CMA patches under consideration for 2.6.20 > support this from a CMA perspective. > > See http://thread.gmane.org/gmane.linux.drivers.openib/33576/focus=33580 Right, looks like it's covered there. Good, thanks. -- MST From mst at mellanox.co.il Mon Dec 4 09:41:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 19:41:47 +0200 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <20061204174147.GE15375@mellanox.co.il> > > > IB/ucm: Fix deadlock in cleanup > > > Can this go into -stable for 2.6.18.x? > > Yes. If you can send to stable@ that would be great. I sent it for inclusion in 2.6.19.y. I don't remember what is the timeframe for 2.6.18.x, exactly. Is it still maintained now that 2.6.19 is out? -- MST From parks at lanl.gov Mon Dec 4 09:58:47 2006 From: parks at lanl.gov (Parks Fields) Date: Mon, 04 Dec 2006 10:58:47 -0700 Subject: [openib-general] Nvivia vs Serverworks chip set. In-Reply-To: <20061204174147.GE15375@mellanox.co.il> References: <20061204174147.GE15375@mellanox.co.il> Message-ID: <7.0.1.0.2.20061204105133.02877c88@lanl.gov> Hello all, Has anyone done any comparisons of the Mellanox MHEA28-XTC card using a motherboard with the serverworks vs Nvidia chipset. I am most concerned with latency and IPoIB bandwidth. Also how a standard RH el/es 4.3 2.6.9 kernel VS a 2.6.17 or 18 kernel with the above chipsets. thanks for any insight. parks ***** Correspondence ***** This email contains no programmatic content that requires independent ADC review -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Mon Dec 4 10:05:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 04 Dec 2006 10:05:41 -0800 Subject: [openib-general] oops with multicast patches In-Reply-To: <20061204152624.GA8269@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> Message-ID: <45746375.5010107@ichips.intel.com> Michael S. Tsirkin wrote: > Sean, Yohad here tried adding your debugging patch and reproduced the crash. > Unfortunately, none of the BUG_ON errors got triggered. > Here's the trace from the last crash: Okay... this will be difficult. The state of the groups all appear to be correct then. Can you provide more information on your setup? I cannot reproduce this with my systems at all, even with kernel debugging enabled. Do you apply any patches to ipoib beyond the patches that I posted? - Sean From mshefty at ichips.intel.com Mon Dec 4 10:08:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 04 Dec 2006 10:08:18 -0800 Subject: [openib-general] IPoIB and MC Group leaving In-Reply-To: <1165248082.25587.8839.camel@hal.voltaire.com> References: <1165243803.25587.5906.camel@hal.voltaire.com> <1165248082.25587.8839.camel@hal.voltaire.com> Message-ID: <45746412.6080302@ichips.intel.com> Hal Rosenstock wrote: > The fact that a leave doesn't wait for the response and then a join is > issued. I think there is a race condition here perhaps triggered by > client reregistration. Note that the multicast interface changes corrects for this. We only need to understand why Mellanox sees crashes in that code... - Sean From mst at mellanox.co.il Mon Dec 4 10:16:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 4 Dec 2006 20:16:41 +0200 Subject: [openib-general] oops with multicast patches In-Reply-To: <45746375.5010107@ichips.intel.com> References: <45746375.5010107@ichips.intel.com> Message-ID: <20061204181641.GF15375@mellanox.co.il> > > Sean, Yohad here tried adding your debugging patch and reproduced the crash. > > Unfortunately, none of the BUG_ON errors got triggered. > > Here's the trace from the last crash: > > Okay... this will be difficult. The state of the groups all appear to be > correct then. Can you provide more information on your setup? I cannot > reproduce this with my systems at all, even with kernel debugging enabled. Do > you apply any patches to ipoib beyond the patches that I posted? This seems to be reproducible with or without additional patches. Not sure what's different with our setup here. I think what Yohad was running in this instance was 2.6.17 with ipoib from 2.6.19 + your patches, but there seems to be no trouble triggering this on other kernels as well. -- MST From ralph.campbell at qlogic.com Mon Dec 4 10:30:44 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 04 Dec 2006 10:30:44 -0800 Subject: [openib-general] [PATCH v2 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <45728E0F.9020106@voltaire.com> References: <1164911024.14800.74.camel@brick.pathscale.com> <45728E0F.9020106@voltaire.com> Message-ID: <1165257044.14800.112.camel@brick.pathscale.com> On Sun, 2006-12-03 at 10:42 +0200, Or Gerlitz wrote: > Ralph Campbell wrote: > > This patch implements the interposing DMA mapping functions to allow > > support for IOMMUs and remove the dependence on phys_to_virt(). > > > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > > +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Wed Nov 29 13:55:07 2006 -0800 > > +/** > > + * ipath_dma_map_single - Map a kernel virtual address to DMA address > > + * @device: The device for which the dma_addr is to be created > > + * @cpu_addr: The kernel virtual address > > + * @size: The size of the region in bytes > > + * @direction: The direction of the DMA > > + */ > > +static u64 ipath_dma_map_single(struct ib_device *dev, > > + void *cpu_addr, size_t size, > > + enum dma_data_direction direction) > > +{ > > + BUG_ON(!valid_dma_direction(direction)); > > + return (u64) cpu_addr; > > +} > > if ipath_dma_map_single is a NO OP > > > +/** > > + * ipath_sync_single_for_cpu - Prepare DMA region to be accessed by CPU > > + * @device: The device for which the DMA address was created > > + * @addr: The DMA address > > + * @size: The size of the region in bytes > > + * @dir: The direction of the DMA > > + */ > > +static void ipath_sync_single_for_cpu(struct ib_device *dev, > > + u64 addr, > > + size_t size, > > + enum dma_data_direction dir) > > +{ > > + dma_sync_single_for_cpu(dev->dma_device, addr, size, dir); > > +} > > then why ipath_sync_single_for_cpu does something? am i just pointing on > a cleanup or there's something more deep here? > > Or. Good catch. There is nothing going on here. The dma_sync_single_* should be NOPs. From jsquyres at cisco.com Mon Dec 4 11:00:56 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 4 Dec 2006 14:00:56 -0500 Subject: [openib-general] .openfabrics.org names In-Reply-To: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> Message-ID: <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> Who controls the DNS for openfabrics.org? Could we get these names created? Or -- are there any objections to creating / using such names? Thanks! On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote: > The name "staging.openfabrics.org" was really intended to be > temporary until the old openfabrics.org was taken offline and > replaced with the new one. > > My $0.02 is that we should stop using staging.openfabrics.org as > soon as possible and create / start using some new names for the > server to allow for potential transparent service relocation someday. > > Here are some new name suggestions that could be done immediately > (with appropriate changes to DNS, apache config, ...and potentially > others): > > * git.openfabrics.org: for all git activity > * wiki.openfabrics.org: a top-level name for the wiki rather than > burying it under several layers of links on the web site > * trac.openfabrics.org: if someone creates this name, I volunteer > to finally get off my butt and install trac to see if people like it > > These are the old names and would need to be changed in DNS only > when the old server is taken offline / we're ready to move to the > new server: > > * openfabrics.org: redirect to www.openfabrics.org, and for mail > traffic > * www.openfabrics.org: main web site > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From ralph.campbell at qlogic.com Mon Dec 4 11:17:24 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 04 Dec 2006 11:17:24 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <45728B6F.6040905@voltaire.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <45728B6F.6040905@voltaire.com> Message-ID: <1165259844.14800.134.camel@brick.pathscale.com> On Sun, 2006-12-03 at 10:31 +0200, Or Gerlitz wrote: > Ralph Campbell wrote: > >> On 11/30/06, Ralph Campbell wrote: > >>> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote: > > >>>> So what did you change since v1? How do you deal with fitting 64-bit > >>>> addresses into an sg list entry that has a 32-bit dma_addr_t? > > > Although the driver compiles on 32-bit kernels, it is unsupported > > and never been tested. All known 64-bit systems don't define > > CONFIG_HIGHMEM. In spite of previous emails suggesting that > > page_address() can return NULL without CONFIG_HIGHMEM defined, > > the code in include/linux/mm.h doesn't allow it (assuming the > > page pointer is valid and not some random address). > > I verified this with Andrew Morton. > > Can you provide the quote from include/linux/mm.h of the code that > disallows it? looking there i don't see the enforcement. > > mmm, your consulting with Andrew Morton was not over this thread... well > Christoph Hellwig comment on the V1 thread tells a different story: > > Only for GFP_KERNEL allocations you can assume page_address is valid, > and the scatterlist passed to a SCSI LLDD can contain any type of pages. > Currently on all 64bit architectures page_address works on all pages, > but that's an implementation detail that could change any time and that > you should not rely on. > > see http://www.mail-archive.com/openib-general at openib.org/msg27132.html > > As i have mentioned in the past, this (no kvaddr for a page) comes into > play when a SCSI LLD (eg iSER, SRP) gets DIRECT I/O or AIO (SDP) pages > from user space. > > Or. I appreciate your pointing out the potential problems. I agree that future kernel changes could certainly break existing drivers. That happens frequently even when following the guarantees. I still don't understand how a valid struct page * (regardless of whether it is mapped into user space or not) can not have a valid kernel address when CONFIG_HIGHMEM is not defined for the current source base. In include/linux/mm.h, page_address() is defined as lowmem_page_address() which is defined as __va(page_to_pfn(page) << PAGE_SHIFT) which can only fail if there isn't a valid PFN for the page. I don't see how that can happen. If I am wrong, I would like to understand why. If you have suggestions for fixing these issues, please let me know. From krause at cup.hp.com Mon Dec 4 11:32:22 2006 From: krause at cup.hp.com (Michael Krause) Date: Mon, 04 Dec 2006 11:32:22 -0800 Subject: [openib-general] CMA issue: SDP login compliancy In-Reply-To: <1165252483.32724.44.camel@stevo-desktop> References: <20061204085945.GC20943@mellanox.co.il> <20061204165712.GC15375@mellanox.co.il> <1165252483.32724.44.camel@stevo-desktop> Message-ID: <6.2.0.14.2.20061204112936.083de870@esmail.cup.hp.com> At 09:14 AM 12/4/2006, Steve Wise wrote: >On Mon, 2006-12-04 at 18:57 +0200, Michael S. Tsirkin wrote: > > > Subject: CMA issue: SDP login compliancy > > > > > > Hi! > > > SDP compliance statement *requires* that a consumer checks the > > > Responder Resources field in the connection Request/Response, > > > verifying that it is > 0. This is part of CA 4-41 in the spec. > > > > > > However Responder Resources field does not seem to be exposed by the > CMA API. I > > > think knowing this value (at least in REQ, but preferably in REP is > well) is > > > also important for any ULP that does RDMA reads. > > > > > > Should/can CMA/UCMA be extended to pass this to the user? This might be > > > something we need to address before UCMA merge to avoid ABI breakage > later. > > > > Steve, could you please comment on the iWarp side of things? > > Does iwarp connection get the number of RDMA read requests remote side > > can support during connection setup? > > Or is this IB-specific? > > > >I believe Sean's latest CMA patches under consideration for 2.6.20 >support this from a CMA perspective. > >See http://thread.gmane.org/gmane.linux.drivers.openib/33576/focus=33580 > >iWARP (MPA protocol) currently doesn't exchange this information across >the wire at connection setup, but there are proposals in the works to >support this (It requires a wire protocol change). So eventually, iWARP >will provide the remote peer's responder resources in the connection >events. SDP Hello exchanges the number of SrcAvail for each side of the communication in addition to other resource information - this provides the RDMA Read Request depth information. I am not aware of any request to modify MPA which just completed last call in November. The same type of information is exchanged during iSCSI login. The consensus was since each ULP exchanges this information during their initial ULP-level communication, there was no reason to replicate this within MPA. Mike From boris at mellanox.com Mon Dec 4 14:30:26 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Mon, 4 Dec 2006 14:30:26 -0800 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Message-ID: <1E3DCD1C63492545881FACB6063A57C16E40E8@mtiexch01.mti.com> I guess we need to have all our recent MPI fixes to be added to the support page. Pasha should keep track of those, including the one I sent to Sun. By the way, where is this support page exactly - on our web site ? Boris. -----Original Message----- From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Sunday, December 03, 2006 5:50 AM To: Boris Shpolyansky Cc: David Costa; openib-general at openib.org; Robert Houk; Anthony Vinciguerra; Thomas Babbit Subject: Re: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Boris Shpolyansky wrote: > Hi David, > > If you are using OFED-1.1 stack and OSU MVAPICH provided with the > OFED-1.1 package as your MPI layer, > the attached patch should solve your problem. > > Please, let me know if that helped. > > Regards, > Boris, Please add this to OFED 1.1 support page Thanks, Tziporet From yuytwr at yahoo.co.jp Mon Dec 4 15:50:36 2006 From: yuytwr at yahoo.co.jp (yuytwr at yahoo.co.jp) Date: Tue, 5 Dec 2006 07:50:36 +0800 Subject: [openib-general] =?GB2312?B?zNi8r6Oh?= Message-ID: <20061204235004.2D41C3B0001@sentry-two.sandia.gov> An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Dec 4 20:12:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 04 Dec 2006 20:12:28 -0800 Subject: [openib-general] [PATCH/RFC] busted request IRQ for PCIe ipath HCAs Message-ID: I think commit 51f65ebc (fix HT IRQ setting on HT HCAs) busted ipath on PCIe HCAs, since ipath_irq is set before pci_enable_msi(), which means it gets some value unrelated to the actual IRQ that is assigned. I needed the patch below to make 2.6.19 work with my PCIe HCAs. Bryan/anyone at Qlogic, does this look right? It worked for me, so if this is what was intended, I will queue the patch for 2.6.20 and submit to stable at kernel.org for 2.6.19.x. - R. diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index 6af8968..498b596 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -851,8 +851,8 @@ static int ipath_setup_pe_config(struct int pos, ret; dd->ipath_msi_lo = 0; /* used as a flag during reset processing */ - dd->ipath_irq = pdev->irq; ret = pci_enable_msi(dd->pcidev); + dd->ipath_irq = pdev->irq; if (ret) ipath_dev_err(dd, "pci_enable_msi failed: %d, " "interrupts may not work\n", ret); From johnpol at 2ka.mipt.ru Mon Dec 4 21:07:25 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 08:07:25 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> Message-ID: <20061205050725.GA26033@2ka.mipt.ru> On Mon, Dec 04, 2006 at 07:45:52AM -0800, Roland Dreier (rdreier at cisco.com) wrote: > > This and a lot of other changes in this driver definitely says you > > implement your own stack of protocols on top of infiniband hardware. > > ...but I do know this driver is for 10-gig ethernet HW. It is for iwarp/rdma from description. If it is 10ge, then why does it parse incomping packet headers and implements initial tcp state machine? > - R. -- Evgeniy Polyakov From johnpol at 2ka.mipt.ru Mon Dec 4 21:13:57 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 08:13:57 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165249251.32724.26.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <1165249251.32724.26.camel@stevo-desktop> Message-ID: <20061205051356.GA26845@2ka.mipt.ru> On Mon, Dec 04, 2006 at 10:20:51AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > This and a lot of other changes in this driver definitely says you > > > implement your own stack of protocols on top of infiniband hardware. > > > > ...but I do know this driver is for 10-gig ethernet HW. > > > > There is no SW TCP stack in this driver. The HW supports RDMA over > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet > (aka iWARP). The device is a 10 GbE device, not Infiniband. The > Ethernet driver, upon which the rdma driver depends, acts both like a > traditional Ethernet NIC for the Linux stack as well as a TCP offload > device for the RDMA driver allowing establishment of RDMA connections. > The Connection Manager (patch 04/13) sends/receives messages from the > Ethernet driver that sets up HW TCP connections for doing RDMA. While > this is indeed implementing TCP offload, it is _not_ integrating it with > the sockets layer nor the linux stack and offloading sockets > connections. Its only supporting offload connections for the RDMA > driver to do iWARP. The Ammasso device is another example of this > (drivers/infiniband/hw/amso1100). Deep iSCSI adapters are another > example of this. So what will happen when application will create a socket, bind it to that NIC, and then try to establish a TCP connection? How NIC will decide that received packets are from socket but not for internal TCP state machine handled by that device? As a side note, does all iwarp devices _require_ to have very limited TCP engine implemented it in its hardware, or it is possible to work with external SW stack? > Steve. -- Evgeniy Polyakov From rdreier at cisco.com Mon Dec 4 21:13:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 04 Dec 2006 21:13:59 -0800 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205050725.GA26033@2ka.mipt.ru> (Evgeniy Polyakov's message of "Tue, 5 Dec 2006 08:07:25 +0300") References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> Message-ID: > It is for iwarp/rdma from description. Yes, iWARP on top of 10G ethernet. > If it is 10ge, then why does it parse incomping packet headers and > implements initial tcp state machine? To establish connections to run RDMA over, I guess. iWARP is RDMA over TCP. - R. From johnpol at 2ka.mipt.ru Mon Dec 4 21:16:58 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 08:16:58 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> Message-ID: <20061205051657.GB26845@2ka.mipt.ru> On Mon, Dec 04, 2006 at 09:13:59PM -0800, Roland Dreier (rdreier at cisco.com) wrote: > > It is for iwarp/rdma from description. > > Yes, iWARP on top of 10G ethernet. > > > If it is 10ge, then why does it parse incomping packet headers and > > implements initial tcp state machine? > > To establish connections to run RDMA over, I guess. iWARP is RDMA > over TCP. So will each new NIC implement some parts of TCP stack in theirs drivers? > - R. -- Evgeniy Polyakov From rdreier at cisco.com Mon Dec 4 21:27:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 04 Dec 2006 21:27:09 -0800 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205051657.GB26845@2ka.mipt.ru> (Evgeniy Polyakov's message of "Tue, 5 Dec 2006 08:16:58 +0300") References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <20061205051657.GB26845@2ka.mipt.ru> Message-ID: > So will each new NIC implement some parts of TCP stack in theirs drivers? I hope not. The driver we merged (amso1100) did it completely in FW, with a separate MAC and IP interface for the RDMA connections. I think we better understand the Chelsio driver pretty well and think it over carefully before we merge it. Thanks for pointing this stuff out. - R. From ogerlitz at voltaire.com Tue Dec 5 02:31:46 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 05 Dec 2006 12:31:46 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mappingfunctions to allow device drivers to interpose In-Reply-To: <1165259844.14800.134.camel@brick.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <45728B6F.6040905@voltaire.com> <1165259844.14800.134.camel@brick.pathscale.com> Message-ID: <45754A92.50102@voltaire.com> Ralph Campbell wrote: > I appreciate your pointing out the potential problems. I agree that > future kernel changes could certainly break existing drivers. That > happens frequently even when following the guarantees. Assuming "an implementation detail that could change any time and that you should not rely on" is too much to my taste, so its left for the IB maintainer to decide if to push it and for the kernel maintainer if to accept it. While discussing it with the group here a was made that a possible solution for this problem would be on top of the suggested change call kmap_atomic/kunmap_atomic in the ipath low level code before/after you memcpy to/from a page provided to you by the IB consumer. But i am not sure if it solves the problem of ib_dma_map_sg for an sg provided later to the FMR code. > I still don't understand how a valid struct page * (regardless of > whether it is mapped into user space or not) can not have a valid > kernel address when CONFIG_HIGHMEM is not defined for the current > source base. In include/linux/mm.h, page_address() is defined as > lowmem_page_address() which is defined as > __va(page_to_pfn(page) << PAGE_SHIFT) > which can only fail if there isn't a valid PFN for the page. > I don't see how that can happen. Looking on the matter again, I agree it can not fail for low memory with nowadays kernel code. Or. From Brice.Goglin at ens-lyon.org Tue Dec 5 02:45:55 2006 From: Brice.Goglin at ens-lyon.org (Brice Goglin) Date: Tue, 05 Dec 2006 11:45:55 +0100 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165249251.32724.26.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <1165249251.32724.26.camel@stevo-desktop> Message-ID: <45754DE3.1020505@ens-lyon.org> Steve Wise wrote: > There is no SW TCP stack in this driver. The HW supports RDMA over > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet > (aka iWARP). The device is a 10 GbE device, not Infiniband. Then, I wonder why the driver goes in drivers/infiniband/ :) Is there really no way to only keep the actual hw infiniband there, move iwarp/rdma drivers in drivers/net/something/ and the core stuff in net/something/ ? Brice From tziporet at dev.mellanox.co.il Tue Dec 5 03:54:25 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 05 Dec 2006 13:54:25 +0200 Subject: [openib-general] .openfabrics.org names In-Reply-To: <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> Message-ID: <45755DF1.5080208@dev.mellanox.co.il> Jeff Squyres wrote: > Who controls the DNS for openfabrics.org? Could we get these names > created? Or -- are there any objections to creating / using such names? > > Thanks! > > > > If I understand correctly Johann from Qlogic is responsible for the stage server setting. Johan - can you drive this? Thanks, Tziporet From tziporet at dev.mellanox.co.il Tue Dec 5 04:03:14 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 05 Dec 2006 14:03:14 +0200 Subject: [openib-general] oops with multicast patches In-Reply-To: <45746375.5010107@ichips.intel.com> References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> Message-ID: <45756002.3030806@dev.mellanox.co.il> Sean Hefty wrote: > > Okay... this will be difficult. The state of the groups all appear to be > correct then. Can you provide more information on your setup? I cannot > reproduce this with my systems at all, even with kernel debugging enabled. Do > you apply any patches to ipoib beyond the patches that I posted? > > - Sean > > Dotan will try to isolate the test that cause this failure and sent it to you, so you can debug it yourself. Tziporet From eeb at bartonsoftware.com Tue Dec 5 04:22:13 2006 From: eeb at bartonsoftware.com (Eric Barton) Date: Tue, 5 Dec 2006 12:22:13 GMT Subject: [openib-general] Performance Degradation with OFED v. Voltaire Message-ID: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com> Hi, We'd dearly like some help to understand why we seem to be having performance issues with OFED. When we run a lustre network bandwidth benchmark, we find significant performance degradation on OFED versus Voltaire... Premap (256 RDMA frags) Map on demand (1 RDMA frag) Voltaire OFED Ratio Voltaire OFED Ratio Writes MB/s 682 567 83 % 577 436 75 % Reads MB/s 658 554 84 % 555 432 77 % These tests measure the bandwidth of 1MByte transfers pipelined 8 deep. All hardware/software was the same, apart from the IB stack and the lustre network driver. The architecture of the lustre network drivers for OFED and Voltaire are almost identical. Both use RC QPs with the same control message protocol to set up bulk data transfers using RDMA WRITE. Control messages use a credit flow protocol to ensure that they are only sent when buffers are posted to receive them. Concurrent transfers over the same QP are supported so that lustre can pipeline bulk I/O. The only difference between the lustre network drivers is that the Voltaire driver has a single global CQ and the OFED driver has 1 CQ per QP. However the measurement above are for a single pair of nodes - in this case both implementations use a single CQ. By default, the drivers pre-map all of physical memory so each RDMA consists of page fragments. However, we can also compile both drivers to map on demand using FMR so that RDMA is not fragmented. The results above compare both methods and although both drivers perform worse when mapping, the OFED driver takes the bigger hit. We'd be delighted if anyone can shed any light or can suggest any steps we should take to discover the reason. We're also very willing to provide assistance if any of the OpenFabrics developers wants to duplicate the setup. -- Cheers, Eric From bugzilla-daemon at openib.org Tue Dec 5 05:45:21 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 5 Dec 2006 05:45:21 -0800 (PST) Subject: [openib-general] [Bug 306] New: Run IPOIB high availability when primary I/F == secondary I/F does not return an error Message-ID: <20061205134521.559522283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=306 Summary: Run IPOIB high availability when primary I/F == secondary I/F does not return an error Product: OpenFabrics Linux Version: gen2 Platform: All OS/Version: Other Status: NEW Severity: normal Priority: P2 Component: IPoIB AssignedTo: bugzilla at openib.org ReportedBy: yohadd at mellanox.co.il When configuring the IPOIB high availability with primary I/F == secondary I/F, the high availability script (ipoib_ha.pl) doesn't return an error. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Tue Dec 5 05:56:11 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 5 Dec 2006 05:56:11 -0800 (PST) Subject: [openib-general] [Bug 307] New: Configuring IPOIB HA with invalid I/F does not return an error Message-ID: <20061205135611.30F022283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=307 Summary: Configuring IPOIB HA with invalid I/F does not return an error Product: OpenFabrics Linux Version: gen2 Platform: All OS/Version: Other Status: NEW Severity: normal Priority: P2 Component: IPoIB AssignedTo: bugzilla at openib.org ReportedBy: yohadd at mellanox.co.il When configuring the IPOIB HA with invalid I/F, the HA script (ipoib_ha.pl) notify about the wrong I/F, but it continue to run with the wrong configuration (does not exit with an error). ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dotanb at dev.mellanox.co.il Tue Dec 5 06:32:08 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 05 Dec 2006 16:32:08 +0200 Subject: [openib-general] oops with multicast patches In-Reply-To: <45756002.3030806@dev.mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> Message-ID: <457582E8.8030705@dev.mellanox.co.il> Hi Sean. Tziporet Koren wrote: >Dotan will try to isolate the test that cause this failure and sent it >to you, so you can debug it yourself. > >Tziporet > > We got a machine crash on a machine with the following attributes: ************************************************************* Host Architecture : x86_64 Linux Distribution: Red Hat Enterprise Linux AS release 4 (Nahant Update 3) Kernel Version : 2.6.9-34.ELsmp GCC Version : gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) Memory size : 2055996 kB HCA ID(s) : mthca0 HCA model(s) : 23108 Board(s) : MT_0030000001 ************************************************************* I attached the test to this email. This test does the following scenario: restart the driver start a user level application that allocate N multicast groups (it is being executed in the background) sleep for a while (to let the later application get the mcgs) start the SM (in the background) sleep for a while kill the SM wait until the user level application will ends We do it in a loop for the following values of N: max_mcast -1, max_mcast -2, max_mcast -3 I executed the following command (only one side is needed): # ./ib_mcast_full.bs --server The test need to be executed when the driver was loaded and the opensm isn't executed in the background. The user level application uses the VL library which can be found in: https://openib.org/svn/trunk/contrib/mellanox/ibtp/common/tools/vl I hope that this will help you .... Dotan -------------- next part -------------- A non-text attachment was scrubbed... Name: ib_mcast_full.tar.gz Type: application/x-gzip Size: 4407 bytes Desc: not available URL: From swise at opengridcomputing.com Tue Dec 5 07:02:05 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 09:02:05 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205050725.GA26033@2ka.mipt.ru> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> Message-ID: <1165330925.16087.13.camel@stevo-desktop> On Tue, 2006-12-05 at 08:07 +0300, Evgeniy Polyakov wrote: > On Mon, Dec 04, 2006 at 07:45:52AM -0800, Roland Dreier (rdreier at cisco.com) wrote: > > > This and a lot of other changes in this driver definitely says you > > > implement your own stack of protocols on top of infiniband hardware. > > > > ...but I do know this driver is for 10-gig ethernet HW. > > It is for iwarp/rdma from description. > If it is 10ge, then why does it parse incomping packet headers and > implements initial tcp state machine? > Its not implementing the TCP state machine at all. Its implementing the MPA state machine (see the iWARP internet drafts). These packets are TCP payload. MPA is used to negotiate RDMA mode on a TCP connection. This entails an exchange of 2 messages on the TCP connection. Once this is exchanged and both side agree, the connection is bound to an RDMA QP and the connection moved into RDMA mode. From that point on, all IO is done via the post_send() and post_recv(). Steve. From swise at opengridcomputing.com Tue Dec 5 07:03:35 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 09:03:35 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> Message-ID: <1165331015.16087.16.camel@stevo-desktop> On Mon, 2006-12-04 at 21:13 -0800, Roland Dreier wrote: > > It is for iwarp/rdma from description. > > Yes, iWARP on top of 10G ethernet. > > > If it is 10ge, then why does it parse incomping packet headers and > > implements initial tcp state machine? > > To establish connections to run RDMA over, I guess. iWARP is RDMA > over TCP. > The driver uses messages exchanged to and from the HW via the Ethernet driver to setup TCP connections. No TCP processing is done in the host. The hardware does all the TCP processing. Steve. From halr at voltaire.com Tue Dec 5 07:01:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 10:01:36 -0500 Subject: [openib-general] {PATCH 0/2] OpenSM and osmtest: Add support for SA InformInfoRecord Message-ID: <1165330881.25587.66892.camel@hal.voltaire.com> OpenSM and osmtest: Add support for SA InformInfoRecord The following patch series adds initial SA InformInfoRecord support into OpenSM and also adds some tests for this and InformInfo into osmtest. There will also be subsequent patches for enhancements to the SA InformInfo and InformInfoRecord support. Signed-off-by: Hal Rosenstock From swise at opengridcomputing.com Tue Dec 5 07:07:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 09:07:33 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205051356.GA26845@2ka.mipt.ru> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <1165249251.32724.26.camel@stevo-desktop> <20061205051356.GA26845@2ka.mipt.ru> Message-ID: <1165331253.16087.21.camel@stevo-desktop> On Tue, 2006-12-05 at 08:13 +0300, Evgeniy Polyakov wrote: > On Mon, Dec 04, 2006 at 10:20:51AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > > This and a lot of other changes in this driver definitely says you > > > > implement your own stack of protocols on top of infiniband hardware. > > > > > > ...but I do know this driver is for 10-gig ethernet HW. > > > > > > > There is no SW TCP stack in this driver. The HW supports RDMA over > > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet > > (aka iWARP). The device is a 10 GbE device, not Infiniband. The > > Ethernet driver, upon which the rdma driver depends, acts both like a > > traditional Ethernet NIC for the Linux stack as well as a TCP offload > > device for the RDMA driver allowing establishment of RDMA connections. > > The Connection Manager (patch 04/13) sends/receives messages from the > > Ethernet driver that sets up HW TCP connections for doing RDMA. While > > this is indeed implementing TCP offload, it is _not_ integrating it with > > the sockets layer nor the linux stack and offloading sockets > > connections. Its only supporting offload connections for the RDMA > > driver to do iWARP. The Ammasso device is another example of this > > (drivers/infiniband/hw/amso1100). Deep iSCSI adapters are another > > example of this. > > So what will happen when application will create a socket, bind it to > that NIC, and then try to establish a TCP connection? How NIC will > decide that received packets are from socket but not for internal TCP > state machine handled by that device? The HW knows which TCP connections are offloaded by virtue of the fact that they were setup via the RDMA subsystem. Any other TCP traffic (and all other non TCP traffic) gets passed to the host stack. > > As a side note, does all iwarp devices _require_ to have very > limited TCP engine implemented it in its hardware, or it is possible > to work with external SW stack? It is possible, but not very interesting. One could implement an all-software iWARP stack. The iWARP protocols are just TCP payload and _could_ be implemented in user mode on top of a socket. However, this isn't very interesting: the goal of iWARP (and RDMA for that matter) is to allow direct placement of data into user memory with 0 copies done by the host CPU. low latency. Steve. From halr at voltaire.com Tue Dec 5 07:02:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 10:02:43 -0500 Subject: [openib-general] [PATCH 2/2]: osmtest/osmtest.c: Add tests for SA InformInfoRecord and InformInfo Message-ID: <1165330909.25587.66946.camel@hal.voltaire.com> osmtest/osmtest.c: Add tests for SA InformInfoRecord and InformInfo The following patch adds some tests for SA InformInfoRecord and InformInfo into osmtest. Signed-off-by: Hal Rosenstock diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index a21e8ca..b3f2bb4 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -69,6 +69,18 @@ #define POOL_MIN_ITEMS 64 #define GUID_ARRAY_SIZE 64 +typedef struct _osmtest_inform_info +{ + boolean_t subscribe; + ib_net32_t qpn; +} osmtest_inform_info_t; + +typedef struct _osmtest_inform_info_rec +{ + ib_gid_t subscriber_gid; + ib_net16_t subscriber_enum; +} osmtest_inform_info_rec_t; + typedef enum _osmtest_token_val { OSMTEST_TOKEN_COMMENT = 0, @@ -4814,6 +4826,119 @@ osmtest_sminfo_record_request( OSM_LOG_EXIT( &p_osmt->log ); return ( status ); } + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osmtest_informinfo_request( + IN osmtest_t * const p_osmt, + IN ib_net16_t attr_id, + IN uint8_t method, + IN void *p_options, + IN OUT osmtest_req_context_t * const p_context ) +{ + ib_api_status_t status = IB_SUCCESS; + osmv_user_query_t user; + osmv_query_req_t req; + ib_inform_info_t rec; + ib_inform_info_record_t record; + ib_mad_t *p_mad; + osmtest_inform_info_t *p_inform_info_opt; + osmtest_inform_info_rec_t *p_inform_info_rec_opt; + + OSM_LOG_ENTER( &p_osmt->log, osmtest_informinfo_request ); + + /* + * Do a blocking query for these records in the subnet. + * The result is returned in the result field of the caller's + * context structure. + * + * The query structures are locals. + */ + memset( &req, 0, sizeof( req ) ); + memset( &user, 0, sizeof( user ) ); + memset( &rec, 0, sizeof( rec ) ); + memset( &record, 0, sizeof( record ) ); + + p_context->p_osmt = p_osmt; + user.attr_id = attr_id; + if (attr_id == IB_MAD_ATTR_INFORM_INFO_RECORD) + { + user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) ); + p_inform_info_rec_opt = p_options; + if (p_inform_info_rec_opt->subscriber_gid.unicast.prefix != 0 && + p_inform_info_rec_opt->subscriber_gid.unicast.interface_id != 0) + { + record.subscriber_gid = p_inform_info_rec_opt->subscriber_gid; + user.comp_mask = IB_IIR_COMPMASK_SUBSCRIBERGID; + } + record.subscriber_enum = cl_hton16(p_inform_info_rec_opt->subscriber_enum); + user.comp_mask |= IB_IIR_COMPMASK_ENUM; + user.p_attr = &record; + } + else + { + user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( rec ) >> 3 ) ); + /* comp mask bits below are for InformInfoRecord rather than InformInfo */ + /* as currently no comp mask bits defined for InformInfo!!! */ + user.comp_mask = IB_IIR_COMPMASK_SUBSCRIBE; + p_inform_info_opt = p_options; + rec.subscribe = p_inform_info_opt->subscribe; + if (p_inform_info_opt->qpn) + { + rec.g_or_v.generic.qpn_resp_time_val = cl_hton32(p_inform_info_opt->qpn) >> 8; + user.comp_mask |= IB_IIR_COMPMASK_QPN; + } + user.p_attr = &rec; + } + user.method = method; + + req.query_type = OSMV_QUERY_USER_DEFINED; + req.timeout_ms = p_osmt->opt.transaction_timeout; + req.retry_cnt = p_osmt->opt.retry_count; + + req.flags = OSM_SA_FLAGS_SYNC; + req.query_context = p_context; + req.pfn_query_cb = osmtest_query_res_cb; + req.p_query_input = &user; + req.sm_key = 0; + + status = osmv_query_sa( p_osmt->h_bind, &req ); + if( status != IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_informinfo_request: ERR 008E: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + goto Exit; + } + + status = p_context->result.status; + + if( status != IB_SUCCESS ) + { + if (status != IB_INVALID_PARAMETER) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_informinfo_request: ERR 008F: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + } + if( status == IB_REMOTE_ERROR ) + { + p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw ); + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_informinfo_request: " + "Remote error = %s\n", + ib_get_mad_status_str( p_mad )); + + status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK ); + } + goto Exit; + } + + Exit: + OSM_LOG_EXIT( &p_osmt->log ); + return ( status ); +} #endif /********************************************************************** @@ -5421,6 +5546,8 @@ osmtest_validate_against_db( IN osmtest_ { ib_api_status_t status = IB_SUCCESS; ib_gid_t portgid, mgid; + osmtest_inform_info_t inform_info_opt; + osmtest_inform_info_rec_t inform_info_rec_opt; #ifdef VENDOR_RMPP_SUPPORT ib_net64_t sm_key; ib_net16_t test_lid; @@ -5684,6 +5811,121 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; + /* InformInfoRecord tests */ + memset( &inform_info_opt, 0, sizeof( inform_info_opt ) ); + memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_SET, &inform_info_rec_opt, &context ); + if ( status == IB_SUCCESS ) + goto Exit; + else + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_informinfo_request: InformInfoRecord " + "IS EXPECTED ERROR ^^^^\n"); + } + + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* InformInfo tests */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_GET, &inform_info_opt, &context ); + if ( status == IB_SUCCESS ) + goto Exit; + else + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_informinfo_request: InformInfo " + "IS EXPECTED ERROR ^^^^\n"); + } + + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, &inform_info_opt, &context ); + if ( status == IB_SUCCESS ) + goto Exit; + else + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_informinfo_request: InformInfo UnSubscribe " + "IS EXPECTED ERROR ^^^^\n"); + } + + /* Now subscribe */ + inform_info_opt.subscribe = TRUE; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, &inform_info_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* Now unsubscribe (QPN needs to be 1 to work) */ + inform_info_opt.subscribe = FALSE; + inform_info_opt.qpn = 1; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, &inform_info_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* Now subscribe again */ + inform_info_opt.subscribe = TRUE; + inform_info_opt.qpn = 1; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, &inform_info_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* Subscribe over existing subscription */ + inform_info_opt.qpn = 0; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, &inform_info_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* More InformInfoRecord tests */ + /* RID lookup */ + ib_gid_set_default( &inform_info_rec_opt.subscriber_gid, + p_osmt->local_port.port_guid ); + inform_info_rec_opt.subscriber_enum = 1; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + inform_info_rec_opt.subscriber_enum = 0; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* Get all InformInfoRecords */ + memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* Cleanup subscriptions before further testing */ + inform_info_opt.subscribe = FALSE; + inform_info_opt.qpn = 1; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, &inform_info_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + if (lmc != 0) { test_lid = cl_ntoh16( p_osmt->local_port.lid + 1 ); From halr at voltaire.com Tue Dec 5 07:02:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 10:02:03 -0500 Subject: [openib-general] [PATCH 1/2] OpenSM: Add support for SA InformInfoRecord Message-ID: <1165330893.25587.66894.camel@hal.voltaire.com> OpenSM: Add support for SA InformInfoRecord The following patch adds initial SA InformInfoRecord support into OpenSM. Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h index 40fec93..0bc8810 100644 --- a/osm/include/opensm/osm_inform.h +++ b/osm/include/opensm/osm_inform.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -254,6 +254,72 @@ osm_infr_get_by_rid( * Inform Record, osm_infr_construct, osm_infr_destroy *********/ +/****f* OpenSM: Inform Record/osm_infr_get_by_gid +* NAME +* osm_infr_get_by_gid +* +* DESCRIPTION +* Find a matching osm_infr_t in the subnet DB by inform_info_record +* subscriber GID +* +* SYNOPSIS +*/ +osm_infr_t* +osm_infr_get_by_gid( + IN osm_subn_t const *p_subn, + IN osm_log_t *p_log, + IN ib_inform_info_record_t* const p_inf_rec ); +/* +* PARAMETERS +* p_subn +* [in] Pointer to the subnet object +* +* p_log +* [in] Pointer to the log object +* +* p_inf_rec +* [in] Pointer to an inform_info record with the search +* subscriber GID +* +* RETURN +* The matching osm_infr_t +* SEE ALSO +* Inform Record, osm_infr_construct, osm_infr_destroy +*********/ + +/****f* OpenSM: Inform Record/osm_infr_get_by_enum +* NAME +* osm_infr_get_by_enum +* +* DESCRIPTION +* Find a matching osm_infr_t in the subnet DB by inform_info_record +* subscriber enum +* +* SYNOPSIS +*/ +osm_infr_t* +osm_infr_get_by_enum( + IN osm_subn_t const *p_subn, + IN osm_log_t *p_log, + IN ib_inform_info_record_t* const p_inf_rec ); +/* +* PARAMETERS +* p_subn +* [in] Pointer to the subnet object +* +* p_log +* [in] Pointer to the log object +* +* p_inf_rec +* [in] Pointer to an inform_info record with the search +* subscriber enum +* +* RETURN +* The matching osm_infr_t +* SEE ALSO +* Inform Record, osm_infr_construct, osm_infr_destroy +*********/ + /****f* OpenSM: Inform Record/osm_infr_get_by_rec * NAME * osm_infr_get_by_rec diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h index 4439339..73af838 100644 --- a/osm/include/opensm/osm_msgdef.h +++ b/osm/include/opensm/osm_msgdef.h @@ -191,6 +191,7 @@ enum OSM_MSG_MAD_VL_ARB, OSM_MSG_MAD_SLVL, OSM_MSG_MAD_GUIDINFO_RECORD, + OSM_MSG_MAD_INFORM_INFO_RECORD, #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) OSM_MSG_MAD_MULTIPATH_RECORD, #endif diff --git a/osm/include/opensm/osm_sa_informinfo.h b/osm/include/opensm/osm_sa_informinfo.h index 2e57f43..c22c1eb 100644 --- a/osm/include/opensm/osm_sa_informinfo.h +++ b/osm/include/opensm/osm_sa_informinfo.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -33,7 +33,6 @@ * */ - /* * Abstract: * Declaration of osm_infr_rcv_t. @@ -108,6 +107,7 @@ typedef struct _osm_infr_rcv osm_mad_pool_t *p_mad_pool; osm_log_t *p_log; cl_plock_t *p_lock; + cl_qlock_pool_t pool; } osm_infr_rcv_t; /* * FIELDS @@ -123,6 +123,10 @@ typedef struct _osm_infr_rcv * p_lock * Pointer to the serializing lock. * +* pool +* Pool of linkable InformInfo Record objects used to +* generate the query response. +* * SEE ALSO * InformInfo Receiver object *********/ @@ -262,6 +266,34 @@ osm_infr_rcv_process( * InformInfo Receiver *********/ +/****f* OpenSM: InformInfo Record Receiver/osm_infir_rcv_process +* NAME +* osm_infir_rcv_process +* +* DESCRIPTION +* Process the InformInfo Record request. +* +* SYNOPSIS +*/ +void +osm_infir_rcv_process( + IN osm_infr_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to an osm_infr_rcv_t object. +* +* p_madw +* [in] Pointer to the MAD Wrapper containing the MAD +* that contains the node's InformInfo Record attribute. +* NOTES +* This function processes a InformInfo Record attribute. +* +* SEE ALSO +* InformInfo Receiver +*********/ + END_C_DECLS #endif /* _OSM_SA_INFR_H_ */ diff --git a/osm/include/opensm/osm_sa_informinfo_ctrl.h b/osm/include/opensm/osm_sa_informinfo_ctrl.h index 21dd0a7..a14c5b4 100644 --- a/osm/include/opensm/osm_sa_informinfo_ctrl.h +++ b/osm/include/opensm/osm_sa_informinfo_ctrl.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -103,6 +103,7 @@ typedef struct _osm_infr_rcv_ctrl osm_log_t *p_log; cl_dispatcher_t *p_disp; cl_disp_reg_handle_t h_disp; + cl_disp_reg_handle_t h_disp2; } osm_infr_rcv_ctrl_t; /* * FIELDS diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c index 178dba2..92647ef 100644 --- a/osm/opensm/osm_inform.c +++ b/osm/opensm/osm_inform.c @@ -94,7 +94,7 @@ osm_infr_init( /* what else do we need in the inform_record ??? */ /* copy the contents of the provided informinfo */ - memcpy(p_infr,p_infr_rec, sizeof(osm_infr_t)); + memcpy(p_infr, p_infr_rec, sizeof(osm_infr_t)); } /********************************************************************** @@ -143,6 +143,54 @@ __match_rid_of_inf_rec( } /********************************************************************** + * Match an infr by the subscriber GID of the stored inform_info_record + **********************************************************************/ +static +cl_status_t +__match_gid_of_inf_rec( + IN const cl_list_item_t* const p_list_item, + IN void* context ) +{ + ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t *)context; + osm_infr_t* p_infr = (osm_infr_t*)p_list_item; + int32_t count; + + count = memcmp( + &p_infr->inform_record, + p_infr_rec, + sizeof(p_infr_rec->subscriber_gid) ); + + if(count == 0) + return CL_SUCCESS; + else + return CL_NOT_FOUND; +} + +/********************************************************************** + * Match an infr by the subscriber enum of the stored inform_info_record + **********************************************************************/ +static +cl_status_t +__match_enum_of_inf_rec( + IN const cl_list_item_t* const p_list_item, + IN void* context ) +{ + ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t *)context; + osm_infr_t* p_infr = (osm_infr_t*)p_list_item; + int32_t count; + + count = memcmp( + &p_infr->inform_record.subscriber_enum, + &p_infr_rec->subscriber_enum, + sizeof(p_infr_rec->subscriber_enum) ); + + if(count == 0) + return CL_SUCCESS; + else + return CL_NOT_FOUND; +} + +/********************************************************************** **********************************************************************/ osm_infr_t* osm_infr_get_by_rid( @@ -168,6 +216,54 @@ osm_infr_get_by_rid( /********************************************************************** **********************************************************************/ +osm_infr_t* +osm_infr_get_by_gid( + IN osm_subn_t const *p_subn, + IN osm_log_t *p_log, + IN ib_inform_info_record_t* const p_infr_rec ) +{ + cl_list_item_t* p_list_item; + + OSM_LOG_ENTER( p_log, osm_infr_get_by_gid ); + + p_list_item = cl_qlist_find_from_head( + &p_subn->sa_infr_list, + __match_gid_of_inf_rec, + p_infr_rec ); + + if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) ) + p_list_item = NULL; + + OSM_LOG_EXIT( p_log ); + return (osm_infr_t*)p_list_item; +} + +/********************************************************************** + **********************************************************************/ +osm_infr_t* +osm_infr_get_by_enum( + IN osm_subn_t const *p_subn, + IN osm_log_t *p_log, + IN ib_inform_info_record_t* const p_infr_rec ) +{ + cl_list_item_t* p_list_item; + + OSM_LOG_ENTER( p_log, osm_infr_get_by_enum ); + + p_list_item = cl_qlist_find_from_head( + &p_subn->sa_infr_list, + __match_enum_of_inf_rec, + p_infr_rec ); + + if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) ) + p_list_item = NULL; + + OSM_LOG_EXIT( p_log ); + return (osm_infr_t*)p_list_item; +} + +/********************************************************************** + **********************************************************************/ void __dump_all_informs( IN osm_subn_t const *p_subn, diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index c979365..2667e49 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -33,7 +33,6 @@ * */ - /* * Abstract: * Implementation of osm_infr_rcv_t. @@ -67,6 +66,26 @@ #include #include +#define OSM_IIR_RCV_POOL_MIN_SIZE 32 +#define OSM_IIR_RCV_POOL_GROW_SIZE 32 + +typedef struct _osm_iir_item +{ + cl_pool_item_t pool_item; + ib_inform_info_record_t rec; +} osm_iir_item_t; + +typedef struct _osm_iir_search_ctxt +{ + const ib_inform_info_record_t* p_rcvd_rec; + ib_net64_t comp_mask; + cl_qlist_t* p_list; + ib_gid_t subscriber_gid; + ib_net16_t subscriber_enum; + osm_infr_rcv_t* p_rcv; + osm_physp_t* p_req_physp; +} osm_iir_search_ctxt_t; + /********************************************************************** **********************************************************************/ void @@ -74,6 +93,7 @@ osm_infr_rcv_construct( IN osm_infr_rcv_t* const p_rcv ) { memset( p_rcv, 0, sizeof(*p_rcv) ); + cl_qlock_pool_construct( &p_rcv->pool ); } /********************************************************************** @@ -85,7 +105,7 @@ osm_infr_rcv_destroy( CL_ASSERT( p_rcv ); OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_destroy ); - + cl_qlock_pool_destroy( &p_rcv->pool ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -112,7 +132,12 @@ osm_infr_rcv_init( p_rcv->p_resp = p_resp; p_rcv->p_mad_pool = p_mad_pool; - status = IB_SUCCESS; + status = cl_qlock_pool_init( &p_rcv->pool, + OSM_IIR_RCV_POOL_MIN_SIZE, + 0, + OSM_IIR_RCV_POOL_GROW_SIZE, + sizeof(osm_iir_item_t), + NULL, NULL, NULL ); OSM_LOG_EXIT( p_rcv->p_log ); return( status ); @@ -333,6 +358,339 @@ __osm_infr_rcv_respond( } /********************************************************************** + **********************************************************************/ +static void +__osm_sa_inform_info_rec_by_comp_mask( + IN osm_infr_rcv_t* const p_rcv, + IN const osm_infr_t* const p_infr, + osm_iir_search_ctxt_t* const p_ctxt ) +{ + const ib_inform_info_record_t* p_rcvd_rec = NULL; + ib_net64_t comp_mask; + ib_net64_t portguid; + osm_port_t * p_subscriber_port; + osm_physp_t * p_subscriber_physp; + const osm_physp_t* p_req_physp; + osm_infr_t* p_infr_rec = NULL; + ib_inform_info_record_t inform_info_rec; + osm_iir_item_t* p_rec_item; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_inform_info_rec_by_comp_mask ); + + p_rcvd_rec = p_ctxt->p_rcvd_rec; + comp_mask = p_ctxt->comp_mask; + p_req_physp = p_ctxt->p_req_physp; + + /* Both subscriber GID and enum specified */ + if ((comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) && + (comp_mask & IB_IIR_COMPMASK_ENUM)) + { + inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid; + inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum; + p_infr_rec = osm_infr_get_by_rid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec); + goto Done; + } + + if (comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) + { + inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid; + p_infr_rec = osm_infr_get_by_gid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec); + goto Done; + } + + if (comp_mask & IB_IIR_COMPMASK_ENUM) + { + inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum; + p_infr_rec = osm_infr_get_by_enum(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec); + goto Done; + } + + /* Implement any other needed search cases */ + +Done: + if (p_infr_rec) + { + /* Ensure pkey is shared before returning any records */ + portguid = p_infr_rec->inform_record.subscriber_gid.unicast.interface_id; + p_subscriber_port = osm_get_port_by_guid( p_rcv->p_subn, portguid); + if ( p_subscriber_port == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_inform_info_rec_by_comp_mask: ERR 430D: " + "Invalid subscriber port guid: 0x%016" PRIx64 "\n", + cl_ntoh64(portguid) ); + goto Exit; + } + + /* get the subscriber InformInfo physical port */ + p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port); + /* make sure that the requester and subscriber port can access each other + according to the current partitioning. */ + if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp)) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_sa_inform_info_rec_by_comp_mask: " + "requester and subscriber ports don't share pkey\n" ); + goto Exit; + } + + p_rec_item = (osm_iir_item_t*)cl_qlock_pool_get( &p_rcv->pool ); + if( p_rec_item == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_inform_info_rec_by_comp_mask: ERR 430E: " + "cl_qlock_pool_get failed\n" ); + goto Exit; + } + + memcpy((void *)&p_rec_item->rec, (void *)&p_infr_rec->inform_record, sizeof(ib_inform_info_record_t)); + cl_qlist_insert_tail( p_ctxt->p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + } + +Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +static void +__osm_sa_inform_info_rec_by_comp_mask_cb( + IN cl_list_item_t* const p_list_item, + IN void* context ) +{ + const osm_infr_t* const p_infr = (osm_infr_t *)p_list_item; + osm_iir_search_ctxt_t* const p_ctxt = (osm_iir_search_ctxt_t *)context; + + __osm_sa_inform_info_rec_by_comp_mask( p_ctxt->p_rcv, p_infr, p_ctxt ); +} + +/********************************************************************** +Received a Get(InformInfoRecord) or GetTable(InformInfoRecord) MAD +**********************************************************************/ +static void +osm_infr_rcv_process_get_method( + IN osm_infr_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ) +{ + ib_sa_mad_t* p_rcvd_mad; + const ib_inform_info_record_t* p_rcvd_rec; + ib_inform_info_record_t* p_resp_rec; + cl_qlist_t rec_list; + osm_madw_t* p_resp_madw; + ib_sa_mad_t* p_resp_sa_mad; + uint32_t num_rec, pre_trim_num_rec; +#ifndef VENDOR_RMPP_SUPPORT + uint32_t trim_num_rec; +#endif + uint32_t i, j; + osm_iir_search_ctxt_t context; + osm_iir_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + osm_physp_t* p_req_physp; + + OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_get_method ); + + CL_ASSERT( p_madw ); + p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_rcvd_rec = + (ib_inform_info_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad ); + + /* update the requester physical port. */ + p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log, + p_rcv->p_subn, + osm_madw_get_mad_addr_ptr(p_madw) ); + if (p_req_physp == NULL) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_get_method: ERR 4309: " + "Cannot find requester physical port\n" ); + goto Exit; + } + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + osm_dump_inform_info_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG ); + + cl_qlist_init( &rec_list ); + + context.p_rcvd_rec = p_rcvd_rec; + context.p_list = &rec_list; + context.comp_mask = p_rcvd_mad->comp_mask; + context.subscriber_gid = p_rcvd_rec->subscriber_gid; + context.subscriber_enum = p_rcvd_rec->subscriber_enum; + context.p_rcv = p_rcv; + context.p_req_physp = p_req_physp; + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_infr_rcv_process_get_method: " + "Query Subscriber GID:0x%016" PRIx64 " : 0x%016" PRIx64 "(%02X) Enum:0x%X(%02X)\n", + cl_ntoh64(p_rcvd_rec->subscriber_gid.unicast.prefix), + cl_ntoh64(p_rcvd_rec->subscriber_gid.unicast.interface_id), + (p_rcvd_mad->comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) != 0, + cl_ntoh16(p_rcvd_rec->subscriber_enum), + (p_rcvd_mad->comp_mask & IB_IIR_COMPMASK_ENUM) != 0 ); + + /* Only Enum 0 is supported currently!!! */ + if (((p_rcvd_mad->comp_mask & IB_IIR_COMPMASK_ENUM) == 0) || (p_rcvd_rec->subscriber_enum == 0)) + { + cl_plock_acquire( p_rcv->p_lock ); + + cl_qlist_apply_func( &p_rcv->p_subn->sa_infr_list, + __osm_sa_inform_info_rec_by_comp_mask_cb, + &context ); + + cl_plock_release( p_rcv->p_lock ); + } + else + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_infr_rcv_process_get_method: " + "Non-zero Enum is not currently supported\n" ); + } + + num_rec = cl_qlist_count( &rec_list ); + + /* + * C15-0.1.30: + * If we do a SubnAdmGet and got more than one record it is an error ! + */ + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_get_method: ERR 430A: " + "More than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS); + + /* need to set the mem free ... */ + p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_iir_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } + } + + pre_trim_num_rec = num_rec; +#ifndef VENDOR_RMPP_SUPPORT + /* we limit the number of records to a single packet */ + trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_inform_info_record_t); + if (trim_num_rec < num_rec) + { + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, + "osm_infr_rcv_process_get_method: " + "Number of records:%u trimmed to:%u to fit in one MAD\n", + num_rec, trim_num_rec ); + num_rec = trim_num_rec; + } +#endif + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_infr_rcv_process_get_method: " + "Returning %u records\n", num_rec ); + + /* + * Get a MAD to reply. Address of Mad is in the received mad_wrapper + */ + p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, + p_madw->h_bind, + num_rec * sizeof(ib_inform_info_record_t) + IB_SA_MAD_HDR_SIZE, + &p_madw->mad_addr ); + + if( !p_resp_madw ) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_get_method: ERR 430B: " + "osm_mad_pool_get failed\n" ); + + for( i = 0; i < num_rec; i++ ) + { + p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list ); + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + } + + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RESOURCES ); + + goto Exit; + } + + p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw ); + + /* + Copy the MAD header back into the response mad. + Set the 'R' bit and the payload length, + Then copy all records from the list into the response payload. + */ + + memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE ); + p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK; + /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */ + p_resp_sa_mad->sm_key = 0; + /* Fill in the offset (paylen will be done by the rmpp SAR) */ + p_resp_sa_mad->attr_offset = + ib_get_attr_offset( sizeof(ib_inform_info_record_t) ); + + p_resp_rec = (ib_inform_info_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); + +#ifndef VENDOR_RMPP_SUPPORT + /* we support only one packet RMPP - so we will set the first and + last flags for gettable */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + { + p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA; + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE; + } +#else + /* forcefully define the packet as RMPP one */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; +#endif + + for( i = 0; i < pre_trim_num_rec; i++ ) + { + p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list ); + /* copy only if not trimmed */ + if (i < num_rec) + { + *p_resp_rec = p_rec_item->rec; + /* clear reserved and pad fields in InformInfoRecord */ + for (j = 0; j < 6; j++) + p_resp_rec->reserved[j] = 0; + for (j = 0; j < 4; j++) + p_resp_rec->pad[j] = 0; + } + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_resp_rec++; + } + + CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); + + status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + if (status != IB_SUCCESS) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_get_method: ERR 430C: " + "osm_vendor_send status = %s\n", + ib_get_err_str(status)); + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************* Received a Set(InformInfo) MAD **********************************************************************/ static void @@ -395,6 +753,12 @@ osm_infr_rcv_process_set_method( osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); goto Exit; } +osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_set_method: " + "LID 0x%04X GID 0x%016" PRIx64 " : 0x%016" PRIx64"\n", + cl_ntoh16(p_madw->mad_addr.dest_lid), + cl_ntoh64(inform_info_rec.inform_record.subscriber_gid.unicast.prefix), + cl_ntoh64(inform_info_rec.inform_record.subscriber_gid.unicast.interface_id)); /* * MODIFICATIONS DONE ON INCOMING REQUEST: @@ -472,7 +836,6 @@ osm_infr_rcv_process_set_method( /* Add this new osm_infr_t object to subnet object */ osm_infr_insert_to_db( p_rcv->p_subn, p_rcv->p_log, p_infr ); - } else { @@ -513,6 +876,8 @@ osm_infr_rcv_process_set_method( OSM_LOG_EXIT( p_rcv->p_log ); } +/********************************************************************* +**********************************************************************/ void osm_infr_rcv_process( IN osm_infr_rcv_t* const p_rcv, @@ -543,3 +908,37 @@ osm_infr_rcv_process( Exit: OSM_LOG_EXIT( p_rcv->p_log ); } + +/********************************************************************* +**********************************************************************/ +void +osm_infir_rcv_process( + IN osm_infr_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ) +{ + ib_sa_mad_t *p_sa_mad; + + OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process ); + + CL_ASSERT( p_madw ); + + p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw ); + + CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_INFORM_INFO_RECORD ); + + if ( (p_sa_mad->method != IB_MAD_METHOD_GET) && + (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_infir_rcv_process: " + "Unsupported Method (%s)\n", + ib_get_sa_method_str( p_sa_mad->method ) ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR ); + goto Exit; + } + + osm_infr_rcv_process_get_method( p_rcv, p_madw ); + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} diff --git a/osm/opensm/osm_sa_informinfo_ctrl.c b/osm/opensm/osm_sa_informinfo_ctrl.c index 76fc402..1637155 100644 --- a/osm/opensm/osm_sa_informinfo_ctrl.c +++ b/osm/opensm/osm_sa_informinfo_ctrl.c @@ -33,7 +33,6 @@ * */ - /* * Abstract: * Implementation of osm_infr_rcv_ctrl_t. @@ -68,12 +67,25 @@ __osm_infr_rcv_ctrl_disp_callback( /********************************************************************** **********************************************************************/ +static void +__osm_infir_rcv_ctrl_disp_callback( + IN void *context, + IN void *p_data ) +{ + /* ignore return status when invoked via the dispatcher */ + osm_infir_rcv_process( ((osm_infr_rcv_ctrl_t*)context)->p_rcv, + (osm_madw_t*)p_data ); +} + +/********************************************************************** + **********************************************************************/ void osm_infr_rcv_ctrl_construct( IN osm_infr_rcv_ctrl_t* const p_ctrl ) { memset( p_ctrl, 0, sizeof(*p_ctrl) ); p_ctrl->h_disp = CL_DISP_INVALID_HANDLE; + p_ctrl->h_disp2 = CL_DISP_INVALID_HANDLE; } /********************************************************************** @@ -83,6 +95,7 @@ osm_infr_rcv_ctrl_destroy( IN osm_infr_rcv_ctrl_t* const p_ctrl ) { CL_ASSERT( p_ctrl ); + cl_disp_unregister( p_ctrl->h_disp2 ); cl_disp_unregister( p_ctrl->h_disp ); } @@ -119,6 +132,22 @@ osm_infr_rcv_ctrl_init( goto Exit; } + p_ctrl->h_disp2 = cl_disp_register( + p_disp, + OSM_MSG_MAD_INFORM_INFO_RECORD, + __osm_infir_rcv_ctrl_disp_callback, + p_ctrl ); + + if( p_ctrl->h_disp2 == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_infr_rcv_ctrl_init: ERR 1702: " + "Dispatcher registration failed\n" ); + cl_disp_unregister( p_ctrl->h_disp ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + Exit: OSM_LOG_EXIT( p_log ); return( status ); diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c index 56386b1..2605fbf 100644 --- a/osm/opensm/osm_sa_mad_ctrl.c +++ b/osm/opensm/osm_sa_mad_ctrl.c @@ -208,6 +208,10 @@ __osm_sa_mad_ctrl_process( msg_id = OSM_MSG_MAD_GUIDINFO_RECORD; break; + case IB_MAD_ATTR_INFORM_INFO_RECORD: + msg_id = OSM_MSG_MAD_INFORM_INFO_RECORD; + break; + #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) case IB_MAD_ATTR_MULTIPATH_RECORD: msg_id = OSM_MSG_MAD_MULTIPATH_RECORD; From steve.apo at googlemail.com Tue Dec 5 07:11:50 2006 From: steve.apo at googlemail.com (Steven Wooding) Date: Tue, 5 Dec 2006 15:11:50 +0000 Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could be wrong? Message-ID: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> Hi, In my application I keep getting -1 returned by a call to ib_cm_send_req() function. The cmpost example application works fine, so I can rule out system set-up issues. I could do with a glue as to what the -1 means and then hopefully correct my application. Thanks, Steve. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Tue Dec 5 07:13:09 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 05 Dec 2006 17:13:09 +0200 Subject: [openib-general] OFED release and Sonoma OFA developers workshop Message-ID: <45758C85.4040903@dev.mellanox.co.il> Hi Bill, Since there is no Intel IDF on March 07, and on March we are going to be in the middle of OFED 1.2 release I suggest to delay the developer's conference to May. It will also be very good to have the workshop *after *the release since it will enable us to understand what went good, and what need to be improved in the process. Any thoughts? Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue Dec 5 07:14:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 09:14:36 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <20061205051657.GB26845@2ka.mipt.ru> Message-ID: <1165331676.16087.29.camel@stevo-desktop> On Mon, 2006-12-04 at 21:27 -0800, Roland Dreier wrote: > > So will each new NIC implement some parts of TCP stack in theirs drivers? > > I hope not. The driver we merged (amso1100) did it completely in FW, > with a separate MAC and IP interface for the RDMA connections. I > think we better understand the Chelsio driver pretty well and think it > over carefully before we merge it. > Chelsio doesn't implement TCP stack in the driver. Just like Ammasso, it sends messages to the HW to setup connections. It differs from Ammasso in at least 2 ways: 1) Ammasso does the MPA negotiations in FW/HW. Chelsio does it in the RDMA driver. So there is code in the Chelsio driver to handle MPA startup negotiation (the exchange of 2 packets over the TCP connection while its still in streaming more). BTW: This code _could_ be moved into the core IWCM if we find it could be used by other rnic devices (don't know yet). 2) Ammasso implments a 100% deep adapter. It does ARP, routing, IP, TCP, and IWARP protocols all in firmware/hw. It had 2 mac addresses simulating 2 ethernet ports. One exclusively for RDMA connections, and one for host stack traffic. Chelsio implements a shallower adapter that only does TCP in HW. ARP, for instance, is handled by the native stack and the rdma driver uses netevents to maintain arp tables in the HW for use by the offloaded TCP connections. Steve. From johnpol at 2ka.mipt.ru Tue Dec 5 07:19:06 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 18:19:06 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165330925.16087.13.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> Message-ID: <20061205151905.GA18275@2ka.mipt.ru> On Tue, Dec 05, 2006 at 09:02:05AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > > This and a lot of other changes in this driver definitely says you > > > > implement your own stack of protocols on top of infiniband hardware. > > > > > > ...but I do know this driver is for 10-gig ethernet HW. > > > > It is for iwarp/rdma from description. > > If it is 10ge, then why does it parse incomping packet headers and > > implements initial tcp state machine? > > > > Its not implementing the TCP state machine at all. Its implementing the > MPA state machine (see the iWARP internet drafts). These packets are > TCP payload. MPA is used to negotiate RDMA mode on a TCP connection. > This entails an exchange of 2 messages on the TCP connection. Once this > is exchanged and both side agree, the connection is bound to an RDMA QP > and the connection moved into RDMA mode. From that point on, all IO is > done via the post_send() and post_recv(). And why does rdma require window scaling, keep alive, nagle and other interesting options from TCP spec? This really looks like initial implementation of TCP in hardware - you setup flags like doing the same using setsockopt() and then hardware manages the flow like network stack manages TCP state machine changes. According to draft-culley-iwarp-mpa-03.txt this layer can do a lot of things with valid TCP flow like 5. The TCP sender puts the FPDUs into the TCP stream. If the TCP Sender is MPA-aware, it segments the TCP stream in such a way that a TCP Segment boundary is also the boundary of an FPDU. TCP then passes each segment to the IP layer for transmission. Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying that hardware (even if it is called ethernet driver) can create and work with own TCP flows potentially modified in the way it likes which is seen in driver. Likely such flows will not be seen by upper layers like OS network stack according to hardware descriptions. Is it correct? > Steve. -- Evgeniy Polyakov From johnpol at 2ka.mipt.ru Tue Dec 5 07:27:36 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 18:27:36 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165331676.16087.29.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <20061205051657.GB26845@2ka.mipt.ru> <1165331676.16087.29.camel@stevo-desktop> Message-ID: <20061205152736.GA2274@2ka.mipt.ru> On Tue, Dec 05, 2006 at 09:14:36AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > Chelsio doesn't implement TCP stack in the driver. Just like Ammasso, > it sends messages to the HW to setup connections. It differs from > Ammasso in at least 2 ways: > > 1) Ammasso does the MPA negotiations in FW/HW. Chelsio does it in the > RDMA driver. So there is code in the Chelsio driver to handle MPA > startup negotiation (the exchange of 2 packets over the TCP connection > while its still in streaming more). BTW: This code _could_ be moved > into the core IWCM if we find it could be used by other rnic devices > (don't know yet). > > 2) Ammasso implments a 100% deep adapter. It does ARP, routing, IP, > TCP, and IWARP protocols all in firmware/hw. It had 2 mac addresses > simulating 2 ethernet ports. One exclusively for RDMA connections, and > one for host stack traffic. Chelsio implements a shallower adapter that > only does TCP in HW. ARP, for instance, is handled by the native stack > and the rdma driver uses netevents to maintain arp tables in the HW for > use by the offloaded TCP connections. So breifly saying - there is TCP stack implementation (including ARP and routing and other parts) in hardware/firmware/driver which is guaranteed to not be visible to host other than in form of high-level dataflow. Am I right here? > Steve. > > > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Evgeniy Polyakov From swise at opengridcomputing.com Tue Dec 5 07:39:58 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 09:39:58 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205151905.GA18275@2ka.mipt.ru> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> Message-ID: <1165333198.16087.53.camel@stevo-desktop> On Tue, 2006-12-05 at 18:19 +0300, Evgeniy Polyakov wrote: > On Tue, Dec 05, 2006 at 09:02:05AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > > > This and a lot of other changes in this driver definitely says you > > > > > implement your own stack of protocols on top of infiniband hardware. > > > > > > > > ...but I do know this driver is for 10-gig ethernet HW. > > > > > > It is for iwarp/rdma from description. > > > If it is 10ge, then why does it parse incomping packet headers and > > > implements initial tcp state machine? > > > > > > > Its not implementing the TCP state machine at all. Its implementing the > > MPA state machine (see the iWARP internet drafts). These packets are > > TCP payload. MPA is used to negotiate RDMA mode on a TCP connection. > > This entails an exchange of 2 messages on the TCP connection. Once this > > is exchanged and both side agree, the connection is bound to an RDMA QP > > and the connection moved into RDMA mode. From that point on, all IO is > > done via the post_send() and post_recv(). > > And why does rdma require window scaling, keep alive, nagle and other > interesting options from TCP spec? > The connection setup messages sent to the hardware need to have these parameters so the TCP engine on the HW knows how to do connection options, windows, etc. > This really looks like initial implementation of TCP in hardware - you > setup flags like doing the same using setsockopt() and then hardware > manages the flow like network stack manages TCP state machine changes. > > According to draft-culley-iwarp-mpa-03.txt this layer can do a lot of > things with valid TCP flow like > > 5. The TCP sender puts the FPDUs into the TCP stream. If the TCP > Sender is MPA-aware, it segments the TCP stream in such a way > that a TCP Segment boundary is also the boundary of an FPDU. > TCP then passes each segment to the IP layer for transmission. > > Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying > that hardware (even if it is called ethernet driver) can create and work > with own TCP flows potentially modified in the way it likes which is seen > in driver. Likely such flows will not be seen by upper layers like OS > network stack according to hardware descriptions. > > Is it correct? > I don't quite get your point about the driver aspect of this? The HW manages the iWARP connection including data flow. It adheres to the MPA, RDDP, and RDMAP protocol specification IDs from the IETF. The HW manages how data gets pushed out in the RDMA stream. The RDMA Driver just requests a TCP connection and does the MPA exchange. Then tells the hardware to move the connection into RDMA mode. From that point on, the driver simply suffles IO work requests from the consumer application to the hardware and handles asynchronous events while the connection is up and running. Steve. From johann.george at qlogic.com Tue Dec 5 07:43:50 2006 From: johann.george at qlogic.com (Johann George) Date: Tue, 5 Dec 2006 07:43:50 -0800 Subject: [openib-general] .openfabrics.org names In-Reply-To: <45755DF1.5080208@dev.mellanox.co.il> References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> <45755DF1.5080208@dev.mellanox.co.il> Message-ID: <20061205154350.GA11109@cuprite.pathscale.com> > Who controls the DNS for openfabrics.org? At the moment, I believe that Intel does. > Could we get these names created? Could you send me a list of the names you would like created and I will try to initiate the process. Johann From swise at opengridcomputing.com Tue Dec 5 07:46:18 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 09:46:18 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205152736.GA2274@2ka.mipt.ru> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <20061205051657.GB26845@2ka.mipt.ru> <1165331676.16087.29.camel@stevo-desktop> <20061205152736.GA2274@2ka.mipt.ru> Message-ID: <1165333578.16087.60.camel@stevo-desktop> On Tue, 2006-12-05 at 18:27 +0300, Evgeniy Polyakov wrote: > On Tue, Dec 05, 2006 at 09:14:36AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > Chelsio doesn't implement TCP stack in the driver. Just like Ammasso, > > it sends messages to the HW to setup connections. It differs from > > Ammasso in at least 2 ways: > > > > 1) Ammasso does the MPA negotiations in FW/HW. Chelsio does it in the > > RDMA driver. So there is code in the Chelsio driver to handle MPA > > startup negotiation (the exchange of 2 packets over the TCP connection > > while its still in streaming more). BTW: This code _could_ be moved > > into the core IWCM if we find it could be used by other rnic devices > > (don't know yet). > > > > 2) Ammasso implments a 100% deep adapter. It does ARP, routing, IP, > > TCP, and IWARP protocols all in firmware/hw. It had 2 mac addresses > > simulating 2 ethernet ports. One exclusively for RDMA connections, and > > one for host stack traffic. Chelsio implements a shallower adapter that > > only does TCP in HW. ARP, for instance, is handled by the native stack > > and the rdma driver uses netevents to maintain arp tables in the HW for > > use by the offloaded TCP connections. > > So breifly saying - there is TCP stack implementation (including ARP and > routing and other parts) in hardware/firmware/driver which is guaranteed > to not be visible to host other than in form of high-level dataflow. > Am I right here? For Ammasso, yes. From bugzilla-daemon at openib.org Tue Dec 5 07:47:38 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 5 Dec 2006 07:47:38 -0800 (PST) Subject: [openib-general] [Bug 308] New: IPOIB HA Failed - ping does not reach to destination Message-ID: <20061205154738.DB4A82283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=308 Summary: IPOIB HA Failed - ping does not reach to destination Product: OpenFabrics Linux Version: gen2 Platform: Other OS/Version: RHEL 4 Status: NEW Severity: blocker Priority: P2 Component: IPoIB AssignedTo: bugzilla at openib.org ReportedBy: yohadd at mellanox.co.il IPOIB HA Failed - ping does not reach to destination. Failure flow: 1) set HA up on host1. primary=ib0, secondary=ib1. 2) run opensm on host2. 3) run ping to the ip that associated with host1 ib0. - ping succeed. 4) set the port that associated with ib0 on host1 down. - ping starts to fail. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From jsquyres at cisco.com Tue Dec 5 07:57:03 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 5 Dec 2006 10:57:03 -0500 Subject: [openib-general] .openfabrics.org names In-Reply-To: <20061205154350.GA11109@cuprite.pathscale.com> References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> <45755DF1.5080208@dev.mellanox.co.il> <20061205154350.GA11109@cuprite.pathscale.com> Message-ID: <1C4EC796-9CD7-4962-BC4E-F76B5443E624@cisco.com> How about the following: git.openfabrics.org wiki.openfabrics.org trac.openfabrics.org ssh.openfabrics.org I'm assuming that these can all be CNAMEs to the main name. (Since Intel is maintaining this, should we be bugging someone else instead of you?) On Dec 5, 2006, at 10:43 AM, Johann George wrote: >> Who controls the DNS for openfabrics.org? > > At the moment, I believe that Intel does. > >> Could we get these names created? > > Could you send me a list of the names you would like created and I > will try to initiate the process. > > Johann -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From swise at opengridcomputing.com Tue Dec 5 08:02:09 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 10:02:09 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <45754DE3.1020505@ens-lyon.org> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <1165249251.32724.26.camel@stevo-desktop> <45754DE3.1020505@ens-lyon.org> Message-ID: <1165334529.16087.69.camel@stevo-desktop> On Tue, 2006-12-05 at 11:45 +0100, Brice Goglin wrote: > Steve Wise wrote: > > There is no SW TCP stack in this driver. The HW supports RDMA over > > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet > > (aka iWARP). The device is a 10 GbE device, not Infiniband. > > Then, I wonder why the driver goes in drivers/infiniband/ :) drivers/infiniband support both IB and IWARP transports. > Is there really no way to only keep the actual hw infiniband there, move > iwarp/rdma drivers in drivers/net/something/ and the core stuff in > net/something/ ? > Sure, this _could_ be done, but what I think you're missing is that applications use the interface exported by drivers/infiniband over both IB -and- IWARP transports. The application can be written to not care which transport is used. Examples of apps that can run over both transports using the same common interface: user mode: MVAPICH2, OMPI, IMPI, HPMPI, kernel mode: NFS-RDMA, iSER. Note that the include directory used by drivers/infiniband is now include/rdma. Perhaps drivers/infiniband should be renamed to drivers/rdma as well at some point... Steve. From johnpol at 2ka.mipt.ru Tue Dec 5 07:59:32 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 18:59:32 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165333198.16087.53.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> Message-ID: <20061205155932.GA32380@2ka.mipt.ru> On Tue, Dec 05, 2006 at 09:39:58AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying > > that hardware (even if it is called ethernet driver) can create and work > > with own TCP flows potentially modified in the way it likes which is seen > > in driver. Likely such flows will not be seen by upper layers like OS > > network stack according to hardware descriptions. > > > > Is it correct? > > > > I don't quite get your point about the driver aspect of this? > > The HW manages the iWARP connection including data flow. It adheres to > the MPA, RDDP, and RDMAP protocol specification IDs from the IETF. The > HW manages how data gets pushed out in the RDMA stream. The RDMA > Driver just requests a TCP connection and does the MPA exchange. Then > tells the hardware to move the connection into RDMA mode. From that > point on, the driver simply suffles IO work requests from the consumer > application to the hardware and handles asynchronous events while the > connection is up and running. My main concern about this is the fact, that protocol handling is splitted into SF and HW parts, and actually until negotiation is completed those parts are completely unrelated to each other, so requested TCP connection can leak into main stack and main stack can send some packets which can be considered as MPA negotiation. > Steve. -- Evgeniy Polyakov From johann.george at qlogic.com Tue Dec 5 08:07:52 2006 From: johann.george at qlogic.com (Johann George) Date: Tue, 5 Dec 2006 08:07:52 -0800 Subject: [openib-general] .openfabrics.org names In-Reply-To: <1C4EC796-9CD7-4962-BC4E-F76B5443E624@cisco.com> References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> <45755DF1.5080208@dev.mellanox.co.il> <20061205154350.GA11109@cuprite.pathscale.com> <1C4EC796-9CD7-4962-BC4E-F76B5443E624@cisco.com> Message-ID: <20061205160752.GA11809@cuprite.pathscale.com> > git.openfabrics.org > wiki.openfabrics.org > trac.openfabrics.org > ssh.openfabrics.org Sounds good. > (Since Intel is maintaining this, should we be bugging someone else > instead of you?) Ideally, yes; but I will be happy to initiate it. Also, we probably should move control of the domain name to OpenFabrics. Johann From swise at opengridcomputing.com Tue Dec 5 08:12:42 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 10:12:42 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205155932.GA32380@2ka.mipt.ru> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> Message-ID: <1165335162.16087.79.camel@stevo-desktop> On Tue, 2006-12-05 at 18:59 +0300, Evgeniy Polyakov wrote: > On Tue, Dec 05, 2006 at 09:39:58AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying > > > that hardware (even if it is called ethernet driver) can create and work > > > with own TCP flows potentially modified in the way it likes which is seen > > > in driver. Likely such flows will not be seen by upper layers like OS > > > network stack according to hardware descriptions. > > > > > > Is it correct? > > > > > > > I don't quite get your point about the driver aspect of this? > > > > The HW manages the iWARP connection including data flow. It adheres to > > the MPA, RDDP, and RDMAP protocol specification IDs from the IETF. The > > HW manages how data gets pushed out in the RDMA stream. The RDMA > > Driver just requests a TCP connection and does the MPA exchange. Then > > tells the hardware to move the connection into RDMA mode. From that > > point on, the driver simply suffles IO work requests from the consumer > > application to the hardware and handles asynchronous events while the > > connection is up and running. > > My main concern about this is the fact, that protocol handling is > splitted into SF and HW parts, and actually until negotiation is > completed those parts are completely unrelated to each other, so > requested TCP connection can leak into main stack and main stack can > send some packets which can be considered as MPA negotiation. > Ah. Data from an offloaded connection cannot leak into the main stack nor vice-verse. We can take an active RDMA connection establishment as an example if you want: Once the message is sent to the HW to "setup a TCP connection from addr/port a.b to addr/port c.d", then packets on that connection (that 4-tuple) will always be delivered to the RDMA driver, not the native stack. If the the packet received after the connection is setup is -not- an MPA reply (in this example), then the connection is aborted. Once the connection is aborted. So no leaking can happen. From tziporet at dev.mellanox.co.il Tue Dec 5 08:17:16 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 05 Dec 2006 18:17:16 +0200 Subject: [openib-general] OFED 1.2 features update Message-ID: <45759B8C.8010408@dev.mellanox.co.il> Hi, In the OFED meeting yesterday the following decisions were taken: 1. We agreed to have two types of features * Must have features - will delay the release if not ready * Desirable features - will be included only if they are ready on time according to OFED requirements. 2. The following features are added to OFED 1.2 as desired: 1. iWARP - someone from iWARP company should be the owner 2. VNIC - Madhue OFED 1.2 plan was updated on the Wiki: https://openib.org/tiki/tiki-index.php?page=OFED+1.2+release+plan+and+features 3. NFSoverRDMA: Will probably not be part of OFED 1.2 since it requires kernel pathces. Tom Tucker will prepare a package that will be installed over OFED 1.2 4. Sean should prepare patches or git tree for kernel code that is not upstream (e.g. SA cache) 5. Hal will take care for git-commit mails 6. Tziporet should send explanation on OFED inclusion requirements (backport patches, install scripts, etc.) Tziporet From swise at opengridcomputing.com Tue Dec 5 08:17:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 10:17:43 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165335162.16087.79.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> Message-ID: <1165335463.16087.83.camel@stevo-desktop> On Tue, 2006-12-05 at 10:12 -0600, Steve Wise wrote: > On Tue, 2006-12-05 at 18:59 +0300, Evgeniy Polyakov wrote: > > On Tue, Dec 05, 2006 at 09:39:58AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > > Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying > > > > that hardware (even if it is called ethernet driver) can create and work > > > > with own TCP flows potentially modified in the way it likes which is seen > > > > in driver. Likely such flows will not be seen by upper layers like OS > > > > network stack according to hardware descriptions. > > > > > > > > Is it correct? > > > > > > > > > > I don't quite get your point about the driver aspect of this? > > > > > > The HW manages the iWARP connection including data flow. It adheres to > > > the MPA, RDDP, and RDMAP protocol specification IDs from the IETF. The > > > HW manages how data gets pushed out in the RDMA stream. The RDMA > > > Driver just requests a TCP connection and does the MPA exchange. Then > > > tells the hardware to move the connection into RDMA mode. From that > > > point on, the driver simply suffles IO work requests from the consumer > > > application to the hardware and handles asynchronous events while the > > > connection is up and running. > > > > My main concern about this is the fact, that protocol handling is > > splitted into SF and HW parts, and actually until negotiation is > > completed those parts are completely unrelated to each other, so > > requested TCP connection can leak into main stack and main stack can > > send some packets which can be considered as MPA negotiation. > > > > Ah. Data from an offloaded connection cannot leak into the main stack > nor vice-verse. We can take an active RDMA connection establishment as > an example if you want: Once the message is sent to the HW to "setup a > TCP connection from addr/port a.b to addr/port c.d", then packets on > that connection (that 4-tuple) will always be delivered to the RDMA > driver, not the native stack. If the the packet received after the > connection is setup is -not- an MPA reply (in this example), then the > connection is aborted. Once the connection is aborted. ^ the 4 tuple can then be reused for rdma or native stack tcp connections. From mst at mellanox.co.il Tue Dec 5 08:19:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 5 Dec 2006 18:19:44 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: <20061129140016.GO5061@mellanox.co.il> References: <20061129140016.GO5061@mellanox.co.il> Message-ID: <20061205161944.GD30209@mellanox.co.il> The following patch adds experimental support for IPoIB connected mode. The idea is to increase performance by increasing the MTU from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD. With this code, I'm able to get 800MByte/sec or more with netperf without options on a Mellanox 4x back-to-back DDR system. Please review. I labeled CM support as experimental, although its been very stable for me, mostly because there are still some things to be addressed before it's as usable as IPoIB UD. I am very interested in getting this code in shape for merging as early as possible, as opposed to maintaining it out of tree until it's fully mature, and I tried to split the CM code in a separate file to make this feasible. Let me know whether this was a good idea, or whether more needs to be done in this direction. Note that the connected mode support adds very little overhead when not activated at run time, and zero data-path overhead when not activated at compile time. Here's a short description of what the patch does: a. The code's here: git://staging.openfabrics.org/~mst/linux-2.6/.git ipoib_cm_branch This is based on 2.6.19, so ~>git diff v2.6.19..ipoib_cm_branch will show what I have done so far. b. How to activate: Server: #modprobe ib_ipoib #/sbin/ifconfig ib0 mtu 65520 #./netperf-2.4.2/src/netserver Client: #modprobe ib_ipoib #/sbin/ifconfig ib0 mtu 65520 #./netperf-2.4.2/src/netperf -H 11.4.3.68 -f M TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.01 891.21 c. TODO list 1. Clean up stale connections 4. (Optional) S/G support 5. (Optional) Make CM use same CQ IPoIB uses for UD d. Limitations UDP multicast and UDP connections to IPoIB UD mode currently don't work since we get packets that are too large to send over a UD QP. As a work around, one can now create separate interfaces for use with CM and UD mode. e. Some notes on code 1. SRQ is used for scalability to large cluster sizes 2. Only RC connections are used (UC does not support SRQ now) 3. Retry count is set to 0 since spec draft warns against retries 4. Each connection is used for data transfers in only 1 direction, so each connection is either active(TX) or passive (RX). 2 sides that want to communicate create 2 connections. 5. Each active (TX) connection has a separate CQ for send completions - this keeps the code simple without CQ resize and other tricks I'm looking at ways to limit the path mtu for these connections, to make it work. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig index c75322d..7aa3a25 100644 --- a/drivers/infiniband/ulp/ipoib/Kconfig +++ b/drivers/infiniband/ulp/ipoib/Kconfig @@ -8,6 +8,15 @@ config INFINIBAND_IPOIB See Documentation/infiniband/ipoib.txt for more information +config INFINIBAND_IPOIB_CM + bool "IP-over-InfiniBand Connected Mode support" + depends on INFINIBAND_IPOIB && EXPERIMENTAL + default n + ---help--- + This option enables experimental support for IPoIB connected mode. + After enabling this option, you need to increase the interface MTU + with e.g. ifconfig ib0 mtu 65520 to actually create connections. + config INFINIBAND_IPOIB_DEBUG bool "IP-over-InfiniBand debugging" if EMBEDDED depends on INFINIBAND_IPOIB diff --git a/drivers/infiniband/ulp/ipoib/Makefile b/drivers/infiniband/ulp/ipoib/Makefile index 8935e74..f01a24b 100644 --- a/drivers/infiniband/ulp/ipoib/Makefile +++ b/drivers/infiniband/ulp/ipoib/Makefile @@ -6,4 +6,5 @@ ib_ipoib-y := ipoib_main.o \ ipoib_verbs.o \ ipoib_vlan.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o +ib_ipoib-$(INFINIBAND_IPOIB_CM) += ipoib_cm.o diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 0b8a79d..545cdae 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -62,6 +62,9 @@ enum { IPOIB_ENCAP_LEN = 4, + IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ + IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, + IPOIB_RX_RING_SIZE = 128, IPOIB_TX_RING_SIZE = 64, IPOIB_MAX_QUEUE_SIZE = 8192, @@ -81,6 +84,7 @@ enum { IPOIB_MCAST_RUN = 6, IPOIB_STOP_REAPER = 7, IPOIB_MCAST_STARTED = 8, + IPOIB_FLAG_NETIF_STOPPED = 9, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -113,6 +117,49 @@ struct ipoib_tx_buf { DECLARE_PCI_UNMAP_ADDR(mapping) }; +struct ib_cm_id; + +struct ipoib_cm_data { + __be32 qpn; /* High byte MUST be ignored on receive */ + __be32 mtu; +}; + +struct ipoib_cm_rx { + struct ib_cm_id *id; + struct ib_qp *qp; + struct list_head list; + struct net_device *dev; +}; + +struct ipoib_cm_tx { + struct ib_cm_id *id; + struct ib_cq *cq; + struct ib_qp *qp; + struct list_head list; + struct net_device *dev; + struct ipoib_neigh *neigh; + struct ipoib_path *path; + struct ipoib_tx_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + unsigned long flags; + u32 mtu; + struct ib_wc ibwc[IPOIB_NUM_WC]; +}; + +struct ipoib_cm_dev_priv { + struct ib_cq *cq; + struct ib_srq *srq; + struct ipoib_rx_buf *srq_ring; + struct ib_cm_id *id; + struct list_head passive_ids; + struct work_struct start_task; + struct work_struct reap_task; + struct list_head start_list; + struct list_head reap_list; + struct ib_wc ibwc[IPOIB_NUM_WC]; +}; + /* * Device private locking: tx_lock protects members used in TX fast * path (and we use LLTX so upper layers don't do extra locking). @@ -179,6 +226,8 @@ struct ipoib_dev_priv { struct list_head child_intfs; struct list_head list; + struct ipoib_cm_dev_priv cm; + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct list_head fs_list; struct dentry *mcg_dentry; @@ -212,6 +261,7 @@ struct ipoib_path { struct ipoib_neigh { struct ipoib_ah *ah; + struct ipoib_cm_tx *cm; union ib_gid dgid; struct sk_buff_head queue; @@ -315,6 +365,93 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); void ipoib_pkey_poll(void *dev); int ipoib_pkey_dev_delay_open(struct net_device *dev); +#ifdef CONFIG_INFINIBAND_IPOIB_CM + +#define IPOIB_FLAGS_RC 0x80 +#define IPOIB_FLAGS_UC 0x40 + +#define IPOIB_CM_ENABLED(ha) (ha[0] & IPOIB_FLAGS_RC) + +static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n) +{ + /* Simple heuristic: dev->mtu > 2K ==> connected mode */ + return (IPOIB_CM_ENABLED(n->ha) && + dev->mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN); +} + +static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh) +{ + return neigh->cm; +} + +void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx); +int ipoib_cm_dev_open(struct net_device *dev); +void ipoib_cm_dev_stop(struct net_device *dev); +int ipoib_cm_dev_init(struct net_device *dev); +void ipoib_cm_dev_cleanup(struct net_device *dev); +struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh); +void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); +#else + +#define IPOIB_CM_ENABLED(ha) (0) + +static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n) + +{ + return 0; +} + +static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh) +{ + return NULL; +} + +static inline +void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) +{ + return; +} + +static inline +int ipoib_cm_dev_open(struct net_device *dev) +{ + return 0; +} + +static inline +void ipoib_cm_dev_stop(struct net_device *dev) +{ + return; +} + +static inline +int ipoib_cm_dev_init(struct net_device *dev) +{ + return 0; +} + +static inline +void ipoib_cm_dev_cleanup(struct net_device *dev) +{ + return; +} + +static inline +struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh) +{ + return NULL; +} + +static inline +void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) +{ + return; +} + +#endif + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG void ipoib_create_debug_files(struct net_device *dev); void ipoib_delete_debug_files(struct net_device *dev); @@ -392,4 +529,7 @@ extern int ipoib_debug_level; #define IPOIB_GID_ARG(gid) IPOIB_GID_RAW_ARG((gid).raw) +#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) + + #endif /* _IPOIB_H */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c new file mode 100644 index 0000000..a40eb4c --- /dev/null +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -0,0 +1,1043 @@ +/* + * Copyright (c) 2006 Mellanox Technologies. All rights reserved + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +static int data_debug_level; + +module_param_named(cm_data_debug_level, data_debug_level, int, 0644); +MODULE_PARM_DESC(cm_data_debug_level, + "Enable data path debug tracing for connected mode if > 0"); +#endif + +#include "ipoib.h" + +#define IPOIB_CM_IETF_ID 0x1000000000000000ULL + +#define IPOIB_OP_SRQ (1ul << 30) + +struct ipoib_cm_id { + struct ib_cm_id *id; + int flags; + u32 remote_qpn; + u32 remote_mtu; +}; + +int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +static int ipoib_cm_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sge list; + struct ib_recv_wr param; + struct ib_recv_wr *bad_wr; + int ret; + + list.addr = priv->cm.srq_ring[id].mapping; + list.length = IPOIB_CM_BUF_SIZE; + list.lkey = priv->mr->lkey; + + param.next = NULL; + param.wr_id = id | IPOIB_OP_SRQ; + param.sg_list = &list; + param.num_sge = 1; + + ret = ib_post_srq_recv(priv->cm.srq, ¶m, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); + dma_unmap_single(priv->ca->dma_device, + priv->cm.srq_ring[id].mapping, + IPOIB_CM_BUF_SIZE, DMA_FROM_DEVICE); + dev_kfree_skb_any(priv->cm.srq_ring[id].skb); + priv->cm.srq_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + dma_addr_t addr; + + skb = dev_alloc_skb(IPOIB_CM_BUF_SIZE + 12); + if (!skb) + return -ENOMEM; + + /* + * IPoIB adds a 4 byte header. So we need 12 more bytes to align the + * IP header to a multiple of 16. + */ + skb_reserve(skb, 12); + + addr = dma_map_single(priv->ca->dma_device, + skb->data, IPOIB_CM_BUF_SIZE, + DMA_FROM_DEVICE); + if (unlikely(dma_mapping_error(addr))) { + dev_kfree_skb_any(skb); + return -EIO; + } + + priv->cm.srq_ring[id].skb = skb; + priv->cm.srq_ring[id].mapping = addr; + + return 0; +} + +static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr attr = { + .send_cq = priv->cm.cq, /* does not matter, we never send anything */ + .recv_cq = priv->cm.cq, + .srq = priv->cm.srq, + .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ + .sq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_RC, + }; + return ib_create_qp(priv->pd, &attr); +} + +static int ipoib_cm_modify_rx_rts(struct net_device *dev, + struct ib_cm_id *cm_id, struct ib_qp *qp) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IB_QPS_INIT; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for INIT: %d\n", ret); + return ret; + } + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to INIT: %d\n", ret); + return ret; + } + qp_attr.qp_state = IB_QPS_RTR; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret); + return ret; + } + qp_attr.rq_psn = 0 /* FIXME */; + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); + return ret; + } + return 0; +} + +static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, + struct ib_qp *qp, struct ib_cm_req_event_param *req) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_data data = {}; + struct ib_cm_rep_param rep = {}; + + data.qpn = cpu_to_be32(priv->qp->qp_num); + data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE); + + rep.private_data = &data; + rep.private_data_len = sizeof data; + rep.flow_control = 0; + rep.rnr_retry_count = req->rnr_retry_count; + rep.target_ack_delay = 20; /* FIXME */ + rep.srq = 1; + rep.qp_num = qp->qp_num; + rep.starting_psn = 0 /* FIXME */; + return ib_send_cm_rep(cm_id, &rep); +} + +static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_rx *p; + unsigned long flags; + int ret; + + ipoib_dbg(priv, "REQ arrived\n"); + p = kzalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + p->dev = dev; + p->id = cm_id; + p->qp = ipoib_cm_create_rx_qp(dev); + if (IS_ERR(p->qp)) { + ret = PTR_ERR(p->qp); + goto err_qp; + } + + ret = ipoib_cm_modify_rx_rts(dev, cm_id, p->qp); + if (ret) + goto err_modify; + + ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd); + if (ret) { + ipoib_warn(priv, "failed to send REP: %d\n", ret); + goto err_rep; + } + + cm_id->context = p; + spin_lock_irqsave(&priv->lock, flags); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + return 0; + +err_rep: +err_modify: + ib_destroy_qp(p->qp); +err_qp: + kfree(p); + return ret; +} + +int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct ipoib_cm_rx *p; + struct ipoib_dev_priv *priv; + unsigned long flags; + int ret; + + switch (event->event) { + case IB_CM_REQ_RECEIVED: + return ipoib_cm_req_handler(cm_id, event); + case IB_CM_DREQ_RECEIVED: + p = cm_id->context; + ib_send_cm_drep(cm_id, NULL, 0); + /* Fall through */ + case IB_CM_REJ_RECEIVED: + p = cm_id->context; + priv = netdev_priv(p->dev); + spin_lock_irqsave(&priv->lock, flags); + if (list_empty(&p->list)) + ret = 0; /* Connection is going away already. */ + else { + list_del(&p->list); + ret = -ECONNRESET; + } + spin_unlock_irqrestore(&priv->lock, flags); + if (ret) { + ib_destroy_qp(p->qp); + kfree(p); + return ret; + } + return 0; + default: + return 0; + } +} + +static void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id & ~IPOIB_OP_SRQ; + struct sk_buff *skb; + dma_addr_t addr; + + ipoib_dbg_data(priv, "cm recv completion: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", + wr_id, ipoib_recvq_size); + return; + } + + skb = priv->cm.srq_ring[wr_id].skb; + addr = priv->cm.srq_ring[wr_id].mapping; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ++priv->stats.rx_dropped; + goto repost; + } + + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + if (unlikely(ipoib_cm_alloc_rx_skb(dev, wr_id))) { + ++priv->stats.rx_dropped; + goto repost; + } + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + dma_unmap_single(priv->ca->dma_device, addr, + IPOIB_CM_BUF_SIZE, DMA_FROM_DEVICE); + + skb_put(skb, wc->byte_len); + + if (wc->slid != priv->local_lid || + wc->src_qp != priv->qp->qp_num) { + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + } else { + ipoib_dbg_data(priv, "dropping loopback packet\n"); + dev_kfree_skb_any(skb); + } + +repost: + if (unlikely(ipoib_cm_post_receive(dev, wr_id))) + ipoib_warn(priv, "ipoib_cm_post_receive failed " + "for buf %d\n", wr_id); +} + +void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->cm.ibwc); + for (i = 0; i < n; ++i) + ipoib_cm_handle_rx_wc(dev, priv->cm.ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + struct ipoib_cm_tx *tx, + unsigned int wr_id, + dma_addr_t addr, int len) +{ + struct ib_send_wr *bad_wr; + + priv->tx_sge.addr = addr; + priv->tx_sge.length = len; + + priv->tx_wr.wr_id = wr_id; + + return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); +} + +void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_tx_buf *tx_req; + dma_addr_t addr; + + if (unlikely(skb->len > tx->mtu)) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, tx->mtu); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "sending packet %p, head %d length=%d connection=%p\n", + skb, tx->tx_head, skb->len, tx); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, + DMA_TO_DEVICE); + if (unlikely(dma_mapping_error(addr))) { + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + pci_unmap_addr_set(tx_req, mapping, addr); + + if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), + addr, skb->len))) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dma_unmap_single(priv->ca->dma_device, addr, skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(skb); + } else { + dev->trans_start = jiffies; + ++tx->tx_head; + + if (tx->tx_head - tx->tx_tail == ipoib_sendq_size) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + set_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags); + } + } +} + +static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + struct ipoib_tx_buf *tx_req; + unsigned long flags; + + ipoib_dbg_data(priv, "cm send completion: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (unlikely(wr_id >= ipoib_sendq_size)) { + ipoib_warn(priv, "cm send completion event with wrid %d (> %d)\n", + wr_id, ipoib_sendq_size); + return; + } + + tx_req = &tx->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + /* FIXME: is this right? Shouldn't we only increment on success? */ + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++tx->tx_tail; + if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags) && + tx->tx_head - tx->tx_tail <= ipoib_sendq_size >> 1) { + netif_wake_queue(dev); + } + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) { + struct ipoib_neigh *neigh; + + ipoib_dbg(priv, "failed cm send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + + spin_lock(&priv->lock); + neigh = tx->neigh; + + if (neigh) { + neigh->cm = NULL; + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + + tx->neigh = NULL; + } + if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { + list_move(&tx->list, &priv->cm.reap_list); + queue_work(ipoib_workqueue, &priv->cm.reap_task); + } + + clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags); + + spin_unlock(&priv->lock); + } + + spin_unlock_irqrestore(&priv->tx_lock, flags); +} + +void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) +{ + struct ipoib_cm_tx *tx = tx_ptr; + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); + for (i = 0; i < n; ++i) + ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +int ipoib_cm_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + if (!IPOIB_CM_ENABLED(dev->dev_addr)) + return 0; + + priv->cm.cq = ib_create_cq(priv->ca, ipoib_cm_rx_completion, NULL, dev, + ipoib_recvq_size + 1); + if (IS_ERR(priv->cm.cq)) { + printk(KERN_WARNING "%s: failed to create CQ\n", priv->ca->name); + return PTR_ERR(priv->cm.cq); + } + + ib_req_notify_cq(priv->cm.cq, IB_CQ_NEXT_COMP); + + priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); + if (IS_ERR(priv->cm.id)) { + printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); + ib_destroy_cq(priv->cm.cq); + return IS_ERR(priv->cm.id); + } + + ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num), + 0, NULL); + if (ret) { + printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name, + IPOIB_CM_IETF_ID | priv->qp->qp_num); + ib_destroy_cm_id(priv->cm.id); + ib_destroy_cq(priv->cm.cq); + return ret; + } + return 0; +} + +void ipoib_cm_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_rx *p; + unsigned long flags; + + if (!IPOIB_CM_ENABLED(dev->dev_addr)) + return; + + ib_destroy_cm_id(priv->cm.id); + spin_lock_irqsave(&priv->lock, flags); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + list_del_init(&p->list); + spin_unlock_irqrestore(&priv->lock, flags); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + kfree(p); + spin_lock_irqsave(&priv->lock, flags); + } + spin_unlock_irqrestore(&priv->lock, flags); + ib_destroy_cq(priv->cm.cq); +} + +static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct ipoib_cm_tx *p = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + struct ipoib_cm_data *data = event->private_data; + struct sk_buff_head skqueue; + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + struct sk_buff *skb; + unsigned long flags; + + p->mtu = be32_to_cpu(data->mtu); + + if (p->mtu < priv->dev->mtu + IPOIB_ENCAP_LEN) { + ipoib_warn(priv, "Rejecting connection: mtu %d < device mtu %d + 4\n", + p->mtu, priv->dev->mtu); + return -EINVAL; + } + + qp_attr.qp_state = IB_QPS_RTR; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret); + return ret; + } + + qp_attr.rq_psn = 0 /* FIXME */; + ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); + return ret; + } + + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret); + return ret; + } + ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret); + return ret; + } + + skb_queue_head_init(&skqueue); + + spin_lock_irqsave(&priv->lock, flags); + set_bit(IPOIB_FLAG_OPER_UP, &p->flags); + if (p->neigh) + while ((skb = __skb_dequeue(&p->neigh->queue))) + __skb_queue_tail(&skqueue, skb); + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { + skb->dev = p->dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } + + ret = ib_send_cm_rtu(cm_id, NULL, 0); + if (ret) { + ipoib_warn(priv, "failed to send RTU: %d\n", ret); + return ret; + } + return 0; +} + +static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr attr = {}; + attr.recv_cq = priv->cm.cq; + attr.srq = priv->cm.srq; + attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_send_sge = 1; + attr.sq_sig_type = IB_SIGNAL_ALL_WR; + attr.qp_type = IB_QPT_RC; + attr.send_cq = cq; + return ib_create_qp(priv->pd, &attr); +} + +static int ipoib_cm_send_req(struct net_device *dev, + struct ib_cm_id *id, struct ib_qp *qp, + u32 qpn, + struct ib_sa_path_rec *pathrec) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_data data = {}; + struct ib_cm_req_param req = {}; + + data.qpn = cpu_to_be32(priv->qp->qp_num); + data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE); + + req.primary_path = pathrec; + req.alternate_path = NULL; + req.service_id = cpu_to_be64(IPOIB_CM_IETF_ID | qpn); + req.qp_num = qp->qp_num; + req.qp_type = qp->qp_type; + req.private_data = &data; + req.private_data_len = sizeof data; + req.flow_control = 0; + + req.starting_psn = 0; /* FIXME */ + + /* + * Pick some arbitrary defaults here; we could make these + * module parameters if anyone cared about setting them. + */ + req.responder_resources = 4; + req.remote_cm_response_timeout = 20; + req.local_cm_response_timeout = 20; + req.retry_count = 0; /* RFC draft warns against retries */ + req.rnr_retry_count = 0; /* RFC draft warns against retries */ + req.max_cm_retries = 15; + req.srq = 15; + return ib_send_cm_req(id, &req); +} + +static int ipoib_cm_modify_tx_init(struct net_device *dev, + struct ib_cm_id *cm_id, struct ib_qp *qp) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); + if (ret) { + ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret); + return ret; + } + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + qp_attr.port_num = priv->port; + qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT; + + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify tx QP to INIT: %d\n", ret); + return ret; + } + return 0; +} + +int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, struct ib_sa_path_rec *pathrec) +{ + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + int ret; + + ipoib_dbg(priv, "Request connection %p for gid " IPOIB_GID_FMT " qpn 0x%x\n", + p, IPOIB_GID_ARG(pathrec->dgid), qpn); + + p->tx_ring = kzalloc(ipoib_sendq_size * sizeof *p->tx_ring, + GFP_KERNEL); + if (!p->tx_ring) { + ipoib_warn(priv, "failed to allocate tx ring\n"); + ret = -ENOMEM; + goto err_tx; + } + + p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, + ipoib_sendq_size + 1); + if (IS_ERR(p->cq)) { + ret = PTR_ERR(p->cq); + ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); + goto err_cq; + } + + ret = ib_req_notify_cq(p->cq, IB_CQ_NEXT_COMP); + if (ret) { + ipoib_warn(priv, "failed to request completion notification: %d\n", ret); + goto err_req_notify; + } + + p->qp = ipoib_cm_create_tx_qp(p->dev, p->cq); + if (IS_ERR(p->qp)) { + ret = PTR_ERR(p->qp); + ipoib_warn(priv, "failed to allocate tx qp: %d\n", ret); + goto err_qp; + } + + p->id = ib_create_cm_id(priv->ca, ipoib_cm_tx_handler, p); + if (IS_ERR(p->id)) { + ret = PTR_ERR(p->id); + ipoib_warn(priv, "failed to create tx cm id: %d\n", ret); + goto err_id; + } + + ret = ipoib_cm_modify_tx_init(p->dev, p->id, p->qp); + if (ret) { + ipoib_warn(priv, "failed to modify tx qp to rtr: %d\n", ret); + goto err_modify; + } + + ret = ipoib_cm_send_req(p->dev, p->id, p->qp, qpn, pathrec); + if (ret) { + ipoib_warn(priv, "failed to send cm req: %d\n", ret); + goto err_send_cm; + } + return 0; + +err_send_cm: +err_modify: + ib_destroy_cm_id(p->id); +err_id: + p->id = NULL; + ib_destroy_qp(p->qp); +err_req_notify: +err_qp: + p->qp = NULL; + ib_destroy_cq(p->cq); +err_cq: + p->cq = NULL; +err_tx: + return ret; +} + +void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) +{ + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + struct ipoib_tx_buf *tx_req; + + ipoib_dbg(priv, "Destroy active connection %p. head 0x%x tail 0x%x\n", + p, p->tx_head, p->tx_tail); + + if (p->id) + ib_destroy_cm_id(p->id); + + if (p->qp) + ib_destroy_qp(p->qp); + + if (p->cq) + ib_destroy_cq(p->cq); + + if (p->tx_ring) { + while ((int) p->tx_tail - (int) p->tx_head < 0) { + tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(tx_req->skb); + ++p->tx_tail; + } + + kfree(p->tx_ring); + } + + kfree(p); +} + +int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct ipoib_cm_tx *tx = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(tx->dev); + struct ipoib_neigh *neigh; + unsigned long flags; + int ret; + + switch (event->event) { + case IB_CM_DREQ_RECEIVED: + ipoib_dbg(priv, "DREQ received.\n"); + ib_send_cm_drep(cm_id, NULL, 0); + break; + case IB_CM_REP_RECEIVED: + ipoib_dbg(priv, "REP received.\n"); + ret = ipoib_cm_rep_handler(cm_id, event); + if (ret) + ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + break; + case IB_CM_REQ_ERROR: + case IB_CM_REJ_RECEIVED: + case IB_CM_TIMEWAIT_EXIT: + ipoib_dbg(priv, "CM error %d.\n", event->event); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + neigh = tx->neigh; + + if (neigh) { + neigh->cm = NULL; + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + + tx->neigh = NULL; + } + + if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { + list_move(&tx->list, &priv->cm.reap_list); + queue_work(ipoib_workqueue, &priv->cm.reap_task); + } + + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); + break; + default: + break; + } + + return 0; +} + +struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_tx *tx; + + tx = kzalloc(sizeof *tx, GFP_ATOMIC); + if (!tx) + return NULL; + + neigh->cm = tx; + tx->neigh = neigh; + tx->path = path; + tx->dev = dev; + list_add(&tx->list, &priv->cm.start_list); + set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags); + queue_work(ipoib_workqueue, &priv->cm.start_task); + return tx; +} + +void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) +{ + struct ipoib_dev_priv *priv = netdev_priv(tx->dev); + if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { + list_move(&tx->list, &priv->cm.reap_list); + queue_work(ipoib_workqueue, &priv->cm.reap_task); + ipoib_dbg(priv, "Reap connection for gid " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(tx->neigh->dgid)); + tx->neigh = NULL; + } +} + +void ipoib_cm_tx_start(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh; + struct ipoib_cm_tx *p; + unsigned long flags; + int ret; + + struct ib_sa_path_rec pathrec; + u32 qpn; + + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + while (!list_empty(&priv->cm.start_list)) { + p = list_entry(priv->cm.start_list.next, typeof(*p), list); + list_del_init(&p->list); + neigh = p->neigh; + qpn = IPOIB_QPN(neigh->neighbour->ha); + memcpy(&pathrec, &p->path->pathrec, sizeof pathrec); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); + ret = ipoib_cm_tx_init(p, qpn, &pathrec); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + if (ret) { + neigh = p->neigh; + if (neigh) { + neigh->cm = NULL; + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + } + list_del(&p->list); + kfree(p); + } + } + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); +} + +void ipoib_cm_tx_reap(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_tx *p; + unsigned long flags; + + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + while (!list_empty(&priv->cm.reap_list)) { + p = list_entry(priv->cm.reap_list.next, typeof(*p), list); + list_del(&p->list); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); + ipoib_cm_tx_destroy(p); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + } + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); +} + +int ipoib_cm_dev_init(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_srq_init_attr srq_init_attr = { + .attr = { + .max_wr = ipoib_recvq_size, + .max_sge = 1 + } + }; + int ret, i; + + INIT_LIST_HEAD(&priv->cm.passive_ids); + INIT_LIST_HEAD(&priv->cm.reap_list); + INIT_LIST_HEAD(&priv->cm.start_list); + INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start, dev); + INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap, dev); + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (ipoib_cm_alloc_rx_skb(dev, i)) { + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (ipoib_cm_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } + } + + priv->dev->dev_addr[0] = IPOIB_FLAGS_RC; + + return 0; +} + +void ipoib_cm_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i, ret; + + ipoib_dbg(priv, "Cleanup ipoib connected mode data.\n"); + + if (!priv->cm.srq) + return; + ret = ib_destroy_srq(priv->cm.srq); + if (ret) + ipoib_warn(priv, "ib_destroy_srq failed: %d\n", ret); + + priv->cm.srq = NULL; + if (!priv->cm.srq_ring) + return; + for (i = 0; i < ipoib_recvq_size; ++i) + if (priv->cm.srq_ring[i].skb) { + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(&priv->cm.srq_ring[i], + mapping), + IPOIB_CM_BUF_SIZE, + DMA_FROM_DEVICE); + dev_kfree_skb_any(priv->cm.srq_ring[i].skb); + priv->cm.srq_ring[i].skb = NULL; + } + kfree(priv->cm.srq_ring); + priv->cm.srq_ring = NULL; +} diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 8bf5e9e..a4b2d21 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -119,6 +119,7 @@ static int ipoib_ib_post_receive(struct net_device *dev, int id) return ret; } + static int ipoib_alloc_rx_skb(struct net_device *dev, int id) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -273,10 +274,10 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; - if (netif_queue_stopped(dev) && - test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && - priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) + if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags) && + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) { netif_wake_queue(dev); + } spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && @@ -378,6 +379,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); + set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); } } } @@ -429,6 +431,13 @@ int ipoib_ib_dev_open(struct net_device *dev) return -1; } + ret = ipoib_cm_dev_open(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + ipoib_ib_dev_stop(dev); + return -1; + } + clear_bit(IPOIB_STOP_REAPER, &priv->flags); queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); @@ -514,6 +523,8 @@ int ipoib_ib_dev_stop(struct net_device *dev) clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); + ipoib_cm_dev_stop(dev); + /* * Move our QP to the error state and then reinitialize in * when all work requests have completed or have been flushed. @@ -603,6 +614,8 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) return -ENODEV; } + ipoib_cm_dev_init(dev); + if (dev->flags & IFF_UP) { if (ipoib_ib_dev_open(dev)) { ipoib_transport_dev_cleanup(dev); @@ -659,6 +672,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) ipoib_mcast_stop_thread(dev, 1); ipoib_mcast_dev_flush(dev); + ipoib_cm_dev_cleanup(dev); ipoib_transport_dev_cleanup(dev); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 85522da..282c5ea 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -49,8 +49,6 @@ #include -#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) - MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); @@ -145,6 +143,8 @@ static int ipoib_stop(struct net_device *dev) netif_stop_queue(dev); + clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + /* * Now flush workqueue to make sure a scheduled task doesn't * bring our internal state back up. @@ -177,14 +177,27 @@ static int ipoib_stop(struct net_device *dev) static int ipoib_change_mtu(struct net_device *dev, int new_mtu) { struct ipoib_dev_priv *priv = netdev_priv(dev); + int old_mtu = dev->mtu; + + /* Simple heuristic: dev->mtu > 2K ==> connected mode */ + /* flush paths if we switch modes so that connections are restarted */ + if (IPOIB_CM_ENABLED(dev->dev_addr) && + new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN && + new_mtu <= IPOIB_CM_MTU) { + dev->mtu = new_mtu; + if (old_mtu <= IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + ipoib_flush_paths(dev); + return 0; + } if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) - return -EINVAL; + return -EINVAL; priv->admin_mtu = new_mtu; - dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + if (old_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + ipoib_flush_paths(dev); return 0; } @@ -414,6 +427,18 @@ static void path_rec_completion(int status, memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, sizeof(union ib_gid)); + if (ipoib_cm_enabled(dev, neigh->neighbour)) { + if (!neigh->cm) + neigh->cm = ipoib_cm_create_tx(dev, path, neigh); + if (!neigh->cm) { + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + continue; + } + } + while ((skb = __skb_dequeue(&neigh->queue))) __skb_queue_tail(&skqueue, skb); } @@ -522,7 +547,22 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, sizeof(union ib_gid)); - ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + if (ipoib_cm_enabled(dev, neigh->neighbour)) { + if (!neigh->cm) + neigh->cm = ipoib_cm_create_tx(dev, path, neigh); + if (!neigh->cm) { + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + goto err_drop; + } + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) + __skb_queue_tail(&neigh->queue, skb); + else + goto err_drop; + } else + ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); } else { neigh->ah = NULL; __skb_queue_tail(&neigh->queue, skb); @@ -539,6 +579,7 @@ err_list: err_path: ipoib_neigh_free(neigh); +err_drop: ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -641,7 +682,12 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (likely(neigh->ah)) { + if (ipoib_cm_get(neigh)) { + if (test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags)) { + ipoib_cm_send(dev, skb, neigh->cm); + goto out; + } + } else if (neigh->ah) { if (unlikely(memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, sizeof(union ib_gid)))) { @@ -805,6 +851,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) neigh->neighbour = neighbour; *to_ipoib_neigh(neighbour) = neigh; + neigh->cm = NULL; return neigh; } @@ -812,6 +859,8 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) void ipoib_neigh_free(struct ipoib_neigh *neigh) { *to_ipoib_neigh(neigh->neighbour) = NULL; + if (neigh->cm) + ipoib_cm_destroy_tx(neigh->cm); kfree(neigh); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 3faa182..14337e9 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -594,7 +594,11 @@ void ipoib_mcast_join_task(void *dev_ptr) priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - IPOIB_ENCAP_LEN; - dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + /* Simple heuristic: dev->mtu > 2K ==> connected mode. + * In this case do not touch dev->mtu. */ + if (dev->mtu <= IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); -- MST From swise at opengridcomputing.com Tue Dec 5 08:27:12 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 10:27:12 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165334529.16087.69.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <1165249251.32724.26.camel@stevo-desktop> <45754DE3.1020505@ens-lyon.org> <1165334529.16087.69.camel@stevo-desktop> Message-ID: <1165336032.16087.89.camel@stevo-desktop> On Tue, 2006-12-05 at 10:02 -0600, Steve Wise wrote: > On Tue, 2006-12-05 at 11:45 +0100, Brice Goglin wrote: > > Steve Wise wrote: > > > There is no SW TCP stack in this driver. The HW supports RDMA over > > > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet > > > (aka iWARP). The device is a 10 GbE device, not Infiniband. > > > > Then, I wonder why the driver goes in drivers/infiniband/ :) > > drivers/infiniband support both IB and IWARP transports. > > > Is there really no way to only keep the actual hw infiniband there, move > > iwarp/rdma drivers in drivers/net/something/ and the core stuff in > > net/something/ ? > > > > Sure, this _could_ be done, but what I think you're missing is that > applications use the interface exported by drivers/infiniband over both > IB -and- IWARP transports. The application can be written to not care > which transport is used. Examples of apps that can run over both > transports using the same common interface: > > user mode: MVAPICH2, OMPI, IMPI, HPMPI, > kernel mode: NFS-RDMA, iSER. > > Note that the include directory used by drivers/infiniband is now > include/rdma. Perhaps drivers/infiniband should be renamed to > drivers/rdma as well at some point... By the way, FYI: The Chelsio T3 device support is split into 2 driver modules: the Ethernet driver and the RDMA driver. The Ethernet driver lives in drivers/net/cxgb3 while the RDMA driver lives in drivers/infiniband/hw/cxgb3. The Ethernet driver can be used stand-alone as a 10GbE high-performance NIC driver. The RDMA driver has a config-time dependency on the Ethernet driver. The 2nd version of the Ethernet driver was posted yesterday. See: http://www.spinics.net/lists/netdev/msg20464.html Steve. From johnpol at 2ka.mipt.ru Tue Dec 5 08:31:33 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 19:31:33 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165335162.16087.79.camel@stevo-desktop> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> Message-ID: <20061205163008.GA30211@2ka.mipt.ru> On Tue, Dec 05, 2006 at 10:12:42AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > Ah. Data from an offloaded connection cannot leak into the main stack > nor vice-verse. We can take an active RDMA connection establishment as > an example if you want: Once the message is sent to the HW to "setup a > TCP connection from addr/port a.b to addr/port c.d", then packets on > that connection (that 4-tuple) will always be delivered to the RDMA > driver, not the native stack. If the the packet received after the > connection is setup is -not- an MPA reply (in this example), then the > connection is aborted. Once the connection is aborted. So no leaking > can happen. And if there were a dataflow between addr/port a.b to addr/port c.d already, it will either terminated? Considering the following sequence: handlers->t3c_handlers->sched()->work_queue->work_handlers()->for example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming skb and sets port/addr/route and other fields to be used as a base for rdma connection. What if it just a usual network packet from kernelspace or userspace with the same payload as should be sent by remote rdma system? -- Evgeniy Polyakov From swise at opengridcomputing.com Tue Dec 5 08:47:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 10:47:25 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205163008.GA30211@2ka.mipt.ru> References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> <20061205163008.GA30211@2ka.mipt.ru> Message-ID: <1165337245.16087.95.camel@stevo-desktop> On Tue, 2006-12-05 at 19:31 +0300, Evgeniy Polyakov wrote: > On Tue, Dec 05, 2006 at 10:12:42AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > Ah. Data from an offloaded connection cannot leak into the main stack > > nor vice-verse. We can take an active RDMA connection establishment as > > an example if you want: Once the message is sent to the HW to "setup a > > TCP connection from addr/port a.b to addr/port c.d", then packets on > > that connection (that 4-tuple) will always be delivered to the RDMA > > driver, not the native stack. If the the packet received after the > > connection is setup is -not- an MPA reply (in this example), then the > > connection is aborted. Once the connection is aborted. So no leaking > > can happen. > > And if there were a dataflow between addr/port a.b to addr/port c.d > already, it will either terminated? > > Considering the following sequence: > handlers->t3c_handlers->sched()->work_queue->work_handlers()->for > example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming > skb and sets port/addr/route and other fields to be used as a base for rdma > connection. What if it just a usual network packet from kernelspace or > userspace with the same payload as should be sent by remote rdma system? > That skb isn't a network packet. Its a CPL_PASS_ACCEPT_REQ message (see struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h). If the RDMA driver hadn't registered to listen on that addr/port, it would never get this skb. Once a connection is established, the MPA messages (and any TCP payload data) is delivered to the RDMA driver in the form of skb's containing struct cpl_rx_data. So these skbs aren't just TCP packets at all. They either control messages or TCP payload. Either way they are encapsulated in CPL message structures. Does this make sense? From rdreier at cisco.com Tue Dec 5 09:14:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 09:14:06 -0800 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <45754DE3.1020505@ens-lyon.org> (Brice Goglin's message of "Tue, 05 Dec 2006 11:45:55 +0100") References: <20061202224917.27014.15424.stgit@dell3.ogc.int> <20061202224958.27014.65970.stgit@dell3.ogc.int> <20061204110825.GA26251@2ka.mipt.ru> <1165249251.32724.26.camel@stevo-desktop> <45754DE3.1020505@ens-lyon.org> Message-ID: > Is there really no way to only keep the actual hw infiniband there, move > iwarp/rdma drivers in drivers/net/something/ and the core stuff in > net/something/ ? It's definitely possible, but rearranging the source tree hasn't been a high priority (for me at least). - R. From johnpol at 2ka.mipt.ru Tue Dec 5 09:32:22 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 20:32:22 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205172649.GA20229@2ka.mipt.ru> References: <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> <20061205163008.GA30211@2ka.mipt.ru> <1165337245.16087.95.camel@stevo-desktop> <20061205172649.GA20229@2ka.mipt.ru> Message-ID: <20061205173221.GB24149@2ka.mipt.ru> On Tue, Dec 05, 2006 at 08:26:49PM +0300, Evgeniy Polyakov (johnpol at 2ka.mipt.ru) wrote: > On Tue, Dec 05, 2006 at 10:47:25AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > And if there were a dataflow between addr/port a.b to addr/port c.d > > > already, it will either terminated? > > > > > > Considering the following sequence: > > > handlers->t3c_handlers->sched()->work_queue->work_handlers()->for > > > example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming > > > skb and sets port/addr/route and other fields to be used as a base for rdma > > > connection. What if it just a usual network packet from kernelspace or > > > userspace with the same payload as should be sent by remote rdma system? > > > > > > > That skb isn't a network packet. Its a CPL_PASS_ACCEPT_REQ message (see > > struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h). If the > > RDMA driver hadn't registered to listen on that addr/port, it would > > never get this skb. Once a connection is established, the MPA messages > > (and any TCP payload data) is delivered to the RDMA driver in the form > > of skb's containing struct cpl_rx_data. So these skbs aren't just TCP > > packets at all. They either control messages or TCP payload. Either way > > they are encapsulated in CPL message structures. > > > > Does this make sense? > > Almost - except the case about where those skbs are coming from? > It looks like they are obtained from network, since it is ethernet > driver, and if they match some set of rules, they are considered as valid > MPA negotiation protocol. > > If it is correct, it means that any packet in the network can be > potentially 'stolen' by rdma hardware, although it was part of the usual > dataflow. > If that packets are not from ethernet network, but from different > low-level, then there is a question (besides why this driver is called > ethernet if it manages different hardware) about how connection over > that different media is being setup and since packets contain perfectly > valid IP addresses and ports. It looks like I've answered myself - it is _not_ ethernet driver, but rdma one, and although it gets all data through skbs from ethernet driver, the latter gets them not from ethernet network. And thus addresses and ports and all other information can not be mixed between the two. -- Evgeniy Polyakov From halr at voltaire.com Tue Dec 5 09:26:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 12:26:56 -0500 Subject: [openib-general] [PATCH 2/5] opensm: trivial indentation fixes in osm_switch.h In-Reply-To: <11645802143335-git-send-email-sashak@voltaire.com> References: <11645802043173-git-send-email-sashak@voltaire.com> <11645802143335-git-send-email-sashak@voltaire.com> Message-ID: <1165339569.25587.73006.camel@hal.voltaire.com> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > Couple of trivial indentation fixes in osm_switch.h. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From johnpol at 2ka.mipt.ru Tue Dec 5 09:26:50 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 20:26:50 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165337245.16087.95.camel@stevo-desktop> References: <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> <20061205163008.GA30211@2ka.mipt.ru> <1165337245.16087.95.camel@stevo-desktop> Message-ID: <20061205172649.GA20229@2ka.mipt.ru> On Tue, Dec 05, 2006 at 10:47:25AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > And if there were a dataflow between addr/port a.b to addr/port c.d > > already, it will either terminated? > > > > Considering the following sequence: > > handlers->t3c_handlers->sched()->work_queue->work_handlers()->for > > example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming > > skb and sets port/addr/route and other fields to be used as a base for rdma > > connection. What if it just a usual network packet from kernelspace or > > userspace with the same payload as should be sent by remote rdma system? > > > > That skb isn't a network packet. Its a CPL_PASS_ACCEPT_REQ message (see > struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h). If the > RDMA driver hadn't registered to listen on that addr/port, it would > never get this skb. Once a connection is established, the MPA messages > (and any TCP payload data) is delivered to the RDMA driver in the form > of skb's containing struct cpl_rx_data. So these skbs aren't just TCP > packets at all. They either control messages or TCP payload. Either way > they are encapsulated in CPL message structures. > > Does this make sense? Almost - except the case about where those skbs are coming from? It looks like they are obtained from network, since it is ethernet driver, and if they match some set of rules, they are considered as valid MPA negotiation protocol. If it is correct, it means that any packet in the network can be potentially 'stolen' by rdma hardware, although it was part of the usual dataflow. If that packets are not from ethernet network, but from different low-level, then there is a question (besides why this driver is called ethernet if it manages different hardware) about how connection over that different media is being setup and since packets contain perfectly valid IP addresses and ports. And, btw, not related question - does postponing the whole skb multiplexing to work queue result in lower latency and/or higher speed? Since there are a lot of tricks introduced to minimize gap between interrupt/napi polling and protocol processing, so such huge postponing with the whole context switch looks strange. -- Evgeniy Polyakov From halr at voltaire.com Tue Dec 5 09:12:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 12:12:19 -0500 Subject: [openib-general] [PATCH 1/5] opensm: eliminate global variable osm in updn In-Reply-To: <11645802093253-git-send-email-sashak@voltaire.com> References: <11645802043173-git-send-email-sashak@voltaire.com> <11645802093253-git-send-email-sashak@voltaire.com> Message-ID: <1165338719.25587.72460.camel@hal.voltaire.com> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > Routing engine setup function for up/down already gets reference to osm > object as parameter - we can keep this reference as part of updn_t > structure rather than to use global variable for referencing osm object. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From swise at opengridcomputing.com Tue Dec 5 09:51:40 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 11:51:40 -0600 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205172649.GA20229@2ka.mipt.ru> References: <20061204110825.GA26251@2ka.mipt.ru> <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> <20061205163008.GA30211@2ka.mipt.ru> <1165337245.16087.95.camel@stevo-desktop> <20061205172649.GA20229@2ka.mipt.ru> Message-ID: <1165341100.16087.109.camel@stevo-desktop> On Tue, 2006-12-05 at 20:26 +0300, Evgeniy Polyakov wrote: > On Tue, Dec 05, 2006 at 10:47:25AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > > And if there were a dataflow between addr/port a.b to addr/port c.d > > > already, it will either terminated? > > > > > > Considering the following sequence: > > > handlers->t3c_handlers->sched()->work_queue->work_handlers()->for > > > example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming > > > skb and sets port/addr/route and other fields to be used as a base for rdma > > > connection. What if it just a usual network packet from kernelspace or > > > userspace with the same payload as should be sent by remote rdma system? > > > > > > > That skb isn't a network packet. Its a CPL_PASS_ACCEPT_REQ message (see > > struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h). If the > > RDMA driver hadn't registered to listen on that addr/port, it would > > never get this skb. Once a connection is established, the MPA messages > > (and any TCP payload data) is delivered to the RDMA driver in the form > > of skb's containing struct cpl_rx_data. So these skbs aren't just TCP > > packets at all. They either control messages or TCP payload. Either way > > they are encapsulated in CPL message structures. > > > > Does this make sense? > > Almost - except the case about where those skbs are coming from? > It looks like they are obtained from network, since it is ethernet > driver, and if they match some set of rules, they are considered as valid > MPA negotiation protocol. They come from the Ethernet driver, but that driver manages multiple HW queues and these packets come from an offload queue, not the NIC queue. So the HW demultiplexes. Perhaps Divy or Felix from Chelsio can expand on how the Ethernet driver manages this? > > If it is correct, it means that any packet in the network can be > potentially 'stolen' by rdma hardware, although it was part of the usual > dataflow. > If that packets are not from ethernet network, but from different > low-level, then there is a question (besides why this driver is called > ethernet if it manages different hardware) about how connection over > that different media is being setup and since packets contain perfectly > valid IP addresses and ports. The HW has different queues for offload vs native Ethernet frames. I'm not an expert on the Ethernet driver, so you'll have to consult that code and ask questions of Divy and/or Felix. > And, btw, not related question - does postponing the whole skb multiplexing > to work queue result in lower latency and/or higher speed? > Since there are a lot of tricks introduced to minimize gap between > interrupt/napi polling and protocol processing, so such huge postponing > with the whole context switch looks strange. > Neither. The work queue makes the RDMA driver's life easier because it has context to allocate skbs, for instance. Note all the work queue stuff is done _only_ for RDMA connection setup and teardown. Once the connection is in RDMA mode, there's no work queues at all for IO, and CQ notifications happen in interrupt context. RDMA operations are submitted to the hardware via iwch_post_send(). Completion notification is done in the interrupt context via iwch_ev_dispatch(). And completion entries reaped by the consumer application via iwch_poll_cq(). Steve. From halr at voltaire.com Tue Dec 5 10:07:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 13:07:40 -0500 Subject: [openib-general] [PATCH 3/5] opensm: routing engine improvements In-Reply-To: <1164580219695-git-send-email-sashak@voltaire.com> References: <11645802043173-git-send-email-sashak@voltaire.com> <1164580219695-git-send-email-sashak@voltaire.com> Message-ID: <1165342043.25587.74435.camel@hal.voltaire.com> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > This prevents lid matrix rebuilding with up/down algorithm when it is > not required (a.e. when root nodes are specified by user), consolidates > routing engine methods and simplifies default LFT creation flow. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From johnpol at 2ka.mipt.ru Tue Dec 5 10:09:39 2006 From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov) Date: Tue, 5 Dec 2006 21:09:39 +0300 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <1165341100.16087.109.camel@stevo-desktop> References: <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> <20061205163008.GA30211@2ka.mipt.ru> <1165337245.16087.95.camel@stevo-desktop> <20061205172649.GA20229@2ka.mipt.ru> <1165341100.16087.109.camel@stevo-desktop> Message-ID: <20061205180939.GA26384@2ka.mipt.ru> On Tue, Dec 05, 2006 at 11:51:40AM -0600, Steve Wise (swise at opengridcomputing.com) wrote: > > Almost - except the case about where those skbs are coming from? > > It looks like they are obtained from network, since it is ethernet > > driver, and if they match some set of rules, they are considered as valid > > MPA negotiation protocol. > > They come from the Ethernet driver, but that driver manages multiple HW > queues and these packets come from an offload queue, not the NIC queue. > So the HW demultiplexes. Ok, thanks for explaination. -- Evgeniy Polyakov From xma at us.ibm.com Tue Dec 5 10:11:19 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 5 Dec 2006 10:11:19 -0800 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: <20061205161944.GD30209@mellanox.co.il> Message-ID: Michael, >The idea is to increase performance by increasing the MTU >from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD. >With this code, I'm able to get 800MByte/sec or more with netperf >without options on a Mellanox 4x back-to-back DDR system. What about CPU utilization? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Tue Dec 5 09:57:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Dec 2006 09:57:39 -0800 Subject: [openib-general] oops with multicast patches In-Reply-To: <457582E8.8030705@dev.mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> Message-ID: <4575B313.1010604@ichips.intel.com> Dotan Barak wrote: > This test does the following scenario: > restart the driver > start a user level application that allocate N multicast groups (it > is being executed in the background) How does the application allocate the multicast groups? Does this involve the kernel multicast module? The multicast module expects that all join/leave operations go through it. Can you produce this crash using only the multicast code and ipoib? - Sean From rdreier at cisco.com Tue Dec 5 10:26:16 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 10:26:16 -0800 Subject: [openib-general] oops with multicast patches In-Reply-To: <4575B313.1010604@ichips.intel.com> (Sean Hefty's message of "Tue, 05 Dec 2006 09:57:39 -0800") References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> <4575B313.1010604@ichips.intel.com> Message-ID: > How does the application allocate the multicast groups? Does this involve the > kernel multicast module? The multicast module expects that all join/leave > operations go through it. Can you produce this crash using only the multicast > code and ipoib? Dotan attached the full source of the test. It just seems to attach local QPs to MCGs without talking to the SA at all. So it's not using the multicast module, but on the other hand I can't see why what it does would have any relevance to the crash. - R. From steve.apo at googlemail.com Tue Dec 5 10:27:27 2006 From: steve.apo at googlemail.com (Steven Wooding) Date: Tue, 5 Dec 2006 18:27:27 +0000 Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could be wrong? In-Reply-To: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> Message-ID: <2cfcf21e0612051027s2c1d45cbk134d0a6ac94f480@mail.gmail.com> Hi again, OK, so I've narrowed it down to the write() function returning the -1, indicating an error. The value of errno I get is EINVAL, but indicates the file descriptor is not valid. However, I've check the file descriptor value and it's listing in the lsof output and all looks fine. Just looking at the code in cm.c, how does the CM_CREATE_MSG_CMD macro work? I can't seem to see where the "msg" parameter gets to point to the "cmd" parameter. Just curious, as I know that the cmpost example application works fine. Any ideas? By the way, I'm using OFED 1.1 Thanks, Steve. On 05/12/06, Steven Wooding wrote: > > Hi, > > In my application I keep getting -1 returned by a call to ib_cm_send_req() > function. The cmpost example application works fine, so I can rule out > system set-up issues. > > I could do with a glue as to what the -1 means and then hopefully correct > my application. > > Thanks, > > Steve. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanb at dev.mellanox.co.il Tue Dec 5 10:55:29 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Tue, 5 Dec 2006 20:55:29 +0200 (IST) Subject: [openib-general] oops with multicast patches In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> <4575B313.1010604@ichips.intel.com> Message-ID: <4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il> > > Dotan attached the full source of the test. > > It just seems to attach local QPs to MCGs without talking to the SA at > all. So it's not using the multicast module, but on the other hand I > can't see why what it does would have any relevance to the crash. > > - R. Roland is right: this test only attaches QPs to the mcg using verbs call (ibv_attach_mcast). I hope to improve this test during the next week(s) and add support to the multicast module (or library, if available). Dotan From rdreier at cisco.com Tue Dec 5 10:57:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 10:57:38 -0800 Subject: [openib-general] oops with multicast patches In-Reply-To: <4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il> (dotanb@dev.mellanox.co.il's message of "Tue, 5 Dec 2006 20:55:29 +0200 (IST)") References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> <4575B313.1010604@ichips.intel.com> <4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il> Message-ID: > Roland is right: this test only attaches QPs to the mcg using verbs call > (ibv_attach_mcast). What I don't understand is why the test has any affect on the kernel at all. How could creating QPs and attaching them to MCGs with verbs calls cause the crash?? - R. From halr at voltaire.com Tue Dec 5 10:58:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 13:58:24 -0500 Subject: [openib-general] [PATCH 4/5] opensm: clean non used LFT entries, update only changed blocks In-Reply-To: <11645802241342-git-send-email-sashak@voltaire.com> References: <11645802043173-git-send-email-sashak@voltaire.com> <11645802241342-git-send-email-sashak@voltaire.com> Message-ID: <1165345060.25587.76545.camel@hal.voltaire.com> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > This uses temporary buffer (one per OpenSM) for LFT entries generation. > In this way old (actually "invalid") LFT entries are not preserved > anymore and we can send update requests for only changed LFT blocks > rather than whole table rewriting. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From rdreier at cisco.com Tue Dec 5 11:00:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 11:00:36 -0800 Subject: [openib-general] [PATCH] IB/ipath: Remove unused "write-only" variables Message-ID: Remove variables that are set but then never looked at in the ipath driver. These cleanups came from David Binderman's list of "set but never used" warnings from icc. Signed-off-by: Roland Dreier --- Bryan, does this look OK to merge? drivers/infiniband/hw/ipath/ipath_driver.c | 4 +--- drivers/infiniband/hw/ipath/ipath_file_ops.c | 5 ++--- drivers/infiniband/hw/ipath/ipath_iba6110.c | 3 +-- drivers/infiniband/hw/ipath/ipath_iba6120.c | 6 +++--- drivers/infiniband/hw/ipath/ipath_init_chip.c | 3 +-- drivers/infiniband/hw/ipath/ipath_intr.c | 3 +-- drivers/infiniband/hw/ipath/ipath_sysfs.c | 3 --- 7 files changed, 9 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c index 1aeddb4..ae7f21a 100644 --- a/drivers/infiniband/hw/ipath/ipath_driver.c +++ b/drivers/infiniband/hw/ipath/ipath_driver.c @@ -1825,8 +1825,6 @@ void ipath_write_kreg_port(const struct */ void ipath_shutdown_device(struct ipath_devdata *dd) { - u64 val; - ipath_dbg("Shutting down the device\n"); dd->ipath_flags |= IPATH_LINKUNK; @@ -1849,7 +1847,7 @@ void ipath_shutdown_device(struct ipath_ */ ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, 0ULL); /* flush it */ - val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); + ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); /* * enough for anything that's going to trickle out to have actually * done so. diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c index a9ddc69..ddbcabd 100644 --- a/drivers/infiniband/hw/ipath/ipath_file_ops.c +++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c @@ -699,7 +699,6 @@ static int ipath_manage_rcvq(struct ipat int start_stop) { struct ipath_devdata *dd = pd->port_dd; - u64 tval; ipath_cdbg(PROC, "%sabling rcv for unit %u port %u:%u\n", start_stop ? "en" : "dis", dd->ipath_unit, @@ -729,7 +728,7 @@ static int ipath_manage_rcvq(struct ipat ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl, dd->ipath_rcvctrl); /* now be sure chip saw it before we return */ - tval = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); + ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); if (start_stop) { /* * And try to be sure that tail reg update has happened too. @@ -738,7 +737,7 @@ static int ipath_manage_rcvq(struct ipat * in memory copy, since we could overwrite an update by the * chip if we did. */ - tval = ipath_read_ureg32(dd, ur_rcvhdrtail, pd->port_port); + ipath_read_ureg32(dd, ur_rcvhdrtail, pd->port_port); } /* always; new head should be equal to new tail; see above */ bail: diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c index e57c7a3..7468477 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -1447,7 +1447,7 @@ static void ipath_ht_tidtemplate(struct static int ipath_ht_early_init(struct ipath_devdata *dd) { u32 __iomem *piobuf; - u32 pioincr, val32, egrsize; + u32 pioincr, val32; int i; /* @@ -1467,7 +1467,6 @@ static int ipath_ht_early_init(struct ip * errors interrupts if we ever see one). */ dd->ipath_rcvegrbufsize = dd->ipath_piosize2k; - egrsize = dd->ipath_rcvegrbufsize; /* * the min() check here is currently a nop, but it may not diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c index 6af8968..397da34 100644 --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c @@ -602,7 +602,7 @@ static void ipath_pe_init_hwerrors(struc */ static int ipath_pe_bringup_serdes(struct ipath_devdata *dd) { - u64 val, tmp, config1, prev_val; + u64 val, config1, prev_val; int ret = 0; ipath_dbg("Trying to bringup serdes\n"); @@ -633,7 +633,7 @@ static int ipath_pe_bringup_serdes(struc | INFINIPATH_SERDC0_L1PWR_DN; ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val); /* be sure chip saw it */ - tmp = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); + ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); udelay(5); /* need pll reset set at least for a bit */ /* * after PLL is reset, set the per-lane Resets and TxIdle and @@ -647,7 +647,7 @@ static int ipath_pe_bringup_serdes(struc "and txidle (%llx)\n", (unsigned long long) val); ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val); /* be sure chip saw it */ - tmp = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); + ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch); /* need PLL reset clear for at least 11 usec before lane * resets cleared; give it a few more to be sure */ udelay(15); diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c index d819cca..d4f6b52 100644 --- a/drivers/infiniband/hw/ipath/ipath_init_chip.c +++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c @@ -347,10 +347,9 @@ done: static int init_chip_reset(struct ipath_devdata *dd, struct ipath_portdata **pdp) { - struct ipath_portdata *pd; u32 rtmp; - *pdp = pd = dd->ipath_pd[0]; + *pdp = dd->ipath_pd[0]; /* ensure chip does no sends or receives while we re-initialize */ dd->ipath_control = dd->ipath_sendctrl = dd->ipath_rcvctrl = 0U; ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl, 0); diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c index 5652a55..72b9e27 100644 --- a/drivers/infiniband/hw/ipath/ipath_intr.c +++ b/drivers/infiniband/hw/ipath/ipath_intr.c @@ -598,10 +598,9 @@ static int handle_errors(struct ipath_de * on close */ if (errs & INFINIPATH_E_RRCVHDRFULL) { - int any; u32 hd, tl; ipath_stats.sps_hdrqfull++; - for (any = i = 0; i < dd->ipath_cfgports; i++) { + for (i = 0; i < dd->ipath_cfgports; i++) { struct ipath_portdata *pd = dd->ipath_pd[i]; if (i == 0) { hd = dd->ipath_port0head; diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c index 182de34..ffa6318 100644 --- a/drivers/infiniband/hw/ipath/ipath_sysfs.c +++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c @@ -215,7 +215,6 @@ static ssize_t store_mlid(struct device size_t count) { struct ipath_devdata *dd = dev_get_drvdata(dev); - int unit; u16 mlid; int ret; @@ -223,8 +222,6 @@ static ssize_t store_mlid(struct device if (ret < 0 || mlid < IPATH_MULTICAST_LID_BASE) goto invalid; - unit = dd->ipath_unit; - dd->ipath_mlid = mlid; goto bail; -- 1.4.3.2 From rdreier at cisco.com Tue Dec 5 11:10:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 11:10:07 -0800 Subject: [openib-general] [PATCH] IB/iser: Remove unused "write-only" variables Message-ID: Remove variables that are set but then never looked at in the iSER initiator. These cleanups came from David Binderman's list of "set but never used" warnings from icc. Signed-off-by: Roland Dreier --- Erez, does this look OK to merge? drivers/infiniband/ulp/iser/iser_initiator.c | 4 ---- drivers/infiniband/ulp/iser/iser_memory.c | 3 +-- 2 files changed, 1 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index 9b3d79c..e73c87b 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -487,10 +487,8 @@ int iser_send_control(struct iscsi_conn struct iscsi_iser_conn *iser_conn = conn->dd_data; struct iser_desc *mdesc = mtask->dd_data; struct iser_dto *send_dto = NULL; - unsigned int itt; unsigned long data_seg_len; int err = 0; - unsigned char opcode; struct iser_regd_buf *regd_buf; struct iser_device *device; @@ -512,8 +510,6 @@ int iser_send_control(struct iscsi_conn iser_reg_single(device, send_dto->regd[0], DMA_TO_DEVICE); - itt = ntohl(mtask->hdr->itt); - opcode = mtask->hdr->opcode & ISCSI_OPCODE_MASK; data_seg_len = ntoh24(mtask->hdr->dlength); if (data_seg_len > 0) { diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c index 0606744..e5a1091 100644 --- a/drivers/infiniband/ulp/iser/iser_memory.c +++ b/drivers/infiniband/ulp/iser/iser_memory.c @@ -234,7 +234,7 @@ static int iser_sg_to_page_vec(struct is { struct scatterlist *sg = (struct scatterlist *)data->buf; dma_addr_t first_addr, last_addr, page; - int start_aligned, end_aligned; + int end_aligned; unsigned int cur_page = 0; unsigned long total_sz = 0; int i; @@ -248,7 +248,6 @@ static int iser_sg_to_page_vec(struct is first_addr = sg_dma_address(&sg[i]); last_addr = first_addr + sg_dma_len(&sg[i]); - start_aligned = !(first_addr & ~MASK_4K); end_aligned = !(last_addr & ~MASK_4K); /* continue to collect page fragments till aligned or SG ends */ -- 1.4.3.2 From mshefty at ichips.intel.com Tue Dec 5 10:26:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Dec 2006 10:26:19 -0800 Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could be wrong? In-Reply-To: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> Message-ID: <4575B9CB.5070507@ichips.intel.com> > In my application I keep getting -1 returned by a call to > ib_cm_send_req() function. The cmpost example application works fine, so > I can rule out system set-up issues. This is probably an error being returned from the kernel. Does errno give any more insight? - Sean From mshefty at ichips.intel.com Tue Dec 5 10:31:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Dec 2006 10:31:33 -0800 Subject: [openib-general] OFED 1.2 features update In-Reply-To: <45759B8C.8010408@dev.mellanox.co.il> References: <45759B8C.8010408@dev.mellanox.co.il> Message-ID: <4575BB05.7040106@ichips.intel.com> > 4. Sean should prepare patches or git tree for kernel code that is not > upstream (e.g. SA cache) I created a kernel git tree with branches for most of the code that was in svn, but not upstream. (The SA cache is the last missing piece that needs to be added.) Branches were made for the rdma_ucm, multicast support, utilities such as madeye, and kernel test apps. Branches were also added to the librdmacm to match with the rdma_ucm and multicast branches. - Sean From rdreier at cisco.com Tue Dec 5 11:12:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 11:12:48 -0800 Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in c2_qp_modify. In-Reply-To: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com> (Krishna Kumar's message of "Mon, 04 Dec 2006 09:14:57 +0530") References: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: Looks right to me. Tom/Steve, should I merge this? > --- org/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-15 12:40:04.000000000 +0530 > +++ new/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-16 18:10:03.000000000 +0530 > @@ -161,8 +161,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s > > if (attr_mask & IB_QP_STATE) { > /* Ensure the state is valid */ > - if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) > - return -EINVAL; > + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) { > + err = -EINVAL; > + goto bail0; > + } > > wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); > > @@ -184,9 +186,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s > if (attr->cur_qp_state != IB_QPS_RTR && > attr->cur_qp_state != IB_QPS_RTS && > attr->cur_qp_state != IB_QPS_SQD && > - attr->cur_qp_state != IB_QPS_SQE) > - return -EINVAL; > - else > + attr->cur_qp_state != IB_QPS_SQE) { > + err = -EINVAL; > + goto bail0; > + } else > wr.next_qp_state = > cpu_to_be32(to_c2_state(attr->cur_qp_state)); > > From dotanb at dev.mellanox.co.il Tue Dec 5 11:15:03 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Tue, 5 Dec 2006 21:15:03 +0200 (IST) Subject: [openib-general] oops with multicast patches In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> <4575B313.1010604@ichips.intel.com> <4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il> Message-ID: <41181.194.90.237.34.1165346103.squirrel@dev.mellanox.co.il> > > Roland is right: this test only attaches QPs to the mcg using verbs > call > > (ibv_attach_mcast). > > What I don't understand is why the test has any affect on the kernel > at all. How could creating QPs and attaching them to MCGs with verbs > calls cause the crash?? Everytime that the IPoIB finds out that there is a new SM (using the client reregister or LID change event), it try to join several (3-4) mcgs. The user level application uses almost all of the multicast groups in the machine. Only 1,2,3,4 mcgs (depends on the test case) are available for the IPoIB to use: it start to attach (and maybe join) all the mcgs that it needs and fails when it reaches the HCA mcgs limit. Maybe there is a problem with the multicast module when the attach to multicast group verb fails? Dotan From swise at opengridcomputing.com Tue Dec 5 11:17:17 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 05 Dec 2006 13:17:17 -0600 Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in c2_qp_modify. In-Reply-To: References: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: <1165346238.16087.128.camel@stevo-desktop> yes, this looks correct. Sorry I missed this or i would have acked it earlier... Steve. On Tue, 2006-12-05 at 11:12 -0800, Roland Dreier wrote: > Looks right to me. Tom/Steve, should I merge this? > > > --- org/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-15 12:40:04.000000000 +0530 > > +++ new/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-16 18:10:03.000000000 +0530 > > @@ -161,8 +161,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s > > > > if (attr_mask & IB_QP_STATE) { > > /* Ensure the state is valid */ > > - if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) > > - return -EINVAL; > > + if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) { > > + err = -EINVAL; > > + goto bail0; > > + } > > > > wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state)); > > > > @@ -184,9 +186,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s > > if (attr->cur_qp_state != IB_QPS_RTR && > > attr->cur_qp_state != IB_QPS_RTS && > > attr->cur_qp_state != IB_QPS_SQD && > > - attr->cur_qp_state != IB_QPS_SQE) > > - return -EINVAL; > > - else > > + attr->cur_qp_state != IB_QPS_SQE) { > > + err = -EINVAL; > > + goto bail0; > > + } else > > wr.next_qp_state = > > cpu_to_be32(to_c2_state(attr->cur_qp_state)); > > > > From rdreier at cisco.com Tue Dec 5 11:18:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 11:18:08 -0800 Subject: [openib-general] oops with multicast patches In-Reply-To: <41181.194.90.237.34.1165346103.squirrel@dev.mellanox.co.il> (dotanb@dev.mellanox.co.il's message of "Tue, 5 Dec 2006 21:15:03 +0200 (IST)") References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> <4575B313.1010604@ichips.intel.com> <4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il> <41181.194.90.237.34.1165346103.squirrel@dev.mellanox.co.il> Message-ID: > The user level application uses almost all of the multicast groups in the > machine. Only 1,2,3,4 mcgs (depends on the test case) are available for > the IPoIB to use: it start to attach (and maybe join) all the mcgs that it > needs and fails when it reaches the HCA mcgs limit. Ohh... I see. > Maybe there is a problem with the multicast module when the attach to > multicast group verb fails? I guess it would be in the ipoib driver error handling or how it interacts with the multicast module, since the attach to multicast happens in the ipoib driver, not the multicast module. Sean, does this give you any ideas of what to look at? - R. From bos at pathscale.com Tue Dec 5 11:29:15 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 05 Dec 2006 11:29:15 -0800 Subject: [openib-general] [PATCH/RFC] busted request IRQ for PCIe ipath HCAs In-Reply-To: References: Message-ID: <4575C88B.9000207@pathscale.com> Roland Dreier wrote: > Bryan/anyone at Qlogic, does this look right? It worked for me, so if > this is what was intended, I will queue the patch for 2.6.20 and > submit to stable at kernel.org for 2.6.19.x. > Yes, this looks correct to me. References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> <4575B9CB.5070507@ichips.intel.com> Message-ID: <2cfcf21e0612051128k59f32e99u42cd7e761063786f@mail.gmail.com> Hi Sean, Yeah, in my second post I said that errno was EINVAL just after the ib_cm_send_req() call, which I assume was from the write() call. Or did you mean something else? Steve. On 05/12/06, Sean Hefty wrote: > > > In my application I keep getting -1 returned by a call to > > ib_cm_send_req() function. The cmpost example application works fine, so > > I can rule out system set-up issues. > > This is probably an error being returned from the kernel. Does errno give > any > more insight? > > - Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bos at pathscale.com Tue Dec 5 11:31:15 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 05 Dec 2006 11:31:15 -0800 Subject: [openib-general] [PATCH] IB/ipath: Remove unused "write-only" variables In-Reply-To: References: Message-ID: <4575C903.5010605@pathscale.com> Roland Dreier wrote: > Remove variables that are set but then never looked at in the ipath > driver. These cleanups came from David Binderman's list of "set but > never used" warnings from icc. > > Signed-off-by: Roland Dreier Acked-by: Bryan O'Sullivan After sending a CM DREQ with ib_send_cm_dreq(), is it OK to destroy the cm_id without waiting for a DREP ? This is of course assuming that we are not really concerned if the DREQ reached the other end or not. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Tue Dec 5 11:33:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 5 Dec 2006 11:33:25 -0800 Subject: [openib-general] oops with multicast patches In-Reply-To: Message-ID: <000001c718a4$3c313730$8698070a@amr.corp.intel.com> > > Maybe there is a problem with the multicast module when the attach to > > multicast group verb fails? > >I guess it would be in the ipoib driver error handling or how it >interacts with the multicast module, since the attach to multicast >happens in the ipoib driver, not the multicast module. > >Sean, does this give you any ideas of what to look at? I think so. Thanks for the feedback. Hopefully I can reproduce this now. From halr at voltaire.com Tue Dec 5 11:38:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 14:38:04 -0500 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down routing engines In-Reply-To: <11645802302048-git-send-email-sashak@voltaire.com> References: <11645802043173-git-send-email-sashak@voltaire.com> <11645802302048-git-send-email-sashak@voltaire.com> Message-ID: <1165347459.25587.78224.camel@hal.voltaire.com> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > This updates "file" and "updn" (up/down) routing engines which should > work properly now with changed LFT setup mechanism. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From tziporet at dev.mellanox.co.il Tue Dec 5 11:50:44 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 05 Dec 2006 21:50:44 +0200 Subject: [openib-general] OFED 1.2 features update In-Reply-To: <4575BB05.7040106@ichips.intel.com> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> Message-ID: <4575CD94.8070608@dev.mellanox.co.il> Sean Hefty wrote: > > I created a kernel git tree with branches for most of the code that > was in svn, but not upstream. (The SA cache is the last missing piece > that needs to be added.) Branches were made for the rdma_ucm, > multicast support, utilities such as madeye, and kernel test apps. > Branches were also added to the librdmacm to match with the rdma_ucm > and multicast branches. > > - Sean great - we will work to integrate them. BTW - where are those trees located? Tziporet From mshefty at ichips.intel.com Tue Dec 5 10:48:51 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Dec 2006 10:48:51 -0800 Subject: [openib-general] oops with multicast patches In-Reply-To: References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> <4575B313.1010604@ichips.intel.com> Message-ID: <4575BF13.5010607@ichips.intel.com> Roland Dreier wrote: > It just seems to attach local QPs to MCGs without talking to the SA at > all. So it's not using the multicast module, but on the other hand I > can't see why what it does would have any relevance to the crash. From his scenario, I thought that there were two applications running, but I could be off. From looking at the attached file, I didn't see a relation between that code and any crash. Dotan, can you clarify what you meant by "allocate N multicast groups" and "later application get the mcgs"? From your description, does this crash (always?) occur on the node that's running the SM? I have not run this code in the SM node, so I will try this. - Sean From mshefty at ichips.intel.com Tue Dec 5 11:22:14 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Dec 2006 11:22:14 -0800 Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could be wrong? In-Reply-To: <2cfcf21e0612051027s2c1d45cbk134d0a6ac94f480@mail.gmail.com> References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> <2cfcf21e0612051027s2c1d45cbk134d0a6ac94f480@mail.gmail.com> Message-ID: <4575C6E6.8060000@ichips.intel.com> > OK, so I've narrowed it down to the write() function returning the -1, > indicating an error. The value of errno I get is EINVAL, but indicates > the file descriptor is not valid. However, I've check the file > descriptor value and it's listing in the lsof output and all looks fine. My guess is that one of the values set in ib_cm_req_param is off. It could be a byte-ordering issue, or maybe the path record has invalid fields. Posting your cm_req_param values might help identify the problem. - Sean From dotanb at dev.mellanox.co.il Tue Dec 5 12:23:40 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Tue, 5 Dec 2006 22:23:40 +0200 (IST) Subject: [openib-general] oops with multicast patches In-Reply-To: <4575BF13.5010607@ichips.intel.com> References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com> <20061204142214.GA5426@mellanox.co.il> <20061204152624.GA8269@mellanox.co.il> <45746375.5010107@ichips.intel.com> <45756002.3030806@dev.mellanox.co.il> <457582E8.8030705@dev.mellanox.co.il> <4575B313.1010604@ichips.intel.com> <4575BF13.5010607@ichips.intel.com> Message-ID: <46814.194.90.237.34.1165350220.squirrel@dev.mellanox.co.il> > Roland Dreier wrote: >> It just seems to attach local QPs to MCGs without talking to the SA at >> all. So it's not using the multicast module, but on the other hand I >> can't see why what it does would have any relevance to the crash. > > From his scenario, I thought that there were two applications running, > but I > could be off. From looking at the attached file, I didn't see a relation > between that code and any crash. Dotan, can you clarify what you meant by > "allocate N multicast groups" and "later application get the mcgs"? No, there is only one application and a script that executes the SM and the application (but it takes some time until this application allocates all the mcgs it will use). > > From your description, does this crash (always?) occur on the node that's > running the SM? I have not run this code in the SM node, so I will try > this. No. I executed this test in back2back configuration and both of the machines hanged. However, most of the time (4/5) the machine that executed the SM got the crash ... Dotan From mshefty at ichips.intel.com Tue Dec 5 12:02:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Dec 2006 12:02:41 -0800 Subject: [openib-general] ib_send_cm_dreq() and cm_id doubt In-Reply-To: References: Message-ID: <4575D061.3010808@ichips.intel.com> > After sending a CM DREQ with ib_send_cm_dreq(), is it OK to destroy the cm_id > without waiting for a DREP ? This is of course assuming that we are not > really concerned if the DREQ reached the other end or not. Yes - you can even destroy the cm_id before calling ib_send_cm_dreq(), which will result in sending a DREQ if the cm_id is still connected. - Sean From mshefty at ichips.intel.com Tue Dec 5 12:03:52 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 05 Dec 2006 12:03:52 -0800 Subject: [openib-general] OFED 1.2 features update In-Reply-To: <4575CD94.8070608@dev.mellanox.co.il> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> Message-ID: <4575D0A8.7080501@ichips.intel.com> > BTW - where are those trees located? My trees are available from the staging.openfabrics.org/git site. I called the kernel tree rdma-dev. - Sean From rdreier at cisco.com Tue Dec 5 12:27:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 12:27:56 -0800 Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in c2_qp_modify. In-Reply-To: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com> (Krishna Kumar's message of "Mon, 04 Dec 2006 09:14:57 +0530") References: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com> Message-ID: OK, applied for 2.6.20 From halr at voltaire.com Tue Dec 5 12:31:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 05 Dec 2006 15:31:18 -0500 Subject: [openib-general] IPoIB and MC Group leaving In-Reply-To: <1165248082.25587.8839.camel@hal.voltaire.com> References: <1165243803.25587.5906.camel@hal.voltaire.com> <1165248082.25587.8839.camel@hal.voltaire.com> Message-ID: <1165350656.25587.80533.camel@hal.voltaire.com> On Mon, 2006-12-04 at 11:01, Hal Rosenstock wrote: > On Mon, 2006-12-04 at 10:49, Roland Dreier wrote: > > > This is to make sure node is not registered in any groups. This leave > > > may not be successful. Failure is "normal" when the subnet is starting > > > up "fresh". There are other cases where the failure is indeed a failure. > > > > As far as I know, IPoIB will not leave a group unless it thinks it has > > joined the group. What is the code path for a "preemptive" leave? > > OK maybe I have that part wrong but what about the other part: Roland, > The fact that a leave doesn't wait for the response and then a join is > issued. I think there is a race condition here perhaps triggered by > client reregistration. Are the leave and join for the same port and group done by the same or different threads within IPoIB ? Is there any way they can be reordered so that the join occurs before the leave rather than the other way around ? It does appear that the leave is sent once and only once (it is not retried as far as I can tell). -- Hal > -- Hal > > > - R. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Dec 5 12:37:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 12:37:12 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> (Ralph Campbell's message of "Fri, 1 Dec 2006 18:08:42 -0800 (PST)") References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> Message-ID: I think this seems reasonable. And I think it also provides a way to address some hypothetical future situation where lowmem pages don't have a kernel virtual address -- you would just have to use this type of cookie implementation everywhere. (Although I don't think using kmap()/kunmap() is really the right approach -- you should probably just do kmap_atomic()/kunmap_atomic() only while you are actually using the page. But the basic approach of using the dma address as a cookie into a mapping table seems sound to me -- you are basically doing a real sw iotlb) Or -- does this seem reasonable to you? - R. From rdreier at cisco.com Tue Dec 5 12:40:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 12:40:03 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> (Ralph Campbell's message of "Fri, 1 Dec 2006 18:08:42 -0800 (PST)") References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> Message-ID: Something weird happened with your mail setup -- your email seemed to come from ralph.campbel at qlogic.com (only one "L" in your last name). Anyway I assume you saw my response on the email list. Also I forgot to mention one more thing: if you repost your patch set with the sync_single fix that Or found then I am inclined to merge this for 2.6.20 unless Or or someone else objects. - R. From rdreier at cisco.com Tue Dec 5 12:47:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 12:47:14 -0800 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: <20061205161944.GD30209@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 5 Dec 2006 18:19:44 +0200") References: <20061129140016.GO5061@mellanox.co.il> <20061205161944.GD30209@mellanox.co.il> Message-ID: OK, just a very quick scan through: > +ib_ipoib-$(INFINIBAND_IPOIB_CM) += ipoib_cm.o Does this actually work in the Makefile without the CONFIG_ prefix? I don't think it's intended anyway... > +#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) > + > + trim one of these blank lines... > + IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ > + IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, > + skb = dev_alloc_skb(IPOIB_CM_BUF_SIZE + 12); This means every RX buffer is an order-4 allocation (with 4K pages). I think that has to be fixed for us to consider this, or else connected mode is basically useless on a loaded system. > + IPOIB_FLAG_NETIF_STOPPED = 9, I can't follow what this is used for. Can you explain in small words? Why is this: > +struct ipoib_cm_dev_priv { > + struct ib_cq *cq; > + struct ib_srq *srq; > + struct ipoib_rx_buf *srq_ring; > + struct ib_cm_id *id; > + struct list_head passive_ids; > + struct work_struct start_task; > + struct work_struct reap_task; > + struct list_head start_list; > + struct list_head reap_list; > + struct ib_wc ibwc[IPOIB_NUM_WC]; > +}; > + > /* > * Device private locking: tx_lock protects members used in TX fast > * path (and we use LLTX so upper layers don't do extra locking). > @@ -179,6 +226,8 @@ struct ipoib_dev_priv { > struct list_head child_intfs; > struct list_head list; > > + struct ipoib_cm_dev_priv cm; > + outside of CONFIG_INFINIBAND_IPOIB_CM (so struct ipoib_dev_priv is significantly larger even with CM off), but this: > +#ifdef CONFIG_INFINIBAND_IPOIB_CM > + > +#define IPOIB_FLAGS_RC 0x80 > +#define IPOIB_FLAGS_UC 0x40 is inside? > +#define IPOIB_CM_ENABLED(ha) (ha[0] & IPOIB_FLAGS_RC) Should that be +#define IPOIB_CM_ENABLED(ha) (ha[0] & (IPOIB_FLAGS_RC | IPOIB_FLAGS_UC)) (I know you don't implement UC at all but if you're going to define the flag, there's no point in setting a trap for the future...) - R. From or.gerlitz at gmail.com Tue Dec 5 13:09:15 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 5 Dec 2006 23:09:15 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1164910957.14800.71.camel@brick.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> Message-ID: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> On 11/30/06, Ralph Campbell wrote: > diff -r c76ed2f1387b include/rdma/ib_verbs.h > --- a/include/rdma/ib_verbs.h Wed Nov 29 13:28:14 2006 +0800 > +++ b/include/rdma/ib_verbs.h Wed Nov 29 13:54:37 2006 -0800 > +struct ib_dma_mapping_ops { > + int (*mapping_error)(struct ib_device *dev, > + u64 dma_addr); > + u64 (*map_single)(struct ib_device *dev, > + void *ptr, size_t size, > + enum dma_data_direction direction); > + void (*unmap_single)(struct ib_device *dev, > + u64 addr, size_t size, > + enum dma_data_direction direction); > + u64 (*map_page)(struct ib_device *dev, > + struct page *page, unsigned long offset, > + size_t size, > + enum dma_data_direction direction); > + void (*unmap_page)(struct ib_device *dev, > + u64 addr, size_t size, > + enum dma_data_direction direction); > + int (*map_sg)(struct ib_device *dev, > + struct scatterlist *sg, int nents, > + enum dma_data_direction direction); > + void (*unmap_sg)(struct ib_device *dev, > + struct scatterlist *sg, int nents, > + enum dma_data_direction direction); > + u64 (*dma_address)(struct ib_device *dev, > + struct scatterlist *sg); > + unsigned int (*dma_len)(struct ib_device *dev, > + struct scatterlist *sg); > + void (*sync_single_for_cpu)(struct ib_device *dev, > + u64 dma_handle, > + size_t size, > + enum dma_data_direction dir); > + void (*sync_single_for_device)(struct ib_device *dev, > + u64 dma_handle, > + size_t size, > + enum dma_data_direction dir); > }; This structure misses some functions which are members of struct dma_mapping_ops. The most notable miss to me is dma_alloc/free_coherent, please note that an IB consumer can call dma_alloc_coherent and place the resulted dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds Also I see in struct dma_mapping_ops also something called dma_map_simple not sure what it does and who can use it. Or. From or.gerlitz at gmail.com Tue Dec 5 13:21:29 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 5 Dec 2006 23:21:29 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> Message-ID: <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> On 12/5/06, Roland Dreier wrote: > I think this seems reasonable. And I think it also provides a way to > address some hypothetical future situation where lowmem pages don't > have a kernel virtual address -- you would just have to use this > type of cookie implementation everywhere. Such an approach would be much more cleaner and result in much less (~zero changes) in the ulp level, just replace dma_map_xxx calls with ib_dma_map_xxx calls. A problem see with the dma_addr_t being a cookie into a table of kv addresses is that its legal for a consumer to use dma_addr_t with an **offset** . So she gets addr y from ib_dma_map_xxx and then uses y + offset in the SGE provided to ibv_post_send/recv or to the fmr map function. So this table is actually a search tree which allows you to match an offset-ed dma_addr_t returned by dma_map_xxx called by ipath ib_dma_map_xxx with its associated kvaddr. I see now that i have managed to confuse myself b/c as Roland wrote below and i have agreed we don't actually have the kv addr for and unmapped page before the ipath driver maps it ie when it attempt to use the page... It becomes late here... am i inventing a non existant problem with the offset? > (Although I don't think using kmap()/kunmap() is really the right > approach -- you should probably just do kmap_atomic()/kunmap_atomic() > only while you are actually using the page. But the basic approach of > using the dma address as a cookie into a mapping table seems sound to > me -- you are basically doing a real sw iotlb) > > Or -- does this seem reasonable to you? I agree that care should be made to do kmap_atomic/kunmap_atomic only when there is actual need to access the page by the ipath driver. Or. From or.gerlitz at gmail.com Tue Dec 5 13:24:55 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 5 Dec 2006 23:24:55 +0200 Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace support In-Reply-To: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> Message-ID: <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com> On 12/1/06, Sean Hefty wrote: > The following set of patches expand the rdma_cm support to include > UDP port space, and expose the rdma_cm to userspace. Multicast > support has been removed from the patches until the ib_multicast > module can be further debugged. > > Adding in multicast support later will result in new APIs and an > ABI bump, but I do not anticipate multicast changing any of the > existing interfaces. (I'm also less confident that the multicast > ABIs are correct.) > > Without the multicast interfaces, I believe what's left is ready to > merge upstream. Ronald, What's the status of this patchset? it would be somehow very usefull to have rdma cm user space support enablement in 2.6.20 and without the multicast code i don't see why not merging it. Or. From rdreier at cisco.com Tue Dec 5 13:28:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 13:28:20 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> (Or Gerlitz's message of "Tue, 5 Dec 2006 23:09:15 +0200") References: <1164910957.14800.71.camel@brick.pathscale.com> <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> Message-ID: > This structure misses some functions which are members of struct > dma_mapping_ops. I don't think we have to wrap every possible function if no IB consumer uses it. > The most notable miss to me is dma_alloc/free_coherent, please note > that an IB consumer can call dma_alloc_coherent and place the resulted > dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS > code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct > usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under > under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds Given that some use of the dma_alloc_coherent interface exists though, I do think it makes sense to wrap it. So Ralph can you please add that to your resubmission too (in addition to fixing the sync_single issue). Any other issues Or? (BTW thanks for helping review this and pointing out some good issues) From rdreier at cisco.com Tue Dec 5 13:31:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 13:31:20 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> (Or Gerlitz's message of "Tue, 5 Dec 2006 23:21:29 +0200") References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> Message-ID: > A problem see with the dma_addr_t being a cookie into a table of kv > addresses is that its legal for a consumer to use dma_addr_t with an > **offset** . So she gets addr y from ib_dma_map_xxx and then uses y + > offset in the SGE provided to ibv_post_send/recv or to the fmr map > function. Yes, that is a little bit of an issue. But I think it just means the ipath driver needs to keep page tables exactly the way an IOTLB would -- ugly but not impossible to handle. > I see now that i have managed to confuse myself b/c as Roland wrote > below and i have agreed we don't actually have the kv addr for and > unmapped page before the ipath driver maps it ie when it attempt to > use the page... It becomes late here... am i inventing a non existant > problem with the offset? The dma address doesn't have to be a kvaddr -- it is purely an address space defined by the low-level driver. - R. From rdreier at cisco.com Tue Dec 5 13:32:11 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 13:32:11 -0800 Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace support In-Reply-To: <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com> (Or Gerlitz's message of "Tue, 5 Dec 2006 23:24:55 +0200") References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com> Message-ID: > What's the status of this patchset? it would be somehow very usefull > to have rdma cm user space support enablement in 2.6.20 and without > the multicast code i don't see why not merging it. I would like to merge it, but I need to find time to read it over carefully. Have you read this patch set over? Do you have any comments about anything? - R. From rdreier at cisco.com Tue Dec 5 13:52:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 13:52:14 -0800 Subject: [openib-general] [PATCH] IPoIB CM Experimental support References: <20061129140016.GO5061@mellanox.co.il> <20061205161944.GD30209@mellanox.co.il> Message-ID: Reading a little more: > + /* Simple heuristic: dev->mtu > 2K ==> connected mode */ I'm not sure this is such a good idea. I think it's setting a trap for people if we have magic behavior -- eg just imagine the questions if changing the MTU makes multicast stop working. - R. From ralph.campbell at qlogic.com Tue Dec 5 13:58:16 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 05 Dec 2006 13:58:16 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> Message-ID: <1165355896.14800.185.camel@brick.pathscale.com> On Fri, 2006-12-01 at 15:15 -0800, Roland Dreier wrote: > Oh yeah, one other thing... > > could you respin this so that all the new dma_xxx wrappers go into a > new file like (and include that from > )? ib_verbs.h is already too big I think. I can move the definition for struct ib_dma_mapping_ops to a separate header file but if I move the inline functions and include the header file at the top of ib_verbs.h, then the struct ib_device is not defined and the compiler complains. I could put the #include after the definition of struct ib_device but I'm not sure how acceptable that is for coding style. Do you still want me to make this change? From ralph.campbell at qlogic.com Tue Dec 5 14:20:52 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 05 Dec 2006 14:20:52 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> Message-ID: <1165357252.14800.192.camel@brick.pathscale.com> On Tue, 2006-12-05 at 23:09 +0200, Or Gerlitz wrote: > On 11/30/06, Ralph Campbell wrote: > > diff -r c76ed2f1387b include/rdma/ib_verbs.h > > --- a/include/rdma/ib_verbs.h Wed Nov 29 13:28:14 2006 +0800 > > +++ b/include/rdma/ib_verbs.h Wed Nov 29 13:54:37 2006 -0800 > > +struct ib_dma_mapping_ops { > > + int (*mapping_error)(struct ib_device *dev, > > + u64 dma_addr); > > + u64 (*map_single)(struct ib_device *dev, > > + void *ptr, size_t size, > > + enum dma_data_direction direction); > > + void (*unmap_single)(struct ib_device *dev, > > + u64 addr, size_t size, > > + enum dma_data_direction direction); > > + u64 (*map_page)(struct ib_device *dev, > > + struct page *page, unsigned long offset, > > + size_t size, > > + enum dma_data_direction direction); > > + void (*unmap_page)(struct ib_device *dev, > > + u64 addr, size_t size, > > + enum dma_data_direction direction); > > + int (*map_sg)(struct ib_device *dev, > > + struct scatterlist *sg, int nents, > > + enum dma_data_direction direction); > > + void (*unmap_sg)(struct ib_device *dev, > > + struct scatterlist *sg, int nents, > > + enum dma_data_direction direction); > > + u64 (*dma_address)(struct ib_device *dev, > > + struct scatterlist *sg); > > + unsigned int (*dma_len)(struct ib_device *dev, > > + struct scatterlist *sg); > > + void (*sync_single_for_cpu)(struct ib_device *dev, > > + u64 dma_handle, > > + size_t size, > > + enum dma_data_direction dir); > > + void (*sync_single_for_device)(struct ib_device *dev, > > + u64 dma_handle, > > + size_t size, > > + enum dma_data_direction dir); > > }; > > This structure misses some functions which are members of struct > dma_mapping_ops. > > The most notable miss to me is dma_alloc/free_coherent, please note > that an IB consumer can call dma_alloc_coherent and place the resulted > dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS > code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct > usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under > under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds This looks like a very different version of RDS from what was in SVN a month ago. The SVN version didn't call alloc_dma_coherent(). > Also I see in struct dma_mapping_ops also something called > dma_map_simple not sure what it does and who can use it. > > Or. I don't see anything with "simple" in the name. There is one call to dma_map_single() in the inline function for ib_dma_map_single() if the ib_device.dma_ops is NULL. From ralph.campbell at qlogic.com Tue Dec 5 14:21:52 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 05 Dec 2006 14:21:52 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1164910957.14800.71.camel@brick.pathscale.com> <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> Message-ID: <1165357312.14800.193.camel@brick.pathscale.com> On Tue, 2006-12-05 at 13:28 -0800, Roland Dreier wrote: > > This structure misses some functions which are members of struct > > dma_mapping_ops. > > I don't think we have to wrap every possible function if no IB > consumer uses it. > > > The most notable miss to me is dma_alloc/free_coherent, please note > > that an IB consumer can call dma_alloc_coherent and place the resulted > > dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS > > code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct > > usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under > > under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds > > Given that some use of the dma_alloc_coherent interface exists though, > I do think it makes sense to wrap it. So Ralph can you please add > that to your resubmission too (in addition to fixing the sync_single > issue). OK. > Any other issues Or? (BTW thanks for helping review this and pointing > out some good issues) From bugzilla-daemon at openib.org Tue Dec 5 14:58:18 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 5 Dec 2006 14:58:18 -0800 (PST) Subject: [openib-general] [Bug 286] "ifconfig ib# down" hangs telnet connection-- NETDEV WATCHDOG: ib0: transmit timed out Message-ID: <20061205225818.BC7B02283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=286 ------- Comment #2 from amir.vetry at sun.com 2006-12-05 14:58 ------- This issue was also reproducible with other Sun's platform (Andromeda's blade) and OFED 1.1 driver. The following are the system used to reproduced this problem: - Linux 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux - SUSE LINUX Enterprise Server 9 (x86_64) VERSION = 9, PATCHLEVEL = 3 Error message in /var/log/message* ================================= Nov 29 04:23:27 kernel: NETDEV WATCHDOG: ib0: transmit timed out Nov 29 04:23:27 kernel: ib0: transmit timeout: latency 6290010 msecs Nov 29 04:23:27 kernel: ib0: queue stopped 1, tx_head 1433713921, tx_tail 1433713886 IB-HCA detail information: ========================== hca_id: mthca0 fw_ver: 4.7.600 node_guid: 0002:c902:0021:83bc sys_image_guid: 0003:ba00:0100:d050 vendor_id: 0x03ba vendor_part_id: 25208 hw_ver: 0xA0 board_id: SUN0050000001 phys_port_cnt: 2 port: 1 state: PORT_ARMED (3) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 7 port_lid: 7 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 7 port_lid: 9 port_lmc: 0x00 hca_id: mthca1 fw_ver: 4.7.600 node_guid: 0002:c902:0040:0458 sys_image_guid: 0003:ba00:0100:d050 vendor_id: 0x03ba vendor_part_id: 25208 hw_ver: 0xA0 board_id: SUN0050000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 7 port_lid: 2 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 7 port_lid: 3 port_lmc: 0x00 ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From ralph.campbell at qlogic.com Tue Dec 5 14:59:20 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 05 Dec 2006 14:59:20 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> Message-ID: <1165359560.14800.210.camel@brick.pathscale.com> On Tue, 2006-12-05 at 23:21 +0200, Or Gerlitz wrote: > On 12/5/06, Roland Dreier wrote: > > I think this seems reasonable. And I think it also provides a way to > > address some hypothetical future situation where lowmem pages don't > > have a kernel virtual address -- you would just have to use this > > type of cookie implementation everywhere. > > Such an approach would be much more cleaner and result in much less > (~zero changes) in the ulp level, just replace dma_map_xxx calls with > ib_dma_map_xxx calls. > > A problem see with the dma_addr_t being a cookie into a table of kv > addresses is that its legal for a consumer to use dma_addr_t with an > **offset** . So she gets addr y from ib_dma_map_xxx and then uses y + > offset in the SGE provided to ibv_post_send/recv or to the fmr map > function. > > So this table is actually a search tree which allows you to match an > offset-ed dma_addr_t returned by dma_map_xxx called by ipath > ib_dma_map_xxx with its associated kvaddr. > > I see now that i have managed to confuse myself b/c as Roland wrote > below and i have agreed we don't actually have the kv addr for and > unmapped page before the ipath driver maps it ie when it attempt to > use the page... It becomes late here... am i inventing a non existant > problem with the offset? > > > (Although I don't think using kmap()/kunmap() is really the right > > approach -- you should probably just do kmap_atomic()/kunmap_atomic() > > only while you are actually using the page. But the basic approach of > > using the dma address as a cookie into a mapping table seems sound to > > me -- you are basically doing a real sw iotlb) > > > > Or -- does this seem reasonable to you? > > I agree that care should be made to do kmap_atomic/kunmap_atomic only > when there is actual need to access the page by the ipath driver. > > Or. I am not following what you two are saying. The ib_dma_mapping_ops functions as implemented by ib_ipath, are redefining dma_addr_t as a kernel virtual address. When ib_dma_map_single() is called, this is a NOP. When ib_dma_map_sg() is called, the dma_map_sg() replacement needs to convert a struct page pointer into a kernel virtual address. When CONFIG_HIGHMEM is defined, some pages may not be mapped into the kernel virtual address space so the driver needs to call kmap(). Since the driver can't use the struct scattergather to store the kmap() result, a separate table needs to be used so the value can be returned by ib_sg_dma_address(). Doing kmap_atomic() at the point where the kernel virtual address is used is not practical since the driver is not mapping dma_addr_t to struct page * although it is possible to write it that way. It would mean that ib_map_single() would then be more complex in that a kernel virtual address would need to be converted to a struct page *. From ralphc at pathscale.com Tue Dec 5 15:10:56 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Tue, 05 Dec 2006 15:10:56 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1165359560.14800.210.camel@brick.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> <1165359560.14800.210.camel@brick.pathscale.com> Message-ID: <1165360256.14800.213.camel@brick.pathscale.com> On Tue, 2006-12-05 at 14:59 -0800, Ralph Campbell wrote: > The ib_dma_mapping_ops functions as implemented by ib_ipath, > are redefining dma_addr_t as a kernel virtual address. > When ib_dma_map_single() is called, this is a NOP. > When ib_dma_map_sg() is called, the dma_map_sg() replacement needs > to convert a struct page pointer into a kernel virtual address. > When CONFIG_HIGHMEM is defined, some pages may not be mapped > into the kernel virtual address space so the driver needs to > call kmap(). Since the driver can't use the struct scattergather > to store the kmap() result, a separate table needs to be used > so the value can be returned by ib_sg_dma_address(). > > Doing kmap_atomic() at the point where the kernel virtual > address is used is not practical since the driver is not > mapping dma_addr_t to struct page * although it is > possible to write it that way. It would mean that > ib_map_single() would then be more complex in that a > kernel virtual address would need to be converted to a > struct page *. I forgot this last part. Making dma_addr_t a kernel virtual address does allow the result to be offset (at least within a page) but making dma_addr_t a struct page pointer doesn't. From rdreier at cisco.com Tue Dec 5 15:45:08 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 05 Dec 2006 15:45:08 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1165355896.14800.185.camel@brick.pathscale.com> (Ralph Campbell's message of "Tue, 05 Dec 2006 13:58:16 -0800") References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <1165355896.14800.185.camel@brick.pathscale.com> Message-ID: > I can move the definition for struct ib_dma_mapping_ops to a > separate header file but if I move the inline functions > and include the header file at the top of ib_verbs.h, > then the struct ib_device is not defined and the compiler > complains. I could put the #include > after the definition of struct ib_device but I'm not sure > how acceptable that is for coding style. > > Do you still want me to make this change? No, that's OK. I have a few ideas, but let's merge this basically as is and then we can play around with it further. - R. From krause at cup.hp.com Tue Dec 5 17:27:14 2006 From: krause at cup.hp.com (Michael Krause) Date: Tue, 05 Dec 2006 17:27:14 -0800 Subject: [openib-general] [PATCH v2 04/13] Connection Manager In-Reply-To: <20061205180939.GA26384@2ka.mipt.ru> References: <20061205050725.GA26033@2ka.mipt.ru> <1165330925.16087.13.camel@stevo-desktop> <20061205151905.GA18275@2ka.mipt.ru> <1165333198.16087.53.camel@stevo-desktop> <20061205155932.GA32380@2ka.mipt.ru> <1165335162.16087.79.camel@stevo-desktop> <20061205163008.GA30211@2ka.mipt.ru> <1165337245.16087.95.camel@stevo-desktop> <20061205172649.GA20229@2ka.mipt.ru> <1165341100.16087.109.camel@stevo-desktop> <20061205180939.GA26384@2ka.mipt.ru> Message-ID: <6.2.0.14.2.20061205172536.086fa438@esmail.cup.hp.com> If you require more details on how this all works - it was fully explored in the IETF RDDP workgroup - may I suggest a reading of the RDMA Security Considerations draft which goes through many of the issues on how one relates to a host stack. This complements the MPA spec and supports much of what Steve has already responded to during this string of e-mails. We took a great deal of time and debate to insure this can work efficiently and without confusion in terms of who owns what and when. Mike At 10:09 AM 12/5/2006, Evgeniy Polyakov wrote: >On Tue, Dec 05, 2006 at 11:51:40AM -0600, Steve Wise >(swise at opengridcomputing.com) wrote: > > > Almost - except the case about where those skbs are coming from? > > > It looks like they are obtained from network, since it is ethernet > > > driver, and if they match some set of rules, they are considered as > valid > > > MPA negotiation protocol. > > > > They come from the Ethernet driver, but that driver manages multiple HW > > queues and these packets come from an offload queue, not the NIC queue. > > So the HW demultiplexes. > >Ok, thanks for explaination. > >-- > Evgeniy Polyakov > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Tue Dec 5 23:20:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 09:20:02 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061206072002.GB26787@mellanox.co.il> Roland, thanks for the comments, I'll work on addressing them. Regarding your question: > > + IPOIB_FLAG_NETIF_STOPPED = 9, > > I can't follow what this is used for. Can you explain in small words? Send Q overrun prevention. Current code stop the interface if send queue gets full, and start it again after sufficient number of send completions arrives. I generalized it to: stop interface if *some* send queue becomes full, and start it again after send completions *for that send queue* arrive. So when I get send completion, I need to know that the interface was stopped because *this* queue was full, and start the interface only in this case. -- MST From mst at mellanox.co.il Tue Dec 5 23:26:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 09:26:04 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061206072604.GC26787@mellanox.co.il> > Reading a little more: > > > + /* Simple heuristic: dev->mtu > 2K ==> connected mode */ > > I'm not sure this is such a good idea. I think it's setting a trap > for people if we have magic behavior -- eg just imagine the questions > if changing the MTU makes multicast stop working. I know. Still, this only happens if you enable CM. Maybe it will help to mention this in the comment in KConfig? Log a message as well? What do you think? I have a notion that once this code is upstream we can work on ways to teach kernel about net devices where MTU changes dynamically. Or possibly, some tricks with icmp can make it work. -- MST From mst at mellanox.co.il Tue Dec 5 23:29:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 09:29:34 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: References: <20061129140016.GO5061@mellanox.co.il> <20061205161944.GD30209@mellanox.co.il> Message-ID: <20061206072934.GD26787@mellanox.co.il> > Quoting r. Roland Dreier : > Subject: Re: [PATCH] IPoIB CM Experimental support > > OK, just a very quick scan through: > > > +ib_ipoib-$(INFINIBAND_IPOIB_CM) += ipoib_cm.o > > Does this actually work in the Makefile without the CONFIG_ prefix? I > don't think it's intended anyway... It does seem to work (try it :) ), but I agree this should be fixed. -- MST From ogerlitz at voltaire.com Wed Dec 6 00:03:40 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 06 Dec 2006 10:03:40 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1165357252.14800.192.camel@brick.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> <1165357252.14800.192.camel@brick.pathscale.com> Message-ID: <4576795C.2050903@voltaire.com> Ralph Campbell wrote: > On Tue, 2006-12-05 at 23:09 +0200, Or Gerlitz wrote: >> The most notable miss to me is dma_alloc/free_coherent, please note >> that an IB consumer can call dma_alloc_coherent and place the resulted >> dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS >> code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct >> usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under >> under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds > > This looks like a very different version of RDS from what was > in SVN a month ago. The SVN version didn't call alloc_dma_coherent(). Two comments: a) the SVN kernel IB code is not something you should look on, its unmaintained b) the SVN RDS code is the 2nd generation RDS code and is a dead one, RDS is now developed within Oracle Linux group and the code is what found under the oss.oracle.com pointer above. >> Also I see in struct dma_mapping_ops also something called >> dma_map_simple not sure what it does and who can use it. > I don't see anything with "simple" in the name. > There is one call to dma_map_single() in the inline function > for ib_dma_map_single() if the ib_device.dma_ops is NULL. I was looking in include/asm-x86_64/dma-mapping.h and there was this map_simple prototype... anyway forget about it. Or. From mst at mellanox.co.il Wed Dec 6 00:03:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 10:03:17 +0200 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <000601c714e0$d955fef0$92cc180a@amr.corp.intel.com> References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> <000601c714e0$d955fef0$92cc180a@amr.corp.intel.com> Message-ID: <20061206080317.GG26787@mellanox.co.il> > To handle the case > where the connection messages are lost, a new API is added that users > may invoke to force a connection into the established state. Just to clarify this point - what connecton messages can be lost? E.g. if the passive side does not get an RTU for a while, it will retry the REP, won't it? Diagram 12.9.6 seems to indicate so: from REP Sent we should go to RTU timeout, Send REP and back to REP Sent. Is this implemented? -- MST From mst at mellanox.co.il Wed Dec 6 00:17:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 10:17:27 +0200 Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace support In-Reply-To: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> Message-ID: <20061206081727.GH26787@mellanox.co.il> > The following set of patches expand the rdma_cm support to include > UDP port space, and expose the rdma_cm to userspace. Multicast > support has been removed from the patches until the ib_multicast > module can be further debugged. > > Adding in multicast support later will result in new APIs and an > ABI bump, but I do not anticipate multicast changing any of the > existing interfaces. (I'm also less confident that the multicast > ABIs are correct.) > > Without the multicast interfaces, I believe what's left is ready to > merge upstream. I agree. Further, limited UCMA testing done on a very similiar codebase in OFED 1.1 did not turn up any issues, and CMA updates address API issues we have seen with SDP. Acked-by: Michael S. Tsirkin -- MST From ogerlitz at voltaire.com Wed Dec 6 00:21:14 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 06 Dec 2006 10:21:14 +0200 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <20061206080317.GG26787@mellanox.co.il> References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> <000601c714e0$d955fef0$92cc180a@amr.corp.intel.com> <20061206080317.GG26787@mellanox.co.il> Message-ID: <45767D7A.7040502@voltaire.com> Michael S. Tsirkin wrote: >> To handle the case >> where the connection messages are lost, a new API is added that users >> may invoke to force a connection into the established state. > > Just to clarify this point - what connecton messages can be lost? > E.g. if the passive side does not get an RTU for a while, it will > retry the REP, won't it? Diagram 12.9.6 seems to indicate so: > from REP Sent we should go to RTU timeout, Send REP and back to REP Sent. > Is this implemented? It handles the case where the first RX crosses the RTU which can happen when the RTU is lost but also without it being lost. Indeed the passive side would resend the REP when a timeout expires but the patch allows the app to force the connection establishment **now** (ie have the CMA move the RC QP to RTS) and not go into queuing of RX-es etc till the RTU is lost, it also handles the case where all the RTUs are lost. Or. From mst at mellanox.co.il Wed Dec 6 00:22:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 10:22:42 +0200 Subject: [openib-general] userspace git conversion status/cut over In-Reply-To: <20061130191717.GJ18978@sashak.voltaire.com> References: <1164897683.11808.129709.camel@hal.voltaire.com> <456F0AE3.4060209@ichips.intel.com> <20061130191717.GJ18978@sashak.voltaire.com> Message-ID: <20061206082242.GI26787@mellanox.co.il> > Other issue. There is /pub/scm/linux-2.6.18/.git tree, looks it was used > for git installation testing or so. > > Does somebody use it? Could this be (re)moved? No one seemed to care, and 2.6.19 is out anyway :) Let's kill it then. -- MST From mst at mellanox.co.il Wed Dec 6 00:28:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 10:28:57 +0200 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <45767D7A.7040502@voltaire.com> References: <45767D7A.7040502@voltaire.com> Message-ID: <20061206082857.GJ26787@mellanox.co.il> > Michael S. Tsirkin wrote: > >> To handle the case > >> where the connection messages are lost, a new API is added that users > >> may invoke to force a connection into the established state. > > > > Just to clarify this point - what connecton messages can be lost? > > E.g. if the passive side does not get an RTU for a while, it will > > retry the REP, won't it? Diagram 12.9.6 seems to indicate so: > > from REP Sent we should go to RTU timeout, Send REP and back to REP Sent. > > Is this implemented? > > It handles the case where the first RX crosses the RTU which can happen > when the RTU is lost but also without it being lost. > > Indeed the passive side would resend the REP when a timeout expires but > the patch allows the app to force the connection establishment **now** > (ie have the CMA move the RC QP to RTS) and not go into queuing of RX-es > etc till the RTU is lost, it also handles the case where all the RTUs > are lost. I think we all already agreed we need the rdma_established call, for reasons that you outline. So I am not arguing at all - I was just checking that REP re-sends are implemented. So, a slightly more exact description for the patch would be "to handle the case where a data packet bypasses an RTU". Is that right? -- MST From mst at mellanox.co.il Wed Dec 6 00:34:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 10:34:27 +0200 Subject: [openib-general] OFED 1.2 features update In-Reply-To: <4575D0A8.7080501@ichips.intel.com> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> <4575D0A8.7080501@ichips.intel.com> Message-ID: <20061206083427.GL26787@mellanox.co.il> > > BTW - where are those trees located? > > My trees are available from the staging.openfabrics.org/git site. I called the > kernel tree rdma-dev. Thanks, Sean! I gather the ucma bits are in rdma_ucm? -- MST From mst at mellanox.co.il Wed Dec 6 00:49:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 10:49:02 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061206084902.GN26787@mellanox.co.il> > >The idea is to increase performance by increasing the MTU > >from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD. > >With this code, I'm able to get 800MByte/sec or more with netperf > >without options on a Mellanox 4x back-to-back DDR system. > > What about CPU utilization? Seems to be about the same (about 100% of a single CPU). UD: # /mswg/work/mst/netperf-2.4.2/src/netperf -H 11.4.3.69 -f M -c -C TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.69 (11.4.3.69) port 0 AF_INET : demo Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % S us/KB us/KB 87380 16384 16384 10.00 276.80 27.98 25.55 3.948 3.606 RC: # /mswg/work/mst/netperf-2.4.2/src/netperf -H 11.4.3.69 -f M -c -C TREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.69 (11.4.3.69) port 0 AF_INET : demo Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. MBytes /s % S % S us/KB us/KB 87380 16384 16384 10.00 907.68 25.08 24.43 1.079 1.052 -- MST From ogerlitz at voltaire.com Wed Dec 6 01:09:06 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 06 Dec 2006 11:09:06 +0200 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <20061206082857.GJ26787@mellanox.co.il> References: <45767D7A.7040502@voltaire.com> <20061206082857.GJ26787@mellanox.co.il> Message-ID: <457688B2.8040704@voltaire.com> Michael S. Tsirkin wrote: > I think we all already agreed we need the rdma_established call, > for reasons that you outline. So I am not arguing at all - I was just > checking that REP re-sends are implemented. Yes, and its not "the rdma_established call" but "an rdma_established" call. Sean has changed the name to cm_notify and rdma_notify as it merges within the framework of other ULP to CMA/CM notifications eg those related to path migration. > So, a slightly more exact description for the patch would be > "to handle the case where a data packet bypasses an RTU". > Is that right? Yes. Or. From ogerlitz at voltaire.com Wed Dec 6 01:52:37 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 06 Dec 2006 11:52:37 +0200 Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace support In-Reply-To: References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com> Message-ID: <457692E5.2050800@voltaire.com> Roland Dreier wrote: > > What's the status of this patchset? it would be somehow very usefull > > to have rdma cm user space support enablement in 2.6.20 and without > > the multicast code i don't see why not merging it. > > I would like to merge it, but I need to find time to read it over > carefully. Have you read this patch set over? Do you have any > comments about anything? + 1/5 is a small fix discussed over the list + 2/5 provides a functionality needed by CMA consumer and does not have any impact on anything below the CMA + 3/5 is a solution for the IB race of data crossing the RTU and is the outcome of a very long discussion over the list. The approach taken is very clean and easy to integrate for CM/CMA consumers. A similar patch was integrated into OFED 1.1 so it closes a hole where a passive side CM/CMA consumers wanting to handle this case easily were not able to do so with the kernel CM/CMA code, we must need it for 2.6.20 to close this gap. + 4/5 adds CMA "UD offload" support using SIDR REQ/REP to exchange the QP and Path information. I did not experience much with the patch other then running the librdmacm uddady test program but have reviewed it without having any special comments. + 5/5 is the CMA user space support. I only did a light review of it but my understanding is that Sean used the in kernel ib_ucm design/code as the base line for this driver so there should be no special issues here. This driver is long time missing in the kernel IB offer, as it enables using the user space rdma cm (librdmacm) which more and more becomes a must have in the IB package of today's distros - it better go in now. Actually i did most of my review and testing on the multicast code which is not in this patch set. I have provided feedback over the list which made its way into v2 of the patches and more feedback 1x1 to Sean during the sc06 devcon. Sean - as of the stability issues reported by Mellanox I understand you have decided not to push the multicast code for 2.6.20 and I see that the focus now is on finding the bug. Once this is solved I would like to provide more feedback before you publish v3 - does it makes sense? Or. From muli at il.ibm.com Wed Dec 6 01:52:55 2006 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Wed, 6 Dec 2006 11:52:55 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com> Message-ID: <20061206095255.GA6564@rhun.haifa.ibm.com> On Tue, Dec 05, 2006 at 11:09:15PM +0200, Or Gerlitz wrote: > Also I see in struct dma_mapping_ops also something called > dma_map_simple not sure what it does and who can use it. It's an internal implementation detail of how the GART IOMMU uses the DMA maping ops which doesn't belong in the dma_ops structure. Cheers, Muli From ogerlitz at voltaire.com Wed Dec 6 01:58:43 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 06 Dec 2006 11:58:43 +0200 Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update - RDMA CM etc In-Reply-To: <20061206083427.GL26787@mellanox.co.il> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> <4575D0A8.7080501@ichips.intel.com> <20061206083427.GL26787@mellanox.co.il> Message-ID: <45769453.3030509@voltaire.com> Michael S. Tsirkin wrote: >>> BTW - where are those trees located? >> My trees are available from the staging.openfabrics.org/git site. I called the >> kernel tree rdma-dev. > > Thanks, Sean! > I gather the ucma bits are in rdma_ucm? We must make some plan here. 1st decide on which kernel OFED 1.2 is based 2.6.19 or 2.6.20.rc1 2nd see if the rdma_ucm bits make their way into 2.6.20 3rd have Sean decide how he wants the multicast support to be integrated into OFED 1.2, my guess would be as a patch set over the ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide 4th have a release of librdmacm that supports RC/UD/UD mcast "offload" - Or. From eitan at mellanox.co.il Wed Dec 6 02:17:34 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 12:17:34 +0200 Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or switch] Message-ID: <457698BE.10907@mellanox.co.il> Hi Hal The simulated regression caught this: The osm.mcfdbs have now the format: Switch 0x0002c90000000006 LID : Out Port(s) 0xC000 : 0x003 0x004 0x005 0x006 0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 :0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 :0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B :0xC01C :0xC01D :0xC01E :0xC01F : Which should probably just be: Switch 0x0002c90000000006 LID : Out Port(s) 0xC000 : 0x003 0x004 0x005 0x006 Actually switches that do not have any MCG entry will not be included in the dump file. The following patch fixes that. Eitan Signed-off-by: Eitan Zahavi Index: opensm/osm_mcast_mgr.c =================================================================== --- opensm/osm_mcast_mgr.c (revision 10188) +++ opensm/osm_mcast_mgr.c (working copy) @@ -1389,10 +1389,13 @@ mcast_mgr_dump_sw_routes( int16_t mlid_start_ho; uint8_t position = 0; int16_t block_num = 0; - boolean_t print_lid; + boolean_t first_mlid; + boolean_t first_port; const osm_node_t* p_node; uint16_t i, j; uint16_t mask_entry; + char sw_hdr[256]; + char mlid_hdr[32]; OSM_LOG_ENTER( p_mgr->p_log, mcast_mgr_dump_sw_routes ); @@ -1403,9 +1406,10 @@ mcast_mgr_dump_sw_routes( p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw ); - fprintf( file, "\nSwitch 0x%016" PRIx64 "\n" + sprintf( sw_hdr, "\nSwitch 0x%016" PRIx64 "\n" "LID : Out Port(s)\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); + cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); + first_mlid = TRUE; while ( block_num <= p_tbl->max_block_in_use ) { mlid_start_ho = (uint16_t)(block_num * IB_MCAST_BLOCK_SIZE); @@ -1413,8 +1417,8 @@ mcast_mgr_dump_sw_routes( { mlid_ho = mlid_start_ho + i; position = 0; - print_lid = FALSE; - fprintf( file, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); + first_port = TRUE; + sprintf( mlid_hdr, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); while ( position <= p_tbl->max_position ) { mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]); @@ -1423,17 +1427,27 @@ mcast_mgr_dump_sw_routes( position++; continue; } - print_lid = TRUE; for (j = 0 ; j < 16 ; j++) { - if ( (1 << j) & mask_entry ) - fprintf( file, " 0x%03X ", j+(position*16) ); + if ( (1 << j) & mask_entry ) { + if (first_mlid) + { + fprintf( file,"%s", sw_hdr ); + first_mlid = FALSE; + } + if (first_port) + { + fprintf( file,"%s", mlid_hdr ); + first_port = FALSE; + } + fprintf( file, " 0x%03X ", j+(position*16) ); + } } position++; } - if (print_lid) + if (first_port == FALSE) { - fprintf( file, "\n" ); + fprintf( file, "\n" ); } } block_num++; From mst at mellanox.co.il Wed Dec 6 02:17:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 12:17:05 +0200 Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update - RDMA CM etc In-Reply-To: <45769453.3030509@voltaire.com> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> <4575D0A8.7080501@ichips.intel.com> <20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com> Message-ID: <20061206101705.GP26787@mellanox.co.il> > >>> BTW - where are those trees located? > >> My trees are available from the staging.openfabrics.org/git site. I called the > >> kernel tree rdma-dev. > > > > Thanks, Sean! > > I gather the ucma bits are in rdma_ucm? > > We must make some plan here. > > 1st decide on which kernel OFED 1.2 is based 2.6.19 or 2.6.20.rc1 1st is probably to fix the mcast bits so that they don't crash the machine. OFED will be based on whatever is merged by Linus by that time + any number of patches and out of kernel modules. > 2nd see if the rdma_ucm bits make their way into 2.6.20 Until that's closed we can keep stuff in patches, assuming its reasonably stable (as in - does not interfere with other work). > 3rd have Sean decide how he wants the multicast support to be integrated > into OFED 1.2, my guess would be as a patch set over the > ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide Yes. The idea is to have in OFED linus' tree + any number of additional files + any number of patches. The point of this is that merges from upstream must be seamless, and if they break something I know which patch to blame. Makefile conflicts I can handle so Makefile additions even in core can go in. > 4th have a release of librdmacm that supports RC/UD/UD mcast "offload" - Need to also think how whatever library OFED ships will work on current and future upstream kernels. I would like to see some plan that will ensure backward compatibility for tools that do not use multicast. Maybe the right thing is to split the multicast stuff in a separate library, or have a separate ABI version for multicast, I don't really know. -- MST From eitan at mellanox.co.il Wed Dec 6 02:21:22 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 12:21:22 +0200 Subject: [openib-general] [PATCH] osm: OpenSM exits on PathRecord query with zero LID Message-ID: <457699A2.9070206@mellanox.co.il> Hi Hal, This is another catch from the nightly simulator based regression. Simple: if OpenSM gets a PathRecord that eventually maps into a port with zero LID (either SRC or DST) if just asserts (in debug mode) on getting the LFT. The following patch catches this error. EZ Signed-off-by: Eitan Zahavi Index: opensm/osm_sa_path_record.c =================================================================== --- opensm/osm_sa_path_record.c (revision 10188) +++ opensm/osm_sa_path_record.c (working copy) @@ -976,6 +976,22 @@ __osm_pr_rcv_get_port_pair_paths( &src_lid_max_ho ); } + if ( src_lid_min_ho == 0 ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_port_pair_paths: ERR 1F20:" + "Obtained zero source LID. No such LID possible.\n"); + goto Exit; + } + + if ( dest_lid_min_ho == 0 ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_pr_rcv_get_port_pair_paths: ERR 1F21:" + "Obtained zero destination LID. No such LID possible.\n"); + goto Exit; + } + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { osm_log( p_rcv->p_log, OSM_LOG_DEBUG, From eitan at mellanox.co.il Wed Dec 6 02:23:24 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 12:23:24 +0200 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down routing engines In-Reply-To: <1165347459.25587.78224.camel@hal.voltaire.com> References: <11645802043173-git-send-email-sashak@voltaire.com> <11645802302048-git-send-email-sashak@voltaire.com> <1165347459.25587.78224.camel@hal.voltaire.com> Message-ID: <45769A1C.2090406@mellanox.co.il> Hal Rosenstock wrote: > On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > >> This updates "file" and "updn" (up/down) routing engines which should >> work properly now with changed LFT setup mechanism. >> >> Signed-off-by: Sasha Khapyorsky >> > > Thanks. Applied. > Are these patches inserted into SVN or GIT ? Eitan > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Wed Dec 6 02:39:07 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 12:39:07 +0200 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down routing engines In-Reply-To: <45769A1C.2090406@mellanox.co.il> References: <11645802043173-git-send-email-sashak@voltaire.com> <11645802302048-git-send-email-sashak@voltaire.com> <1165347459.25587.78224.camel@hal.voltaire.com> <45769A1C.2090406@mellanox.co.il> Message-ID: <45769DCB.7010105@mellanox.co.il> Eitan Zahavi wrote: > Hal Rosenstock wrote: > >> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: >> >> >>> This updates "file" and "updn" (up/down) routing engines which should >>> work properly now with changed LFT setup mechanism. >>> >>> Signed-off-by: Sasha Khapyorsky >>> >>> >> Thanks. Applied. >> >> > Are these patches inserted into SVN or GIT > Ignore this - just cloned GIT and its there > Eitan > >> -- Hal >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From chevchenkovic at gmail.com Wed Dec 6 02:52:23 2006 From: chevchenkovic at gmail.com (Chevchenkovic Chevchenkovic) Date: Wed, 6 Dec 2006 16:22:23 +0530 Subject: [openib-general] Forwarding tables Message-ID: <1c16cdf90612060252t38f5ab5cn995c2c5140498005@mail.gmail.com> Hi, I would like to write my own forwarding table to be used by openSM. I hope some expert here would help me out in this. 1. How do I write the new table in a file. What is the format used and wht are the commands to be used while loading openS? 2; Which part of the code should I modify so as to incorporate this changing of linear forwarding tables in the code itself. Help would b very much apreciared. Best Wishes, -Chev From halr at voltaire.com Wed Dec 6 03:11:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 06:11:47 -0500 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down routing engines In-Reply-To: <45769DCB.7010105@mellanox.co.il> References: <11645802043173-git-send-email-sashak@voltaire.com> <11645802302048-git-send-email-sashak@voltaire.com> <1165347459.25587.78224.camel@hal.voltaire.com> <45769A1C.2090406@mellanox.co.il> <45769DCB.7010105@mellanox.co.il> Message-ID: <1165403496.25587.119503.camel@hal.voltaire.com> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote: > Eitan Zahavi wrote: > > Hal Rosenstock wrote: > > > >> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > >> > >> > >>> This updates "file" and "updn" (up/down) routing engines which should > >>> work properly now with changed LFT setup mechanism. > >>> > >>> Signed-off-by: Sasha Khapyorsky > >>> > >>> > >> Thanks. Applied. > >> > >> > > Are these patches inserted into SVN or GIT > > > Ignore this - just cloned GIT and its there Were your latest regressions run against svn or git clone ? -- Hal > > Eitan > > > >> -- Hal > >> > >> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > >> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From erezz at voltaire.com Wed Dec 6 03:29:27 2006 From: erezz at voltaire.com (Erez Zilber) Date: Wed, 06 Dec 2006 13:29:27 +0200 Subject: [openib-general] [PATCH] IB/iser: Remove unused "write-only" variables In-Reply-To: References: Message-ID: <4576A997.9030602@voltaire.com> Roland Dreier wrote: > Remove variables that are set but then never looked at in the iSER > initiator. These cleanups came from David Binderman's list of "set > but never used" warnings from icc. > > Signed-off-by: Roland Dreier > --- > Erez, does this look OK to merge? > > drivers/infiniband/ulp/iser/iser_initiator.c | 4 ---- > drivers/infiniband/ulp/iser/iser_memory.c | 3 +-- > 2 files changed, 1 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c > index 9b3d79c..e73c87b 100644 > --- a/drivers/infiniband/ulp/iser/iser_initiator.c > +++ b/drivers/infiniband/ulp/iser/iser_initiator.c > @@ -487,10 +487,8 @@ int iser_send_control(struct iscsi_conn > struct iscsi_iser_conn *iser_conn = conn->dd_data; > struct iser_desc *mdesc = mtask->dd_data; > struct iser_dto *send_dto = NULL; > - unsigned int itt; > unsigned long data_seg_len; > int err = 0; > - unsigned char opcode; > struct iser_regd_buf *regd_buf; > struct iser_device *device; > > @@ -512,8 +510,6 @@ int iser_send_control(struct iscsi_conn > > iser_reg_single(device, send_dto->regd[0], DMA_TO_DEVICE); > > - itt = ntohl(mtask->hdr->itt); > - opcode = mtask->hdr->opcode & ISCSI_OPCODE_MASK; > data_seg_len = ntoh24(mtask->hdr->dlength); > > if (data_seg_len > 0) { > diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c > index 0606744..e5a1091 100644 > --- a/drivers/infiniband/ulp/iser/iser_memory.c > +++ b/drivers/infiniband/ulp/iser/iser_memory.c > @@ -234,7 +234,7 @@ static int iser_sg_to_page_vec(struct is > { > struct scatterlist *sg = (struct scatterlist *)data->buf; > dma_addr_t first_addr, last_addr, page; > - int start_aligned, end_aligned; > + int end_aligned; > unsigned int cur_page = 0; > unsigned long total_sz = 0; > int i; > @@ -248,7 +248,6 @@ static int iser_sg_to_page_vec(struct is > first_addr = sg_dma_address(&sg[i]); > last_addr = first_addr + sg_dma_len(&sg[i]); > > - start_aligned = !(first_addr & ~MASK_4K); > end_aligned = !(last_addr & ~MASK_4K); > > /* continue to collect page fragments till aligned or SG ends */ > I'm ok with that. Thanks. -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Team Voltaire – _The Grid Backbone_ __ www.voltaire.com From ogerlitz at voltaire.com Wed Dec 6 03:33:07 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 06 Dec 2006 13:33:07 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1165359560.14800.210.camel@brick.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> <1165359560.14800.210.camel@brick.pathscale.com> Message-ID: <4576AA73.105@voltaire.com> Ralph Campbell wrote: > On Tue, 2006-12-05 at 23:21 +0200, Or Gerlitz wrote: >> On 12/5/06, Roland Dreier wrote: > I am not following what you two are saying. > The ib_dma_mapping_ops functions as implemented by ib_ipath, > are redefining dma_addr_t as a kernel virtual address. > When ib_dma_map_single() is called, this is a NOP. > When ib_dma_map_sg() is called, the dma_map_sg() replacement needs > to convert a struct page pointer into a kernel virtual address. > When CONFIG_HIGHMEM is defined, some pages may not be mapped > into the kernel virtual address space so the driver needs to > call kmap(). Since the driver can't use the struct scattergather > to store the kmap() result, a separate table needs to be used > so the value can be returned by ib_sg_dma_address(). Indeed. > Doing kmap_atomic() at the point where the kernel virtual > address is used is not practical since the driver is not > mapping dma_addr_t to struct page * although it is > possible to write it that way. It would mean that > ib_map_single() would then be more complex in that a > kernel virtual address would need to be converted to a > struct page *. Basically what Roland suggest is that you need to implement SW IOTLB mapping from dma_addr_t (possibly offset-ed) to kv addr. And do the actual kmap/unmap calls before/after you must touch the data. Is this impossible? Or. From halr at voltaire.com Wed Dec 6 03:33:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 06:33:57 -0500 Subject: [openib-general] Forwarding tables In-Reply-To: <1c16cdf90612060252t38f5ab5cn995c2c5140498005@mail.gmail.com> References: <1c16cdf90612060252t38f5ab5cn995c2c5140498005@mail.gmail.com> Message-ID: <1165404763.25587.120284.camel@hal.voltaire.com> Hi Chev, On Wed, 2006-12-06 at 05:52, Chevchenkovic Chevchenkovic wrote: > Hi, > I would like to write my own forwarding table to be used by openSM. > I hope some expert here would help me out in this. > 1. How do I write the new table in a file. What is the format used > and wht are the commands to be used while loading openS? Run dump_lfts.sh on a subnet to see the file format used for loading. > 2; Which part of the code should I modify so as to incorporate this > changing of linear forwarding tables in the code itself. None if you just want to load it from a file. See opensm man page. The options are: -R, --routing_engine This option chooses routing engine instead of Min Hop algorithm (default). Supported engines: updn, file -M, --lid_matrix_file This option specifies the name of the lid matrix dump file from where switch lid matrices (min hops tables will be loaded. Also, see osm/doc/modular-routing.doc in the svn or git repository for userspace management. If you do want to write an algorithm, then there is some "intrusive" work to do. Is file based sufficient for now ? Will you be adding an additional routing algorithm ? Or do you just want to experiment for now ? -- Hal > Help would b very much apreciared. > Best Wishes, > -Chev > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- Modular Routine Engine Modular routing engine structure has been added to allow for ease of "plugging" new routing modules. Currently, only unicast callbacks are supported. Multicast can be added later. One existing routing module is up-down "updn", which may be activate with '-R updn' option (instead of old '-u'). General usage is: $ opensm -R 'module-name' There is also a trivial routing module which is able to load LFT tables from a dump file. Main features: - support for unicast LFTs only; support for multicast can be added later - this will run after min hop matrix calculation - this will load switch LFTs according to the path entries introduced in the dump file - no additional checks will be performed (such as "is port connected", etc.) - in case when fabric LIDs were changed this will try to reconstruct LFTs correctly if endport GUIDs are represented in the dump file (in order to disable this GUIDs may be removed from the dump file or zeroed) The dump file format is compatible with output of 'ibroute' util and for whole fabric may be generated with script like this: for sw_lid in `ibswitches | awk '{print $NF}'` ; do ibroute $sw_lid done > /path/to/dump_file , or using DR paths: for sw_dr in `ibnetdiscover -v \ | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ | sed -e 's/\]\[/,/g' \ | sort -u` ; do ibroute -D ${sw_dr} done > /path/to/dump_file This script is dump_lfts.sh In order to activate new module use: opensm -R file -U /path/to/dump_file If the dump_file is not found or is in error, the default routing algorithm is utilized. The ability to dump switch lid matrices (aka min hops tables) to file and later to load these is also supported. The usage is similar to unicast forwarding tables loading from dump file (introduced by 'file' routing engine), but new lid matrix file name should be specified by -M or --lid_matrix_file option. For example:   opensm -R file -M ./opensm-lid-matrix.dump The dump file is named 'opensm-lid-matrix.dump' and will be generated in standard opensm dump directory (/var/log by default) when OSM_LOG_ROUTING logging flag is set. When routing engine 'file' is activated, but dump file is not specified or not cannot be open default lid matrix algorithm will be used. There is also a switch forwarding tables dumper which generates a file compatible with dump_lfts.sh output. This file can be used as input for forwarding tables loading by 'file' routing engine. Both or one of options -U and -M can be specified together with '-R file'. NOTE: ibroute has been updated (for switch management ports) to support this. Also, lmc was added to switch management ports. ibroute needs to be r7855 or later from the trunk. From ogerlitz at voltaire.com Wed Dec 6 03:35:29 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 06 Dec 2006 13:35:29 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> Message-ID: <4576AB01.9070206@voltaire.com> Roland Dreier wrote: > > A problem see with the dma_addr_t being a cookie into a table of kv > > addresses is that its legal for a consumer to use dma_addr_t with an > > **offset** . So she gets addr y from ib_dma_map_xxx and then uses y + > > offset in the SGE provided to ibv_post_send/recv or to the fmr map > > function. > > Yes, that is a little bit of an issue. But I think it just means the > ipath driver needs to keep page tables exactly the way an IOTLB would > -- ugly but not impossible to handle. OK > > > I see now that i have managed to confuse myself b/c as Roland wrote > > below and i have agreed we don't actually have the kv addr for and > > unmapped page before the ipath driver maps it ie when it attempt to > > use the page... It becomes late here... am i inventing a non existant > > problem with the offset? > > The dma address doesn't have to be a kvaddr -- it is purely an address > space defined by the low-level driver. OK, you are right, the dma_addr_t returned by the ipath ib_map_xxx calls would live in a virtual space defined by the ipath implementation but have to be presented in the form of dma_addr_t Or. From eitan at mellanox.co.il Wed Dec 6 03:35:21 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 6 Dec 2006 13:35:21 +0200 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/downrouting engines Message-ID: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> Run against SVN. Will move to GIT today (hopefully - if I am able to git clone without password ...) Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, December 06, 2006 1:12 PM > To: Eitan Zahavi > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and > up/downrouting engines > > On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote: > > Eitan Zahavi wrote: > > > Hal Rosenstock wrote: > > > > > >> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > > >> > > >> > > >>> This updates "file" and "updn" (up/down) routing engines which > > >>> should work properly now with changed LFT setup mechanism. > > >>> > > >>> Signed-off-by: Sasha Khapyorsky > > >>> > > >>> > > >> Thanks. Applied. > > >> > > >> > > > Are these patches inserted into SVN or GIT > > > > > Ignore this - just cloned GIT and its there > > Were your latest regressions run against svn or git clone ? > > -- Hal > > > > Eitan > > > > > >> -- Hal > > >> > > >> > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > From ramachandra.kuchimanchi at qlogic.com Tue Dec 5 23:06:32 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra Kuchimanchi) Date: Wed, 6 Dec 2006 01:06:32 -0600 Subject: [openib-general] ib_send_cm_dreq() and cm_id doubt In-Reply-To: <4575D061.3010808@ichips.intel.com> References: <4575D061.3010808@ichips.intel.com> Message-ID: > > After sending a CM DREQ with ib_send_cm_dreq(), is it OK to destroy > > the cm_id without waiting for a DREP ? This is of course assuming > > that we are not really concerned if the DREQ reached the other end or not. > > Yes - you can even destroy the cm_id before calling ib_send_cm_dreq(), which > will result in sending a DREQ if the cm_id is still connected. > > - Sean Thanks for the info. Regards, Ram From halr at voltaire.com Wed Dec 6 03:57:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 06:57:53 -0500 Subject: [openib-general] [PATCH] osm: OpenSM exits on PathRecord query with zero LID In-Reply-To: <457699A2.9070206@mellanox.co.il> References: <457699A2.9070206@mellanox.co.il> Message-ID: <1165406233.25587.121329.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-12-06 at 05:21, Eitan Zahavi wrote: > Hi Hal, > > This is another catch from the nightly simulator based regression. > Simple: if OpenSM gets a PathRecord that eventually maps into a port > with zero LID (either SRC or DST) > if just asserts (in debug mode) on getting the LFT. > > The following patch catches this error. Thanks. Applied (only to the management git repository). A couple of related questions: 1. Is this needed as an OFED 1.1 patch ? 2. Is the same thing needed for SA MultiPathRecord ? -- Hal From eitan at mellanox.co.il Wed Dec 6 04:01:13 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 6 Dec 2006 14:01:13 +0200 Subject: [openib-general] [PATCH] osm: OpenSM exits on PathRecord query with zero LID Message-ID: <6C2C79E72C305246B504CBA17B5500C96DF394@mtlexch01.mtl.com> Hi Hal, > 1. Is this needed as an OFED 1.1 patch ? I would leave the OFED 1.1 for now. A wrong query can still crash the SM but I have not hear about such so-far. > 2. Is the same thing needed for SA MultiPathRecord ? Probably yes. Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Wednesday, December 06, 2006 1:58 PM > To: Eitan Zahavi > Cc: Sasha Khapyorsky; Yevgeny Kliteynik; OPENIB GENERAL > Subject: Re: [PATCH] osm: OpenSM exits on PathRecord query with zero LID > > Hi Eitan, > > On Wed, 2006-12-06 at 05:21, Eitan Zahavi wrote: > > Hi Hal, > > > > This is another catch from the nightly simulator based regression. > > Simple: if OpenSM gets a PathRecord that eventually maps into a port > > with zero LID (either SRC or DST) if just asserts (in debug mode) on > > getting the LFT. > > > > The following patch catches this error. > > Thanks. Applied (only to the management git repository). > > A couple of related questions: > 1. Is this needed as an OFED 1.1 patch ? > 2. Is the same thing needed for SA MultiPathRecord ? > > -- Hal From eitan at mellanox.co.il Wed Dec 6 05:18:52 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 15:18:52 +0200 Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or switch] In-Reply-To: <457698BE.10907@mellanox.co.il> References: <457698BE.10907@mellanox.co.il> Message-ID: <4576C33C.7050204@mellanox.co.il> Hi Hal, Here is the same patch against GIT for your convenience. Thanks EZ The simulated regression caught this: The osm.mcfdbs have now the format: Switch 0x0002c90000000006 LID : Out Port(s) 0xC000 : 0x003 0x004 0x005 0x006 0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 :0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 :0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B :0xC01C :0xC01D :0xC01E :0xC01F : Which should probably just be: Switch 0x0002c90000000006 LID : Out Port(s) 0xC000 : 0x003 0x004 0x005 0x006 Actually switches that do not have any MCG entry will not be included in the dump file. Signed-off-by: Eitan Zahavi --- osm/opensm/osm_mcast_mgr.c 2006-12-06 12:39:13.018015000 +0200 +++ osm/opensm/osm_mcast_mgr.c 2006-12-06 12:06:29.602097000 +0200 @@ -1388,10 +1389,13 @@ mcast_mgr_dump_sw_routes( int16_t mlid_start_ho; uint8_t position = 0; int16_t block_num = 0; - boolean_t print_lid; + boolean_t first_mlid; + boolean_t first_port; const osm_node_t* p_node; uint16_t i, j; uint16_t mask_entry; + char sw_hdr[256]; + char mlid_hdr[32]; OSM_LOG_ENTER( p_mgr->p_log, mcast_mgr_dump_sw_routes ); @@ -1402,9 +1406,10 @@ mcast_mgr_dump_sw_routes( p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw ); - fprintf( file, "\nSwitch 0x%016" PRIx64 "\n" + sprintf( sw_hdr, "\nSwitch 0x%016" PRIx64 "\n" "LID : Out Port(s)\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); + cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); + first_mlid = TRUE; while ( block_num <= p_tbl->max_block_in_use ) { mlid_start_ho = (uint16_t)(block_num * IB_MCAST_BLOCK_SIZE); @@ -1412,8 +1417,8 @@ mcast_mgr_dump_sw_routes( { mlid_ho = mlid_start_ho + i; position = 0; - print_lid = FALSE; - fprintf( file, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); + first_port = TRUE; + sprintf( mlid_hdr, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); while ( position <= p_tbl->max_position ) { mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]); @@ -1422,17 +1427,27 @@ mcast_mgr_dump_sw_routes( position++; continue; } - print_lid = TRUE; for (j = 0 ; j < 16 ; j++) { - if ( (1 << j) & mask_entry ) - fprintf( file, " 0x%03X ", j+(position*16) ); + if ( (1 << j) & mask_entry ) { + if (first_mlid) + { + fprintf( file,"%s", sw_hdr ); + first_mlid = FALSE; + } + if (first_port) + { + fprintf( file,"%s", mlid_hdr ); + first_port = FALSE; + } + fprintf( file, " 0x%03X ", j+(position*16) ); + } } position++; } - if (print_lid) + if (first_port == FALSE) { - fprintf( file, "\n" ); + fprintf( file, "\n" ); } } block_num++; From eitan at mellanox.co.il Wed Dec 6 05:25:08 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 15:25:08 +0200 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/downrouting engines In-Reply-To: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> Message-ID: <4576C4B4.9080608@mellanox.co.il> Hi Hal, I just run one iteration of the simulation regression against the git tree: The Multicast fails on the change of format of osm.mcfdbs The Stability flows failed on the change of subnet.lst to osm-subnet.lst ... doing another loop: run=0 cron=0 hour=14 OsmStress IS1-16.topo ... PASS LidMgr IS1-16.topo ... PASS LidMgr IS1-16.topo ... PASS LidMgr IS1-16.topo ... PASS LidMgr IS3-128.topo ... PASS Multicast IS1-16.topo ... FAIL (sleeping 10) Multicast IS1-16.topo ... FAIL (sleeping 10) Multicast IS1-16.topo ... FAIL (sleeping 10) Multicast IS3-128.topo ... FAIL (sleeping 10) Multicast IS3-loop.topo ... FAIL (sleeping 10) Stability IS1-16.topo ... FAIL (sleeping 10) Stability IS1-16.topo ... FAIL (sleeping 10) Stability IS1-16.topo ... FAIL (sleeping 10) Stability IS3-128.topo ... FAIL (sleeping 10) Stability IS3-loop.topo ... FAIL (sleeping 10) OsmTest IS1-16.topo ... PASS OsmTest IS1-16.topo ... PASS OsmTest IS1-16.topo ... PASS OsmTest IS3-128.topo ... PASS OsmTest IS3-loop.topo ... PASS Pkey IS1-16.topo ... PASS Pkey IS1-16.topo ... PASS Pkey IS1-16.topo ... PASS Pkey IS3-128.topo ... PASS OsmStress IS1-16.topo ... PASS OsmStress IS1-16.topo ... PASS OsmStress IS3-128.topo ... PASS Eitan Zahavi wrote: > Run against SVN. > Will move to GIT today (hopefully - if I am able to git clone without > password ...) > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > >> -----Original Message----- >> From: Hal Rosenstock [mailto:halr at voltaire.com] >> Sent: Wednesday, December 06, 2006 1:12 PM >> To: Eitan Zahavi >> Cc: openib-general at openib.org >> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and >> up/downrouting engines >> >> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote: >> >>> Eitan Zahavi wrote: >>> >>>> Hal Rosenstock wrote: >>>> >>>> >>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: >>>>> >>>>> >>>>> >>>>>> This updates "file" and "updn" (up/down) routing engines which >>>>>> should work properly now with changed LFT setup mechanism. >>>>>> >>>>>> Signed-off-by: Sasha Khapyorsky >>>>>> >>>>>> >>>>>> >>>>> Thanks. Applied. >>>>> >>>>> >>>>> >>>> Are these patches inserted into SVN or GIT >>>> >>>> >>> Ignore this - just cloned GIT and its there >>> >> Were your latest regressions run against svn or git clone ? >> >> -- Hal >> >> >>>> Eitan >>>> >>>> >>>>> -- Hal >>>>> >>>>> >>>>> _______________________________________________ >>>>> openib-general mailing list >>>>> openib-general at openib.org >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> openib-general mailing list >>>> openib-general at openib.org >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Wed Dec 6 05:26:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 15:26:43 +0200 Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or switch] In-Reply-To: <4576C33C.7050204@mellanox.co.il> References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il> Message-ID: <20061206132643.GR26787@mellanox.co.il> > > Actually switches that do not have any MCG entry will not be included > in the dump file. > > Signed-off-by: Eitan Zahavi > > --- osm/opensm/osm_mcast_mgr.c 2006-12-06 12:39:13.018015000 +0200 > +++ osm/opensm/osm_mcast_mgr.c 2006-12-06 12:06:29.602097000 +0200 All, to make integrating patches easier, please try to actually use git diff to generate patches, and put patches in following format: Subject: [PATCH anytext] short log From: <> <-------- optional author line if not same as person posting Short explanation for commit log. Signed-off-by: <> --- arbirary long explanation patch -- MST From mst at mellanox.co.il Wed Dec 6 05:33:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 6 Dec 2006 15:33:42 +0200 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/downrouting engines In-Reply-To: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> Message-ID: <20061206133342.GS26787@mellanox.co.il> > Run against SVN. > Will move to GIT today (hopefully - if I am able to git clone without > password ...) Note that you do *not* need ssh accound just to clone a git tree. That's why we are running git-daemon on staging. -- MST From halr at voltaire.com Wed Dec 6 05:40:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 08:40:44 -0500 Subject: [openib-general] [PATCH] opensm: switch lookups consolidation with osm_get_switch_by_guid() In-Reply-To: <20061126233200.GA25110@sashak.voltaire.com> References: <20061126233200.GA25110@sashak.voltaire.com> Message-ID: <1165412405.25587.125366.camel@hal.voltaire.com> On Sun, 2006-11-26 at 18:32, Sasha Khapyorsky wrote: > For switch object lookups, instead of repetead in many places code > fragments like: > > p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl; > > p_sw = (osm_switch_t*)cl_qmap_get( p_sw_guid_tbl, node_guid ); > if (p_sw == (osm_switch_t*)cl_qmap_end( p_sw_guid_tbl ) ) { ... } > > use already existing "centralized" osm_get_switch_by_guid() function. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Wed Dec 6 05:59:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 08:59:46 -0500 Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_inform.c: Removed unneeded memory clearing in osm_infr_construct Message-ID: <1165413570.25587.126121.camel@hal.voltaire.com> OpenSM/osm_inform.c: Removed unneeded memory clearing in osm_infr_construct Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c index 92647ef..8d1a13a 100644 --- a/osm/opensm/osm_inform.c +++ b/osm/opensm/osm_inform.c @@ -70,7 +70,7 @@ void osm_infr_construct( IN osm_infr_t* const p_infr ) { - memset( p_infr, 0, sizeof(osm_infr_t) ); + } /********************************************************************** From halr at voltaire.com Wed Dec 6 06:07:53 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 09:07:53 -0500 Subject: [openib-general] [PATCHv2][MINOR] OpenSM/osm_inform.c: In osm_infr_new, remove unneeded call to osm_infr_construct Message-ID: <1165414065.25587.126426.camel@hal.voltaire.com> OpenSM/osm_inform.c: In osm_infr_new, remove unneeded call to osm_infr_construct This is safer in the long term than the previous patch to remove the memset from osm_infr_construct. Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c index 92647ef..cd40e5d 100644 --- a/osm/opensm/osm_inform.c +++ b/osm/opensm/osm_inform.c @@ -110,7 +110,6 @@ osm_infr_new( p_infr = (osm_infr_t*)malloc( sizeof(osm_infr_t) ); if( p_infr ) { - osm_infr_construct( p_infr ); osm_infr_init( p_infr, p_infr_rec ); } From halr at voltaire.com Wed Dec 6 07:08:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 10:08:19 -0500 Subject: [openib-general] [PATCH][MINOR} OpenSM/osm_sa_mcmember_record.c: Move some osm_log messages outside of holding lock Message-ID: <1165417683.25587.128924.camel@hal.voltaire.com> OpenSM/osm_sa_mcmember_record.c: Move some osm_log messages outside of holding lock Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 4b06bab..31d1fb5 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1447,12 +1447,6 @@ __osm_mcmr_rcv_leave_mgrp( port_join_state & ~(p_recvd_mcmember_rec->scope_state & 0x0F); if (new_join_state) { - osm_log( p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_mcmr_rcv_leave_mgrp: " - "After update JoinState != 0. Updating from 0x%X to 0x%X\n", - port_join_state, - new_join_state - ); /* Just update the result JoinState */ p_mcm_port->scope_state = new_join_state | (p_mcm_port->scope_state & 0xf0); @@ -1460,6 +1454,13 @@ __osm_mcmr_rcv_leave_mgrp( mcmember_rec.scope_state = p_mcm_port->scope_state; CL_PLOCK_RELEASE( p_rcv->p_lock ); + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mcmr_rcv_leave_mgrp: " + "After update JoinState != 0. Updating from 0x%X to 0x%X\n", + port_join_state, + new_join_state + ); } else { @@ -1649,6 +1650,8 @@ __osm_mcmr_rcv_join_mgrp( } else { + CL_PLOCK_RELEASE( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_mcmr_rcv_join_mgrp: ERR 1B11: " "method = %s, " @@ -1665,7 +1668,6 @@ __osm_mcmr_rcv_join_mgrp( cl_ntoh64( p_recvd_mcmember_rec->mgid.unicast.interface_id ), cl_ntoh64( portguid ) ); - CL_PLOCK_RELEASE( p_rcv->p_lock ); sa_status = IB_SA_MAD_STATUS_INSUF_COMPS; osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status ); goto Exit; @@ -1713,6 +1715,11 @@ __osm_mcmr_rcv_join_mgrp( if (!valid) { + /* since we might have created the new group we need to cleanup */ + __cleanup_mgrp(p_rcv, mlid); + + CL_PLOCK_RELEASE( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_mcmr_rcv_join_mgrp: ERR 1B12: " "__validate_more_comp_fields, __validate_port_caps, " @@ -1720,11 +1727,6 @@ __osm_mcmr_rcv_join_mgrp( "sending IB_SA_MAD_STATUS_REQ_INVALID\n", cl_ntoh64( portguid ) ); - /* since we might have created the new group we need to cleanup */ - __cleanup_mgrp(p_rcv, mlid); - - CL_PLOCK_RELEASE( p_rcv->p_lock ); - sa_status = IB_SA_MAD_STATUS_REQ_INVALID; osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status ); goto Exit; @@ -1746,13 +1748,13 @@ __osm_mcmr_rcv_join_mgrp( &p_mcmr_port); if (!valid) { + CL_PLOCK_RELEASE( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_mcmr_rcv_join_mgrp: ERR 1B13: " "__validate_modify failed, " "sending IB_SA_MAD_STATUS_REQ_INVALID\n" ); - CL_PLOCK_RELEASE( p_rcv->p_lock ); - sa_status = IB_SA_MAD_STATUS_REQ_INVALID; osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status ); goto Exit; From halr at voltaire.com Wed Dec 6 07:46:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 10:46:40 -0500 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/downrouting engines In-Reply-To: <4576C4B4.9080608@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> <4576C4B4.9080608@mellanox.co.il> Message-ID: <1165419916.25587.130371.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-12-06 at 08:25, Eitan Zahavi wrote: > Hi Hal, > > I just run one iteration of the simulation regression against the git tree: > The Multicast fails on the change of format of osm.mcfdbs Is this change in OFED 1.1 too ? If so, can the validation be enhanced to handle the empty MLID case ? > The Stability flows failed on the change of subnet.lst to osm-subnet.lst ... Yes, this patch went out on the list on 11/29 and committed on 11/30. We had agreed this would be done after SC. Can the verification be changed to look for this file so this doesn't fail ? It also indicated that a similar change is needed to ibutils Has that been done ? -- Hal > doing another loop: run=0 cron=0 hour=14 > OsmStress IS1-16.topo ... PASS > LidMgr IS1-16.topo ... PASS > LidMgr IS1-16.topo ... PASS > LidMgr IS1-16.topo ... PASS > LidMgr IS3-128.topo ... PASS > Multicast IS1-16.topo ... FAIL (sleeping 10) > Multicast IS1-16.topo ... FAIL (sleeping 10) > Multicast IS1-16.topo ... FAIL (sleeping 10) > Multicast IS3-128.topo ... FAIL (sleeping 10) > Multicast IS3-loop.topo ... FAIL (sleeping 10) > Stability IS1-16.topo ... FAIL (sleeping 10) > Stability IS1-16.topo ... FAIL (sleeping 10) > Stability IS1-16.topo ... FAIL (sleeping 10) > Stability IS3-128.topo ... FAIL (sleeping 10) > Stability IS3-loop.topo ... FAIL (sleeping 10) > OsmTest IS1-16.topo ... PASS > OsmTest IS1-16.topo ... PASS > OsmTest IS1-16.topo ... PASS > OsmTest IS3-128.topo ... PASS > OsmTest IS3-loop.topo ... PASS > Pkey IS1-16.topo ... PASS > Pkey IS1-16.topo ... PASS > Pkey IS1-16.topo ... PASS > Pkey IS3-128.topo ... PASS > OsmStress IS1-16.topo ... PASS > OsmStress IS1-16.topo ... PASS > OsmStress IS3-128.topo ... PASS > > > Eitan Zahavi wrote: > > Run against SVN. > > Will move to GIT today (hopefully - if I am able to git clone without > > password ...) > > > > Eitan Zahavi > > Senior Engineering Director, Software Architect > > Mellanox Technologies LTD > > Tel:+972-4-9097208 > > Fax:+972-4-9593245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > >> -----Original Message----- > >> From: Hal Rosenstock [mailto:halr at voltaire.com] > >> Sent: Wednesday, December 06, 2006 1:12 PM > >> To: Eitan Zahavi > >> Cc: openib-general at openib.org > >> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and > >> up/downrouting engines > >> > >> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote: > >> > >>> Eitan Zahavi wrote: > >>> > >>>> Hal Rosenstock wrote: > >>>> > >>>> > >>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > >>>>> > >>>>> > >>>>> > >>>>>> This updates "file" and "updn" (up/down) routing engines which > >>>>>> should work properly now with changed LFT setup mechanism. > >>>>>> > >>>>>> Signed-off-by: Sasha Khapyorsky > >>>>>> > >>>>>> > >>>>>> > >>>>> Thanks. Applied. > >>>>> > >>>>> > >>>>> > >>>> Are these patches inserted into SVN or GIT > >>>> > >>>> > >>> Ignore this - just cloned GIT and its there > >>> > >> Were your latest regressions run against svn or git clone ? > >> > >> -- Hal > >> > >> > >>>> Eitan > >>>> > >>>> > >>>>> -- Hal > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> openib-general mailing list > >>>>> openib-general at openib.org > >>>>> http://openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> To unsubscribe, please visit > >>>>> http://openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> openib-general mailing list > >>>> openib-general at openib.org > >>>> http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> To unsubscribe, please visit > >>>> http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From sweitzen at cisco.com Wed Dec 6 08:43:47 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 6 Dec 2006 08:43:47 -0800 Subject: [openib-general] [PATCH] IPoIB CM Experimental support Message-ID: > d. Limitations > UDP multicast and UDP connections to IPoIB UD mode > currently don't work since we get packets that are too large to > send over a UD QP. > As a work around, one can now create separate interfaces > for use with CM and UD mode. You can't send UDP/multicast traffic at all between IPoIB CM and IPoIB UD? What about UDP/multicast between IPoIB CM hosts? Scott From eitan at mellanox.co.il Wed Dec 6 09:42:47 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 19:42:47 +0200 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/downrouting engines In-Reply-To: <1165419916.25587.130371.camel@hal.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> <4576C4B4.9080608@mellanox.co.il> <1165419916.25587.130371.camel@hal.voltaire.com> Message-ID: <45770117.2060306@mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Eitan, > > On Wed, 2006-12-06 at 08:25, Eitan Zahavi wrote: > >> Hi Hal, >> >> I just run one iteration of the simulation regression against the git tree: >> The Multicast fails on the change of format of osm.mcfdbs >> > > Is this change in OFED 1.1 too ? If so, can the validation be enhanced > to handle the empty MLID case ? > The current format (broken) where multiple MLIDs apear on one line is harder to manage. I will also need to change ibutils to generate the new format.Whenever such a format change I have to chase it through whatever utility is out there that breaks. I do not see any reason why it had to change. I understand it was broken by the fix that eliminate the need for opening the file and appending to it. Instead of modifying ibutils and the simulator tests I propose to fix it back to what it was using the patch I provided. > >> The Stability flows failed on the change of subnet.lst to osm-subnet.lst ... >> > > Yes, this patch went out on the list on 11/29 and committed on 11/30. > We had agreed this would be done after SC. Can the verification be > changed to look for this file so this doesn't fail ? > Yes this is a simple fix and it was already pushed into ibutils. I missed the simulator tests and pushed the change today. > It also indicated that a similar change is needed to ibutils > Has that been done ? > Yes ibutils modified to accommodate for this change. > -- Hal > > >> doing another loop: run=0 cron=0 hour=14 >> OsmStress IS1-16.topo ... PASS >> LidMgr IS1-16.topo ... PASS >> LidMgr IS1-16.topo ... PASS >> LidMgr IS1-16.topo ... PASS >> LidMgr IS3-128.topo ... PASS >> Multicast IS1-16.topo ... FAIL (sleeping 10) >> Multicast IS1-16.topo ... FAIL (sleeping 10) >> Multicast IS1-16.topo ... FAIL (sleeping 10) >> Multicast IS3-128.topo ... FAIL (sleeping 10) >> Multicast IS3-loop.topo ... FAIL (sleeping 10) >> Stability IS1-16.topo ... FAIL (sleeping 10) >> Stability IS1-16.topo ... FAIL (sleeping 10) >> Stability IS1-16.topo ... FAIL (sleeping 10) >> Stability IS3-128.topo ... FAIL (sleeping 10) >> Stability IS3-loop.topo ... FAIL (sleeping 10) >> OsmTest IS1-16.topo ... PASS >> OsmTest IS1-16.topo ... PASS >> OsmTest IS1-16.topo ... PASS >> OsmTest IS3-128.topo ... PASS >> OsmTest IS3-loop.topo ... PASS >> Pkey IS1-16.topo ... PASS >> Pkey IS1-16.topo ... PASS >> Pkey IS1-16.topo ... PASS >> Pkey IS3-128.topo ... PASS >> OsmStress IS1-16.topo ... PASS >> OsmStress IS1-16.topo ... PASS >> OsmStress IS3-128.topo ... PASS >> >> >> Eitan Zahavi wrote: >> >>> Run against SVN. >>> Will move to GIT today (hopefully - if I am able to git clone without >>> password ...) >>> >>> Eitan Zahavi >>> Senior Engineering Director, Software Architect >>> Mellanox Technologies LTD >>> Tel:+972-4-9097208 >>> Fax:+972-4-9593245 >>> P.O. Box 586 Yokneam 20692 ISRAEL >>> >>> >>> >>> >>>> -----Original Message----- >>>> From: Hal Rosenstock [mailto:halr at voltaire.com] >>>> Sent: Wednesday, December 06, 2006 1:12 PM >>>> To: Eitan Zahavi >>>> Cc: openib-general at openib.org >>>> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and >>>> up/downrouting engines >>>> >>>> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote: >>>> >>>> >>>>> Eitan Zahavi wrote: >>>>> >>>>> >>>>>> Hal Rosenstock wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> This updates "file" and "updn" (up/down) routing engines which >>>>>>>> should work properly now with changed LFT setup mechanism. >>>>>>>> >>>>>>>> Signed-off-by: Sasha Khapyorsky >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Thanks. Applied. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> Are these patches inserted into SVN or GIT >>>>>> >>>>>> >>>>>> >>>>> Ignore this - just cloned GIT and its there >>>>> >>>>> >>>> Were your latest regressions run against svn or git clone ? >>>> >>>> -- Hal >>>> >>>> >>>> >>>>>> Eitan >>>>>> >>>>>> >>>>>> >>>>>>> -- Hal >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> openib-general mailing list >>>>>>> openib-general at openib.org >>>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>>> >>>>>>> To unsubscribe, please visit >>>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> openib-general mailing list >>>>>> openib-general at openib.org >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>>> >>>>>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Wed Dec 6 09:45:00 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 06 Dec 2006 09:45:00 -0800 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <20061206080317.GG26787@mellanox.co.il> References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> <000601c714e0$d955fef0$92cc180a@amr.corp.intel.com> <20061206080317.GG26787@mellanox.co.il> Message-ID: <4577019C.7050900@ichips.intel.com> > Just to clarify this point - what connecton messages can be lost? > E.g. if the passive side does not get an RTU for a while, it will > retry the REP, won't it? Diagram 12.9.6 seems to indicate so: > from REP Sent we should go to RTU timeout, Send REP and back to REP Sent. > Is this implemented? REP retries are already implemented in the ib_cm. This handles the case where the RTU is repeatedly lost, but data is still received on the connection. - Sean From eitan at mellanox.co.il Wed Dec 6 10:02:01 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 20:02:01 +0200 Subject: [openib-general] osm: More simulation faiures on trunk Message-ID: <45770599.7080005@mellanox.co.il> Hi Hal, Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when opensm is invoked with updn routing engine. I will be working on finding what changed between OFED 1.1 and the trunk. This is another cause for the failure of all osmMulticastRoutingTest and osmStability tests runs. Another one would be the change of the osm.mcfdbs which is parsed by IBDM too. Eitan From elsen_david at yahoo.com Wed Dec 6 10:03:49 2006 From: elsen_david at yahoo.com (david elsen) Date: Wed, 6 Dec 2006 10:03:49 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <4570CAA8.5080806@cse.ohio-state.edu> Message-ID: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com> Shaun / Steve, To pass the "librdmacm.so: cannot open shared object file: No such file or >> directory" error message, LD_RUN_PATH also need to be set. Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying to figure out how to run the various nenchmark tests using this MPI tool. Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also want to run the OSC benchmark tests. Any guideline availabvle for these please? Thanks, David Shaun Rowland wrote: Steve Wise wrote: > I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing their > dev team so maybe they can help. > > Steve. > > > > On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote: >> Steve, >> >> I can run rping, rdma_lat etc on the Ammasso card but when I try to >> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. >> >> ./mpdboot -n 1 >> debug: starting >> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: >> librdmacm.so: cannot open shared object file: No such file or >> directory >> running mpdallexit on ammasso1 >> LAUNCHED mpd on ammasso1 via >> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d >> debug: mpd on ammasso1 on port 35352 >> RUNNING: mpd on ammasso1 >> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, >> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''} Hello David and Steve. We discussed this problem in detail on the mvapich-discuss list recently. David, you indicated the following in your last email about this to mvapich-discuss on 11/26/2006: "For some reason, it is working in SuSE, and not working in Fedora." Is this still the case? Were the libraries built specifically on the Fedora Core 6 system, or are you using libraries that were built on SuSE? I assume they were built on Fedora Core 6. Were you trying to run this as root or as a regular user? I am not sure exactly how this might affect shared library loading, but it is possible there is a difference. In our previous discussion, your library path did indeed have a librdmacm.so file, though it could not be loaded for an unknown reason. It is unclear to me if this email thread indicates that you have tried to rebuild that and are experiencing the same issue. Where you able to try running that test shared library example I gave and did it work? Did it work as the same user you are trying to run MVAPICH as? It seems clear this is a runtime loader problem on Fedora Core 6, or on your particular configuration there. That is what cannot find the library. It is similar to the libtest code I provided as an example: [rowland at e14-oib libtest]$ ls Makefile test.c test.h test-program.c [rowland at e14-oib libtest]$ make normal gcc -c -fPIC test.c gcc -shared -Wl,-soname,libtest.so.1 -o libtest.so.1.0 test.o ln -s libtest.so.1.0 libtest.so.1 ln -s libtest.so.1 libtest.so gcc -c -o test-program.o test-program.c gcc -o test-program test-program.o -L/home/7/rowland/libtest -ltest [rowland at e14-oib libtest]$ ldd test-program libtest.so.1 => not found libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000) /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000) [rowland at e14-oib libtest]$ ./test-program ./test-program: error while loading shared libraries: libtest.so.1: cannot open shared object file: No such file or directory [rowland at e14-oib libtest]$ export LD_LIBRARY_PATH=$PWD [rowland at e14-oib libtest]$ ldd test-program libtest.so.1 => /home/7/rowland/libtest/libtest.so.1 (0x00002abbf9aee000) libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000) /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000) [rowland at e14-oib libtest]$ ./test-program In shared library function... In previous email your ldd output showed the library was being found: Please see the output of ldd /usr/local/mvapich2/bin/mpdroot : [root at ammasso1 ~]# ldd /usr/local/mvapich2/bin/mpdroot linux-gate.so.1 => (0xffffe000) librdmacm.so => /usr/local/lib/librdmacm.so (0xb7fec000) libibverbs.so.2 => /usr/local/lib/libibverbs.so.2 (0xb7fe5000) libibumad.so.1 => /usr/local/lib/libibumad.so.1 (0xb7fdc000) libpthread.so.0 => /lib/libpthread.so.0 (0x0012a000) libc.so.6 => /lib/libc.so.6 (0x00ca7000) libsysfs.so.2 => /usr/lib/libsysfs.so.2 (0x00369000) libdl.so.2 => /lib/libdl.so.2 (0x00de6000) libibcommon.so.1 => /usr/local/lib/libibcommon.so.1 (0xb7fcb000) /lib/ld-linux.so.2 (0x002d8000) But that path is different than the one you are quoting above. Does an ldd on /root/0.9.8-RELEASE/bin/mpdroot find librdmacm.so too, as the same user you are trying to run it as? I have one more idea for you to try here. You can do the following: export LD_DEBUG=all /root/0.9.8-RELEASE/bin/mpdroot >&output unset LD_DEBUG Then take a look at the output file to see if there are any relevant error messages. Don't forget to unset LD_DEBUG before doing anything else. Also, just to be sure, if you run "file " what does it say? It should indicate that it is a shared library as similarly to: [rowland at e14-oib libtest]$ file /usr/local/ofed/lib64/librdmacm.so* /usr/local/ofed/lib64/librdmacm.so: symbolic link to `librdmacm.so.0.9.0' /usr/local/ofed/lib64/librdmacm.so.0.9.0: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped Unfortunately, we do not have any Fedora Core 6 systems to investigate this problem on at this time, and I don't know anything about what might be there that would cause a problem. As far as I know, there shouldn't be. However, it seems there is some runtime issue on your Fedora Core 6 machine or with how this is being run there. If it is in fact working on another distribution as you indicated in your previous response to us, then that also strongly points in this direction. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Wed Dec 6 09:51:03 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 06 Dec 2006 09:51:03 -0800 Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace support In-Reply-To: <457692E5.2050800@voltaire.com> References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com> <457692E5.2050800@voltaire.com> Message-ID: <45770307.5060101@ichips.intel.com> > Sean - as of the stability issues reported by Mellanox I understand you > have decided not to push the multicast code for 2.6.20 and I see that > the focus now is on finding the bug. Once this is solved I would like to > provide more feedback before you publish v3 - does it makes sense? The multicast code that I had has been added as a branch to my rdma-dev git tree that's available from the openfabrics server. A corresponding branch is in the librdmacm tree. I have not had time yet to update the multicast code based on the latest feedback. - Sean From shubbell at dbresearch.net Wed Dec 6 09:52:50 2006 From: shubbell at dbresearch.net (Sean Hubbell) Date: Wed, 06 Dec 2006 11:52:50 -0600 Subject: [openib-general] Multicast Group Routing Question Message-ID: <45770372.8010700@dbresearch.net> Hello, I was testing our code and noticed that when I send data using multicast over our ib0 interface, all of the infiniband switches route the data to each switch and each node instead of a node that has an application listening to the interface like Ethernet. Is this by design? Thanks in advance, Sean From rdreier at cisco.com Wed Dec 6 10:11:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 06 Dec 2006 10:11:31 -0800 Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace support In-Reply-To: <457692E5.2050800@voltaire.com> (Or Gerlitz's message of "Wed, 06 Dec 2006 11:52:37 +0200") References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com> <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com> <457692E5.2050800@voltaire.com> Message-ID: > + 5/5 is the CMA user space support. I only did a light review of it > but my understanding is that Sean used the in kernel ib_ucm > design/code as the base line for this driver so there should be no > special issues here. OK, I'll have to take a close look at this. ucm has known-broken object lifetime handling (probably oopsable from userspace) From halr at voltaire.com Wed Dec 6 10:14:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 13:14:10 -0500 Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/downrouting engines In-Reply-To: <45770117.2060306@mellanox.co.il> References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com> <4576C4B4.9080608@mellanox.co.il> <1165419916.25587.130371.camel@hal.voltaire.com> <45770117.2060306@mellanox.co.il> Message-ID: <1165428839.25587.136467.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-12-06 at 12:42, Eitan Zahavi wrote: > Hi Hal, > > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Wed, 2006-12-06 at 08:25, Eitan Zahavi wrote: > > > >> Hi Hal, > >> > >> I just run one iteration of the simulation regression against the git tree: > >> The Multicast fails on the change of format of osm.mcfdbs > >> > > > > Is this change in OFED 1.1 too ? If so, can the validation be enhanced > > to handle the empty MLID case ? > > > The current format (broken) where multiple MLIDs apear on one line is > harder to manage. Is it true for OFED 1.1 as well ? Also, I'm in the process of incorporating your change. > I will also need to change ibutils to generate the new format.Whenever > such a format change > I have to chase it through whatever utility is out there that breaks. > I do not see any reason why it had to change. I understand it was broken > by the fix that eliminate the need for > opening the file and appending to it. I'm not sure it was done by design or accident. Anyhow, the patches were out on the list for quite some time without comment. > Instead of modifying ibutils and the simulator tests I propose to fix it > back to what it was > using the patch I provided. I'm in the process of incorporating your patch. -- Hal > >> The Stability flows failed on the change of subnet.lst to osm-subnet.lst ... > >> > > > > Yes, this patch went out on the list on 11/29 and committed on 11/30. > > We had agreed this would be done after SC. Can the verification be > > changed to look for this file so this doesn't fail ? > > > Yes this is a simple fix and it was already pushed into ibutils. I > missed the simulator tests and pushed the change today. > > It also indicated that a similar change is needed to ibutils > > Has that been done ? > > > Yes ibutils modified to accommodate for this change. > > -- Hal > > > > > >> doing another loop: run=0 cron=0 hour=14 > >> OsmStress IS1-16.topo ... PASS > >> LidMgr IS1-16.topo ... PASS > >> LidMgr IS1-16.topo ... PASS > >> LidMgr IS1-16.topo ... PASS > >> LidMgr IS3-128.topo ... PASS > >> Multicast IS1-16.topo ... FAIL (sleeping 10) > >> Multicast IS1-16.topo ... FAIL (sleeping 10) > >> Multicast IS1-16.topo ... FAIL (sleeping 10) > >> Multicast IS3-128.topo ... FAIL (sleeping 10) > >> Multicast IS3-loop.topo ... FAIL (sleeping 10) > >> Stability IS1-16.topo ... FAIL (sleeping 10) > >> Stability IS1-16.topo ... FAIL (sleeping 10) > >> Stability IS1-16.topo ... FAIL (sleeping 10) > >> Stability IS3-128.topo ... FAIL (sleeping 10) > >> Stability IS3-loop.topo ... FAIL (sleeping 10) > >> OsmTest IS1-16.topo ... PASS > >> OsmTest IS1-16.topo ... PASS > >> OsmTest IS1-16.topo ... PASS > >> OsmTest IS3-128.topo ... PASS > >> OsmTest IS3-loop.topo ... PASS > >> Pkey IS1-16.topo ... PASS > >> Pkey IS1-16.topo ... PASS > >> Pkey IS1-16.topo ... PASS > >> Pkey IS3-128.topo ... PASS > >> OsmStress IS1-16.topo ... PASS > >> OsmStress IS1-16.topo ... PASS > >> OsmStress IS3-128.topo ... PASS > >> > >> > >> Eitan Zahavi wrote: > >> > >>> Run against SVN. > >>> Will move to GIT today (hopefully - if I am able to git clone without > >>> password ...) > >>> > >>> Eitan Zahavi > >>> Senior Engineering Director, Software Architect > >>> Mellanox Technologies LTD > >>> Tel:+972-4-9097208 > >>> Fax:+972-4-9593245 > >>> P.O. Box 586 Yokneam 20692 ISRAEL > >>> > >>> > >>> > >>> > >>>> -----Original Message----- > >>>> From: Hal Rosenstock [mailto:halr at voltaire.com] > >>>> Sent: Wednesday, December 06, 2006 1:12 PM > >>>> To: Eitan Zahavi > >>>> Cc: openib-general at openib.org > >>>> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and > >>>> up/downrouting engines > >>>> > >>>> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote: > >>>> > >>>> > >>>>> Eitan Zahavi wrote: > >>>>> > >>>>> > >>>>>> Hal Rosenstock wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> This updates "file" and "updn" (up/down) routing engines which > >>>>>>>> should work properly now with changed LFT setup mechanism. > >>>>>>>> > >>>>>>>> Signed-off-by: Sasha Khapyorsky > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> Thanks. Applied. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> Are these patches inserted into SVN or GIT > >>>>>> > >>>>>> > >>>>>> > >>>>> Ignore this - just cloned GIT and its there > >>>>> > >>>>> > >>>> Were your latest regressions run against svn or git clone ? > >>>> > >>>> -- Hal > >>>> > >>>> > >>>> > >>>>>> Eitan > >>>>>> > >>>>>> > >>>>>> > >>>>>>> -- Hal > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> openib-general mailing list > >>>>>>> openib-general at openib.org > >>>>>>> http://openib.org/mailman/listinfo/openib-general > >>>>>>> > >>>>>>> To unsubscribe, please visit > >>>>>>> http://openib.org/mailman/listinfo/openib-general > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> openib-general mailing list > >>>>>> openib-general at openib.org > >>>>>> http://openib.org/mailman/listinfo/openib-general > >>>>>> > >>>>>> To unsubscribe, please visit > >>>>>> http://openib.org/mailman/listinfo/openib-general > >>>>>> > >>>>>> > >>>>>> > >>> _______________________________________________ > >>> openib-general mailing list > >>> openib-general at openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From ralph.campbell at qlogic.com Wed Dec 6 10:16:34 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 06 Dec 2006 10:16:34 -0800 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <4576AA73.105@voltaire.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> <1165359560.14800.210.camel@brick.pathscale.com> <4576AA73.105@voltaire.com> Message-ID: <1165428994.14800.229.camel@brick.pathscale.com> On Wed, 2006-12-06 at 13:33 +0200, Or Gerlitz wrote: > Ralph Campbell wrote: > > On Tue, 2006-12-05 at 23:21 +0200, Or Gerlitz wrote: > >> On 12/5/06, Roland Dreier wrote: > > I am not following what you two are saying. > > > The ib_dma_mapping_ops functions as implemented by ib_ipath, > > are redefining dma_addr_t as a kernel virtual address. > > When ib_dma_map_single() is called, this is a NOP. > > When ib_dma_map_sg() is called, the dma_map_sg() replacement needs > > to convert a struct page pointer into a kernel virtual address. > > When CONFIG_HIGHMEM is defined, some pages may not be mapped > > into the kernel virtual address space so the driver needs to > > call kmap(). Since the driver can't use the struct scattergather > > to store the kmap() result, a separate table needs to be used > > so the value can be returned by ib_sg_dma_address(). > > Indeed. > > > Doing kmap_atomic() at the point where the kernel virtual > > address is used is not practical since the driver is not > > mapping dma_addr_t to struct page * although it is > > possible to write it that way. It would mean that > > ib_map_single() would then be more complex in that a > > kernel virtual address would need to be converted to a > > struct page *. > > Basically what Roland suggest is that you need to implement SW IOTLB > mapping from dma_addr_t (possibly offset-ed) to kv addr. And do the > actual kmap/unmap calls before/after you must touch the data. > > Is this impossible? > > Or. It is not impossible, just inefficient. Why add a mapping table when it isn't needed? If I needed to implement HIGMEM support, I would probably make "dma_addr_t" be a physical memory address, convert to PFN, find the struct page pointer, and call kmap_atomic() or page_address(). Why go though all that in the worst case CPU path when doing the conversion to kernel virtual address outside the critical path is feasible? From halr at voltaire.com Wed Dec 6 10:27:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 13:27:08 -0500 Subject: [openib-general] Multicast Group Routing Question In-Reply-To: <45770372.8010700@dbresearch.net> References: <45770372.8010700@dbresearch.net> Message-ID: <1165429589.25587.136986.camel@hal.voltaire.com> Hi Sean, On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote: > Hello, > > I was testing our code and noticed that when I send data using > multicast over our ib0 interface, all of the infiniband switches route > the data to each switch and each node instead of a node that has an > application listening to the interface like Ethernet. Is this by design? It depends on what multicast group is being used and which end nodes have registered for that group as to where the data is routed. -- Hal > Thanks in advance, > > Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Wed Dec 6 10:27:32 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 06 Dec 2006 12:27:32 -0600 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com> References: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com> Message-ID: <1165429652.25183.16.camel@stevo-desktop> On Wed, 2006-12-06 at 10:03 -0800, david elsen wrote: > Shaun / Steve, > > To pass the "librdmacm.so: cannot open shared object file: No such > file or > >> directory" error message, LD_RUN_PATH also need to be set. > > Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying > to figure out how to run the various nenchmark tests using this MPI > tool. > > Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also > want to run the OSC benchmark tests. Any guideline availabvle for > these please? > Thanks, > David I've run IMB benchmarks (aka pallas) on mvapich2 0.9.8 over iwarp. The mvapich2 user guide explains how to start up mpd daemons and use mpiexec. Its fairly straight forward. You need ssh or rsh access and you need to setup a few files. Then pull down IMB and build it. To run 2 node IMB-MPI1 tests, you do something like this: $ mpdboot -n 2 $ mpiexec -n 2 /IMB-MPI1 This will run the entire MPI1 suite. Steve. From mshefty at ichips.intel.com Wed Dec 6 10:23:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 06 Dec 2006 10:23:31 -0800 Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update - RDMA CM etc In-Reply-To: <20061206101705.GP26787@mellanox.co.il> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> <4575D0A8.7080501@ichips.intel.com> <20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com> <20061206101705.GP26787@mellanox.co.il> Message-ID: <45770AA3.2040505@ichips.intel.com> >>>I gather the ucma bits are in rdma_ucm? Yes. Basically, I reworked changes that were in svn into separate branches based off of 2.6.19. > 1st is probably to fix the mcast bits so that they don't crash the machine. > OFED will be based on whatever is merged by Linus by that time + any number of patches > and out of kernel modules. Even if the kernel multicast support could make it into 2.6.20, I won't have the multicast changes to the rdma_cm done by then. >>3rd have Sean decide how he wants the multicast support to be integrated >>into OFED 1.2, my guess would be as a patch set over the >>ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide Does OFED want the multicast support in 1.2? > Maybe the right thing is to split the multicast stuff in a separate library, > or have a separate ABI version for multicast, I don't really know. My anticipation is that the multicast support will bump the ABI, but will allow backwards compatibility. The break from librdmacm ABI 2 to ABI 3 is a result of changing the event reporting. - Sean From ralphc at pathscale.com Wed Dec 6 10:31:59 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Wed, 06 Dec 2006 10:31:59 -0800 Subject: [openib-general] [PATCH v3 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose Message-ID: <1165429919.14800.238.camel@brick.pathscale.com> This version of the patch adds ib_dma_alloc_coherent() and ib_dma_free_coherent() to the list of wrapped DMA functions. The earlier V2 patches are the same since this addition doesn't affect them. The QLogic InfiniPath HCAs use programmed I/O instead of HW DMA. This patch allows a verbs device driver to interpose on DMA mapping function calls in order to avoid relying on bus_to_virt() and phys_to_virt() to undo the mappings created by dma_map_single(), dma_map_sg(), etc. From: Ralph Campbell diff -r c76ed2f1387b include/rdma/ib_verbs.h --- a/include/rdma/ib_verbs.h Wed Nov 29 13:28:14 2006 +0800 +++ b/include/rdma/ib_verbs.h Tue Dec 05 15:50:07 2006 -0800 @@ -43,6 +43,8 @@ #include #include +#include +#include #include #include @@ -846,6 +848,49 @@ struct ib_cache { struct ib_pkey_cache **pkey_cache; struct ib_gid_cache **gid_cache; u8 *lmc_cache; +}; + +struct ib_dma_mapping_ops { + int (*mapping_error)(struct ib_device *dev, + u64 dma_addr); + u64 (*map_single)(struct ib_device *dev, + void *ptr, size_t size, + enum dma_data_direction direction); + void (*unmap_single)(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction); + u64 (*map_page)(struct ib_device *dev, + struct page *page, unsigned long offset, + size_t size, + enum dma_data_direction direction); + void (*unmap_page)(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction); + int (*map_sg)(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction); + void (*unmap_sg)(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction); + u64 (*dma_address)(struct ib_device *dev, + struct scatterlist *sg); + unsigned int (*dma_len)(struct ib_device *dev, + struct scatterlist *sg); + void (*sync_single_for_cpu)(struct ib_device *dev, + u64 dma_handle, + size_t size, + enum dma_data_direction dir); + void (*sync_single_for_device)(struct ib_device *dev, + u64 dma_handle, + size_t size, + enum dma_data_direction dir); + void *(*alloc_coherent)(struct ib_device *dev, + size_t size, + u64 *dma_handle, + gfp_t flag); + void (*free_coherent)(struct ib_device *dev, + size_t size, void *cpu_addr, + u64 dma_handle); }; struct iw_cm_verbs; @@ -992,6 +1037,8 @@ struct ib_device { struct ib_mad *in_mad, struct ib_mad *out_mad); + struct ib_dma_mapping_ops *dma_ops; + struct module *owner; struct class_device class_dev; struct kobject ports_parent; @@ -1395,8 +1442,214 @@ static inline int ib_req_ncomp_notif(str * usable for DMA. * @pd: The protection domain associated with the memory region. * @mr_access_flags: Specifies the memory access rights. + * + * Note that the ib_dma_*() functions defined below must be used + * to create/destroy addresses used with the Lkey or Rkey returned + * by ib_get_dma_mr(). */ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +/** + * ib_dma_mapping_error - check a DMA addr for error + * @dev: The device for which the dma_addr was created + * @dma_addr: The DMA address to check + */ +static inline int ib_dma_mapping_error(struct ib_device *dev, u64 dma_addr) +{ + return dev->dma_ops ? + dev->dma_ops->mapping_error(dev, dma_addr) : + dma_mapping_error(dma_addr); +} + +/** + * ib_dma_map_single - Map a kernel virtual address to DMA address + * @dev: The device for which the dma_addr is to be created + * @cpu_addr: The kernel virtual address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline u64 ib_dma_map_single(struct ib_device *dev, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return dev->dma_ops ? + dev->dma_ops->map_single(dev, cpu_addr, size, direction) : + dma_map_single(dev->dma_device, cpu_addr, size, direction); +} + +/** + * ib_dma_unmap_single - Destroy a mapping created by ib_dma_map_single() + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline void ib_dma_unmap_single(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + dev->dma_ops ? + dev->dma_ops->unmap_single(dev, addr, size, direction) : + dma_unmap_single(dev->dma_device, addr, size, direction); +} + +/** + * ib_dma_map_page - Map a physical page to DMA address + * @dev: The device for which the dma_addr is to be created + * @page: The page to be mapped + * @offset: The offset within the page + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline u64 ib_dma_map_page(struct ib_device *dev, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction direction) +{ + return dev->dma_ops ? + dev->dma_ops->map_page(dev, page, offset, size, direction) : + dma_map_page(dev->dma_device, page, offset, size, direction); +} + +/** + * ib_dma_unmap_page - Destroy a mapping created by ib_dma_map_page() + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline void ib_dma_unmap_page(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + dev->dma_ops ? + dev->dma_ops->unmap_page(dev, addr, size, direction) : + dma_unmap_page(dev->dma_device, addr, size, direction); +} + +/** + * ib_dma_map_sg - Map a scatter/gather list to DMA addresses + * @dev: The device for which the DMA addresses are to be created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +static inline int ib_dma_map_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + return dev->dma_ops ? + dev->dma_ops->map_sg(dev, sg, nents, direction) : + dma_map_sg(dev->dma_device, sg, nents, direction); +} + +/** + * ib_dma_unmap_sg - Unmap a scatter/gather list of DMA addresses + * @dev: The device for which the DMA addresses were created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +static inline void ib_dma_unmap_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + dev->dma_ops ? + dev->dma_ops->unmap_sg(dev, sg, nents, direction) : + dma_unmap_sg(dev->dma_device, sg, nents, direction); +} + +/** + * ib_sg_dma_address - Return the DMA address from a scatter/gather entry + * @dev: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static inline u64 ib_sg_dma_address(struct ib_device *dev, + struct scatterlist *sg) +{ + return dev->dma_ops ? + dev->dma_ops->dma_address(dev, sg) : sg_dma_address(sg); +} + +/** + * ib_sg_dma_len - Return the DMA length from a scatter/gather entry + * @dev: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static inline unsigned int ib_sg_dma_len(struct ib_device *dev, + struct scatterlist *sg) +{ + return dev->dma_ops ? + dev->dma_ops->dma_len(dev, sg) : sg_dma_len(sg); +} + +/** + * ib_dma_sync_single_for_cpu - Prepare DMA region to be accessed by CPU + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static inline void ib_dma_sync_single_for_cpu(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ + dev->dma_ops ? + dev->dma_ops->sync_single_for_cpu(dev, addr, size, dir) : + dma_sync_single_for_cpu(dev->dma_device, addr, size, dir); +} + +/** + * ib_dma_sync_single_for_device - Prepare DMA region to be accessed by device + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static inline void ib_dma_sync_single_for_device(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ + dev->dma_ops ? + dev->dma_ops->sync_single_for_device(dev, addr, size, dir) : + dma_sync_single_for_device(dev->dma_device, addr, size, dir); +} + +/** + * ib_dma_alloc_coherent - Allocate memory and map it for DMA + * @dev: The device for which the DMA address is requested + * @size: The size of the region to allocate in bytes + * @dma_handle: A pointer for returning the DMA address of the region + * @flag: memory allocator flags + */ +static inline void *ib_dma_alloc_coherent(struct ib_device *dev, + size_t size, + u64 *dma_handle, + gfp_t flag) +{ + return dev->dma_ops ? + dev->dma_ops->alloc_coherent(dev, size, dma_handle, flag) : + dma_alloc_coherent(dev->dma_device, size, dma_handle, flag); +} + +/** + * ib_dma_free_coherent - Free memory allocated by ib_dma_alloc_coherent() + * @dev: The device for which the DMA addresses were allocated + * @size: The size of the region + * @cpu_addr: the address returned by ib_dma_alloc_coherent() + * @dma_handle: the DMA address returned by ib_dma_alloc_coherent() + */ +static inline void ib_dma_free_coherent(struct ib_device *dev, + size_t size, void *cpu_addr, + u64 dma_handle) +{ + dev->dma_ops ? + dev->dma_ops->free_coherent(dev, size, cpu_addr, dma_handle) : + dma_free_coherent(dev->dma_device, size, cpu_addr, dma_handle); +} /** * ib_reg_phys_mr - Prepares a virtually addressed memory region for use diff -r c76ed2f1387b drivers/infiniband/core/mad.c --- a/drivers/infiniband/core/mad.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/core/mad.c Wed Nov 29 13:54:36 2006 -0800 @@ -999,16 +999,16 @@ int ib_send_mad(struct ib_mad_send_wr_pr mad_agent = mad_send_wr->send_buf.mad_agent; sge = mad_send_wr->sg_list; - sge[0].addr = dma_map_single(mad_agent->device->dma_device, - mad_send_wr->send_buf.mad, - sge[0].length, - DMA_TO_DEVICE); + sge[0].addr = ib_dma_map_single(mad_agent->device, + mad_send_wr->send_buf.mad, + sge[0].length, + DMA_TO_DEVICE); pci_unmap_addr_set(mad_send_wr, header_mapping, sge[0].addr); - sge[1].addr = dma_map_single(mad_agent->device->dma_device, - ib_get_payload(mad_send_wr), - sge[1].length, - DMA_TO_DEVICE); + sge[1].addr = ib_dma_map_single(mad_agent->device, + ib_get_payload(mad_send_wr), + sge[1].length, + DMA_TO_DEVICE); pci_unmap_addr_set(mad_send_wr, payload_mapping, sge[1].addr); spin_lock_irqsave(&qp_info->send_queue.lock, flags); @@ -1027,12 +1027,14 @@ int ib_send_mad(struct ib_mad_send_wr_pr } spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); if (ret) { - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, header_mapping), - sge[0].length, DMA_TO_DEVICE); - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, payload_mapping), - sge[1].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_agent->device, + pci_unmap_addr(mad_send_wr, + header_mapping), + sge[0].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_agent->device, + pci_unmap_addr(mad_send_wr, + payload_mapping), + sge[1].length, DMA_TO_DEVICE); } return ret; } @@ -1851,11 +1853,11 @@ static void ib_mad_recv_done_handler(str mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, mad_list); recv = container_of(mad_priv_hdr, struct ib_mad_private, header); - dma_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&recv->header, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), - DMA_FROM_DEVICE); + ib_dma_unmap_single(port_priv->device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); /* Setup MAD receive work completion from "normal" work completion */ recv->header.wc = *wc; @@ -2081,12 +2083,12 @@ static void ib_mad_send_done_handler(str qp_info = send_queue->qp_info; retry: - dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, header_mapping), - mad_send_wr->sg_list[0].length, DMA_TO_DEVICE); - dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, payload_mapping), - mad_send_wr->sg_list[1].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_send_wr->send_buf.mad_agent->device, + pci_unmap_addr(mad_send_wr, header_mapping), + mad_send_wr->sg_list[0].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_send_wr->send_buf.mad_agent->device, + pci_unmap_addr(mad_send_wr, payload_mapping), + mad_send_wr->sg_list[1].length, DMA_TO_DEVICE); queued_send_wr = NULL; spin_lock_irqsave(&send_queue->lock, flags); list_del(&mad_list->list); @@ -2527,12 +2529,11 @@ static int ib_mad_post_receive_mads(stru break; } } - sg_list.addr = dma_map_single(qp_info->port_priv-> - device->dma_device, - &mad_priv->grh, - sizeof *mad_priv - - sizeof mad_priv->header, - DMA_FROM_DEVICE); + sg_list.addr = ib_dma_map_single(qp_info->port_priv->device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; mad_priv->header.mad_list.mad_queue = recv_queue; @@ -2548,12 +2549,12 @@ static int ib_mad_post_receive_mads(stru list_del(&mad_priv->header.mad_list.list); recv_queue->count--; spin_unlock_irqrestore(&recv_queue->lock, flags); - dma_unmap_single(qp_info->port_priv->device->dma_device, - pci_unmap_addr(&mad_priv->header, - mapping), - sizeof *mad_priv - - sizeof mad_priv->header, - DMA_FROM_DEVICE); + ib_dma_unmap_single(qp_info->port_priv->device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); kmem_cache_free(ib_mad_cache, mad_priv); printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); break; @@ -2585,11 +2586,11 @@ static void cleanup_recv_queue(struct ib /* Remove from posted receive MAD list */ list_del(&mad_list->list); - dma_unmap_single(qp_info->port_priv->device->dma_device, - pci_unmap_addr(&recv->header, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), - DMA_FROM_DEVICE); + ib_dma_unmap_single(qp_info->port_priv->device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); kmem_cache_free(ib_mad_cache, recv); } diff -r c76ed2f1387b drivers/infiniband/core/uverbs_mem.c --- a/drivers/infiniband/core/uverbs_mem.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/core/uverbs_mem.c Wed Nov 29 13:54:36 2006 -0800 @@ -52,8 +52,8 @@ static void __ib_umem_release(struct ib_ int i; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { - dma_unmap_sg(dev->dma_device, chunk->page_list, - chunk->nents, DMA_BIDIRECTIONAL); + ib_dma_unmap_sg(dev, chunk->page_list, + chunk->nents, DMA_BIDIRECTIONAL); for (i = 0; i < chunk->nents; ++i) { if (umem->writable && dirty) set_page_dirty_lock(chunk->page_list[i].page); @@ -136,10 +136,10 @@ int ib_umem_get(struct ib_device *dev, s chunk->page_list[i].length = PAGE_SIZE; } - chunk->nmap = dma_map_sg(dev->dma_device, - &chunk->page_list[0], - chunk->nents, - DMA_BIDIRECTIONAL); + chunk->nmap = ib_dma_map_sg(dev, + &chunk->page_list[0], + chunk->nents, + DMA_BIDIRECTIONAL); if (chunk->nmap <= 0) { for (i = 0; i < chunk->nents; ++i) put_page(chunk->page_list[i].page); From ralph.campbell at qlogic.com Wed Dec 6 10:35:56 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 06 Dec 2006 10:35:56 -0800 Subject: [openib-general] [PATCH v3 2/7] IB/ipath - Implement new verbs DMA mapping functions Message-ID: <1165430156.14800.243.camel@brick.pathscale.com> This version of the patch adds support for ib_dma_alloc_coherent() and ib_dma_free_coherent(). It also fixes the bug Or found in ipath_sync_single_for_cpu() and ipath_sync_single_for_device(). This patch implements the interposing DMA mapping functions to allow support for IOMMUs and remove the dependence on phys_to_virt(). From: Ralph Campbell diff -r c76ed2f1387b drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/Makefile Wed Nov 29 13:54:36 2006 -0800 @@ -6,6 +6,7 @@ ib_ipath-y := \ ib_ipath-y := \ ipath_cq.o \ ipath_diag.o \ + ipath_dma.o \ ipath_driver.o \ ipath_eeprom.o \ ipath_file_ops.o \ diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_keys.c --- a/drivers/infiniband/hw/ipath/ipath_keys.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_keys.c Wed Nov 29 13:54:36 2006 -0800 @@ -134,7 +134,7 @@ int ipath_lkey_ok(struct ipath_qp *qp, s */ if (sge->lkey == 0) { isge->mr = NULL; - isge->vaddr = bus_to_virt(sge->addr); + isge->vaddr = (void *) sge->addr; isge->length = sge->length; isge->sge_length = sge->length; ret = 1; @@ -202,12 +202,12 @@ int ipath_rkey_ok(struct ipath_qp *qp, s int ret; /* - * We use RKEY == zero for physical addresses - * (see ipath_get_dma_mr). + * We use RKEY == zero for kernel virtual addresses + * (see ipath_get_dma_mr and ipath_dma.c). */ if (rkey == 0) { sge->mr = NULL; - sge->vaddr = phys_to_virt(vaddr); + sge->vaddr = (void *) vaddr; sge->length = len; sge->sge_length = len; ss->sg_list = NULL; diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_mr.c --- a/drivers/infiniband/hw/ipath/ipath_mr.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_mr.c Wed Nov 29 13:54:37 2006 -0800 @@ -54,6 +54,8 @@ static inline struct ipath_fmr *to_ifmr( * @acc: access flags * * Returns the memory region on success, otherwise returns an errno. + * Note that all DMA addresses should be created via the + * struct ib_dma_mapping_ops functions (see ipath_dma.c). */ struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc) { @@ -149,8 +151,7 @@ struct ib_mr *ipath_reg_phys_mr(struct i m = 0; n = 0; for (i = 0; i < num_phys_buf; i++) { - mr->mr.map[m]->segs[n].vaddr = - phys_to_virt(buffer_list[i].addr); + mr->mr.map[m]->segs[n].vaddr = (void *) buffer_list[i].addr; mr->mr.map[m]->segs[n].length = buffer_list[i].size; mr->mr.length += buffer_list[i].size; n++; @@ -347,7 +348,7 @@ int ipath_map_phys_fmr(struct ib_fmr *ib n = 0; ps = 1 << fmr->page_shift; for (i = 0; i < list_len; i++) { - fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]); + fmr->mr.map[m]->segs[n].vaddr = (void *) page_list[i]; fmr->mr.map[m]->segs[n].length = ps; if (++n == IPATH_SEGSZ) { m++; diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Nov 29 13:54:37 2006 -0800 @@ -1599,6 +1599,7 @@ int ipath_register_ib_device(struct ipat dev->detach_mcast = ipath_multicast_detach; dev->process_mad = ipath_process_mad; dev->mmap = ipath_mmap; + dev->dma_ops = &ipath_dma_mapping_ops; snprintf(dev->node_desc, sizeof(dev->node_desc), IPATH_IDSTR " %s", init_utsname()->nodename); diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Nov 29 13:54:37 2006 -0800 @@ -812,4 +812,6 @@ extern unsigned int ib_ipath_max_srq_wrs extern const u32 ib_ipath_rnr_table[]; +extern struct ib_dma_mapping_ops ipath_dma_mapping_ops; + #endif /* IPATH_VERBS_H */ diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Tue Dec 05 16:04:53 2006 -0800 @@ -0,0 +1,262 @@ +/* + * Copyright (c) 2006 QLogic, Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include "ipath_verbs.h" + +#define BAD_DMA_ADDRESS ((u64) 0) + +/** + * ipath_dma_mapping_error - check a DMA address for error + * @dev: The device for which the dma_addr was created + * @dma_addr: The DMA address to check + */ +static int ipath_mapping_error(struct ib_device *dev, u64 dma_addr) +{ + return dma_addr == BAD_DMA_ADDRESS; +} + +/** + * ipath_dma_map_single - Map a kernel virtual address to DMA address + * @dev: The device for which the dma_addr is to be created + * @cpu_addr: The kernel virtual address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static u64 ipath_dma_map_single(struct ib_device *dev, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); + return (u64) cpu_addr; +} + +/** + * ipath_dma_unmap_single - Destroy a mapping created by ipath_dma_map_single() + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static void ipath_dma_unmap_single(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); +} + +/** + * ipath_dma_map_page - Map a physical page to DMA address + * @dev: The device for which the dma_addr is to be created + * @page: The page to be mapped + * @offset: The offset within the page + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static u64 ipath_dma_map_page(struct ib_device *dev, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction direction) +{ + u64 addr; + + BUG_ON(!valid_dma_direction(direction)); + + if (offset + size > PAGE_SIZE) { + addr = BAD_DMA_ADDRESS; + goto done; + } + + addr = (u64) page_address(page); + if (addr) + addr += offset; + /* TODO: handle highmem pages */ + +done: + return addr; +} + +/** + * ipath_dma_unmap_page - Destroy a mapping created by ipath_dma_map_page() + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static void ipath_dma_unmap_page(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); +} + +/** + * ipath_map_sg - Map a scatter/gather list to DMA addresses + * @dev: The device for which the DMA addresses are to be created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +int ipath_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + u64 addr; + int i; + int ret = nents; + + BUG_ON(!valid_dma_direction(direction)); + + for (i = 0; i < nents; i++) { + addr = (u64) page_address(sg[i].page); + /* TODO: handle highmem pages */ + if (!addr) { + ret = 0; + break; + } + } + return ret; +} + +/** + * ipath_unmap_sg - Unmap a scatter/gather list of DMA addresses + * @dev: The device for which the DMA addresses were created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +static void ipath_unmap_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); +} + +/** + * ipath_sg_dma_address - Return the DMA address from a scatter/gather entry + * @dev: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg) +{ + return (u64) page_address(sg->page); +} + +/** + * ipath_sg_dma_len - Return the DMA length from a scatter/gather entry + * @dev: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static unsigned int ipath_sg_dma_len(struct ib_device *dev, + struct scatterlist *sg) +{ + return sg->length; +} + +/** + * ipath_sync_single_for_cpu - Prepare DMA region to be accessed by CPU + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static void ipath_sync_single_for_cpu(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ +} + +/** + * ipath_sync_single_for_device - Prepare DMA region to be accessed by device + * @dev: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static void ipath_sync_single_for_device(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ +} + +/** + * ipath_dma_alloc_coherent - Allocate memory and map it for DMA + * @dev: The device for which the DMA address is requested + * @size: The size of the region to allocate in bytes + * @dma_handle: A pointer for returning the DMA address of the region + * @flag: memory allocator flags + */ +static void *ipath_dma_alloc_coherent(struct ib_device *dev, size_t size, + u64 *dma_handle, gfp_t flag) +{ + struct page *p; + void *addr = NULL; + + p = alloc_pages(flag, get_order(size)); + if (p) + addr = page_address(p); + if (dma_handle) + *dma_handle = (u64) addr; + return addr; +} + +/** + * ipath_dma_free_coherent - Free memory allocated by ib_dma_alloc_coherent() + * @dev: The device for which the DMA addresses were allocated + * @size: The size of the region + * @cpu_addr: the address returned by ib_dma_alloc_coherent() + * @dma_handle: the DMA address returned by ib_dma_alloc_coherent() + */ +static void ipath_dma_free_coherent(struct ib_device *dev, size_t size, + void *cpu_addr, dma_addr_t dma_handle) +{ + free_pages((unsigned long) cpu_addr, get_order(size)); +} + +struct ib_dma_mapping_ops ipath_dma_mapping_ops = { + ipath_mapping_error, + ipath_dma_map_single, + ipath_dma_unmap_single, + ipath_dma_map_page, + ipath_dma_unmap_page, + ipath_map_sg, + ipath_unmap_sg, + ipath_sg_dma_address, + ipath_sg_dma_len, + ipath_sync_single_for_cpu, + ipath_sync_single_for_device, + ipath_dma_alloc_coherent, + ipath_dma_free_coherent +}; From shubbell at dbresearch.net Wed Dec 6 10:48:47 2006 From: shubbell at dbresearch.net (Sean Hubbell) Date: Wed, 06 Dec 2006 12:48:47 -0600 Subject: [openib-general] Multicast Group Routing Question In-Reply-To: <1165429589.25587.136986.camel@hal.voltaire.com> References: <45770372.8010700@dbresearch.net> <1165429589.25587.136986.camel@hal.voltaire.com> Message-ID: <4577108F.9080308@dbresearch.net> Hal Rosenstock wrote: > Hi Sean, > > On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote: > >> Hello, >> >> I was testing our code and noticed that when I send data using >> multicast over our ib0 interface, all of the infiniband switches route >> the data to each switch and each node instead of a node that has an >> application listening to the interface like Ethernet. Is this by design? >> > > It depends on what multicast group is being used and which end nodes > have registered for that group as to where the data is routed. > > -- Hal > Hey Hal, The multicast group I am sending data to is 224.10.10.x (not 224.0.0.x) and I have no clients / nodes listening but the data is still being sent. I am using wwtop from warewulf to view the network load for each node. Does this make sense? Sean From xma at us.ibm.com Wed Dec 6 11:06:31 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 6 Dec 2006 11:06:31 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Hi, Roland, We have found missing interrupts in ehca driver none scaling code. We are testing the patch now. I will let you know when we pass the test ASAP. Does your patch use netif_receive_skb or netif_rx_ni() in IPoIB receiving path? I haven't looked at your most recent git tree yet. If it's netif_rx_ni(), that's wrong. NAPI should avoid IP backlog queue. As we discussed before, I suggested to use return (unlikely(missed_event) & netif_reschedule_rx()) instead of going back polling cq again and again. ehca delivers packets too fast, according to my debug output, I could get up to 58 missed_events between notify_cq and netif_reschedule_rx() to exit from NAPI poll. Sorry to block your NAPI patch that long. Are you still planning to use NAPI as default or as an configuration option? As Michael's pointed out, under some situation (like heavy load), NAPI might not be a good choice. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Wed Dec 6 11:17:53 2006 From: elsen_david at yahoo.com (david elsen) Date: Wed, 6 Dec 2006 11:17:53 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <1165429652.25183.16.camel@stevo-desktop> Message-ID: <742095.47380.qm@web58014.mail.re3.yahoo.com> Steve, Thanks a lot for the reply. I could run the cpi from the example directory. But I see some error message when trying to run the IMB-MPI1. I am using 219297_IMB_2.3. Which version are you using? David Steve Wise wrote: On Wed, 2006-12-06 at 10:03 -0800, david elsen wrote: > Shaun / Steve, > > To pass the "librdmacm.so: cannot open shared object file: No such > file or > >> directory" error message, LD_RUN_PATH also need to be set. > > Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying > to figure out how to run the various nenchmark tests using this MPI > tool. > > Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also > want to run the OSC benchmark tests. Any guideline availabvle for > these please? > Thanks, > David I've run IMB benchmarks (aka pallas) on mvapich2 0.9.8 over iwarp. The mvapich2 user guide explains how to start up mpd daemons and use mpiexec. Its fairly straight forward. You need ssh or rsh access and you need to setup a few files. Then pull down IMB and build it. To run 2 node IMB-MPI1 tests, you do something like this: $ mpdboot -n 2 $ mpiexec -n 2 /IMB-MPI1 This will run the entire MPI1 suite. Steve. --------------------------------- Any questions? Get answers on any topic at Yahoo! Answers. Try it now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Wed Dec 6 11:23:19 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 06 Dec 2006 13:23:19 -0600 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <742095.47380.qm@web58014.mail.re3.yahoo.com> References: <742095.47380.qm@web58014.mail.re3.yahoo.com> Message-ID: <1165432999.25183.26.camel@stevo-desktop> On Wed, 2006-12-06 at 11:17 -0800, david elsen wrote: > Steve, > > Thanks a lot for the reply. > > I could run the cpi from the example directory. > > But I see some error message when trying to run the IMB-MPI1. I am > using 219297_IMB_2.3. Which version are you using? I'm running the same release. Steve. From rowland at cse.ohio-state.edu Wed Dec 6 11:16:46 2006 From: rowland at cse.ohio-state.edu (Shaun Rowland) Date: Wed, 06 Dec 2006 14:16:46 -0500 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com> References: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com> Message-ID: <4577171E.6060205@cse.ohio-state.edu> david elsen wrote: > Shaun / Steve, > > To pass the "librdmacm.so: cannot open shared object file: No such file or > >> directory" error message, LD_RUN_PATH also need to be set. > > Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying > to figure out how to run the various nenchmark tests using this MPI tool. > > Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also want > to run the OSC benchmark tests. Any guideline availabvle for these please? > Thanks, We run these tests. For IMB (Pallas), you can look at the doc/ReadMe_IMB.txt in the source to see more details. For the OSU benchmarks, you can simply build them with mpicc and run them on 2 nodes or 1 node. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ From elsen_david at yahoo.com Wed Dec 6 11:30:52 2006 From: elsen_david at yahoo.com (david elsen) Date: Wed, 6 Dec 2006 11:30:52 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <1165432999.25183.26.camel@stevo-desktop> Message-ID: <957376.73440.qm@web58010.mail.re3.yahoo.com> Steve, Somehow I get the following error message: [0] Abort: [] Got completion with error 5, vendor code=a, dest rank=1 at line 479 in file ibv_channel_manager.c [1] Abort: ibv_post_recv err with 22 at line 1420 in file rdma_iba_priv.c rank 1 in job 1 ammasso1_50414 caused collective abort of all ranks exit status of rank 1: killed by signal 9 For detail, please see the following: [root at ammasso1 0.9.8-RELEASE]# vi /etc/hosts [root at ammasso1 0.9.8-RELEASE]# cd bin [root at ammasso1 bin]# ./mpdboot -n 2 debug: starting mpdroot: perror msg: Connection refused running mpdallexit on ammasso1 LAUNCHED mpd on ammasso1 via debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d debug: mpd on ammasso1 on port 50414 RUNNING: mpd on ammasso1 debug: info for running mpd: {'ncpus': 1, 'list_port': 50414, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''} LAUNCHED mpd on ammasso2 via ammasso1 debug: launch cmd= ssh -x -n ammasso2. '/root/0.9.8-RELEASE/bin/mpd.py -h ammasso1 -p 50414 --ncpus=1 -e -d' root at ammasso2.'s password: debug: mpd on ammasso2 on port 59327 RUNNING: mpd on ammasso2 debug: info for running mpd: {'entry_port': 50414, 'ncpus': 1, 'list_port': 59327, 'pid': 2997, 'host': 'ammasso2., 'entry_host': 'ammasso1', 'ifhn': ''} [root at ammasso1 bin]# ./mpiexec -n 2 /root/IMB_2.3/src/IMB-MPI1 secretword= #--------------------------------------------------- # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part #--------------------------------------------------- # Date : Wed Dec 6 13:25:59 2006 # Machine : i686# System : Linux # Release : 2.6.17.13 # Version : #1 SMP Wed Nov 8 17:34:14 PST 2006 # # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # PingPong # PingPing # Sendrecv # Exchange # Allreduce # Reduce # Reduce_scatter # Allgather # Allgatherv # Alltoall # Bcast # Barrier recv desc error, 128 [0] Abort: [] Got completion with error 5, vendor code=a, dest rank=1 at line 479 in file ibv_channel_manager.c [1] Abort: ibv_post_recv err with 22 at line 1420 in file rdma_iba_priv.c rank 1 in job 1 ammasso1_50414 caused collective abort of all ranks exit status of rank 1: killed by signal 9 David Steve Wise wrote: On Wed, 2006-12-06 at 11:17 -0800, david elsen wrote: > Steve, > > Thanks a lot for the reply. > > I could run the cpi from the example directory. > > But I see some error message when trying to run the IMB-MPI1. I am > using 219297_IMB_2.3. Which version are you using? I'm running the same release. Steve. --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Wed Dec 6 11:40:55 2006 From: elsen_david at yahoo.com (david elsen) Date: Wed, 6 Dec 2006 11:40:55 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <4577171E.6060205@cse.ohio-state.edu> Message-ID: <340888.34199.qm@web58004.mail.re3.yahoo.com> Shaun, I tried this and am getting some error messages: Please see following: [root at ammasso2 osu_benchmarks]# mpicc osu_latency.c [root at ammasso2 osu_benchmarks]# ls a.out osu_bw.c osu_latency.c osu_put_bw.c osu_acc_latency.c osu_get_bw.c osu_latency_mt.c osu_put_latency.c osu_bibw.c osu_get_latency.c osu_put_bibw.c [root at ammasso2 osu_benchmarks]# ./a.out [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_0 key=0 : system msg for write_line failure : Bad file descriptor [unset]: got unexpected response to get :cmd=get kvsname=singinit_kvs_0 key=0 : [0] Abort: PMI Lookup name failed at line 519 in file rdma_cm.c I get similar error message for all the tests: [root at ammasso2 pt2pt]# /root/0.9.8-RELEASE/test/mpi/pt2pt/pingping [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_0 key=0 : system msg for write_line failure : Bad file descriptor [unset]: got unexpected response to get :cmd=get kvsname=singinit_kvs_0 key=0 : [0] Abort: PMI Lookup name failed at line 519 in file rdma_cm.c [root at ammasso2 pt2pt]# /root/0.9.8-RELEASE/test/mpi/pt2pt/bsend bash: /root/0.9.8-RELEASE/test/mpi/pt2pt/bsend: No such file or directory [root at ammasso2 pt2pt]# /root/0.9.8-RELEASE/test/mpi/pt2pt/bsend1 [unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_0 key=0 : system msg for write_line failure : Bad file descriptor [unset]: got unexpected response to get :cmd=get kvsname=singinit_kvs_0 key=0 : [0] Abort: PMI Lookup name failed at line 519 in file rdma_cm.c [root at ammasso2 pt2pt]# David Shaun Rowland wrote: david elsen wrote: > Shaun / Steve, > > To pass the "librdmacm.so: cannot open shared object file: No such file or > >> directory" error message, LD_RUN_PATH also need to be set. > > Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying > to figure out how to run the various nenchmark tests using this MPI tool. > > Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also want > to run the OSC benchmark tests. Any guideline availabvle for these please? > Thanks, We run these tests. For IMB (Pallas), you can look at the doc/ReadMe_IMB.txt in the source to see more details. For the OSU benchmarks, you can simply build them with mpicc and run them on 2 nodes or 1 node. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve.apo at googlemail.com Wed Dec 6 11:45:54 2006 From: steve.apo at googlemail.com (Steven Wooding) Date: Wed, 6 Dec 2006 19:45:54 +0000 Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could be wrong? In-Reply-To: <2cfcf21e0612051128k59f32e99u42cd7e761063786f@mail.gmail.com> References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com> <4575B9CB.5070507@ichips.intel.com> <2cfcf21e0612051128k59f32e99u42cd7e761063786f@mail.gmail.com> Message-ID: <2cfcf21e0612061145i346b99e8n9074218547947aec@mail.gmail.com> Hi Sean, Thanks for the tip. I wasn't setting the QP type properly. Fixed now. Cheers, Steve. On 05/12/06, Steven Wooding wrote: > > Hi Sean, > > Yeah, in my second post I said that errno was EINVAL just after the > ib_cm_send_req() call, which I assume was from the write() call. Or did you > mean something else? > > Steve. > > On 05/12/06, Sean Hefty wrote: > > > > > In my application I keep getting -1 returned by a call to > > > ib_cm_send_req() function. The cmpost example application works fine, > > so > > > I can rule out system set-up issues. > > > > This is probably an error being returned from the kernel. Does errno > > give any > > more insight? > > > > - Sean > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Wed Dec 6 11:52:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 14:52:07 -0500 Subject: [openib-general] osm: More simulation faiures on trunk In-Reply-To: <45770599.7080005@mellanox.co.il> References: <45770599.7080005@mellanox.co.il> Message-ID: <1165434717.25587.140668.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-12-06 at 13:02, Eitan Zahavi wrote: > Hi Hal, > > Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when > opensm > is invoked with updn routing engine. Are you referring to certain LIDs being UNREACHABLE like this: LID : Port : Hops : Optimal 0x0001 : UNREACHABLE 0x0002 : UNREACHABLE 0x0003 : 000 : 00 : yes 0x0004 : 001 : 02 : yes 0x0005 : 003 : 02 : yes 0x0006 : 001 : 01 : yes 0x0007 : UNREACHABLE 0x0008 : UNREACHABLE 0x0009 : UNREACHABLE 0x000A : 001 : 02 : yes 0x000B : 003 : 02 : yes So should UNREACHABLE LIDs just not be put into the file ? Or is it something else ? > I will be working on finding what changed between OFED 1.1 and the trunk. It was likely introduced by the changes to the routing engines committed yesterday and sent on the last in late Novemeber. git-bisect can help isolate exactly which change. > This is another cause for the failure of all osmMulticastRoutingTest and > osmStability tests runs. > Another one would be the change of the osm.mcfdbs which is parsed by > IBDM too. Are you asking about the other patch again ? -- Hal > Eitan > From halr at voltaire.com Wed Dec 6 12:04:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 15:04:07 -0500 Subject: [openib-general] Multicast Group Routing Question In-Reply-To: <4577108F.9080308@dbresearch.net> References: <45770372.8010700@dbresearch.net> <1165429589.25587.136986.camel@hal.voltaire.com> <4577108F.9080308@dbresearch.net> Message-ID: <1165435407.25587.141052.camel@hal.voltaire.com> On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote: > Hal Rosenstock wrote: > > Hi Sean, > > > > On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote: > > > >> Hello, > >> > >> I was testing our code and noticed that when I send data using > >> multicast over our ib0 interface, all of the infiniband switches route > >> the data to each switch and each node instead of a node that has an > >> application listening to the interface like Ethernet. Is this by design? > >> > > > > It depends on what multicast group is being used and which end nodes > > have registered for that group as to where the data is routed. > > > > -- Hal > > > Hey Hal, > > The multicast group I am sending data to is 224.10.10.x (not > 224.0.0.x) and I have no clients / nodes listening but the data is still > being sent. Yes, if there is only a sender, the data should not be routed anywhere. > I am using wwtop from warewulf to view the network load for > each node. I'm not familiar with those tools. > Does this make sense? Nope. To state the obvious, something is not as it seems... Can you state which SM you are using ? Also, can you do the following: saquery -g saquery -m and send me the output. I may have some more experiments once I get that level of info. -- Hal > Sean From rowland at cse.ohio-state.edu Wed Dec 6 12:42:10 2006 From: rowland at cse.ohio-state.edu (Shaun Rowland) Date: Wed, 06 Dec 2006 15:42:10 -0500 Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <340888.34199.qm@web58004.mail.re3.yahoo.com> References: <340888.34199.qm@web58004.mail.re3.yahoo.com> Message-ID: <45772B22.2090702@cse.ohio-state.edu> david elsen wrote: > Shaun, > > I tried this and am getting some error messages: > > Please see following: > [root at ammasso2 osu_benchmarks]# mpicc osu_latency.c > [root at ammasso2 osu_benchmarks]# ls > a.out osu_bw.c osu_latency.c osu_put_bw.c > osu_acc_latency.c osu_get_bw.c osu_latency_mt.c osu_put_latency.c > osu_bibw.c osu_get_latency.c osu_put_bibw.c > [root at ammasso2 osu_benchmarks]# ./a.out You need to execute these with mpiexec after starting mpdboot, so the process would be something like: mpicc -o osu_lat osu_latency.c mpdboot -n 2 -f hosts mpiexec -n 2 ./osu_lat .... mpdallexit As detailed in the User Guide: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-170005.2 You should also see this section of the User Guide if you have problems with iWARP: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-400007.3 Also, this section describes using iWARP with MVAPICH2: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-230005.8 Have you set up everything (like /etc/mv2.conf)? Are you using the environment variable MV2_USE_RDMA_CM as described above? With the mpiexec command, it should be enough to export this variable to a value of 1 in the same environment in which you execute mpiexec - this will automatically propagate to the processes on remote machines. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ From shubbell at dbresearch.net Wed Dec 6 13:06:13 2006 From: shubbell at dbresearch.net (Sean Hubbell) Date: Wed, 06 Dec 2006 15:06:13 -0600 Subject: [openib-general] Multicast Group Routing Question In-Reply-To: <1165435407.25587.141052.camel@hal.voltaire.com> References: <45770372.8010700@dbresearch.net> <1165429589.25587.136986.camel@hal.voltaire.com> <4577108F.9080308@dbresearch.net> <1165435407.25587.141052.camel@hal.voltaire.com> Message-ID: <457730C5.9000902@dbresearch.net> Hal Rosenstock wrote: > On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote: > >> Hal Rosenstock wrote: >> >>> Hi Sean, >>> >>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote: >>> >>> >>>> Hello, >>>> >>>> I was testing our code and noticed that when I send data using >>>> multicast over our ib0 interface, all of the infiniband switches route >>>> the data to each switch and each node instead of a node that has an >>>> application listening to the interface like Ethernet. Is this by design? >>>> >>>> >>> It depends on what multicast group is being used and which end nodes >>> have registered for that group as to where the data is routed. >>> >>> -- Hal >>> >>> >> Hey Hal, >> >> The multicast group I am sending data to is 224.10.10.x (not >> 224.0.0.x) and I have no clients / nodes listening but the data is still >> being sent. >> > > Yes, if there is only a sender, the data should not be routed anywhere. > > >> I am using wwtop from warewulf to view the network load for >> each node. >> > > I'm not familiar with those tools. > > >> Does this make sense? >> > > Nope. To state the obvious, something is not as it seems... > > Can you state which SM you are using ? > > Also, can you do the following: > saquery -g > saquery -m > and send me the output. > > I may have some more experiments once I get that level of info. > > -- Hal > We have a Voltaire HW subnet manager. I do not have the saquery command. I'll have to find this and install it. Would the web interface help? Sean From vishal at endace.com Wed Dec 6 13:23:17 2006 From: vishal at endace.com (vishal) Date: Thu, 07 Dec 2006 10:23:17 +1300 Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem Message-ID: <1165440197.2894.5.camel@julia.et.endace.com> Hi, Was trying to install IBGOLD on Red Hat 4 (x86_64), and the following is the 'error' part from a log file. I couldn't find the -Xcompiler option in the gcc manual. Am I missing something ? configure:2466: $? = 0 configure:2468: gcc -v &5 Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.6/specs Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-java-awt=gtk --host=x86_64-redhat-linux Thread model: posix gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) configure:2471: $? = 0 configure:2473: gcc -V &5 gcc: `-V' option must have argument configure:2476: $? = 1 configure:2499: checking for C compiler default output file name configure:2502: gcc -m32 -m32 -Xcompiler -m32 conftest.c >&5 gcc: unrecognized option `-Xcompiler' /usr/bin/ld: crt1.o: No such file: No such file or directory collect2: ld returned 1 exit status Thanks! Vishal From halr at voltaire.com Wed Dec 6 13:38:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 16:38:54 -0500 Subject: [openib-general] Multicast Group Routing Question In-Reply-To: <457730C5.9000902@dbresearch.net> References: <45770372.8010700@dbresearch.net> <1165429589.25587.136986.camel@hal.voltaire.com> <4577108F.9080308@dbresearch.net> <1165435407.25587.141052.camel@hal.voltaire.com> <457730C5.9000902@dbresearch.net> Message-ID: <1165441086.25587.144751.camel@hal.voltaire.com> On Wed, 2006-12-06 at 16:06, Sean Hubbell wrote: > Hal Rosenstock wrote: > > On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote: > > > >> Hal Rosenstock wrote: > >> > >>> Hi Sean, > >>> > >>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote: > >>> > >>> > >>>> Hello, > >>>> > >>>> I was testing our code and noticed that when I send data using > >>>> multicast over our ib0 interface, all of the infiniband switches route > >>>> the data to each switch and each node instead of a node that has an > >>>> application listening to the interface like Ethernet. Is this by design? > >>>> > >>>> > >>> It depends on what multicast group is being used and which end nodes > >>> have registered for that group as to where the data is routed. > >>> > >>> -- Hal > >>> > >>> > >> Hey Hal, > >> > >> The multicast group I am sending data to is 224.10.10.x (not > >> 224.0.0.x) and I have no clients / nodes listening but the data is still > >> being sent. > >> > > > > Yes, if there is only a sender, the data should not be routed anywhere. > > > > > >> I am using wwtop from warewulf to view the network load for > >> each node. > >> > > > > I'm not familiar with those tools. > > > > > >> Does this make sense? > >> > > > > Nope. To state the obvious, something is not as it seems... > > > > Can you state which SM you are using ? > > > > Also, can you do the following: > > saquery -g > > saquery -m > > and send me the output. > > > > I may have some more experiments once I get that level of info. > > > > -- Hal > > > We have a Voltaire HW subnet manager. I do not have the saquery command. > I'll have to find this and install it. What is running on your end nodes ? Is it OpenIB/OFED or something else ? If it is OpenIB/OFED, saquery should be there. I think OFED 1.2 supports the options I mentioned. > Would the web interface help? Not sure whether there is anything there for this. -- Hal > > Sean From elsen_david at yahoo.com Wed Dec 6 13:44:49 2006 From: elsen_david at yahoo.com (david elsen) Date: Wed, 6 Dec 2006 13:44:49 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <45772B22.2090702@cse.ohio-state.edu> Message-ID: <336295.54543.qm@web58008.mail.re3.yahoo.com> oops, sorry, my fault. I will try it again. Shaun Rowland wrote: david elsen wrote: > Shaun, > > I tried this and am getting some error messages: > > Please see following: > [root at ammasso2 osu_benchmarks]# mpicc osu_latency.c > [root at ammasso2 osu_benchmarks]# ls > a.out osu_bw.c osu_latency.c osu_put_bw.c > osu_acc_latency.c osu_get_bw.c osu_latency_mt.c osu_put_latency.c > osu_bibw.c osu_get_latency.c osu_put_bibw.c > [root at ammasso2 osu_benchmarks]# ./a.out You need to execute these with mpiexec after starting mpdboot, so the process would be something like: mpicc -o osu_lat osu_latency.c mpdboot -n 2 -f hosts mpiexec -n 2 ./osu_lat .... mpdallexit As detailed in the User Guide: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-170005.2 You should also see this section of the User Guide if you have problems with iWARP: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-400007.3 Also, this section describes using iWARP with MVAPICH2: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-230005.8 Have you set up everything (like /etc/mv2.conf)? Are you using the environment variable MV2_USE_RDMA_CM as described above? With the mpiexec command, it should be enough to export this variable to a value of 1 in the same environment in which you execute mpiexec - this will automatically propagate to the processes on remote machines. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ --------------------------------- Any questions? Get answers on any topic at Yahoo! Answers. Try it now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Wed Dec 6 13:50:26 2006 From: elsen_david at yahoo.com (david elsen) Date: Wed, 6 Dec 2006 13:50:26 -0800 (PST) Subject: [openib-general] openMPI for 2.6.17.10 kernel In-Reply-To: <336295.54543.qm@web58008.mail.re3.yahoo.com> Message-ID: <24626.54501.qm@web58007.mail.re3.yahoo.com> Shan /Steve, I could run the osu_lat on my set-up with two Ammasso cards. Thanks, David Thanks a lot for the help, david elsen wrote: oops, sorry, my fault. I will try it again. Shaun Rowland wrote: david elsen wrote: > Shaun, > > I tried this and am getting some error messages: > > Please see following: > [root at ammasso2 osu_benchmarks]# mpicc osu_latency.c > [root at ammasso2 osu_benchmarks]# ls > a.out osu_bw.c osu_latency.c osu_put_bw.c > osu_acc_latency.c osu_get_bw.c osu_latency_mt.c osu_put_latency.c > osu_bibw.c osu_get_latency.c osu_put_bibw.c > [root at ammasso2 osu_benchmarks]# ./a.out You need to execute these with mpiexec after starting mpdboot, so the process would be something like: mpicc -o osu_lat osu_latency.c mpdboot -n 2 -f hosts mpiexec -n 2 ./osu_lat .... mpdallexit As detailed in the User Guide: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-170005.2 You should also see this section of the User Guide if you have problems with iWARP: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-400007.3 Also, this section describes using iWARP with MVAPICH2: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-230005.8 Have you set up everything (like /etc/mv2.conf)? Are you using the environment variable MV2_USE_RDMA_CM as described above? With the mpiexec command, it should be enough to export this variable to a value of 1 in the same environment in which you execute mpiexec - this will automatically propagate to the processes on remote machines. -- Shaun Rowland rowland at cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ --------------------------------- Any questions? Get answers on any topic at Yahoo! Answers. Try it now._______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Any questions? Get answers on any topic at Yahoo! Answers. Try it now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Wed Dec 6 13:56:25 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 06 Dec 2006 23:56:25 +0200 Subject: [openib-general] osm: More simulation faiures on trunk In-Reply-To: <1165434717.25587.140668.camel@hal.voltaire.com> References: <45770599.7080005@mellanox.co.il> <1165434717.25587.140668.camel@hal.voltaire.com> Message-ID: <45773C89.9060901@mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Eitan, > > On Wed, 2006-12-06 at 13:02, Eitan Zahavi wrote: > >> Hi Hal, >> >> Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when >> opensm >> is invoked with updn routing engine. >> > > Are you referring to certain LIDs being UNREACHABLE like this: > LID : Port : Hops : Optimal > 0x0001 : UNREACHABLE > 0x0002 : UNREACHABLE > 0x0003 : 000 : 00 : yes > 0x0004 : 001 : 02 : yes > 0x0005 : 003 : 02 : yes > 0x0006 : 001 : 01 : yes > 0x0007 : UNREACHABLE > 0x0008 : UNREACHABLE > 0x0009 : UNREACHABLE > 0x000A : 001 : 02 : yes > 0x000B : 003 : 02 : yes > > So should UNREACHABLE LIDs just not be put into the file ? Or is it > something else ? > > The UNREACHABLE is fine. The problem is that ALL LFTs are full of UNREACHABLE. Actually there are no reachable nodes ... >> I will be working on finding what changed between OFED 1.1 and the trunk. >> > > It was likely introduced by the changes to the routing engines committed > yesterday and sent on the last in late Novemeber. git-bisect can help > isolate exactly which change. > Thanks I will follow that trail > >> This is another cause for the failure of all osmMulticastRoutingTest and >> osmStability tests runs. >> > > >> Another one would be the change of the osm.mcfdbs which is parsed by >> IBDM too. >> > > Are you asking about the other patch again ? > Do you have an estimate for when my patch will be merged ? > -- Hal > > >> Eitan >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Wed Dec 6 14:15:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 17:15:44 -0500 Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_sa_informinfo.c: Conformance changes for subscribe component Message-ID: <1165443320.25587.146153.camel@hal.voltaire.com> OpenSM/osm_sa_informinfo.c: Conformance changes for subscribe component Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index ad705b5..5d81b84 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -339,9 +339,6 @@ __osm_infr_rcv_respond( p_resp_infr = (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); - /* confirm success */ - p_resp_infr->subscribe = 1; - status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); if ( status != IB_SUCCESS ) @@ -754,6 +751,20 @@ osm_infr_rcv_process_set_method( goto Exit; } + /* Subscribe values above 1 are undefined */ + if (p_recvd_inform_info->subscribe > 1) + { + cl_plock_release( p_rcv->p_lock ); + + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_infr_rcv_process_set_method: ERR 4308 " + "Invalid subscribe: %d\n", + p_recvd_inform_info->subscribe + ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID ); + goto Exit; + } + /* * MODIFICATIONS DONE ON INCOMING REQUEST: * From halr at voltaire.com Wed Dec 6 14:46:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 17:46:14 -0500 Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_ucast_updn.c: In updn_init, add routine exit osm_log message for an error case Message-ID: <1165445137.25587.147342.camel@hal.voltaire.com> OpenSM/osm_ucast_updn.c: In updn_init, add routine exit osm_log message for an error case Also, some cosmetic changes Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c index 7e6a6d5..b0ea721 100644 --- a/osm/opensm/osm_ucast_updn.c +++ b/osm/opensm/osm_ucast_updn.c @@ -33,7 +33,6 @@ * */ - /* * Abstract: * Implementation of Up Down Algorithm using ranking & Min Hop @@ -272,7 +271,7 @@ __updn_bfs_by_node( "__updn_bfs_by_node:" "Update Min Hop Table of GUID 0x%" PRIx64 "\n", cl_ntoh64(p_port->guid) ); - osm_switch_set_hops(p_self_node, root_lid , 0, 0); + osm_switch_set_hops(p_self_node, root_lid, 0, 0); } else { @@ -598,7 +597,7 @@ updn_init( if (!p_list) { status = IB_ERROR; - goto Exit_Bad; + goto Exit; } cl_list_construct( p_list ); @@ -630,7 +629,7 @@ updn_init( { if (strcspn(line, " ,;.") == strlen(line)) { - /* Skip empty Lines anywhere in the file - only one char means the Null termination */ + /* Skip empty lines anywhere in the file - only one char means the Null termination */ if (strlen(line) > 1) { p_tmp = malloc(sizeof(uint64_t)); @@ -670,12 +669,8 @@ updn_init( } /* If auto mode detection required - will be executed in main b4 the assignment of UI Ucast */ - goto Exit; - - Exit_Bad : - return 1; - Exit : - OSM_LOG_EXIT( &p_osm->log ); +Exit : + OSM_LOG_EXIT( &p_osm->log ); return (status); } From halr at voltaire.com Wed Dec 6 15:20:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Dec 2006 18:20:41 -0500 Subject: [openib-general] osm: More simulation faiures on trunk In-Reply-To: <45773C89.9060901@mellanox.co.il> References: <45770599.7080005@mellanox.co.il> <1165434717.25587.140668.camel@hal.voltaire.com> <45773C89.9060901@mellanox.co.il> Message-ID: <1165447232.25587.148638.camel@hal.voltaire.com> Hi Eitan, On Wed, 2006-12-06 at 16:56, Eitan Zahavi wrote: > Hi Hal, > > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Wed, 2006-12-06 at 13:02, Eitan Zahavi wrote: > > > >> Hi Hal, > >> > >> Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when > >> opensm > >> is invoked with updn routing engine. > >> > > > > Are you referring to certain LIDs being UNREACHABLE like this: > > LID : Port : Hops : Optimal > > 0x0001 : UNREACHABLE > > 0x0002 : UNREACHABLE > > 0x0003 : 000 : 00 : yes > > 0x0004 : 001 : 02 : yes > > 0x0005 : 003 : 02 : yes > > 0x0006 : 001 : 01 : yes > > 0x0007 : UNREACHABLE > > 0x0008 : UNREACHABLE > > 0x0009 : UNREACHABLE > > 0x000A : 001 : 02 : yes > > 0x000B : 003 : 02 : yes > > > > So should UNREACHABLE LIDs just not be put into the file ? Or is it > > something else ? > > > > > The UNREACHABLE is fine. > The problem is that ALL LFTs are full of UNREACHABLE. Actually there are > no reachable nodes ... Weird. It works on my topology (with UPDN). > >> I will be working on finding what changed between OFED 1.1 and the trunk. > >> > > > > It was likely introduced by the changes to the routing engines committed > > yesterday and sent on the last in late Novemeber. git-bisect can help > > isolate exactly which change. > > > Thanks I will follow that trail > > > >> This is another cause for the failure of all osmMulticastRoutingTest and > >> osmStability tests runs. > >> > > > > > >> Another one would be the change of the osm.mcfdbs which is parsed by > >> IBDM too. > >> > > > > Are you asking about the other patch again ? > > > Do you have an estimate for when my patch will be merged ? I already answered this in an earlier email. -- Hal > > -- Hal > > > > > >> Eitan > >> > >> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From boris at mellanox.com Wed Dec 6 15:54:16 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Wed, 6 Dec 2006 15:54:16 -0800 Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem Message-ID: <1E3DCD1C63492545881FACB6063A57C16E4132@mtiexch01.mti.com> What IBGD version you are using ? Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of vishal Sent: Wednesday, December 06, 2006 1:23 PM To: openib-general at openib.org Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem Hi, Was trying to install IBGOLD on Red Hat 4 (x86_64), and the following is the 'error' part from a log file. I couldn't find the -Xcompiler option in the gcc manual. Am I missing something ? configure:2466: $? = 0 configure:2468: gcc -v &5 Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.6/specs Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-java-awt=gtk --host=x86_64-redhat-linux Thread model: posix gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) configure:2471: $? = 0 configure:2473: gcc -V &5 gcc: `-V' option must have argument configure:2476: $? = 1 configure:2499: checking for C compiler default output file name configure:2502: gcc -m32 -m32 -Xcompiler -m32 conftest.c >&5 gcc: unrecognized option `-Xcompiler' /usr/bin/ld: crt1.o: No such file: No such file or directory collect2: ld returned 1 exit status Thanks! Vishal _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From gregkh at suse.de Wed Dec 6 22:12:01 2006 From: gregkh at suse.de (gregkh at suse.de) Date: Wed, 06 Dec 2006 22:12:01 -0800 Subject: [openib-general] patch pci-only-check-the-ht-capability-bits-in-mpic.c.patch added to gregkh-2.6 tree In-Reply-To: <20061122072626.94B1A67C3C@ozlabs.org> Message-ID: <20061207061210.04E90A609F3@imap.suse.de> This is a note to let you know that I've just added the patch titled Subject: PCI: Only check the HT capability bits in mpic.c to my gregkh-2.6 tree. Its filename is pci-only-check-the-ht-capability-bits-in-mpic.c.patch This tree can be found at http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/ >From michael at ozlabs.org Tue Nov 21 23:26:32 2006 From: Michael Ellerman To: linux-pci at atrey.karlin.mff.cuni.cz CC: Greg Kroah-Hartman , Benjamin Herrenschmidt , Eric W. Biederman , Segher Boessenkool , , , Date: Wed, 22 Nov 2006 18:26:22 +1100 Subject: PCI: Only check the HT capability bits in mpic.c Message-Id: <20061122072626.94B1A67C3C at ozlabs.org> Only compare the exact HT capability bits against HT_CAPTYPE_IRQ, this is a little paranoid, but doesn't hurt. Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman --- arch/powerpc/sysdev/mpic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- gregkh-2.6.orig/arch/powerpc/sysdev/mpic.c +++ gregkh-2.6/arch/powerpc/sysdev/mpic.c @@ -390,7 +390,7 @@ static void __init mpic_scan_ht_pic(stru u8 id = readb(devbase + pos + PCI_CAP_LIST_ID); if (id == PCI_CAP_ID_HT) { id = readb(devbase + pos + 3); - if (id == HT_CAPTYPE_IRQ) + if ((id & HT_5BIT_CAP_MASK) == HT_CAPTYPE_IRQ) break; } } Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are From gregkh at suse.de Wed Dec 6 22:12:04 2006 From: gregkh at suse.de (gregkh at suse.de) Date: Wed, 06 Dec 2006 22:12:04 -0800 Subject: [openib-general] patch pci-use-pci_find_ht_capability-in-drivers-pci-htirq.c.patch added to gregkh-2.6 tree In-Reply-To: <20061122072623.ECBFE67C38@ozlabs.org> Message-ID: <20061207061213.4F333A60A6D@imap.suse.de> This is a note to let you know that I've just added the patch titled Subject: PCI: Use pci_find_ht_capability() in drivers/pci/htirq.c to my gregkh-2.6 tree. Its filename is pci-use-pci_find_ht_capability-in-drivers-pci-htirq.c.patch This tree can be found at http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/ >From michael at ozlabs.org Tue Nov 21 23:26:32 2006 From: Michael Ellerman To: linux-pci at atrey.karlin.mff.cuni.cz CC: Greg Kroah-Hartman , Benjamin Herrenschmidt , Eric W. Biederman , Segher Boessenkool , , , Date: Wed, 22 Nov 2006 18:26:19 +1100 Subject: PCI: Use pci_find_ht_capability() in drivers/pci/htirq.c Message-Id: <20061122072623.ECBFE67C38 at ozlabs.org> Use pci_find_ht_capability() in drivers/pci/htirq.c Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman --- drivers/pci/htirq.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) --- gregkh-2.6.orig/drivers/pci/htirq.c +++ gregkh-2.6/drivers/pci/htirq.c @@ -99,14 +99,7 @@ int __ht_create_irq(struct pci_dev *dev, int pos; int irq; - pos = pci_find_capability(dev, PCI_CAP_ID_HT); - while (pos) { - u8 subtype; - pci_read_config_byte(dev, pos + 3, &subtype); - if (subtype == HT_CAPTYPE_IRQ) - break; - pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); - } + pos = pci_find_ht_capability(dev, HT_CAPTYPE_IRQ); if (!pos) return -EINVAL; Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are From gregkh at suse.de Wed Dec 6 22:11:57 2006 From: gregkh at suse.de (gregkh at suse.de) Date: Wed, 06 Dec 2006 22:11:57 -0800 Subject: [openib-general] patch pci-create-__pci_bus_find_cap_start-from-__pci_bus_find_cap.patch added to gregkh-2.6 tree In-Reply-To: <20061122072621.BC4B967C35@ozlabs.org> Message-ID: <20061207061206.7DDE6A606D8@imap.suse.de> This is a note to let you know that I've just added the patch titled Subject: PCI: Create __pci_bus_find_cap_start() from __pci_bus_find_cap() to my gregkh-2.6 tree. Its filename is pci-create-__pci_bus_find_cap_start-from-__pci_bus_find_cap.patch This tree can be found at http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/ >From owner-linux-pci at atrey.karlin.mff.cuni.cz Tue Nov 21 23:26:32 2006 From: Michael Ellerman To: linux-pci at atrey.karlin.mff.cuni.cz Cc: Greg Kroah-Hartman , Benjamin Herrenschmidt , Eric W.Biederman , Segher Boessenkool , , , Date: Wed, 22 Nov 2006 18:26:16 +1100 Subject: PCI: Create __pci_bus_find_cap_start() from __pci_bus_find_cap() Message-Id: <20061122072621.BC4B967C35 at ozlabs.org> The current implementation of __pci_bus_find_cap() does two things, first it determines the start of the capability chain for the device, and then it trys to find the requested capability. Split these out, so that we can use the two parts independantly in a subsequent patch. Externally visible behaviour should be unchanged. Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman --- drivers/pci/pci.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) --- gregkh-2.6.orig/drivers/pci/pci.c +++ gregkh-2.6/drivers/pci/pci.c @@ -96,10 +96,10 @@ int pci_find_next_capability(struct pci_ } EXPORT_SYMBOL_GPL(pci_find_next_capability); -static int __pci_bus_find_cap(struct pci_bus *bus, unsigned int devfn, u8 hdr_type, int cap) +static int __pci_bus_find_cap_start(struct pci_bus *bus, + unsigned int devfn, u8 hdr_type) { u16 status; - u8 pos; pci_bus_read_config_word(bus, devfn, PCI_STATUS, &status); if (!(status & PCI_STATUS_CAP_LIST)) @@ -108,15 +108,14 @@ static int __pci_bus_find_cap(struct pci switch (hdr_type) { case PCI_HEADER_TYPE_NORMAL: case PCI_HEADER_TYPE_BRIDGE: - pos = PCI_CAPABILITY_LIST; - break; + return PCI_CAPABILITY_LIST; case PCI_HEADER_TYPE_CARDBUS: - pos = PCI_CB_CAPABILITY_LIST; - break; + return PCI_CB_CAPABILITY_LIST; default: return 0; } - return __pci_find_next_cap(bus, devfn, pos, cap); + + return 0; } /** @@ -140,7 +139,13 @@ static int __pci_bus_find_cap(struct pci */ int pci_find_capability(struct pci_dev *dev, int cap) { - return __pci_bus_find_cap(dev->bus, dev->devfn, dev->hdr_type, cap); + int pos; + + pos = __pci_bus_find_cap_start(dev->bus, dev->devfn, dev->hdr_type); + if (pos) + pos = __pci_find_next_cap(dev->bus, dev->devfn, pos, cap); + + return pos; } /** @@ -158,11 +163,16 @@ int pci_find_capability(struct pci_dev * */ int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap) { + int pos; u8 hdr_type; pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, &hdr_type); - return __pci_bus_find_cap(bus, devfn, hdr_type & 0x7f, cap); + pos = __pci_bus_find_cap_start(bus, devfn, hdr_type & 0x7f); + if (pos) + pos = __pci_find_next_cap(bus, devfn, pos, cap); + + return pos; } /** Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are From gregkh at suse.de Wed Dec 6 22:12:08 2006 From: gregkh at suse.de (gregkh at suse.de) Date: Wed, 06 Dec 2006 22:12:08 -0800 Subject: [openib-general] patch pci-use-pci_find_ht_capability-in-drivers-pci-quirks.c.patch added to gregkh-2.6 tree In-Reply-To: <20061122072625.8B07767C3B@ozlabs.org> Message-ID: <20061207061216.C71179A88F5@imap.suse.de> This is a note to let you know that I've just added the patch titled Subject: PCI: Use pci_find_ht_capability() in drivers/pci/quirks.c to my gregkh-2.6 tree. Its filename is pci-use-pci_find_ht_capability-in-drivers-pci-quirks.c.patch This tree can be found at http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/ >From michael at ozlabs.org Tue Nov 21 23:26:32 2006 From: Michael Ellerman To: linux-pci at atrey.karlin.mff.cuni.cz CC: Greg Kroah-Hartman , Benjamin Herrenschmidt , Eric W. Biederman , Segher Boessenkool , , , Date: Wed, 22 Nov 2006 18:26:21 +1100 Subject: PCI: Use pci_find_ht_capability() in drivers/pci/quirks.c Message-Id: <20061122072625.8B07767C3B at ozlabs.org> Use pci_find_ht_capability() in drivers/pci/quirks.c. I'm pretty sure the logic is unchanged here, but someone please eye-ball it for me. I've changed the message to be a little shorter, it's now: PCI: Found (enabled|disabled) HT MSI mapping on xxxx:xx:xx.x Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman --- drivers/pci/quirks.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) --- gregkh-2.6.orig/drivers/pci/quirks.c +++ gregkh-2.6/drivers/pci/quirks.c @@ -1644,19 +1644,23 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AM * return 1 if a HT MSI capability is found and enabled */ static int __devinit msi_ht_cap_enabled(struct pci_dev *dev) { - u8 pos; - int ttl; - for (pos = pci_find_capability(dev, PCI_CAP_ID_HT), ttl = 48; - pos && ttl; - pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT), ttl--) { - u32 cap_hdr; - /* MSI mapping section according to Hypertransport spec */ - if (pci_read_config_dword(dev, pos, &cap_hdr) == 0 - && (cap_hdr & 0xf8000000) == 0xa8000000 /* MSI mapping */) { - printk(KERN_INFO "PCI: Found HT MSI mapping on %s with capability %s\n", - pci_name(dev), cap_hdr & 0x10000 ? "enabled" : "disabled"); - return (cap_hdr & 0x10000) != 0; /* MSI mapping cap enabled */ + int pos, ttl = 48; + + pos = pci_find_ht_capability(dev, HT_CAPTYPE_MSI_MAPPING); + while (pos && ttl--) { + u8 flags; + + if (pci_read_config_byte(dev, pos + HT_MSI_FLAGS, + &flags) == 0) + { + printk(KERN_INFO "PCI: Found %s HT MSI Mapping on %s\n", + flags & HT_MSI_FLAGS_ENABLE ? + "enabled" : "disabled", pci_name(dev)); + return (flags & HT_MSI_FLAGS_ENABLE) != 0; } + + pos = pci_find_next_ht_capability(dev, pos, + HT_CAPTYPE_MSI_MAPPING); } return 0; } Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are From gregkh at suse.de Wed Dec 6 22:11:54 2006 From: gregkh at suse.de (gregkh at suse.de) Date: Wed, 06 Dec 2006 22:11:54 -0800 Subject: [openib-general] patch pci-add-pci_find_ht_capability-for-finding-hypertransport-capabilities.patch added to gregkh-2.6 tree In-Reply-To: <20061122072622.E2A8967C37@ozlabs.org> Message-ID: <20061207061202.977658A4C6C@imap.suse.de> This is a note to let you know that I've just added the patch titled Subject: PCI: Add pci_find_ht_capability() for finding Hypertransport capabilities to my gregkh-2.6 tree. Its filename is pci-add-pci_find_ht_capability-for-finding-hypertransport-capabilities.patch This tree can be found at http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/ >From owner-linux-pci at atrey.karlin.mff.cuni.cz Tue Nov 21 23:26:37 2006 From: Michael Ellerman To: linux-pci at atrey.karlin.mff.cuni.cz Cc: Greg Kroah-Hartman , Benjamin Herrenschmidt , Eric W.Biederman , Segher Boessenkool , , , Date: Wed, 22 Nov 2006 18:26:18 +1100 Subject: PCI: Add pci_find_ht_capability() for finding Hypertransport capabilities Message-Id: <20061122072622.E2A8967C37 at ozlabs.org> From: Michael Ellerman There are already several places in the kernel that want to search a PCI device for a given Hypertransport capability. Although this is possible using pci_find_capability() etc., it makes sense to encapsulate that logic in a helper - pci_find_ht_capability(). To cater for searching exhaustively for a capability, we also provide pci_find_next_ht_capability(). We also need to cater for the fact that the HT capability fields may be either 3 or 5 bits wide. pci_find_ht_capability() deals with this for you, but callers using the #defines directly must handle that themselves. Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman --- drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++-- include/linux/pci.h | 2 + include/linux/pci_regs.h | 12 ++++++ 3 files changed, 94 insertions(+), 4 deletions(-) --- gregkh-2.6.orig/drivers/pci/pci.c +++ gregkh-2.6/drivers/pci/pci.c @@ -68,12 +68,14 @@ pci_max_busnr(void) #endif /* 0 */ -static int __pci_find_next_cap(struct pci_bus *bus, unsigned int devfn, u8 pos, int cap) +#define PCI_FIND_CAP_TTL 48 + +static int __pci_find_next_cap_ttl(struct pci_bus *bus, unsigned int devfn, + u8 pos, int cap, int *ttl) { u8 id; - int ttl = 48; - while (ttl--) { + while ((*ttl)--) { pci_bus_read_config_byte(bus, devfn, pos, &pos); if (pos < 0x40) break; @@ -89,6 +91,14 @@ static int __pci_find_next_cap(struct pc return 0; } +static int __pci_find_next_cap(struct pci_bus *bus, unsigned int devfn, + u8 pos, int cap) +{ + int ttl = PCI_FIND_CAP_TTL; + + return __pci_find_next_cap_ttl(bus, devfn, pos, cap, &ttl); +} + int pci_find_next_capability(struct pci_dev *dev, u8 pos, int cap) { return __pci_find_next_cap(dev->bus, dev->devfn, @@ -224,6 +234,74 @@ int pci_find_ext_capability(struct pci_d } EXPORT_SYMBOL_GPL(pci_find_ext_capability); +static int __pci_find_next_ht_cap(struct pci_dev *dev, int pos, int ht_cap) +{ + int rc, ttl = PCI_FIND_CAP_TTL; + u8 cap, mask; + + if (ht_cap == HT_CAPTYPE_SLAVE || ht_cap == HT_CAPTYPE_HOST) + mask = HT_3BIT_CAP_MASK; + else + mask = HT_5BIT_CAP_MASK; + + pos = __pci_find_next_cap_ttl(dev->bus, dev->devfn, pos, + PCI_CAP_ID_HT, &ttl); + while (pos) { + rc = pci_read_config_byte(dev, pos + 3, &cap); + if (rc != PCIBIOS_SUCCESSFUL) + return 0; + + if ((cap & mask) == ht_cap) + return pos; + + pos = __pci_find_next_cap_ttl(dev->bus, dev->devfn, pos, + PCI_CAP_ID_HT, &ttl); + } + + return 0; +} +/** + * pci_find_next_ht_capability - query a device's Hypertransport capabilities + * @dev: PCI device to query + * @pos: Position from which to continue searching + * @ht_cap: Hypertransport capability code + * + * To be used in conjunction with pci_find_ht_capability() to search for + * all capabilities matching @ht_cap. @pos should always be a value returned + * from pci_find_ht_capability(). + * + * NB. To be 100% safe against broken PCI devices, the caller should take + * steps to avoid an infinite loop. + */ +int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap) +{ + return __pci_find_next_ht_cap(dev, pos + PCI_CAP_LIST_NEXT, ht_cap); +} +EXPORT_SYMBOL_GPL(pci_find_next_ht_capability); + +/** + * pci_find_ht_capability - query a device's Hypertransport capabilities + * @dev: PCI device to query + * @ht_cap: Hypertransport capability code + * + * Tell if a device supports a given Hypertransport capability. + * Returns an address within the device's PCI configuration space + * or 0 in case the device does not support the request capability. + * The address points to the PCI capability, of type PCI_CAP_ID_HT, + * which has a Hypertransport capability matching @ht_cap. + */ +int pci_find_ht_capability(struct pci_dev *dev, int ht_cap) +{ + int pos; + + pos = __pci_bus_find_cap_start(dev->bus, dev->devfn, dev->hdr_type); + if (pos) + pos = __pci_find_next_ht_cap(dev, pos, ht_cap); + + return pos; +} +EXPORT_SYMBOL_GPL(pci_find_ht_capability); + /** * pci_find_parent_resource - return resource region of parent bus of given region * @dev: PCI device structure contains resources to be searched --- gregkh-2.6.orig/include/linux/pci.h +++ gregkh-2.6/include/linux/pci.h @@ -454,6 +454,8 @@ struct pci_dev *pci_find_slot (unsigned int pci_find_capability (struct pci_dev *dev, int cap); int pci_find_next_capability (struct pci_dev *dev, u8 pos, int cap); int pci_find_ext_capability (struct pci_dev *dev, int cap); +int pci_find_ht_capability (struct pci_dev *dev, int ht_cap); +int pci_find_next_ht_capability (struct pci_dev *dev, int pos, int ht_cap); struct pci_bus *pci_find_next_bus(const struct pci_bus *from); struct pci_dev *pci_get_device(unsigned int vendor, unsigned int device, --- gregkh-2.6.orig/include/linux/pci_regs.h +++ gregkh-2.6/include/linux/pci_regs.h @@ -475,9 +475,19 @@ #define PCI_PWR_CAP 12 /* Capability */ #define PCI_PWR_CAP_BUDGET(x) ((x) & 1) /* Included in system budget */ -/* Hypertransport sub capability types */ +/* + * Hypertransport sub capability types + * + * Unfortunately there are both 3 bit and 5 bit capability types defined + * in the HT spec, catering for that is a little messy. You probably don't + * want to use these directly, just use pci_find_ht_capability() and it + * will do the right thing for you. + */ +#define HT_3BIT_CAP_MASK 0xE0 #define HT_CAPTYPE_SLAVE 0x00 /* Slave/Primary link configuration */ #define HT_CAPTYPE_HOST 0x20 /* Host/Secondary link configuration */ + +#define HT_5BIT_CAP_MASK 0xF8 #define HT_CAPTYPE_IRQ 0x80 /* IRQ Configuration */ #define HT_CAPTYPE_REMAPPING_40 0xA0 /* 40 bit address remapping */ #define HT_CAPTYPE_REMAPPING_64 0xA2 /* 64 bit address remapping */ Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are From gregkh at suse.de Wed Dec 6 22:11:50 2006 From: gregkh at suse.de (gregkh at suse.de) Date: Wed, 06 Dec 2006 22:11:50 -0800 Subject: [openib-general] patch pci-add-defines-for-hypertransport-msi-fields.patch added to gregkh-2.6 tree In-Reply-To: <20061122072624.B86A167C3A@ozlabs.org> Message-ID: <20061207061158.F2300A60904@imap.suse.de> This is a note to let you know that I've just added the patch titled Subject: PCI: Add #defines for Hypertransport MSI fields to my gregkh-2.6 tree. Its filename is pci-add-defines-for-hypertransport-msi-fields.patch This tree can be found at http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/ >From owner-linux-pci at atrey.karlin.mff.cuni.cz Tue Nov 21 23:26:46 2006 From: Michael Ellerman To: linux-pci at atrey.karlin.mff.cuni.cz Cc: Greg Kroah-Hartman , Benjamin Herrenschmidt , Eric W.Biederman , Segher Boessenkool , , , Date: Wed, 22 Nov 2006 18:26:20 +1100 Subject: PCI: Add #defines for Hypertransport MSI fields Message-Id: <20061122072624.B86A167C3A at ozlabs.org> Add a few #defines for grabbing and working with the address fields in a HT_CAPTYPE_MSI_MAPPING capability. All from the HT spec v3.00. Signed-off-by: Michael Ellerman Signed-off-by: Greg Kroah-Hartman --- include/linux/pci_regs.h | 7 +++++++ 1 file changed, 7 insertions(+) --- gregkh-2.6.orig/include/linux/pci_regs.h +++ gregkh-2.6/include/linux/pci_regs.h @@ -494,6 +494,13 @@ #define HT_CAPTYPE_UNITID_CLUMP 0x90 /* Unit ID clumping */ #define HT_CAPTYPE_EXTCONF 0x98 /* Extended Configuration Space Access */ #define HT_CAPTYPE_MSI_MAPPING 0xA8 /* MSI Mapping Capability */ +#define HT_MSI_FLAGS 0x02 /* Offset to flags */ +#define HT_MSI_FLAGS_ENABLE 0x1 /* Mapping enable */ +#define HT_MSI_FLAGS_FIXED 0x2 /* Fixed mapping only */ +#define HT_MSI_FIXED_ADDR 0x00000000FEE00000ULL /* Fixed addr */ +#define HT_MSI_ADDR_LO 0x04 /* Offset to low addr bits */ +#define HT_MSI_ADDR_LO_MASK 0xFFF00000 /* Low address bit mask */ +#define HT_MSI_ADDR_HI 0x08 /* Offset to high addr bits */ #define HT_CAPTYPE_DIRECT_ROUTE 0xB0 /* Direct routing configuration */ #define HT_CAPTYPE_VCSET 0xB8 /* Virtual Channel configuration */ #define HT_CAPTYPE_ERROR_RETRY 0xC0 /* Retry on error configuration */ Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are From monil at voltaire.com Wed Dec 6 22:17:39 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 7 Dec 2006 08:17:39 +0200 Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update - RDMA CM etc In-Reply-To: <45770AA3.2040505@ichips.intel.com> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> <4575D0A8.7080501@ichips.intel.com> <20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com> <20061206101705.GP26787@mellanox.co.il> <45770AA3.2040505@ichips.intel.com> Message-ID: <6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com> Sean, On 12/6/06, Sean Hefty wrote: > >>>I gather the ucma bits are in rdma_ucm? > > Yes. > > Basically, I reworked changes that were in svn into separate branches based off > of 2.6.19. > > > 1st is probably to fix the mcast bits so that they don't crash the machine. > > OFED will be based on whatever is merged by Linus by that time + any number of patches > > and out of kernel modules. > > Even if the kernel multicast support could make it into 2.6.20, I won't have the > multicast changes to the rdma_cm done by then. > > >>3rd have Sean decide how he wants the multicast support to be integrated > >>into OFED 1.2, my guess would be as a patch set over the > >>ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide > > Does OFED want the multicast support in 1.2? We definitely want the multicast support in 1.2. It's on the wiki ( OFED 1.2 release plan and features) and I understood that this was also agreed on at SC06. -- Moni > > > Maybe the right thing is to split the multicast stuff in a separate library, > > or have a separate ABI version for multicast, I don't really know. > > My anticipation is that the multicast support will bump the ABI, but will allow > backwards compatibility. The break from librdmacm ABI 2 to ABI 3 is a result of > changing the event reporting. > > - Sean > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > From ogerlitz at voltaire.com Wed Dec 6 23:22:49 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 07 Dec 2006 09:22:49 +0200 Subject: [openib-general] [PATCH v3 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1165430156.14800.243.camel@brick.pathscale.com> References: <1165430156.14800.243.camel@brick.pathscale.com> Message-ID: <4577C149.3050900@voltaire.com> Ralph Campbell wrote: > This version of the patch adds support for ib_dma_alloc_coherent() > and ib_dma_free_coherent(). It also fixes the bug Or found in > ipath_sync_single_for_cpu() and ipath_sync_single_for_device(). > This patch implements the interposing DMA mapping functions to allow > support for IOMMUs and remove the dependence on phys_to_virt(). Haven't you said that the ipath driver uses bus_to_virt ? > diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Tue Dec 05 16:04:53 2006 -0800 > +/** > + * ipath_dma_map_single - Map a kernel virtual address to DMA address > + * @dev: The device for which the dma_addr is to be created > + * @cpu_addr: The kernel virtual address > + * @size: The size of the region in bytes > + * @direction: The direction of the DMA > + */ > +static u64 ipath_dma_map_single(struct ib_device *dev, > + void *cpu_addr, size_t size, > + enum dma_data_direction direction) > +{ > + BUG_ON(!valid_dma_direction(direction)); > + return (u64) cpu_addr; > +} The documentation is both over kill in its volume and worse, simply tells a whole different story then what this code is doing. It does not generate DMA address, it does not care about the ib device nor the size or dma direction. Same for all the documentation below. > +/** > + * ipath_sg_dma_address - Return the DMA address from a scatter/gather entry > + * @dev: The device for which the DMA addresses were created > + * @sg: The scatter/gather entry > + */ > +static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg) > +{ > + return (u64) page_address(sg->page); > +} this is a bug, you need to add sg->offset Or. From mst at mellanox.co.il Wed Dec 6 23:29:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 7 Dec 2006 09:29:38 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061207072938.GB26107@mellanox.co.il> > Quoting r. Scott Weitzenkamp (sweitzen) : > Subject: RE: [openib-general] [PATCH] IPoIB CM Experimental support > > > d. Limitations > > UDP multicast and UDP connections to IPoIB UD mode > > currently don't work since we get packets that are too large to > > send over a UD QP. > > As a work around, one can now create separate interfaces > > for use with CM and UD mode. > > You can't send UDP/multicast traffic at all between IPoIB CM and IPoIB > UD? With my experimental code, this currently works only if you manually limit the MTU for multicast/UD addresses. The simplest way to do this is to set up separate interfaces for CM and UD modes. > What about UDP/multicast between IPoIB CM hosts? As above. -- MST From mst at mellanox.co.il Wed Dec 6 23:30:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 7 Dec 2006 09:30:38 +0200 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <4577019C.7050900@ichips.intel.com> References: <4577019C.7050900@ichips.intel.com> Message-ID: <20061207073038.GC26107@mellanox.co.il> > > Just to clarify this point - what connecton messages can be lost? > > E.g. if the passive side does not get an RTU for a while, it will > > retry the REP, won't it? Diagram 12.9.6 seems to indicate so: > > from REP Sent we should go to RTU timeout, Send REP and back to REP Sent. > > Is this implemented? > > REP retries are already implemented in the ib_cm. This handles the case where > the RTU is repeatedly lost, but data is still received on the connection. Yes, I've even observed this with SDP, but I'm not sure why this happens. It seems that MADs are sometimes lost even in back to back configurations. Any idea why? -- MST From ogerlitz at voltaire.com Wed Dec 6 23:35:39 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 07 Dec 2006 09:35:39 +0200 Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1165428994.14800.229.camel@brick.pathscale.com> References: <1164910957.14800.71.camel@brick.pathscale.com> <1164918691.14800.101.camel@brick.pathscale.com> <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com> <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com> <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com> <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com> <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com> <1165359560.14800.210.camel@brick.pathscale.com> <4576AA73.105@voltaire.com> <1165428994.14800.229.camel@brick.pathscale.com> Message-ID: <4577C44B.20305@voltaire.com> Ralph Campbell wrote: > On Wed, 2006-12-06 at 13:33 +0200, Or Gerlitz wrote: >> Basically what Roland suggest is that you need to implement SW IOTLB >> mapping from dma_addr_t (possibly offset-ed) to kv addr. And do the >> actual kmap/unmap calls before/after you must touch the data. >> Is this impossible? > It is not impossible, just inefficient. Why add a mapping > table when it isn't needed? If I needed to implement HIGMEM > support, I would probably make "dma_addr_t" be a physical > memory address, convert to PFN, find the struct page pointer, > and call kmap_atomic() or page_address(). Why go though all > that in the worst case CPU path when doing the conversion > to kernel virtual address outside the critical path is > feasible? As i wrote you earlier on this thread, calling kmap_atomic **outside** the critical path (ie not when the low level ipath driver does an actual write/read to/from the page) is problematic b/c is means you hold a kmap atomic slots for long time which is something should not be done - eg see LDD 3rd edition pp 418 "your code must not sleep while holding a atomic kmap", on the other hand you can't just call kmap since you might be in non sleepable context (eg SCSI LLD such as SRP/iSER calling ib_dma_map_sg etc). So you might be able to follow your approach of the physical --> pfn --> page --> kmap_atomic (I think you don't need to bother checking if page_address is NULL since kmap is a NO OP when the page is mapped), but do it when you actually need the map. Or. From sean.hefty at intel.com Thu Dec 7 00:45:49 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 7 Dec 2006 00:45:49 -0800 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <20061207073038.GC26107@mellanox.co.il> Message-ID: <000001c719dc$185ac590$30cc180a@amr.corp.intel.com> >Yes, I've even observed this with SDP, but I'm not sure why this >happens. It seems that MADs are sometimes lost even in back to back >configurations. Any idea why? I have no idea why MADs would be lost. In our scale up testing, we *never* saw lost or dropped MADs to the SA node, even when hitting it with 500,000 queries. The fact that you're seeing lost MADs is something that we should probably look into more, someday, hopefully, when I have more time available... We didn't notice any issues with the CM messages in our testing, so we didn't examine that traffic in more detail. Are there counters for QP0/1 that can let us know whether drops are occurring on the send or receive side? - Sean From boris at lfbs.RWTH-Aachen.DE Thu Dec 7 01:26:28 2006 From: boris at lfbs.RWTH-Aachen.DE (Boris Bierbaum) Date: Thu, 07 Dec 2006 10:26:28 +0100 Subject: [openib-general] Status of DAT conformance test Message-ID: <4577DE44.5030308@lfbs.rwth-aachen.de> Hi, I'm looking for ways to test the standard conformance of a uDAPL provider. I had a look at the DAT conformance test contained in the DAPL reference implementation, release version gamma 3.2. This test doesn't seem to be in a state in which it can be used to test a uDAPL version 1.2 provider, is anybody working to fix this? Which test programs can be recommaned for this purpose? Thanks Boris -- | _ RWTH | Boris Bierbaum |_|_`_ | Lehrstuhl fuer Betriebssysteme | |_) _ | RWTH Aachen D-52056 Aachen |_)(_` | Tel: +49-241-80-27805 ._) | Fax: +49-241-80-22339 From mst at mellanox.co.il Thu Dec 7 01:54:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 7 Dec 2006 11:54:18 +0200 Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages In-Reply-To: <000001c719dc$185ac590$30cc180a@amr.corp.intel.com> References: <20061207073038.GC26107@mellanox.co.il> <000001c719dc$185ac590$30cc180a@amr.corp.intel.com> Message-ID: <20061207095418.GA2614@mellanox.co.il> > Quoting r. Sean Hefty : > Subject: Re: [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages > > >Yes, I've even observed this with SDP, but I'm not sure why this > >happens. It seems that MADs are sometimes lost even in back to back > >configurations. Any idea why? > > I have no idea why MADs would be lost. In our scale up testing, we *never* saw > lost or dropped MADs to the SA node, even when hitting it with 500,000 queries. > The fact that you're seeing lost MADs is something that we should probably look > into more, someday, hopefully, when I have more time available... We didn't > notice any issues with the CM messages in our testing, so we didn't examine that > traffic in more detail. Note I only see CM message drops. I had to use rdma_establish and send an extra send after start in SDP to trigger it, but path resolution was always working fine. > Are there counters for QP0/1 that can let us know whether drops are occurring on > the send or receive side? Not sure what do you mean. Let's just count the send/receive completions in MAD layer. -- MST From poknam at gmail.com Thu Dec 7 02:02:00 2006 From: poknam at gmail.com (Lai Dragonfly) Date: Thu, 7 Dec 2006 18:02:00 +0800 Subject: [openib-general] Automatically connect to SRP target Message-ID: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com> Hi all, i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in clients and IBGD-1.8.2-srpt in targets. i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in openib.conf , i could not found the SRP target. until i execute "srp_daemon -e -o", i can see all the targets appear in /dev/sdX. since i want to export the targets to other nodes, any idea so that i can connect to the targets automatically in each reboot. without typing "srp_daemon -e -o" each time? thanks in advance. PN -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Thu Dec 7 02:28:38 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 07 Dec 2006 12:28:38 +0200 Subject: [openib-general] osm: osmtest new flow of informinfo fails Message-ID: <4577ECD6.2050906@mellanox.co.il> Hi Hal, All osmtest flows fail for me with the following error: I start the log from the first inform info related message to give you the context. Dec 07 11:48:17 656752 [B7FD48E0] -> osmtest_get_node_rec_by_lid: Getting node record for LID 0xFFFF Dec 07 11:48:17 663592 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: Remote error:0x0C00 . Dec 07 11:48:17 663642 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Dec 07 11:48:17 663694 [B7FD48E0] -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Dec 07 11:48:17 663729 [B7FD48E0] -> osmtest_informinfo_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Dec 07 11:48:17 663759 [B7FD48E0] -> osmtest_informinfo_request: InformInfoRecord IS EXPECTED ERROR ^^^^ Dec 07 11:48:17 667671 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: Remote error:0x0C00 . Dec 07 11:48:17 667705 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Dec 07 11:48:17 667756 [B7FD48E0] -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Dec 07 11:48:17 667789 [B7FD48E0] -> osmtest_informinfo_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Dec 07 11:48:17 667820 [B7FD48E0] -> osmtest_informinfo_request: InformInfo IS EXPECTED ERROR ^^^^ Dec 07 11:48:17 669403 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: Remote error:0x0002 . Dec 07 11:48:17 669436 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Dec 07 11:48:17 669489 [B7FD48E0] -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Dec 07 11:48:17 669561 [B7FD48E0] -> osmtest_informinfo_request: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Dec 07 11:48:17 669590 [B7FD48E0] -> osmtest_informinfo_request: InformInfo UnSubscribe IS EXPECTED ERROR ^^^^ Dec 07 11:48:17 672731 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: Remote error:0x0002 . Dec 07 11:48:17 672772 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Dec 07 11:48:17 672826 [B7FD48E0] -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Dec 07 11:48:17 672859 [B7FD48E0] -> osmtest_informinfo_request: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Dec 07 11:48:17 672894 [B7FD48E0] -> osmtest_run: ERR 0146: SA validation database failure (IB_INSUFFICIENT_MEMORY) OpenSM log says: Dec 07 11:48:17 668513 [B57DABB0] -> osm_infr_rcv_process_set_method: ERR 4307: Failed to UnSubscribe to non existin g inform object Dec 07 11:48:17 671896 [B75DDBB0] -> osm_infr_rcv_process_set_method: ERR 4307: Failed to UnSubscribe to non existin g inform object Please let me know if you want me to debug it. Eitan From ramachandra.kuchimanchi at qlogic.com Thu Dec 7 03:02:48 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 07 Dec 2006 16:32:48 +0530 Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover from secondary path back to primary path Message-ID: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com> This fixes a bug due to which failover from secondary path back to primary path was not working. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_ib.c | 4 +++- drivers/infiniband/ulp/vnic/vnic_main.c | 9 +++++---- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_ib.c b/drivers/infiniband/ulp/vnic/vnic_ib.c index 6196e20..56ae9f7 100644 --- a/drivers/infiniband/ulp/vnic/vnic_ib.c +++ b/drivers/infiniband/ulp/vnic/vnic_ib.c @@ -303,10 +303,12 @@ int vnic_ib_get_path(struct netpath *net " path record query\n", config->path_info.status); - netpath_timer(netpath, vnic->config->no_path_timeout); ret = config->path_info.status; } out: + if (ret) + netpath_timer(netpath, vnic->config->no_path_timeout); + return ret; } diff --git a/drivers/infiniband/ulp/vnic/vnic_main.c b/drivers/infiniband/ulp/vnic/vnic_main.c index fca2b90..e15d3f9 100644 --- a/drivers/infiniband/ulp/vnic/vnic_main.c +++ b/drivers/infiniband/ulp/vnic/vnic_main.c @@ -710,17 +710,18 @@ static struct vnic * vnic_handle_npevent case VNIC_PRINP_TIMEREXPIRED: netpath = &vnic->primary_path; netpath->timer_state = NETPATH_TS_EXPIRED; - if (netpath->carrier) + if (!netpath->carrier) update_path_and_reconnect(netpath, vnic); break; case VNIC_SECNP_TIMEREXPIRED: netpath = &vnic->secondary_path; netpath->timer_state = NETPATH_TS_EXPIRED; - if (netpath->carrier) { + if (!netpath->carrier) + update_path_and_reconnect(netpath, vnic); + else { if (vnic->state == VNIC_UNINITIALIZED) vnic_npevent_register(vnic, netpath); - } else - update_path_and_reconnect(netpath, vnic); + } break; case VNIC_PRINP_LINKUP: vnic->primary_path.carrier = 1; From ramachandra.kuchimanchi at qlogic.com Thu Dec 7 03:03:30 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 07 Dec 2006 16:33:30 +0530 Subject: [openib-general] [PATCH 2/2 vex branch] IB/VNIC Fix failover delay issue Message-ID: <4578425A.27226.250CE6A4@ramachandra.kuchimanchi.qlogic.com> This reduces the delay in failover from one path to another. When a path is lost, the control and data connections of that path are cleaned up. As part of this a CM DREQ was being sent and we waited for a DREP. During this time the viport thread was blocked which delayed sending of a CONFIG_LINK request to the VEx for the other path. Due to this, there was considerable delay in the failover path becoming active. To fix this, send a DREQ but do not wait for a DREP from the VEx. We need not worry about a DREQ being lost because the VEx will anyway terminate a connection if it does not receive heartbeats. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_control.c | 4 +--- drivers/infiniband/ulp/vnic/vnic_data.c | 3 --- 2 files changed, 1 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_control.c b/drivers/infiniband/ulp/vnic/vnic_control.c index b6a3e7f..2c55540 100644 --- a/drivers/infiniband/ulp/vnic/vnic_control.c +++ b/drivers/infiniband/ulp/vnic/vnic_control.c @@ -1450,12 +1450,10 @@ void control_cleanup(struct control *con { CONTROL_FUNCTION("%s: control_disconnect()\n", control_ifcfg_name(control)); - init_completion(&control->ib_conn.done); if (ib_send_cm_dreq(control->ib_conn.cm_id, NULL, 0)) printk(KERN_DEBUG "control CM DREQ sending failed\n"); - else - wait_for_completion(&control->ib_conn.done); + control_timer_stop(control); ib_destroy_cm_id(control->ib_conn.cm_id); ib_destroy_qp(control->ib_conn.qp); diff --git a/drivers/infiniband/ulp/vnic/vnic_data.c b/drivers/infiniband/ulp/vnic/vnic_data.c index 0ce81f3..c1d056a 100644 --- a/drivers/infiniband/ulp/vnic/vnic_data.c +++ b/drivers/infiniband/ulp/vnic/vnic_data.c @@ -666,11 +666,8 @@ void data_disconnect(struct data *data) void data_cleanup(struct data *data) { - init_completion(&data->ib_conn.done); if (ib_send_cm_dreq(data->ib_conn.cm_id, NULL, 0)) printk(KERN_DEBUG "data CM DREQ sending failed\n"); - else - wait_for_completion(&data->ib_conn.done); ib_destroy_cm_id(data->ib_conn.cm_id); ib_destroy_qp(data->ib_conn.qp); From eeb at bartonsoftware.com Thu Dec 7 03:04:22 2006 From: eeb at bartonsoftware.com (Eric Barton) Date: Thu, 7 Dec 2006 11:04:22 GMT Subject: [openib-general] version #defines for the kernel Message-ID: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> Hi, I found out there has been a change in the kernel ib_fmr_pool_map_phys() to pass the last parameter by address rather than value. I can cope with either version with by coding... +#if IB_USER_VERBS_ABI_VERSION < 6 fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool, tx->tx_pages, npages, &rd->rd_addr); +#else + fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool, + tx->tx_pages, npages, + rd->rd_addr); +#endif ...but is this the right thing to do? It's the "USER" in IB_USER_VERBS_ABI_VERSION that's making me nervous since this is kernel code. Actually a single OFED version #define would most probably suit my purposes - is that controversial? -- Cheers, Eric From ogerlitz at voltaire.com Thu Dec 7 03:43:27 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 07 Dec 2006 13:43:27 +0200 Subject: [openib-general] version #defines for the kernel In-Reply-To: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> Message-ID: <4577FE5F.90407@voltaire.com> Eric Barton wrote: > Hi, > > I found out there has been a change in the kernel ib_fmr_pool_map_phys() to pass the > last parameter by address rather than value. I can cope with either version > with by coding... > > +#if IB_USER_VERBS_ABI_VERSION < 6 > fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool, > tx->tx_pages, npages, > &rd->rd_addr); > +#else > + fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool, > + tx->tx_pages, npages, > + rd->rd_addr); > +#endif > > ...but is this the right thing to do? It's the "USER" in > IB_USER_VERBS_ABI_VERSION that's making me nervous since this is kernel code. Indeed, it has nothing to do with user/kernel ABI, the FMR verbs are only exposed to kernel space consumers same for the FMR pool. The ib_fmr_pool_map_phys api change was done in the 2.6.18 cycle Or. From halr at voltaire.com Thu Dec 7 04:07:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2006 07:07:43 -0500 Subject: [openib-general] osm: osmtest new flow of informinfo fails In-Reply-To: <4577ECD6.2050906@mellanox.co.il> References: <4577ECD6.2050906@mellanox.co.il> Message-ID: <1165493190.25587.182601.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-12-07 at 05:28, Eitan Zahavi wrote: > Hi Hal, > > All osmtest flows fail for me with the following error: By all flows, you mean osmtest -a (the all flows test). > I start the log from the first inform info related message to give you > the context. > > Dec 07 11:48:17 656752 [B7FD48E0] -> osmtest_get_node_rec_by_lid: > Getting node record for LID 0xFFFF > Dec 07 11:48:17 663592 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: > Remote error:0x0C00 . > Dec 07 11:48:17 663642 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: > Error on query (IB_REMOTE_ERROR) > Dec 07 11:48:17 663694 [B7FD48E0] -> osmtest_informinfo_request: ERR > 008F: ib_query failed (IB_REMOTE_ERROR) > Dec 07 11:48:17 663729 [B7FD48E0] -> osmtest_informinfo_request: Remote > error = IB_MAD_STATUS_UNSUP_METHOD_ATTR > Dec 07 11:48:17 663759 [B7FD48E0] -> osmtest_informinfo_request: > InformInfoRecord IS EXPECTED ERROR ^^^^ > Dec 07 11:48:17 667671 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: > Remote error:0x0C00 . > Dec 07 11:48:17 667705 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: > Error on query (IB_REMOTE_ERROR) > Dec 07 11:48:17 667756 [B7FD48E0] -> osmtest_informinfo_request: ERR > 008F: ib_query failed (IB_REMOTE_ERROR) > Dec 07 11:48:17 667789 [B7FD48E0] -> osmtest_informinfo_request: Remote > error = IB_MAD_STATUS_UNSUP_METHOD_ATTR > Dec 07 11:48:17 667820 [B7FD48E0] -> osmtest_informinfo_request: > InformInfo IS EXPECTED ERROR ^^^^ > Dec 07 11:48:17 669403 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: > Remote error:0x0002 . > Dec 07 11:48:17 669436 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: > Error on query (IB_REMOTE_ERROR) > Dec 07 11:48:17 669489 [B7FD48E0] -> osmtest_informinfo_request: ERR > 008F: ib_query failed (IB_REMOTE_ERROR) > Dec 07 11:48:17 669561 [B7FD48E0] -> osmtest_informinfo_request: Remote > error = IB_SA_MAD_STATUS_REQ_INVALID > Dec 07 11:48:17 669590 [B7FD48E0] -> osmtest_informinfo_request: > InformInfo UnSubscribe IS EXPECTED ERROR ^^^^ > Dec 07 11:48:17 672731 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: > Remote error:0x0002 . > Dec 07 11:48:17 672772 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: > Error on query (IB_REMOTE_ERROR) > Dec 07 11:48:17 672826 [B7FD48E0] -> osmtest_informinfo_request: ERR > 008F: ib_query failed (IB_REMOTE_ERROR) > Dec 07 11:48:17 672859 [B7FD48E0] -> osmtest_informinfo_request: Remote > error = IB_SA_MAD_STATUS_REQ_INVALID > Dec 07 11:48:17 672894 [B7FD48E0] -> osmtest_run: ERR 0146: SA > validation database failure (IB_INSUFFICIENT_MEMORY) This is a failure of the first subscribe. > OpenSM log says: > Dec 07 11:48:17 668513 [B57DABB0] -> osm_infr_rcv_process_set_method: > ERR 4307: Failed to UnSubscribe to non existin > g inform object > Dec 07 11:48:17 671896 [B75DDBB0] -> osm_infr_rcv_process_set_method: > ERR 4307: Failed to UnSubscribe to non existin > g inform object The first one is correct. The second one is due to bad treatment on the valid subscribe. Evidently, it is now somehow being treated as an unsubscribe rather than a subscribe. Can you run opensm with -V to see all the log messages which will give a better indication of what path it is taking in osm_infr_rcv_process_set_method. Thanks. > Please let me know if you want me to debug it. This works for me. Not sure what is different. -- Hal > Eitan From shubbell at dbresearch.net Thu Dec 7 04:08:37 2006 From: shubbell at dbresearch.net (Sean Hubbell) Date: Thu, 07 Dec 2006 06:08:37 -0600 Subject: [openib-general] Multicast Group Routing Question In-Reply-To: <1165441086.25587.144751.camel@hal.voltaire.com> References: <45770372.8010700@dbresearch.net> <1165429589.25587.136986.camel@hal.voltaire.com> <4577108F.9080308@dbresearch.net> <1165435407.25587.141052.camel@hal.voltaire.com> <457730C5.9000902@dbresearch.net> <1165441086.25587.144751.camel@hal.voltaire.com> Message-ID: <45780445.5010200@dbresearch.net> Hal Rosenstock wrote: > On Wed, 2006-12-06 at 16:06, Sean Hubbell wrote: > >> Hal Rosenstock wrote: >> >>> On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote: >>> >>> >>>> Hal Rosenstock wrote: >>>> >>>> >>>>> Hi Sean, >>>>> >>>>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote: >>>>> >>>>> >>>>> >>>>>> Hello, >>>>>> >>>>>> I was testing our code and noticed that when I send data using >>>>>> multicast over our ib0 interface, all of the infiniband switches route >>>>>> the data to each switch and each node instead of a node that has an >>>>>> application listening to the interface like Ethernet. Is this by design? >>>>>> >>>>>> >>>>>> >>>>> It depends on what multicast group is being used and which end nodes >>>>> have registered for that group as to where the data is routed. >>>>> >>>>> -- Hal >>>>> >>>>> >>>>> >>>> Hey Hal, >>>> >>>> The multicast group I am sending data to is 224.10.10.x (not >>>> 224.0.0.x) and I have no clients / nodes listening but the data is still >>>> being sent. >>>> >>>> >>> Yes, if there is only a sender, the data should not be routed anywhere. >>> >>> >>> >>>> I am using wwtop from warewulf to view the network load for >>>> each node. >>>> >>>> >>> I'm not familiar with those tools. >>> >>> >>> >>>> Does this make sense? >>>> >>>> >>> Nope. To state the obvious, something is not as it seems... >>> >>> Can you state which SM you are using ? >>> >>> Also, can you do the following: >>> saquery -g >>> saquery -m >>> and send me the output. >>> >>> I may have some more experiments once I get that level of info. >>> >>> -- Hal >>> >>> >> We have a Voltaire HW subnet manager. I do not have the saquery command. >> I'll have to find this and install it. >> > > What is running on your end nodes ? Is it OpenIB/OFED or something else > ? If it is OpenIB/OFED, saquery should be there. I think OFED 1.2 > supports the options I mentioned. > > >> Would the web interface help? >> > > Not sure whether there is anything there for this. > > -- Hal > > >> Sean >> > > > Hal, Here are the results: The result of saquery -g on our head node: [root at neptune ~]# saquery -g MCMemberRecord group dump: MGID....................0xff12401bffff0000 : 0x00000000ffffffff Mlid....................0xC000 Mtu.....................0x4 pkey....................0xFFFF Rate....................0x3 MCMemberRecord group dump: MGID....................0xff12401bffff0000 : 0x0000000000000001 Mlid....................0xC001 Mtu.....................0x4 pkey....................0xFFFF Rate....................0x3 The result of saquery -m on our root node: Query SA failed: IB_TIMEOUT Running package openib-diags-1.1.0-0 Sean From halr at voltaire.com Thu Dec 7 04:24:01 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2006 07:24:01 -0500 Subject: [openib-general] Multicast Group Routing Question In-Reply-To: <45780445.5010200@dbresearch.net> References: <45770372.8010700@dbresearch.net> <1165429589.25587.136986.camel@hal.voltaire.com> <4577108F.9080308@dbresearch.net> <1165435407.25587.141052.camel@hal.voltaire.com> <457730C5.9000902@dbresearch.net> <1165441086.25587.144751.camel@hal.voltaire.com> <45780445.5010200@dbresearch.net> Message-ID: <1165494213.25587.183122.camel@hal.voltaire.com> On Thu, 2006-12-07 at 07:08, Sean Hubbell wrote: > Hal Rosenstock wrote: > > On Wed, 2006-12-06 at 16:06, Sean Hubbell wrote: > > > >> Hal Rosenstock wrote: > >> > >>> On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote: > >>> > >>> > >>>> Hal Rosenstock wrote: > >>>> > >>>> > >>>>> Hi Sean, > >>>>> > >>>>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote: > >>>>> > >>>>> > >>>>> > >>>>>> Hello, > >>>>>> > >>>>>> I was testing our code and noticed that when I send data using > >>>>>> multicast over our ib0 interface, all of the infiniband switches route > >>>>>> the data to each switch and each node instead of a node that has an > >>>>>> application listening to the interface like Ethernet. Is this by design? > >>>>>> > >>>>>> > >>>>>> > >>>>> It depends on what multicast group is being used and which end nodes > >>>>> have registered for that group as to where the data is routed. > >>>>> > >>>>> -- Hal > >>>>> > >>>>> > >>>>> > >>>> Hey Hal, > >>>> > >>>> The multicast group I am sending data to is 224.10.10.x (not > >>>> 224.0.0.x) and I have no clients / nodes listening but the data is still > >>>> being sent. > >>>> > >>>> > >>> Yes, if there is only a sender, the data should not be routed anywhere. > >>> > >>> > >>> > >>>> I am using wwtop from warewulf to view the network load for > >>>> each node. > >>>> > >>>> > >>> I'm not familiar with those tools. > >>> > >>> > >>> > >>>> Does this make sense? > >>>> > >>>> > >>> Nope. To state the obvious, something is not as it seems... > >>> > >>> Can you state which SM you are using ? > >>> > >>> Also, can you do the following: > >>> saquery -g > >>> saquery -m > >>> and send me the output. > >>> > >>> I may have some more experiments once I get that level of info. > >>> > >>> -- Hal > >>> > >>> > >> We have a Voltaire HW subnet manager. I do not have the saquery command. > >> I'll have to find this and install it. > >> > > > > What is running on your end nodes ? Is it OpenIB/OFED or something else > > ? If it is OpenIB/OFED, saquery should be there. I think OFED 1.2 > > supports the options I mentioned. > > > > > >> Would the web interface help? > >> > > > > Not sure whether there is anything there for this. > > > > -- Hal > > > > > >> Sean > >> > > > > > > > Hal, > > Here are the results: > > The result of saquery -g on our head node: > > [root at neptune ~]# saquery -g > > MCMemberRecord group dump: > > MGID....................0xff12401bffff0000 : 0x00000000ffffffff > Mlid....................0xC000 > Mtu.....................0x4 > pkey....................0xFFFF > Rate....................0x3 > > MCMemberRecord group dump: > > MGID....................0xff12401bffff0000 : 0x0000000000000001 > Mlid....................0xC001 > Mtu.....................0x4 > pkey....................0xFFFF > Rate....................0x3 I don't see the mgrp for 224.10.10.x here. > The result of saquery -m on our root node: > > Query SA failed: IB_TIMEOUT This failure can be valid and is SM dependent. -- Hal > Running package openib-diags-1.1.0-0 > > Sean From mst at mellanox.co.il Thu Dec 7 05:29:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 7 Dec 2006 15:29:56 +0200 Subject: [openib-general] potential multicast module issue (was Fwd: FW: IPoIB on c0-6 and c0-7: problem creating mcgroup) Message-ID: <20061207132956.GC2614@mellanox.co.il> Sean, FYI. ----- Forwarded message from Yohad Dickman ----- Subject: FW: IPoIB on c0-6 and c0-7: problem creating mcgroup Date: Thu, 7 Dec 2006 10:07:01 +0200 From: Yohad Dickman Hi Michael, Yesterday, when I ran regression on the gen2_devel driver with the multicast patches, the opensm got an errors on multicast join (described below). Can you check it? Thx, Yohad -----Original Message----- From: Yevgeny Kliteynik Sent: Wednesday, December 06, 2006 7:03 PM To: Yohad Dickman; Yevgeny Kliteynik Subject: IPoIB on c0-6 and c0-7: problem creating mcgroup c0-7 (port 2) is trying to create mgroup, but the component mask is missing some bits: __osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 : 0x0000000000000002 from port 0x0002c90200209622 Missing bits in component mask for creating mcgroup: IB_MCR_COMPMASK_QKEY IB_MCR_COMPMASK_TCLASS IB_MCR_COMPMASK_SL IB_MCR_COMPMASK_FLOW Regards, Yevgeny Kliteynik Mellanox Technologies LTD Tel: +972-4-909-7200 ext: 394 Fax: +972-4-959-3245 P.O. Box 586 Yokneam 20692 ISRAEL ----- End forwarded message ----- -- MST From halr at voltaire.com Thu Dec 7 05:33:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2006 08:33:39 -0500 Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or switch] In-Reply-To: <4576C33C.7050204@mellanox.co.il> References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il> Message-ID: <1165498375.25587.185801.camel@hal.voltaire.com> On Wed, 2006-12-06 at 08:18, Eitan Zahavi wrote: > Hi Hal, > > Here is the same patch against GIT for your convenience. > > Thanks > > EZ > > The simulated regression caught this: > The osm.mcfdbs have now the format: > Switch 0x0002c90000000006 > LID : Out Port(s) > 0xC000 : 0x003 0x004 0x005 0x006 > 0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 > :0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 > :0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B > :0xC01C :0xC01D :0xC01E :0xC01F : > > Which should probably just be: > Switch 0x0002c90000000006 > LID : Out Port(s) > 0xC000 : 0x003 0x004 0x005 0x006 > > Actually switches that do not have any MCG entry will not be included > in the dump file. > > Signed-off-by: Eitan Zahavi Thanks. Applied. -- Hal From mst at mellanox.co.il Thu Dec 7 07:03:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 7 Dec 2006 17:03:04 +0200 Subject: [openib-general] [PATCH untested] mthca: map all MTTs/MPTs for FMR on 64 bit Message-ID: <20061207150304.GD2614@mellanox.co.il> We currently reserve separate MPT and MTT space for FMRs so avoid abusing the vmalloc space on 32 bit systems. No such problem exists on 64 bit systems so let's not do it there. This mapping will also make writing MTTs for regular regions directly from driver easier in the future. Signed-off-by: Michael S. Tsirkin --- Roland, this is untested. Could you take a look please? diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c index f71ffa8..3064002 100644 --- a/drivers/infiniband/hw/mthca/mthca_mr.c +++ b/drivers/infiniband/hw/mthca/mthca_mr.c @@ -761,7 +761,7 @@ void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr) int mthca_init_mr_table(struct mthca_dev *dev) { unsigned long addr; - int err, i; + int mpts, mtts, err, i; err = mthca_alloc_init(&dev->mr_table.mpt_alloc, dev->limits.num_mpts, @@ -789,19 +789,26 @@ int mthca_init_mr_table(struct mthca_dev *dev) if (dev->limits.fmr_reserved_mtts) { i = fls(dev->limits.fmr_reserved_mtts - 1); - if (i >= 31) { mthca_warn(dev, "Unable to reserve 2^31 FMR MTTs.\n"); err = -EINVAL; goto err_fmr_mpt; } + mpts = mtts = 1 << i; + } else { + mpts = dev->limits.num_mtt_segs; + mtts = dev->limits.num_mpts; + } + + if (!mthca_is_memfree(dev) && + (dev->mthca_flags & MTHCA_FLAG_FMR)) { addr = pci_resource_start(dev->pdev, 4) + ((pci_resource_len(dev->pdev, 4) - 1) & dev->mr_table.mpt_base); dev->mr_table.tavor_fmr.mpt_base = - ioremap(addr, (1 << i) * sizeof(struct mthca_mpt_entry)); + ioremap(addr, mpts * sizeof(struct mthca_mpt_entry)); if (!dev->mr_table.tavor_fmr.mpt_base) { mthca_warn(dev, "MPT ioremap for FMR failed.\n"); @@ -814,19 +821,21 @@ int mthca_init_mr_table(struct mthca_dev *dev) dev->mr_table.mtt_base); dev->mr_table.tavor_fmr.mtt_base = - ioremap(addr, (1 << i) * MTHCA_MTT_SEG_SIZE); + ioremap(addr, mtts * MTHCA_MTT_SEG_SIZE); if (!dev->mr_table.tavor_fmr.mtt_base) { mthca_warn(dev, "MTT ioremap for FMR failed.\n"); err = -ENOMEM; goto err_fmr_mtt; } + } - err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, i); + if (dev->limits.fmr_reserved_mtts) { + err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, mtts); if (err) goto err_fmr_mtt_buddy; /* Prevent regular MRs from using FMR keys */ - err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, i); + err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, mtts); if (err) goto err_reserve_fmr; diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c index 58d44aa..26bf86d 100644 --- a/drivers/infiniband/hw/mthca/mthca_profile.c +++ b/drivers/infiniband/hw/mthca/mthca_profile.c @@ -277,7 +277,7 @@ u64 mthca_make_profile(struct mthca_dev *dev, * out of the MR pool. They don't use additional memory, but * we assign them as part of the HCA profile anyway. */ - if (mthca_is_memfree(dev)) + if (mthca_is_memfree(dev) || BITS_PER_LONG == 64) dev->limits.fmr_reserved_mtts = 0; else dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; -- MST From eitan at mellanox.co.il Thu Dec 7 07:06:57 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 07 Dec 2006 17:06:57 +0200 Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or switch] In-Reply-To: <1165498375.25587.185801.camel@hal.voltaire.com> References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il> <1165498375.25587.185801.camel@hal.voltaire.com> Message-ID: <45782E11.5010400@mellanox.co.il> Hi Hal, Great thanks. Applying the patch helps. Now I am stuck behind another issue introduced by the latest patch of incremental Set(LFT). I will describe it in separate mail Eitan Hal Rosenstock wrote: > On Wed, 2006-12-06 at 08:18, Eitan Zahavi wrote: > >> Hi Hal, >> >> Here is the same patch against GIT for your convenience. >> >> Thanks >> >> EZ >> >> The simulated regression caught this: >> The osm.mcfdbs have now the format: >> Switch 0x0002c90000000006 >> LID : Out Port(s) >> 0xC000 : 0x003 0x004 0x005 0x006 >> 0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 >> :0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 >> :0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B >> :0xC01C :0xC01D :0xC01E :0xC01F : >> >> Which should probably just be: >> Switch 0x0002c90000000006 >> LID : Out Port(s) >> 0xC000 : 0x003 0x004 0x005 0x006 >> >> Actually switches that do not have any MCG entry will not be included >> in the dump file. >> >> Signed-off-by: Eitan Zahavi >> > > Thanks. Applied. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Thu Dec 7 07:12:59 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 07 Dec 2006 17:12:59 +0200 Subject: [openib-general] [PATCH] osm: Routing Tables are full of UNREACHABLE instead of real route Message-ID: <45782F7B.1010408@mellanox.co.il> Hi Hal, I resolved the mystery behind the osm.fdbs that is now full of UNREACHABLE instead of correct out ports. The problem is a consequence of the new code that does not use the switch LFT blocks for the intermediate LFT assignments: The idea of having incremental updates only relies on temporary buffer that the routing algorithm fills. Then it is sent to the wire only if there is a diff between the switch LFT tables (from the SMDB) and the temporary buffer. So the switch LFT tables are not being directly updated by the routing algorithm - but only by the GetResp obtained as reply to the setting. Until this stage of the description - everything looks right. But what is wrong is that the dump of LFT tables is invoked before the GetResp is obtained. So if only a single sweep is invoked the resulting osm.fdbs show the original state of the SMDB tables whicg is full of 0xFF = UNREACHABLE. The patch below is taking the easy way and should be probably revisited. Instead of having a separate algorithm step for dumping out the resulting GetResp data after all LFT responses were obtained it just copies the sent LFT blocks to the SMDB. I think we need to have at least this simple patch until we have the dump move to a new algorithm step. Thanks Eitan Signed-off-by: Eitan Zahavi ===================================================================== diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index 5a55da8..3a62c7f 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table( "osm_ucast_mgr_set_fwd_table: ERR 3A05: " "Sending linear fwd. tbl. block failed (%s)\n", ib_get_err_str( status ) ); - } + } else { + /* + HACK: for now we will assume we succeeded to send + and set the local DB based on it. This should allow + us to immediatly dump out our routing + */ + osm_switch_set_ft_block( + p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho); + } } OSM_LOG_EXIT( p_mgr->p_log ); From tziporet at dev.mellanox.co.il Thu Dec 7 07:35:59 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 07 Dec 2006 17:35:59 +0200 Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update - RDMA CM etc In-Reply-To: <6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> <4575D0A8.7080501@ichips.intel.com> <20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com> <20061206101705.GP26787@mellanox.co.il> <45770AA3.2040505@ichips.intel.com> <6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com> Message-ID: <457834DF.7030400@dev.mellanox.co.il> Moni Levy wrote: >> >> Does OFED want the multicast support in 1.2? >> > > We definitely want the multicast support in 1.2. It's on the wiki ( > OFED 1.2 release plan and features) and I understood that this was > also agreed on at SC06. > > -- Moni > > We want it but it must work properly before we can integrate it. Moni - you suggested help in debugging - can you take the test that Dotan submitted and debug to understand the failure? Thanks, Tziporet From tziporet at mellanox.co.il Thu Dec 7 08:16:03 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 7 Dec 2006 18:16:03 +0200 Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Message-ID: <6C2C79E72C305246B504CBA17B5500C9521A21@mtlexch01.mtl.com> openib-general at openib.org -----Original Message----- From: Brian Sparks Sent: Thursday, December 07, 2006 6:01 PM To: Tziporet Koren Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Which list... Brian Sparks Marketing Communications Manager Mellanox Technologies 2900 Stender Way Santa Clara, CA 95054 408-916-0008 direct - 408-802-2775 cell http://www.mellanox.com -----Original Message----- From: Tziporet Koren Sent: Thursday, December 07, 2006 8:01 AM To: Brian Sparks Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test I can't help - you can send an email to openib mailing list - maybe someone there can help -----Original Message----- From: Brian Sparks Sent: Thursday, December 07, 2006 5:46 PM To: Tziporet Koren Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test I still get an invalid cert Brian Sparks Marketing Communications Manager Mellanox Technologies 2900 Stender Way Santa Clara, CA 95054 408-916-0008 direct - 408-802-2775 cell http://www.mellanox.com -----Original Message----- From: Tziporet Koren Sent: Thursday, December 07, 2006 1:05 AM To: Brian Sparks; Boris Shpolyansky; Aviram Gutman; Sujal Das Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test That's strange. Can you try the following instead: Go to https://openib.org/tiki/tiki-index.php and then press the link to OFED support. Or - instead of clicking the link do copy & paste to the web browser. Tziporet -----Original Message----- From: Brian Sparks Sent: Wednesday, December 06, 2006 6:12 PM To: Tziporet Koren; Boris Shpolyansky; Aviram Gutman; Sujal Das Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Just going to that link gives me a certificate error. Brian Sparks Marketing Communications Manager Mellanox Technologies 2900 Stender Way Santa Clara, CA 95054 408-916-0008 direct - 408-802-2775 cell http://www.mellanox.com -----Original Message----- From: Tziporet Koren Sent: Wednesday, December 06, 2006 12:44 AM To: Brian Sparks; Boris Shpolyansky; Aviram Gutman; Sujal Das Cc: FAE; Thad Omura; Hani Salloum Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test What have you tried to do - just read or edit? If you want to edit the file you need to register first, and after you have an account you can login and edit the page. Tziporet -----Original Message----- From: Brian Sparks Sent: Tuesday, December 05, 2006 9:50 PM To: Tziporet Koren; Boris Shpolyansky; Aviram Gutman; Sujal Das Cc: FAE; Thad Omura; Hani Salloum Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test I get a certificate error Brian Sparks Marketing Communications Manager Mellanox Technologies 2900 Stender Way Santa Clara, CA 95054 408-916-0008 direct - 408-802-2775 cell http://www.mellanox.com -----Original Message----- From: Tziporet Koren Sent: Tuesday, December 05, 2006 11:48 AM To: Boris Shpolyansky; Brian Sparks; Aviram Gutman; Sujal Das Cc: FAE; Thad Omura; Hani Salloum Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Boris, The support page can be improved by anyone. So you are welcome to edit it and make it better. (all you need is to login to the Wiki and it's pretty intuitive to edit) If you can't please send me the input and I will try to improve it. All - please review the support page and suggest what should be added/changed: https://openib.org/tiki/tiki-index.php?page=OFED+Support Thanks, Tziporet -----Original Message----- From: Boris Shpolyansky Sent: Tuesday, December 05, 2006 9:29 PM To: Brian Sparks; Tziporet Koren; Aviram Gutman; Sujal Das Cc: FAE; Thad Omura Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test I still strongly believe we should have those instructions available. I'm perfectly fine with having them on OpenFabrics web site with us providing a link to them from our web site. We should drive this and if needed put together this page and maintain it as a "service to open source community" - without taking sole support responsibility. Will appreciate everybody's comments. Boris. -----Original Message----- From: Brian Sparks Sent: Tuesday, December 05, 2006 8:06 AM To: Boris Shpolyansky; Tziporet Koren; Aviram Gutman; Sujal Das Cc: FAE; Thad Omura Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test The OFED support pages should be managed on the OF site. Because it's a community source stack, we should not take sole support responsibility and have it delegated to our site. Brian Sparks Marketing Communications Manager Mellanox Technologies 2900 Stender Way Santa Clara, CA 95054 408-916-0008 direct - 408-802-2775 cell http://www.mellanox.com -----Original Message----- From: Boris Shpolyansky Sent: Monday, December 04, 2006 5:37 PM To: Tziporet Koren; Aviram Gutman; Sujal Das; Brian Sparks Cc: FAE Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Hi, As far as I could check we do not have a good support page for OFED stack (as we used to have for IBGD). The one I found on OpenIB wiki is crappy and doesn't look professional at all. I believe we need to set up such page either on our web-site or on OpenIB with a link to it from the main (home) page of both sites. It should have links to all relevant documents, code download, known issues and recent patches with clear instructions how to apply those. I'm adding an example of the patch instructions I just sent to Sun. Please, comment. Boris. Here is the procedure you need to follow in order to apply the patch I sent you earlier: 1. Patch the source code tar xvfz OFED-1.1.tgz package cd OFED-1.1/SOURCES tar xvfz mpi_osu-0.9.7-mlx2.2.0.tgz cd mpi_osu-0.9.7-mlx2.2.0 cp /smpi_cancel.patch . echo "smpi_cancel.patch" >> patch.lst cd .. tar cvfz mpi_osu-0.9.7-mlx2.2.0.tgz mpi_osu-0.9.7-mlx2.2.0 cd .. 2. Build OSU MPI (MVAPICH) RPM ./build.sh - choose option 2 "Build InfiniBand Software RPMs" - then choose option 4 "Customize" - then answer yes on "mpi_osu" The new RPM will go to OFED/RPMS directory. 3. Install newly built RPM - either with ./install.sh script - you'll have to make sure to mark all needed components or to install with "-c" option using correct ofed.conf file - or using "rpm -Uhv" command -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Boris Shpolyansky Sent: Monday, December 04, 2006 2:30 PM To: Tziporet Koren Cc: David Costa; Robert Houk; Thomas Babbit; Anthony Vinciguerra; openib-general at openib.org Subject: Re: [openib-general] HPCC benchmark aborts at MPIRandomAccess test I guess we need to have all our recent MPI fixes to be added to the support page. Pasha should keep track of those, including the one I sent to Sun. By the way, where is this support page exactly - on our web site ? Boris. -----Original Message----- From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Sunday, December 03, 2006 5:50 AM To: Boris Shpolyansky Cc: David Costa; openib-general at openib.org; Robert Houk; Anthony Vinciguerra; Thomas Babbit Subject: Re: [openib-general] HPCC benchmark aborts at MPIRandomAccess test Boris Shpolyansky wrote: > Hi David, > > If you are using OFED-1.1 stack and OSU MVAPICH provided with the > OFED-1.1 package as your MPI layer, > the attached patch should solve your problem. > > Please, let me know if that helped. > > Regards, > Boris, Please add this to OFED 1.1 support page Thanks, Tziporet _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Brian at Mellanox.com Thu Dec 7 08:23:51 2006 From: Brian at Mellanox.com (Brian Sparks) Date: Thu, 7 Dec 2006 08:23:51 -0800 Subject: [openib-general] Certification Error Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F510DBB@mtiexch01.mti.com> FYI: To whom it may concern, I'm seeing a certification error on the following link: https://openib.org/tiki/tiki-index.php Regards, Brian Sparks Marketing Communications Manager Mellanox Technologies 2900 Stender Way Santa Clara, CA 95054 408-916-0008 direct - 408-802-2775 cell http://www.mellanox.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Thu Dec 7 08:34:21 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 7 Dec 2006 11:34:21 -0500 Subject: [openib-general] Certification Error In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F510DBB@mtiexch01.mti.com> References: <9FA59C95FFCBB34EA5E42C1A8573784F510DBB@mtiexch01.mti.com> Message-ID: Correct. I think it's simply because OFA didn't purchase an SSL certificate from a well-know CA (such as Verisign). On Dec 7, 2006, at 11:23 AM, Brian Sparks wrote: > > > FYI: To whom it may concern, I’m seeing a certification error on > the following link: > > > > https://openib.org/tiki/tiki-index.php > > > > > > Regards, > > > > Brian Sparks > Marketing Communications Manager > > Mellanox Technologies > 2900 Stender Way > Santa Clara, CA 95054 > 408-916-0008 direct - 408-802-2775 cell > http://www.mellanox.com > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From sweitzen at cisco.com Thu Dec 7 08:38:48 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 7 Dec 2006 08:38:48 -0800 Subject: [openib-general] [PATCH] IPoIB CM Experimental support Message-ID: > > You can't send UDP/multicast traffic at all between IPoIB > CM and IPoIB > > UD? > > With my experimental code, this currently works only if you > manually limit the MTU > for multicast/UD addresses. > The simplest way to do this is to set up separate interfaces > for CM and UD modes. Separate interfaces as in ib0 vs ib1? Thus I can use IPoIB HA or IPoIB CM but not both, which is not very useful. Speaking of IPoIB CM, will it work with the OFED IPoIB HA? Scott From Brian at Mellanox.com Thu Dec 7 08:36:45 2006 From: Brian at Mellanox.com (Brian Sparks) Date: Thu, 7 Dec 2006 08:36:45 -0800 Subject: [openib-general] Certification Error Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F510DBD@mtiexch01.mti.com> Jeff, Do you know if/when this problem will be fixed? Brian Sparks Marketing Communications Manager Mellanox Technologies 2900 Stender Way Santa Clara, CA 95054 408-916-0008 direct - 408-802-2775 cell http://www.mellanox.com -----Original Message----- From: Jeff Squyres [mailto:jsquyres at cisco.com] Sent: Thursday, December 07, 2006 8:34 AM To: Brian Sparks Cc: openib-general at openib.org Subject: Re: [openib-general] Certification Error Correct. I think it's simply because OFA didn't purchase an SSL certificate from a well-know CA (such as Verisign). On Dec 7, 2006, at 11:23 AM, Brian Sparks wrote: > > > FYI: To whom it may concern, I'm seeing a certification error on > the following link: > > > > https://openib.org/tiki/tiki-index.php > > > > > > Regards, > > > > Brian Sparks > Marketing Communications Manager > > Mellanox Technologies > 2900 Stender Way > Santa Clara, CA 95054 > 408-916-0008 direct - 408-802-2775 cell > http://www.mellanox.com > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From Brian.Cain at ge.com Thu Dec 7 08:33:22 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Thu, 7 Dec 2006 11:33:22 -0500 Subject: [openib-general] website certificate issues In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9521A21@mtlexch01.mtl.com> Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301AC53BA@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- ... > > -----Original Message----- > From: Brian Sparks > Sent: Wednesday, December 06, 2006 6:12 PM > To: Tziporet Koren; Boris Shpolyansky; Aviram Gutman; Sujal Das > Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess > test > > Just going to that link gives me a certificate error. > > > Brian Sparks ... I get the same error. I just "Accept for this session" and grumble in silence. For the folks who aren't getting this error, you might have imported the certificate into your trust store the first time you visited it. The website maintainer should probably get a certificate signed by one of the root CAs. If one of the national labs can't afford a certificate, consider getting your cert signed by CACert.org (I've got their root cert in my store). As a quick workaround, you can stop redirecting http:// traffic to https://. BTW, putting the expiration in 2010 is too far into the future. I think two years is a good max. -Brian From jsquyres at cisco.com Thu Dec 7 08:40:28 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 7 Dec 2006 11:40:28 -0500 Subject: [openib-general] Certification Error In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F510DBD@mtiexch01.mti.com> References: <9FA59C95FFCBB34EA5E42C1A8573784F510DBD@mtiexch01.mti.com> Message-ID: <4240A8B1-1219-4A48-87B4-0C3014A9EE40@cisco.com> I am unaware of any plans to purchase a certificate, but I'm certainly not the authority (hah) on this issue. I suppose that someone could purchase a certificate (does OFA have funds for this kind of thing? certificates need to be renewed on a periodic basis), but if they do, my $0.02 is that it should be done only for the new server. On Dec 7, 2006, at 11:36 AM, Brian Sparks wrote: > Jeff, > Do you know if/when this problem will be fixed? > > Brian Sparks > Marketing Communications Manager > > Mellanox Technologies > 2900 Stender Way > Santa Clara, CA 95054 > 408-916-0008 direct - 408-802-2775 cell > http://www.mellanox.com > > > -----Original Message----- > From: Jeff Squyres [mailto:jsquyres at cisco.com] > Sent: Thursday, December 07, 2006 8:34 AM > To: Brian Sparks > Cc: openib-general at openib.org > Subject: Re: [openib-general] Certification Error > > Correct. I think it's simply because OFA didn't purchase an SSL > certificate from a well-know CA (such as Verisign). > > > On Dec 7, 2006, at 11:23 AM, Brian Sparks wrote: > >> >> >> FYI: To whom it may concern, I'm seeing a certification error on >> the following link: >> >> >> >> https://openib.org/tiki/tiki-index.php >> >> >> >> >> >> Regards, >> >> >> >> Brian Sparks >> Marketing Communications Manager >> >> Mellanox Technologies >> 2900 Stender Way >> Santa Clara, CA 95054 >> 408-916-0008 direct - 408-802-2775 cell >> http://www.mellanox.com >> >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >> openib-general > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From monil at voltaire.com Thu Dec 7 08:44:49 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 7 Dec 2006 18:44:49 +0200 Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update - RDMA CM etc In-Reply-To: <457834DF.7030400@dev.mellanox.co.il> References: <45759B8C.8010408@dev.mellanox.co.il> <4575BB05.7040106@ichips.intel.com> <4575CD94.8070608@dev.mellanox.co.il> <4575D0A8.7080501@ichips.intel.com> <20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com> <20061206101705.GP26787@mellanox.co.il> <45770AA3.2040505@ichips.intel.com> <6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com> <457834DF.7030400@dev.mellanox.co.il> Message-ID: <6a122cc00612070844g577c50c6p39e2394936ffd794@mail.gmail.com> On 12/7/06, Tziporet Koren wrote: > Moni Levy wrote: > >> > >> Does OFED want the multicast support in 1.2? > >> > > > > We definitely want the multicast support in 1.2. It's on the wiki ( > > OFED 1.2 release plan and features) and I understood that this was > > also agreed on at SC06. > > > > -- Moni > > > > > > We want it but it must work properly before we can integrate it. > Moni - you suggested help in debugging - can you take the test that > Dotan submitted and debug to understand the failure? Sure --Moni > > Thanks, > Tziporet > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From jlentini at netapp.com Thu Dec 7 08:59:37 2006 From: jlentini at netapp.com (James Lentini) Date: Thu, 7 Dec 2006 11:59:37 -0500 (EST) Subject: [openib-general] Status of DAT conformance test In-Reply-To: <4577DE44.5030308@lfbs.rwth-aachen.de> References: <4577DE44.5030308@lfbs.rwth-aachen.de> Message-ID: On Thu, 7 Dec 2006, Boris Bierbaum wrote: > I'm looking for ways to test the standard conformance of a uDAPL > provider. I had a look at the DAT conformance test contained in the > DAPL reference implementation, release version gamma 3.2. > > This test doesn't seem to be in a state in which it can be used to > test a uDAPL version 1.2 provider, is anybody working to fix this? I don't know of anyone working on this. > Which test programs can be recommaned for this purpose? While there is no official uDAPL 1.2 conformance test, the tests in the "test" directory (dapltest, etc.) are fairly comprehensive. From mshefty at ichips.intel.com Thu Dec 7 09:29:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 07 Dec 2006 09:29:41 -0800 Subject: [openib-general] potential multicast module issue (was Fwd: FW: IPoIB on c0-6 and c0-7: problem creating mcgroup) In-Reply-To: <20061207132956.GC2614@mellanox.co.il> References: <20061207132956.GC2614@mellanox.co.il> Message-ID: <45784F85.9040304@ichips.intel.com> > __osm_mcmr_rcv_join_mgrp: ERR 1B11: > method = SubnAdmSet, > scope_state = 0x1, > component mask = 0x0000000000010083, > expected comp mask = 0x00000000000130c7, > MGID: 0xff12601bffff0000 : 0x0000000000000002 > from port 0x0002c90200209622 > > Missing bits in component mask for creating mcgroup: > > IB_MCR_COMPMASK_QKEY > IB_MCR_COMPMASK_TCLASS > IB_MCR_COMPMASK_SL > IB_MCR_COMPMASK_FLOW The multicast module uses the component mask given to it by the user when sending join requests to the SA. Was this a join request coming from ipoib (an IPv6?)? There are definite cases where ipoib does a join without doing a create that I tried to leave as they were. - Sean From ralph.campbell at qlogic.com Thu Dec 7 10:05:19 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 07 Dec 2006 10:05:19 -0800 Subject: [openib-general] [PATCH v3 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <4577C149.3050900@voltaire.com> References: <1165430156.14800.243.camel@brick.pathscale.com> <4577C149.3050900@voltaire.com> Message-ID: <1165514719.14800.273.camel@brick.pathscale.com> On Thu, 2006-12-07 at 09:22 +0200, Or Gerlitz wrote: > Ralph Campbell wrote: > > This version of the patch adds support for ib_dma_alloc_coherent() > > and ib_dma_free_coherent(). It also fixes the bug Or found in > > ipath_sync_single_for_cpu() and ipath_sync_single_for_device(). > > > This patch implements the interposing DMA mapping functions to allow > > support for IOMMUs and remove the dependence on phys_to_virt(). > > Haven't you said that the ipath driver uses bus_to_virt ? It did, this patch removes that too. > > diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c > > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > > +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Tue Dec 05 16:04:53 2006 -0800 > > +/** > > + * ipath_dma_map_single - Map a kernel virtual address to DMA address > > + * @dev: The device for which the dma_addr is to be created > > + * @cpu_addr: The kernel virtual address > > + * @size: The size of the region in bytes > > + * @direction: The direction of the DMA > > + */ > > +static u64 ipath_dma_map_single(struct ib_device *dev, > > + void *cpu_addr, size_t size, > > + enum dma_data_direction direction) > > +{ > > + BUG_ON(!valid_dma_direction(direction)); > > + return (u64) cpu_addr; > > +} > > The documentation is both over kill in its volume and worse, simply > tells a whole different story then what this code is doing. It does not > generate DMA address, it does not care about the ib device nor the size > or dma direction. Same for all the documentation below. OK. I have removed the comments and added the following at the top: /* * The following functions implement driver specific replacements * for the ib_dma_*() functions. * * These functions return kernel virtual addresses instead of * device bus addresses since the driver uses the CPU to copy * data instead of using hardware DMA. */ > > +/** > > + * ipath_sg_dma_address - Return the DMA address from a scatter/gather entry > > + * @dev: The device for which the DMA addresses were created > > + * @sg: The scatter/gather entry > > + */ > > +static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg) > > +{ > > + return (u64) page_address(sg->page); > > +} > > this is a bug, you need to add sg->offset > > Or. Thanks, applied. From ralphc at pathscale.com Thu Dec 7 10:47:33 2006 From: ralphc at pathscale.com (Ralph Campbell) Date: Thu, 07 Dec 2006 10:47:33 -0800 Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions Message-ID: <1165517253.14800.283.camel@brick.pathscale.com> This version of the patch fixes ipath_sg_dma_address() and updates the comments for ipath_dma.c as Or Gerlitz suggested. This patch implements the interposing DMA mapping functions to allow support for IOMMUs and remove the dependence on phys_to_virt() and bus_to_virt(). From: Ralph Campbell diff -r c76ed2f1387b drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/Makefile Wed Nov 29 13:54:36 2006 -0800 @@ -6,6 +6,7 @@ ib_ipath-y := \ ib_ipath-y := \ ipath_cq.o \ ipath_diag.o \ + ipath_dma.o \ ipath_driver.o \ ipath_eeprom.o \ ipath_file_ops.o \ diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_keys.c --- a/drivers/infiniband/hw/ipath/ipath_keys.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_keys.c Wed Nov 29 13:54:36 2006 -0800 @@ -134,7 +134,7 @@ int ipath_lkey_ok(struct ipath_qp *qp, s */ if (sge->lkey == 0) { isge->mr = NULL; - isge->vaddr = bus_to_virt(sge->addr); + isge->vaddr = (void *) sge->addr; isge->length = sge->length; isge->sge_length = sge->length; ret = 1; @@ -202,12 +202,12 @@ int ipath_rkey_ok(struct ipath_qp *qp, s int ret; /* - * We use RKEY == zero for physical addresses - * (see ipath_get_dma_mr). + * We use RKEY == zero for kernel virtual addresses + * (see ipath_get_dma_mr and ipath_dma.c). */ if (rkey == 0) { sge->mr = NULL; - sge->vaddr = phys_to_virt(vaddr); + sge->vaddr = (void *) vaddr; sge->length = len; sge->sge_length = len; ss->sg_list = NULL; diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_mr.c --- a/drivers/infiniband/hw/ipath/ipath_mr.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_mr.c Wed Nov 29 13:54:37 2006 -0800 @@ -54,6 +54,8 @@ static inline struct ipath_fmr *to_ifmr( * @acc: access flags * * Returns the memory region on success, otherwise returns an errno. + * Note that all DMA addresses should be created via the + * struct ib_dma_mapping_ops functions (see ipath_dma.c). */ struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc) { @@ -149,8 +151,7 @@ struct ib_mr *ipath_reg_phys_mr(struct i m = 0; n = 0; for (i = 0; i < num_phys_buf; i++) { - mr->mr.map[m]->segs[n].vaddr = - phys_to_virt(buffer_list[i].addr); + mr->mr.map[m]->segs[n].vaddr = (void *) buffer_list[i].addr; mr->mr.map[m]->segs[n].length = buffer_list[i].size; mr->mr.length += buffer_list[i].size; n++; @@ -347,7 +348,7 @@ int ipath_map_phys_fmr(struct ib_fmr *ib n = 0; ps = 1 << fmr->page_shift; for (i = 0; i < list_len; i++) { - fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]); + fmr->mr.map[m]->segs[n].vaddr = (void *) page_list[i]; fmr->mr.map[m]->segs[n].length = ps; if (++n == IPATH_SEGSZ) { m++; diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Nov 29 13:54:37 2006 -0800 @@ -1599,6 +1599,7 @@ int ipath_register_ib_device(struct ipat dev->detach_mcast = ipath_multicast_detach; dev->process_mad = ipath_process_mad; dev->mmap = ipath_mmap; + dev->dma_ops = &ipath_dma_mapping_ops; snprintf(dev->node_desc, sizeof(dev->node_desc), IPATH_IDSTR " %s", init_utsname()->nodename); diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Nov 29 13:28:14 2006 +0800 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Wed Nov 29 13:54:37 2006 -0800 @@ -812,4 +812,6 @@ extern unsigned int ib_ipath_max_srq_wrs extern const u32 ib_ipath_rnr_table[]; +extern struct ib_dma_mapping_ops ipath_dma_mapping_ops; + #endif /* IPATH_VERBS_H */ diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Thu Dec 07 10:06:46 2006 -0800 @@ -0,0 +1,189 @@ +/* + * Copyright (c) 2006 QLogic, Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include "ipath_verbs.h" + +#define BAD_DMA_ADDRESS ((u64) 0) + +/* + * The following functions implement driver specific replacements + * for the ib_dma_*() functions. + * + * These functions return kernel virtual addresses instead of + * device bus addresses since the driver uses the CPU to copy + * data instead of using hardware DMA. + */ + +static int ipath_mapping_error(struct ib_device *dev, u64 dma_addr) +{ + return dma_addr == BAD_DMA_ADDRESS; +} + +static u64 ipath_dma_map_single(struct ib_device *dev, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); + return (u64) cpu_addr; +} + +static void ipath_dma_unmap_single(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); +} + +static u64 ipath_dma_map_page(struct ib_device *dev, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction direction) +{ + u64 addr; + + BUG_ON(!valid_dma_direction(direction)); + + if (offset + size > PAGE_SIZE) { + addr = BAD_DMA_ADDRESS; + goto done; + } + + addr = (u64) page_address(page); + if (addr) + addr += offset; + /* TODO: handle highmem pages */ + +done: + return addr; +} + +static void ipath_dma_unmap_page(struct ib_device *dev, + u64 addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); +} + +int ipath_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + u64 addr; + int i; + int ret = nents; + + BUG_ON(!valid_dma_direction(direction)); + + for (i = 0; i < nents; i++) { + addr = (u64) page_address(sg[i].page); + /* TODO: handle highmem pages */ + if (!addr) { + ret = 0; + break; + } + } + return ret; +} + +static void ipath_unmap_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + BUG_ON(!valid_dma_direction(direction)); +} + +static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg) +{ + u64 addr = (u64) page_address(sg->page); + + if (addr) + addr += sg->offset; + return addr; +} + +static unsigned int ipath_sg_dma_len(struct ib_device *dev, + struct scatterlist *sg) +{ + return sg->length; +} + +static void ipath_sync_single_for_cpu(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ +} + +static void ipath_sync_single_for_device(struct ib_device *dev, + u64 addr, + size_t size, + enum dma_data_direction dir) +{ +} + +static void *ipath_dma_alloc_coherent(struct ib_device *dev, size_t size, + u64 *dma_handle, gfp_t flag) +{ + struct page *p; + void *addr = NULL; + + p = alloc_pages(flag, get_order(size)); + if (p) + addr = page_address(p); + if (dma_handle) + *dma_handle = (u64) addr; + return addr; +} + +static void ipath_dma_free_coherent(struct ib_device *dev, size_t size, + void *cpu_addr, dma_addr_t dma_handle) +{ + free_pages((unsigned long) cpu_addr, get_order(size)); +} + +struct ib_dma_mapping_ops ipath_dma_mapping_ops = { + ipath_mapping_error, + ipath_dma_map_single, + ipath_dma_unmap_single, + ipath_dma_map_page, + ipath_dma_unmap_page, + ipath_map_sg, + ipath_unmap_sg, + ipath_sg_dma_address, + ipath_sg_dma_len, + ipath_sync_single_for_cpu, + ipath_sync_single_for_device, + ipath_dma_alloc_coherent, + ipath_dma_free_coherent +}; From halr at voltaire.com Thu Dec 7 11:58:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2006 14:58:48 -0500 Subject: [openib-general] [PATCH] osm: Routing Tables are full of UNREACHABLE instead of real route In-Reply-To: <45782F7B.1010408@mellanox.co.il> References: <45782F7B.1010408@mellanox.co.il> Message-ID: <1165521425.25587.198999.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-12-07 at 10:12, Eitan Zahavi wrote: > Hi Hal, > > I resolved the mystery behind the osm.fdbs that is now full of > UNREACHABLE instead of correct out ports. > > The problem is a consequence of the new code that does not use the > switch LFT blocks for the intermediate LFT assignments: > The idea of having incremental updates only relies on temporary buffer > that the routing algorithm fills. > Then it is sent to the wire only if there is a diff between the switch > LFT tables (from the SMDB) and the temporary buffer. > > So the switch LFT tables are not being directly updated by the routing > algorithm - but only by the GetResp obtained as > reply to the setting. Until this stage of the description - everything > looks right. > > But what is wrong is that the dump of LFT tables is invoked before the > GetResp is obtained. > So if only a single sweep is invoked the resulting osm.fdbs show the > original state of the SMDB tables whicg is full of 0xFF = UNREACHABLE. > > The patch below is taking the easy way and should be probably revisited. > Instead of having a separate algorithm step for dumping out the > resulting GetResp data after all LFT responses were obtained it just > copies the sent LFT blocks to the SMDB. Any idea on why the LFT set failed ? > I think we need to have at least this simple patch until we have the > dump move to a new algorithm step. Good find. Applied. Thanks. We'll revisit a longer term solution to this issue. -- Hal > Thanks > Eitan > > Signed-off-by: Eitan Zahavi From swise at opengridcomputing.com Thu Dec 7 12:14:43 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 07 Dec 2006 14:14:43 -0600 Subject: [openib-general] [ANNOUNCE] - Ammasso Library Git Repository Message-ID: <1165522483.14449.39.camel@stevo-desktop> The Ammasso RDMA library is now maintained via git at: git://staging.openfabrics.org/~swise/libamso.git Thanks, Steve. From swise at opengridcomputing.com Thu Dec 7 12:20:11 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 07 Dec 2006 14:20:11 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories Message-ID: <1165522811.14449.46.camel@stevo-desktop> The Chelsio T3 RDMA Library is now maintained via git at: git://staging.openfabrics.org/~swise/libcxgb3.git I'm also maintaining a kernel git repository with the various needed patches for running the T3 device at: git://staging.openfabrics.org/~swise/cxgb3.git This repository is based on Linus's git tree as of 2.6.19. The cxgb3 branch should be checked out to get the latest T3 patches including the low level Ethernet driver. This repos will eventually go away (hopefully :) as the T3 drivers are pulled into 2.6.20. Thanks, Steve. From cap at nsc.liu.se Thu Dec 7 12:20:48 2006 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Thu, 7 Dec 2006 21:20:48 +0100 Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem In-Reply-To: <1165440197.2894.5.camel@julia.et.endace.com> References: <1165440197.2894.5.camel@julia.et.endace.com> Message-ID: <200612072120.52117.cap@nsc.liu.se> On Wednesday 06 December 2006 22:23, vishal wrote: > Hi, > > Was trying to install IBGOLD on Red Hat 4 (x86_64), and the > following is the 'error' part from a log file. I couldn't find the > -Xcompiler option in the gcc manual. Am I missing something ? First, this list isn't really a good place for IBGD questions, you should probably contact mellanox. That said, you are probably missing some packages. Make sure you have atleast gcc, libgcc and glibc-devel installed (building IBGD will require more but those are probably a start). /Peter (who has built IBGD from 0.5.0 to 1.8.2 on EL4) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From robert.j.woodruff at intel.com Thu Dec 7 13:55:55 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 7 Dec 2006 13:55:55 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories Message-ID: Steve Wise wrote, >I'm also maintaining a kernel git repository with the various needed >patches for running the T3 device at: >git://staging.openfabrics.org/~swise/cxgb3.git >This repository is based on Linus's git tree as of 2.6.19. The cxgb3 >branch should be checked out to get the latest T3 patches including the >low level Ethernet driver. This repos will eventually go away >(hopefully :) as the T3 drivers are pulled into 2.6.20. It looks like this tree is not based on 2.6.19 for the drivers/infiniband/core but some other tree. When I do a git-diff of your tree against linux-2.6.19 there are diffs in the drivers/infiniband/core that should not be there. I think you need to rebase your tree on a stock linux 2.6.19 and then only add the cxgb3 code, or have a branch that only contains the cxgb3 code and another branch that might contain other newer infiniband/core code if you want to test with that. This way, someone can easily do a git-diff of cxgb3 with a stock linux-2.6.19 to generate a patch that only contains the Chelsio code. woody From swise at opengridcomputing.com Thu Dec 7 14:09:37 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 07 Dec 2006 16:09:37 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: References: Message-ID: <1165529377.14449.75.camel@stevo-desktop> On Thu, 2006-12-07 at 13:55 -0800, Woodruff, Robert J wrote: > Steve Wise wrote, > >I'm also maintaining a kernel git repository with the various needed > >patches for running the T3 device at: > > >git://staging.openfabrics.org/~swise/cxgb3.git > > >This repository is based on Linus's git tree as of 2.6.19. The cxgb3 > >branch should be checked out to get the latest T3 patches including the > >low level Ethernet driver. This repos will eventually go away > >(hopefully :) as the T3 drivers are pulled into 2.6.20. > > It looks like this tree is not based on 2.6.19 for the > drivers/infiniband/core > but some other tree. When I do a git-diff of your tree against > linux-2.6.19 > there are diffs in the drivers/infiniband/core that should not be there. > It is based on 2.6.19. But it also has sean's ucma patch series (the old 7 part patch series...I haven't updated to his latest 5-par patch set yet or tried to use his git tree). Plus it has an iwcm fix from Krishna Kumar that fixed bugs I've hit during QA testing. > I think you need to rebase your tree on a stock linux 2.6.19 and then > only add > the cxgb3 code, or have a branch that only contains the cxgb3 code and > another > branch that might contain other newer infiniband/core code if you want > to test with that. Yea maybe. For now, you get everything I need to make cxgb3 run on 2.6.19. I'll think about the multiple branch approach. > This way, someone can easily do a git-diff of cxgb3 with a stock > linux-2.6.19 to > generate a patch that only contains the Chelsio code. I'm struggling with maintaining a patch series in-review on lklm and netdev, plus maintaining a consistent tree that I can QA on and not introduce bugs from other stuff going into 2.6.20. So I don't want to just base this tree on Roland's for-2.6.20, as an example. I really just want 2.6.19 + stuff needed to run chelsio's T3. Right now, that is the UCMA stuff + a few core fixes... Roland, I welcome your thoughts too on how I should do this. I'm new to git. Also I'm using stgit to maintain the chelsio driver patch series, so I continually pop it and add fixes to each patch as I fix things, so the tree really is kind of in-flux... Steve. From rdreier at cisco.com Thu Dec 7 14:13:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 07 Dec 2006 14:13:15 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: <1165529377.14449.75.camel@stevo-desktop> (Steve Wise's message of "Thu, 07 Dec 2006 16:09:37 -0600") References: <1165529377.14449.75.camel@stevo-desktop> Message-ID: > Plus it has an iwcm fix from Krishna Kumar that > fixed bugs I've hit during QA testing. Do I have that patch (or is it in 2.6.20 already)? > Roland, I welcome your thoughts too on how I should do this. I'm new to > git. Also I'm using stgit to maintain the chelsio driver patch series, > so I continually pop it and add fixes to each patch as I fix things, so > the tree really is kind of in-flux... What you're doing sounds reasonable. If you want to create a "chelsio prerequisites" branch that might address Woody's concern -- then a git diff between the branches would show the chelsio changes only. And that would be really cheap to do -- just create a new branch pointing at the commit before the chelsio stuff in your stack. From swise at opengridcomputing.com Thu Dec 7 14:16:13 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 07 Dec 2006 16:16:13 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: References: <1165529377.14449.75.camel@stevo-desktop> Message-ID: <1165529773.14449.80.camel@stevo-desktop> On Thu, 2006-12-07 at 14:13 -0800, Roland Dreier wrote: > > Plus it has an iwcm fix from Krishna Kumar that > > fixed bugs I've hit during QA testing. > > Do I have that patch (or is it in 2.6.20 already)? > Yes. > > Roland, I welcome your thoughts too on how I should do this. I'm new to > > git. Also I'm using stgit to maintain the chelsio driver patch series, > > so I continually pop it and add fixes to each patch as I fix things, so > > the tree really is kind of in-flux... > > What you're doing sounds reasonable. If you want to create a "chelsio > prerequisites" branch that might address Woody's concern -- then a git > diff between the branches would show the chelsio changes only. And > that would be really cheap to do -- just create a new branch pointing > at the commit before the chelsio stuff in your stack. Lemme try this out and Woody: I'll get back to ya. Steve. From robert.j.woodruff at intel.com Thu Dec 7 14:21:09 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 7 Dec 2006 14:21:09 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories Message-ID: Steve wrote, >Yea maybe. For now, you get everything I need to make cxgb3 run on >2.6.19. I'll think about the multiple branch approach. The issue is this. I am working on putting together an OFA integration tree that integrates several components from several different developers. The same will be true when we start to integrate code into OFED 1.2. Most code will come from Linus's tree, but some code will need to come directly from the developer's git trees and we will need a way to generate a patch for only your code, as we will get things like the local_sa cache code directly from Sean's. So if you can make a branch that only contains the cxgb3 code, it makes generating a patch with only your code easier, and this will be needed both for my early OFA integration work and also for OFED 1.2. Once your code is upstream, life is easier as we will get it from linus, until then we'd like a way to patch the existing released kernel (2.6.19 in this case) with your code. make sense ? woody From swise at opengridcomputing.com Thu Dec 7 14:24:10 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 07 Dec 2006 16:24:10 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: References: Message-ID: <1165530250.14449.85.camel@stevo-desktop> On Thu, 2006-12-07 at 14:21 -0800, Woodruff, Robert J wrote: > Steve wrote, > >Yea maybe. For now, you get everything I need to make cxgb3 run on > >2.6.19. I'll think about the multiple branch approach. > > The issue is this. I am working on putting together an OFA integration > tree that integrates several components from several different > developers. > The same will be true when we start to integrate code into OFED 1.2. > Most code will come from Linus's tree, but some code will need to > come directly from the developer's git trees and we will need > a way to generate a patch for only your code, as we will get things like > the local_sa cache code directly from Sean's. > > So if you can make a branch that only contains the cxgb3 code, it makes > generating a patch with only your code easier, and this will be needed > both for my early OFA integration work and also for OFED 1.2. > Once your code is upstream, life is easier as we will get it from > linus, until then we'd like a way to patch the existing released kernel > (2.6.19 in this case) with your code. > > make sense ? I understand. From robert.j.woodruff at intel.com Thu Dec 7 14:24:05 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 7 Dec 2006 14:24:05 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories Message-ID: Steve wrote> >> >> What you're doing sounds reasonable. If you want to create a "chelsio >> prerequisites" branch that might address Woody's concern -- then a git >> diff between the branches would show the chelsio changes only. And >> that would be really cheap to do -- just create a new branch pointing >> at the commit before the chelsio stuff in your stack. >Lemme try this out and Woody: I'll get back to ya. >Steve. Thanks woody From mshefty at ichips.intel.com Thu Dec 7 14:24:32 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 07 Dec 2006 14:24:32 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: <1165529377.14449.75.camel@stevo-desktop> References: <1165529377.14449.75.camel@stevo-desktop> Message-ID: <457894A0.6020002@ichips.intel.com> > I'm struggling with maintaining a patch series in-review on lklm and > netdev, plus maintaining a consistent tree that I can QA on and not > introduce bugs from other stuff going into 2.6.20. So I don't want to > just base this tree on Roland's for-2.6.20, as an example. I really > just want 2.6.19 + stuff needed to run chelsio's T3. Right now, that is > the UCMA stuff + a few core fixes... I'm sure Roland can provide more input here, but what I did was start with 2.6.19. Then, for each feature set in SVN, I created a new git branch, reworked the SVN patches, and applied them to that branch. Where I had dependencies, I simply branched off one of my branches. For example, my multicast branch is off my rdma_ucm branch. My master branch is 2.6.19. My intent is to update my tree with each new Linux release. As an aside, I created a test-apps branch to throw all my kernel test apps into. (I really didn't want to maintain a branch per test app, since these will never merge upstream.) I included krping in that tree, since i didn't see where you were maintaining it, and I didn't want to lose it. > Roland, I welcome your thoughts too on how I should do this. I'm new to > git. Also I'm using stgit to maintain the chelsio driver patch series, > so I continually pop it and add fixes to each patch as I fix things, so > the tree really is kind of in-flux... I didn't think that you wanted to do this after you've published a tree. If someone clones your tree, then you use stgit to pop a patch, modify it, then recommit it, I'm not how a cloned tree reconciles the changes. - Sean From caryang at cisco.com Thu Dec 7 14:27:31 2006 From: caryang at cisco.com (Carl Yang (caryang)) Date: Thu, 7 Dec 2006 14:27:31 -0800 Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support foroperation over IPoIB Message-ID: Or, Can you please forward me (or to the email alias) "an example bonding sysfs script which can be used to set bonding to work with patches 1-3?" Thanks, Carl -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Or Gerlitz Sent: Thursday, November 30, 2006 2:57 AM To: netdev at vger.kernel.org Cc: Roland Dreier (rdreier); Jay Vosburgh; openib-general at openib.org Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support foroperation over IPoIB This patch series is a second version (see below link to V1) of the suggested changes to the bonding driver such that it would be able to support non ARPHRD_ETHER netdevices for its High-Availability (active-backup) mode. The motivation is to enable the bonding driver on its HA mode to work with the IP over Infiniband (IPoIB) driver. With these patches I was able to enslave IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with fail-over and fail-back working fine. My working env was the net-2.6.20 git. More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which is used by native IB ULPs whose addressing scheme is based on IP (eg iSER, SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA for these ULPs. This holds as when the ULP is informed by the IB HW on the failure of the current IB connection, it just need to reconnect, where the bonding device will now issue the IB ARP over the active IPoIB slave. The first patch changes some of the bond netdevice attributes and functions to be that of the active slave for the case of the enslaved device not being of ARPHRD_ETHER type. Basically it overrides those setting done by ether_setup(), which are netdevice **type** dependent and hence might be not appropriate for devices of other types. It also enforces mutual exclusion on bonding slaves from dissimilar ether types, as was concluded over the v1 discussion. IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a 3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID) of the port this IPoIB device is bounded to. The QP is a resource created by the IB HW and the GID is an identifier burned into the HCA (i have omitted here some details which are not important for the bonding RFC). Basically the IPoIB spec and impl. do not allow for setting the MAC address of an IPoIB device and this work was made under this assumption. Hence, the second patch allows for enslaving netdevices which do not support the set_mac_address() function. In that case the bond mac address is the one of the active slave, where remote peers are notified on the mac address (neighbour) change by Gratuitous ARP sent by bonding when fail-over occurs (this is already done by the bonding code). Normally, the bonding driver is UP before any enslavement takes place. Once a netdevice is UP, the network stack acts to have it join some multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called where for mcast joins taking place **after** the enslavement another ip_xxx_mc_map() is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND) The third patch handles this problem by allowing to enslave devices when the bonding device is not up. Over the discussion held at the previous post this seemed to be the most clean way to go, where it is not expected to cause instabilities. These patches are not enough for configuration of IPoIB bonding through tools (eg /sbin/ifenslave and /sbin/ifup) provided by packages such as sysconfig and initscripts, specifically since these tools sets the bonding device to be UP before enslaving anything. Once this patchset gets positive/feedback the next step would be to look how to enhance the tools/packages so it would be possible to bond/enslave with the modified code. As suggested by the bonding maintainer, this step can potentially involve converting ifenslave to be a script based on the bonding sysfs infrastructure rather on the somehow obsoleted Documentation/networking/ifenslave.c For the ease of potential testers, I will post an example bonding sysfs script which can be used to set bonding to work with patches 1-3 (let me know!) Or. changes from V1 (the links point to V1 0-3/3) http://marc.theaimsgroup.com/?l=linux-netdev&m=115926582209736&w=2 http://marc.theaimsgroup.com/?l=linux-netdev&m=115926599515568&w=2 http://marc.theaimsgroup.com/?l=linux-netdev&m=115926599430055&w=2 http://marc.theaimsgroup.com/?l=linux-netdev&m=115926599415729&w=2 + enforce mutual exclusion on the slaves ether types don't attempt to + set the bond mtu when enslaving a non ARPHRD_ETHER device rather than + hack the bond device ether type through mod params allow enslavement when the bond device is not up _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu Dec 7 14:30:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 07 Dec 2006 16:30:44 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: <457894A0.6020002@ichips.intel.com> References: <1165529377.14449.75.camel@stevo-desktop> <457894A0.6020002@ichips.intel.com> Message-ID: <1165530644.14449.88.camel@stevo-desktop> On Thu, 2006-12-07 at 14:24 -0800, Sean Hefty wrote: > > I'm struggling with maintaining a patch series in-review on lklm and > > netdev, plus maintaining a consistent tree that I can QA on and not > > introduce bugs from other stuff going into 2.6.20. So I don't want to > > just base this tree on Roland's for-2.6.20, as an example. I really > > just want 2.6.19 + stuff needed to run chelsio's T3. Right now, that is > > the UCMA stuff + a few core fixes... > > I'm sure Roland can provide more input here, but what I did was start with > 2.6.19. Then, for each feature set in SVN, I created a new git branch, reworked > the SVN patches, and applied them to that branch. Where I had dependencies, I > simply branched off one of my branches. For example, my multicast branch is off > my rdma_ucm branch. > > My master branch is 2.6.19. My intent is to update my tree with each new Linux > release. > > As an aside, I created a test-apps branch to throw all my kernel test apps into. > (I really didn't want to maintain a branch per test app, since these will > never merge upstream.) I included krping in that tree, since i didn't see where > you were maintaining it, and I didn't want to lose it. > Thanks! I forgot about that stuff! > > Roland, I welcome your thoughts too on how I should do this. I'm new to > > git. Also I'm using stgit to maintain the chelsio driver patch series, > > so I continually pop it and add fixes to each patch as I fix things, so > > the tree really is kind of in-flux... > > I didn't think that you wanted to do this after you've published a tree. If > someone clones your tree, then you use stgit to pop a patch, modify it, then > recommit it, I'm not how a cloned tree reconciles the changes. > Well, the life of this T3 git tree will hopefully be short: we're trying hard to get the kernel bits of T3 into 2.6.20... You're right. Folks cannot back against this tree and do a pull to refresh. It'll get balled up. But Roland's tree is the same way. Steve. From halr at voltaire.com Thu Dec 7 14:32:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2006 17:32:29 -0500 Subject: [openib-general] [PATCH] Diags/saquery: Add support for querying ServiceRecords Message-ID: <1165530723.25587.203519.camel@hal.voltaire.com> Diags/saquery: Add support for querying ServiceRecords Signed-off-by: Hal Rosenstock diff --git a/diags/ChangeLog b/diags/ChangeLog index 186059c..318f4b9 100644 --- a/diags/ChangeLog +++ b/diags/ChangeLog @@ -1,3 +1,8 @@ +2006-12-07 Hal Rosenstock + + * src/saquery.c, man/saquery.8: Add support for + querying ServiceRecords + 2006-11-21 Hal Rosenstock * src/perfquery.c: Add support for PerfMgt ClassPortInfo: diff --git a/diags/man/saquery.8 b/diags/man/saquery.8 index 853effc..5bbc8a2 100644 --- a/diags/man/saquery.8 +++ b/diags/man/saquery.8 @@ -1,11 +1,11 @@ -.TH SAQUERY 8 "October 9, 2006" "OpenIB" "OpenIB Diagnostics" +.TH SAQUERY 8 "December 7, 2006" "OpenIB" "OpenIB Diagnostics" .SH NAME saquery \- query InfiniBand subnet administration attributes .SH SYNOPSIS .B saquery -[\-h] [\-d] [\-P] [\-N] [\-D] [\-L] i[\-l] [\-G] [\-C] [\-s] [\-g] [\-m] [--src-to-dst ] [] +[\-h] [\-d] [\-P] [\-N] [\-D] [\-S] [\-L] i[\-l] [\-G] [\-C] [\-s] [\-g] [\-m] [--src-to-dst ] [] .SH DESCRIPTION .PP @@ -24,6 +24,9 @@ get NodeRecord info \fB\-D\fR get NodeDescriptions of CAs only .TP +\fB\-S\fR +get ServiceRecord info +.TP \fB\-L\fR return the Lids of the name specified .TP diff --git a/diags/src/saquery.c b/diags/src/saquery.c index cc39b06..168df01 100644 --- a/diags/src/saquery.c +++ b/diags/src/saquery.c @@ -334,6 +334,104 @@ print_multicast_member_record(ib_member_ } static void +print_service_record(ib_service_record_t *p_sr) +{ + char buf_service_key[35]; + char buf_service_name[65]; + + sprintf(buf_service_key, + "0x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x", + p_sr->service_key[0], + p_sr->service_key[1], + p_sr->service_key[2], + p_sr->service_key[3], + p_sr->service_key[4], + p_sr->service_key[5], + p_sr->service_key[6], + p_sr->service_key[7], + p_sr->service_key[8], + p_sr->service_key[9], + p_sr->service_key[10], + p_sr->service_key[11], + p_sr->service_key[12], + p_sr->service_key[13], + p_sr->service_key[14], + p_sr->service_key[15]); + strncpy(buf_service_name, (char *)p_sr->service_name, 64); + buf_service_name[64] = '\0'; + + printf("ServiceRecord dump:\n" + "\t\t\t\tServiceID...............0x%016" PRIx64 "\n" + "\t\t\t\tServiceGID..............0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n" + "\t\t\t\tServiceP_Key............0x%X\n" + "\t\t\t\tServiceLease............0x%X\n" + "\t\t\t\tServiceKey..............%s\n" + "\t\t\t\tServiceName.............%s\n" + "\t\t\t\tServiceData8.1..........0x%X\n" + "\t\t\t\tServiceData8.2..........0x%X\n" + "\t\t\t\tServiceData8.3..........0x%X\n" + "\t\t\t\tServiceData8.4..........0x%X\n" + "\t\t\t\tServiceData8.5..........0x%X\n" + "\t\t\t\tServiceData8.6..........0x%X\n" + "\t\t\t\tServiceData8.7..........0x%X\n" + "\t\t\t\tServiceData8.8..........0x%X\n" + "\t\t\t\tServiceData8.9..........0x%X\n" + "\t\t\t\tServiceData8.10.........0x%X\n" + "\t\t\t\tServiceData8.11.........0x%X\n" + "\t\t\t\tServiceData8.12.........0x%X\n" + "\t\t\t\tServiceData8.13.........0x%X\n" + "\t\t\t\tServiceData8.14.........0x%X\n" + "\t\t\t\tServiceData8.15.........0x%X\n" + "\t\t\t\tServiceData8.16.........0x%X\n" + "\t\t\t\tServiceData16.1.........0x%X\n" + "\t\t\t\tServiceData16.2.........0x%X\n" + "\t\t\t\tServiceData16.3.........0x%X\n" + "\t\t\t\tServiceData16.4.........0x%X\n" + "\t\t\t\tServiceData16.5.........0x%X\n" + "\t\t\t\tServiceData16.6.........0x%X\n" + "\t\t\t\tServiceData16.7.........0x%X\n" + "\t\t\t\tServiceData16.8.........0x%X\n" + "\t\t\t\tServiceData32.1.........0x%X\n" + "\t\t\t\tServiceData32.2.........0x%X\n" + "\t\t\t\tServiceData32.3.........0x%X\n" + "\t\t\t\tServiceData32.4.........0x%X\n" + "\t\t\t\tServiceData64.1.........0x%016" PRIx64 "\n" + "\t\t\t\tServiceData64.2.........0x%016" PRIx64 "\n" + "", + cl_ntoh64( p_sr->service_id ), + cl_ntoh64( p_sr->service_gid.unicast.prefix ), + cl_ntoh64( p_sr->service_gid.unicast.interface_id ), + cl_ntoh16( p_sr->service_pkey ), + cl_ntoh32( p_sr->service_lease ), + buf_service_key, + buf_service_name, + p_sr->service_data8[0], p_sr->service_data8[1], + p_sr->service_data8[2], p_sr->service_data8[3], + p_sr->service_data8[4], p_sr->service_data8[5], + p_sr->service_data8[6], p_sr->service_data8[7], + p_sr->service_data8[8], p_sr->service_data8[9], + p_sr->service_data8[10], p_sr->service_data8[11], + p_sr->service_data8[12], p_sr->service_data8[13], + p_sr->service_data8[14], p_sr->service_data8[15], + cl_ntoh16(p_sr->service_data16[0]), + cl_ntoh16(p_sr->service_data16[1]), + cl_ntoh16(p_sr->service_data16[2]), + cl_ntoh16(p_sr->service_data16[3]), + cl_ntoh16(p_sr->service_data16[4]), + cl_ntoh16(p_sr->service_data16[5]), + cl_ntoh16(p_sr->service_data16[6]), + cl_ntoh16(p_sr->service_data16[7]), + cl_ntoh32(p_sr->service_data32[0]), + cl_ntoh32(p_sr->service_data32[1]), + cl_ntoh32(p_sr->service_data32[2]), + cl_ntoh32(p_sr->service_data32[3]), + cl_ntoh64(p_sr->service_data64[0]), + cl_ntoh64(p_sr->service_data64[1]) + ); +} + +static void return_mad(void) { /* @@ -645,6 +743,26 @@ print_multicast_group_records(osm_bind_h return (status); } +static ib_api_status_t +print_service_records(osm_bind_handle_t bind_handle) +{ + int i = 0; + ib_service_record_t *service_record = NULL; + ib_net16_t attr_offset = ib_get_attr_offset(sizeof(*service_record)); + ib_api_status_t status; + + status = get_all_records(bind_handle, IB_MAD_ATTR_SERVICE_RECORD, attr_offset, 0); + if (status != IB_SUCCESS) + return (status); + + for (i = 0; i < result.result_cnt; i++) { + service_record = osmv_get_query_svc_rec(result.p_result_madw, i); + print_service_record(service_record); + } + return_mad(); + return (status); +} + static osm_bind_handle_t get_bind_handle(void) { @@ -729,12 +847,13 @@ clean_up(void) static void usage(void) { - fprintf(stderr, "Usage: %s [-h -d -P -N -D -L -l -G -C -s -g -m --src-to-dst ] []\n", argv0); + fprintf(stderr, "Usage: %s [-h -d -P -N -D -S -L -l -G -C -s -g -m --src-to-dst ] []\n", argv0); fprintf(stderr, " Queries node records by default\n"); fprintf(stderr, " -d enable debugging\n"); fprintf(stderr, " -P get PathRecord info\n"); fprintf(stderr, " -N get NodeRecord info\n"); fprintf(stderr, " -D get NodeDescriptions of CAs only\n"); + fprintf(stderr, " -S get ServiceRecord info\n"); fprintf(stderr, " -L return the Lids of the name specified\n"); fprintf(stderr, " -l return the unique Lid of the name specified\n"); fprintf(stderr, " -G return the Guids of the name specified\n"); @@ -758,7 +877,7 @@ main(int argc, char **argv) ib_net16_t dst_lid; ib_api_status_t status; - static char const str_opts[] = "PNDLlGCsgmdh"; + static char const str_opts[] = "PNDLlGCSsgmdh"; static const struct option long_opts [] = { {"P", 0, 0, 'P'}, {"N", 0, 0, 'N'}, @@ -771,6 +890,7 @@ main(int argc, char **argv) {"m", 0, 0, 'm'}, {"d", 0, 0, 'd'}, {"C", 0, 0, 'C'}, + {"S", 0, 0, 'S'}, {"help", 0, 0, 'h'}, {"src-to-dst", 1, 0, 1}, { } @@ -806,6 +926,9 @@ main(int argc, char **argv) case 'C': query_type = IB_MAD_ATTR_CLASS_PORT_INFO; break; + case 'S': + query_type = IB_MAD_ATTR_SERVICE_RECORD; + break; case 'N': query_type = IB_MAD_ATTR_NODE_RECORD; break; @@ -871,6 +994,9 @@ main(int argc, char **argv) case IB_MAD_ATTR_MCMEMBER_RECORD: status = print_multicast_group_records(bind_handle, members); break; + case IB_MAD_ATTR_SERVICE_RECORD: + status = print_service_records(bind_handle); + break; default: fprintf(stderr, "Unknown query type %d\n", query_type); status = IB_UNKNOWN_ERROR; From rdreier at cisco.com Thu Dec 7 14:44:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 07 Dec 2006 14:44:59 -0800 Subject: [openib-general] version #defines for the kernel In-Reply-To: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> ( Eric Barton's message of "Thu, 7 Dec 2006 11:04:22 GMT") References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> Message-ID: > ...but is this the right thing to do? It's the "USER" in > IB_USER_VERBS_ABI_VERSION that's making me nervous since this is kernel code. No, this is utterly wrong -- the userspace verbs ABI has nothing to do with the in-kernel API (which changes at any time with no notice). > Actually a single OFED version #define would most probably suit my purposes - > is that controversial? It might be sensible for OFED to supply that, if it's going to backport drivers to old kernels. But you should also cope with non-OFED (vanilla upstream) drivers, probably by testing LINUX_VERSION_CODE too I suppose. From halr at voltaire.com Thu Dec 7 14:48:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Dec 2006 17:48:08 -0500 Subject: [openib-general] OpenSM Issues of the last couple days Message-ID: <1165531651.25587.204056.camel@hal.voltaire.com> Hi Eitan, Just wanted to close the loop on the OpenSM issues of the last couple days. 1. When can you supply an OpenSM verbose log for the InformInfo subscribe problem you reported earlier today ? Failing that, I don't know how to reproduce this. 2. With the latest tree, do your simulation tests now work ? The osm.fdbs UNREACHABLE was only a problem with the file and not with the LFTs in the network. 3. In terms of file format changes, the lack of any file versioning makes it difficult to move these forward when the need arises. (The format change to osm.mcfdbs was unintentional (not by design)). 4. I encourage you to look at and comment on the OpenSM patches rather than waiting for them to be in the tree. Thanks for your help in finding the bugs sooner. -- Hal From rdreier at cisco.com Thu Dec 7 16:06:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 07 Dec 2006 16:06:15 -0800 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: <20061206072604.GC26787@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 6 Dec 2006 09:26:04 +0200") References: <20061206072604.GC26787@mellanox.co.il> Message-ID: > I know. Still, this only happens if you enable CM. Maybe it will help > to mention this in the comment in KConfig? Log a message as well? Logging a message might help a tiny bit. But the Kconfig help text is useless -- most naive users will be running distro kernels or OFED, which I assume will enable CM by default. > I have a notion that once this code is upstream we can work on > ways to teach kernel about net devices where MTU changes dynamically. > Or possibly, some tricks with icmp can make it work. I think it would be better to use ethtool or something similar to explicitly enable CM. At least until there's a way to make multicast work on an interface using CM. From sweitzen at cisco.com Thu Dec 7 16:35:44 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 7 Dec 2006 16:35:44 -0800 Subject: [openib-general] Multicast Group Routing Question Message-ID: What OS and kernel are you using? I just took a closer look on RHEL4 U4 2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1, where sending IP multicast traffic causes the data to go to all hosts. I do not see this problem when the sender is SLES10 i686 or RHEL4 U3. This looks to be related to http://openib.org/bugzilla/show_bug.cgi?id=266. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell > Sent: Wednesday, December 06, 2006 9:53 AM > To: openib-general at openib.org > Subject: [openib-general] Multicast Group Routing Question > > Hello, > > I was testing our code and noticed that when I send data using > multicast over our ib0 interface, all of the infiniband > switches route > the data to each switch and each node instead of a node that has an > application listening to the interface like Ethernet. Is this > by design? > > Thanks in advance, > > Sean > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From bugzilla-daemon at openib.org Thu Dec 7 16:41:55 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 7 Dec 2006 16:41:55 -0800 (PST) Subject: [openib-general] [Bug 266] IPoIB multicast does not work with RHEL4 U4 Message-ID: <20061208004155.84F972283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=266 ------- Comment #4 from sweitzen at cisco.com 2006-12-07 16:41 ------- What OS and kernel are you using? I just took a closer look on RHEL4 U4 2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1, where sending IP multicast traffic causes the data to go to all hosts. I do not see this problem when the sender is SLES10 i686 or RHEL4 U3. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell > Sent: Wednesday, December 06, 2006 9:53 AM > To: openib-general at openib.org > Subject: [openib-general] Multicast Group Routing Question > > Hello, > > I was testing our code and noticed that when I send data using > multicast over our ib0 interface, all of the infiniband > switches route > the data to each switch and each node instead of a node that has an > application listening to the interface like Ethernet. Is this > by design? > > Thanks in advance, > > Sean ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From robert.j.woodruff at intel.com Thu Dec 7 16:41:19 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 7 Dec 2006 16:41:19 -0800 Subject: [openib-general] [TRIVIAL] ipoib connected mode makefile bug Message-ID: I tried to build the ipoib connected mode support and had to modify the IPoIB Makefile with the following patch to make it build correctly. woody --- drivers/infiniband/ulp/ipoib/Makefile 2006-12-07 15:39:51.000000000 -0800 +++ drivers/infiniband/ulp/ipoib/Makefile.new 2006-12-07 16:35:08.000000000 -0800 @@ -6,5 +6,5 @@ ib_ipoib-y := ipoib_main.o \ ipoib_verbs.o \ ipoib_vlan.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o -ib_ipoib-$(INFINIBAND_IPOIB_CM) += ipoib_cm.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM) += ipoib_cm.o From yhkim93 at keti.re.kr Thu Dec 7 18:49:27 2006 From: yhkim93 at keti.re.kr (=?euc-kr?B?sei/tciv?=) Date: Fri, 8 Dec 2006 11:49:27 +0900 (KST) Subject: [openib-general] booting problem after cross compile to ppc in infiniband source of linux-2.6.19 Message-ID: <28996683.1165546167039.JavaMail.kebi@nuri> An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Fri Dec 8 03:04:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 8 Dec 2006 13:04:04 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061208110404.GA31845@mellanox.co.il> >> > You can't send UDP/multicast traffic at all between IPoIB >> CM and IPoIB >> > UD? >> >> With my experimental code, this currently works only if you >> manually limit the MTU >> for multicast/UD addresses. >> The simplest way to do this is to set up separate interfaces >> for CM and UD modes. > >Separate interfaces as in ib0 vs ib1? >Thus I can use IPoIB HA or IPoIB >CM but not both, which is not very useful. There are many ways to use both IPoIB HA and IPoIB CM at the same time. You can create a child interface and use that for IPoIB CM. Or you can force lower MTU for UD destinations in the routing table. >Speaking of IPoIB CM, willit work with the OFED IPoIB HA? Should work. -- MST From mst at mellanox.co.il Fri Dec 8 03:08:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 8 Dec 2006 13:08:01 +0200 Subject: [openib-general] [Bug 266] IPoIB multicast does not work with RHEL4 U4 In-Reply-To: <20061208004155.84F972283D4@openib.ca.sandia.gov> References: <20061208004155.84F972283D4@openib.ca.sandia.gov> Message-ID: <20061208110801.GB31845@mellanox.co.il> This is a bug in RHEL4 U4. The issue is documented in OFED release notes, the solution is is to stay with U3. Quoting r. bugzilla-daemon at openib.org : Subject: [Bug 266] IPoIB multicast does not work with RHEL4 U4 http://openib.org/bugzilla/show_bug.cgi?id=266 ------- Comment #4 from sweitzen at cisco.com 2006-12-07 16:41 ------- What OS and kernel are you using? I just took a closer look on RHEL4 U4 2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1, where sending IP multicast traffic causes the data to go to all hosts. I do not see this problem when the sender is SLES10 i686 or RHEL4 U3. > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell > Sent: Wednesday, December 06, 2006 9:53 AM > To: openib-general at openib.org > Subject: [openib-general] Multicast Group Routing Question > > Hello, > > I was testing our code and noticed that when I send data using > multicast over our ib0 interface, all of the infiniband > switches route > the data to each switch and each node instead of a node that has an > application listening to the interface like Ethernet. Is this > by design? > > Thanks in advance, > > Sean ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at mellanox.co.il Fri Dec 8 03:09:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 8 Dec 2006 13:09:30 +0200 Subject: [openib-general] [PATCH] IPoIB CM Experimental support In-Reply-To: References: <20061206072604.GC26787@mellanox.co.il> Message-ID: <20061208110930.GC31845@mellanox.co.il> > > I know. Still, this only happens if you enable CM. Maybe it will help > > to mention this in the comment in KConfig? Log a message as well? > > Logging a message might help a tiny bit. But the Kconfig help text is > useless -- most naive users will be running distro kernels or OFED, > which I assume will enable CM by default. > > > I have a notion that once this code is upstream we can work on > > ways to teach kernel about net devices where MTU changes dynamically. > > Or possibly, some tricks with icmp can make it work. > > I think it would be better to use ethtool or something similar to > explicitly enable CM. At least until there's a way to make multicast > work on an interface using CM. Thanks for the suggestion, I'll look into that. -- MST From shubbell at dbresearch.net Fri Dec 8 05:04:30 2006 From: shubbell at dbresearch.net (Sean Hubbell) Date: Fri, 08 Dec 2006 07:04:30 -0600 Subject: [openib-general] [Bug 266] IPoIB multicast does not work with RHEL4 U4 In-Reply-To: <20061208004155.84F972283D4@openib.ca.sandia.gov> References: <20061208004155.84F972283D4@openib.ca.sandia.gov> Message-ID: <457962DE.2080407@dbresearch.net> Centos Linux neptune 2.6.9-42.0.3.plus.c4smp #1 SMP Fri Oct 6 11:42:04 CDT 2006 x86_64 GNU/Linux (Isn't Centos us a lot of the RH rpms?). Sean bugzilla-daemon at openib.org wrote: > http://openib.org/bugzilla/show_bug.cgi?id=266 > > > > > > ------- Comment #4 from sweitzen at cisco.com 2006-12-07 16:41 ------- > What OS and kernel are you using? I just took a closer look on RHEL4 U4 > 2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1, > where sending IP multicast traffic causes the data to go to all hosts. > I do not see this problem when the sender is SLES10 i686 or RHEL4 U3. > > > >> -----Original Message----- >> From: openib-general-bounces at openib.org >> [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell >> Sent: Wednesday, December 06, 2006 9:53 AM >> To: openib-general at openib.org >> Subject: [openib-general] Multicast Group Routing Question >> >> Hello, >> >> I was testing our code and noticed that when I send data using >> multicast over our ib0 interface, all of the infiniband >> switches route >> the data to each switch and each node instead of a node that has an >> application listening to the interface like Ethernet. Is this >> by design? >> >> Thanks in advance, >> >> Sean >> > > > > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From eitan at mellanox.co.il Fri Dec 8 08:42:13 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 08 Dec 2006 18:42:13 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <1165531651.25587.204056.camel@hal.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> Message-ID: <457995E5.40303@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > Just wanted to close the loop on the OpenSM issues of the last couple > days. > > 1. When can you supply an OpenSM verbose log for the InformInfo > subscribe problem you reported earlier today ? Failing that, I don't > know how to reproduce this. > Attached > 2. With the latest tree, do your simulation tests now work ? The > osm.fdbs UNREACHABLE was only a problem with the file and not with the > LFTs in the network. > Yes they do. > 3. In terms of file format changes, the lack of any file versioning > makes it difficult to move these forward when the need arises. (The > format change to osm.mcfdbs was unintentional (not by design)). > The issues until now were not that a file format change was required but were unintentional. When we will have a real need to change file format I am sure we can agree on adding version and change all parsers. > 4. I encourage you to look at and comment on the OpenSM patches rather > than waiting for them to be in the tree. > I am sure you did not mean to, but now I have to admit my limited skills in catching bugs by reading patches :-( . Instead on relying on bug reading I use automatic regression. I wish we could agree on some regression that each developer will have to run before patches are committed to the trunk. On my side I would love to have an automatic way to include all the patches posted (one at a time) run "dead or alive" check and provide feedback. Currently my automation is limited to testing the trunk. So I will always be complaining after the patches are committed. I think this is the way most other components testing works. What kind of regression suite do you and Sasha use? Can we agree on minimal pre-commit testing? Can we have a branch for that sake where all patches will first have to go into for 2 days? (it will allow for pre-trunk testing). > Thanks for your help in finding the bugs sooner. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- A non-text attachment was scrubbed... Name: ibmgtsim.13801.tar.bz2 Type: application/octet-stream Size: 505618 bytes Desc: not available URL: From sweitzen at cisco.com Fri Dec 8 09:47:51 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Fri, 8 Dec 2006 09:47:51 -0800 Subject: [openib-general] [Bug 266] IPoIB multicast does not work with RHEL4 U4 Message-ID: The OFED 1.1 IPoIB release notes state "5. On RedHat EL 4 up4, ipoib multicast group membership does not work due to missing code in the kernel which was available in u3 and removed in u4.", which is a good hint, but I just want to clarify that U4 can only receive multicast from U4, and U4 sends multicast to all nodes. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Friday, December 08, 2006 3:08 AM > To: Scott Weitzenkamp (sweitzen) > Cc: openib-general at openib.org > Subject: Re: [Bug 266] IPoIB multicast does not work with RHEL4 U4 > > This is a bug in RHEL4 U4. > The issue is documented in OFED release notes, the solution is > is to stay with U3. > > Quoting r. bugzilla-daemon at openib.org : > Subject: [Bug 266] IPoIB multicast does not work with RHEL4 U4 > > http://openib.org/bugzilla/show_bug.cgi?id=266 > > > > > > ------- Comment #4 from sweitzen at cisco.com 2006-12-07 16:41 ------- > What OS and kernel are you using? I just took a closer look > on RHEL4 U4 > 2.6.9-42.Elsmp x86_64, and I am seeting the same problem with > OFED 1.1, > where sending IP multicast traffic causes the data to go to all hosts. > I do not see this problem when the sender is SLES10 i686 or RHEL4 U3. > > > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell > > Sent: Wednesday, December 06, 2006 9:53 AM > > To: openib-general at openib.org > > Subject: [openib-general] Multicast Group Routing Question > > > > Hello, > > > > I was testing our code and noticed that when I send data using > > multicast over our ib0 interface, all of the infiniband > > switches route > > the data to each switch and each node instead of a node that has an > > application listening to the interface like Ethernet. Is this > > by design? > > > > Thanks in advance, > > > > Sean > > > > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -- > MST > From or.gerlitz at gmail.com Fri Dec 8 10:45:00 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 8 Dec 2006 20:45:00 +0200 Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support foroperation over IPoIB In-Reply-To: References: Message-ID: <15ddcffd0612081045s569bd04at8489f35e32fe6bcc@mail.gmail.com> On 12/8/06, Carl Yang (caryang) wrote: > Can you please forward me (or to the email alias) "an example bonding > sysfs script which can be used to set bonding to work with patches 1-3?" Sure, i did it along with sending the patches, you can the thing here: http://marc.theaimsgroup.com/?l=linux-netdev&m=116488445829045&w=2 Or. From halr at voltaire.com Fri Dec 8 11:23:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2006 14:23:36 -0500 Subject: [openib-general] OpenSM/osm_remote_sm.h: Eliminate unused is_opensm boolean Message-ID: <1165605794.25587.256398.camel@hal.voltaire.com> OpenSM/osm_remote_sm.h: Eliminate unused is_opensm boolean Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_remote_sm.h b/osm/include/opensm/osm_remote_sm.h index 68359c6..6e67b7c 100644 --- a/osm/include/opensm/osm_remote_sm.h +++ b/osm/include/opensm/osm_remote_sm.h @@ -97,7 +97,6 @@ typedef struct _osm_remote_sm cl_map_item_t map_item; const osm_port_t *p_port; ib_sm_info_t smi; - boolean_t is_opensm; } osm_remote_sm_t; /* * FIELDS @@ -109,10 +108,6 @@ typedef struct _osm_remote_sm * smi * The SMInfo attribute for this SM. * -* is_opensm -* TRUE if this SM is an OpenSM. -* FALSE otherwise. -* * SEE ALSO *********/ From eric at barton.org.uk Fri Dec 8 11:55:44 2006 From: eric at barton.org.uk (Eric Barton) Date: Fri, 8 Dec 2006 19:55:44 -0000 Subject: [openib-general] version #defines for the kernel In-Reply-To: Message-ID: <045401c71b02$d8d17a40$0281a8c0@ebpc> > > Actually a single OFED version #define would most probably > > suit my purposes - > > is that controversial? > > It might be sensible for OFED to supply that, if it's going to > backport drivers to old kernels. But you should also cope with > non-OFED (vanilla upstream) drivers, probably by testing > LINUX_VERSION_CODE too I suppose. How about an OpenFabrics API version #define? Living in hope... Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb at bartonsoftware.com| --------------------------------------------------- From vu at mellanox.com Fri Dec 8 12:10:43 2006 From: vu at mellanox.com (Vu Pham) Date: Fri, 08 Dec 2006 12:10:43 -0800 Subject: [openib-general] nfsrdma server stop responding, Message-ID: <4579C6C3.5090207@mellanox.com> Hi James, I got these errors in server's /var/log/messages and then the server stop responding to login, I/O...; however, the server is still up, ipoib is still working Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[] [] put_page+0x17/0x40 Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS: 00010246 Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000000003ffff Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8102274e92f8 Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034 R09: 0000000000000000 Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff81020ef96800 Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000 R15: ffff8102053ee890 Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000) GS:ffff81022066eb40(0000) knlGS:0000000000000000 Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000 CR4: 00000000000006e0 Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo ffff810219dde000, task ffff81020d87f0c0) Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 ffff81020ef96968 ffff81020ef96800 ffff81020ef96958 Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90 ffffffff80424e05 0000000000000000 Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90 ffffffff80239b90 ffff81020d87f0c0 Dec 8 06:38:21 ibd201 kernel: Call Trace: Dec 8 06:38:21 ibd201 kernel: [] :sunrpc:svc_rdma_put_context+0x37/0xd0 Dec 8 06:38:21 ibd201 kernel: [] :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0 Dec 8 06:38:21 ibd201 kernel: [] schedule_timeout+0x95/0xb0 Dec 8 06:38:21 ibd201 kernel: [] process_timeout+0x0/0x10 Dec 8 06:38:21 ibd201 kernel: [] wait_for_completion_timeout+0xcd/0x150 Dec 8 06:38:21 ibd201 kernel: [] default_wake_function+0x0/0x10 Dec 8 06:38:21 ibd201 kernel: [] :ib_mthca:mthca_cmd_post+0x232/0x260 Dec 8 06:38:21 ibd201 kernel: [] default_wake_function+0x0/0x10 Dec 8 06:38:21 ibd201 kernel: [] __next_cpu+0x19/0x30 Dec 8 06:38:21 ibd201 kernel: [] find_busiest_group+0x24e/0x6d0 Dec 8 06:38:21 ibd201 kernel: [] thread_return+0x0/0xde Dec 8 06:38:21 ibd201 kernel: [] _spin_unlock_irqrestore+0x8/0x10 Dec 8 06:38:21 ibd201 kernel: [] try_to_del_timer_sync+0x51/0x60 Dec 8 06:38:21 ibd201 kernel: [] del_timer_sync+0xc/0x20 Dec 8 06:38:21 ibd201 kernel: [] schedule_timeout+0x95/0xb0 Dec 8 06:38:21 ibd201 kernel: [] :sunrpc:svc_recv+0x416/0x510 Dec 8 06:38:21 ibd201 kernel: [] default_wake_function+0x0/0x10 Dec 8 06:38:21 ibd201 kernel: [] default_wake_function+0x0/0x10 Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x111/0x380 Dec 8 06:38:21 ibd201 kernel: [] child_rip+0xa/0x12 Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 Dec 8 06:38:21 ibd201 kernel: [] child_rip+0x0/0x12 Dec 8 06:38:21 ibd201 kernel: Dec 8 06:38:21 ibd201 kernel: Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f 08 0f 94 c0 84 c0 74 Dec 8 06:38:21 ibd201 kernel: RIP [] put_page+0x17/0x40 Dec 8 06:38:21 ibd201 kernel: RSP -vu From rdreier at cisco.com Fri Dec 8 12:17:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 08 Dec 2006 12:17:54 -0800 Subject: [openib-general] version #defines for the kernel In-Reply-To: <045401c71b02$d8d17a40$0281a8c0@ebpc> (Eric Barton's message of "Fri, 8 Dec 2006 19:55:44 -0000") References: <045401c71b02$d8d17a40$0281a8c0@ebpc> Message-ID: > How about an OpenFabrics API version #define? No other kernel subsystem has one, so I don't think it's realistic to expect one for IB. - R. From halr at voltaire.com Fri Dec 8 13:05:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2006 16:05:46 -0500 Subject: [openib-general] [PATCH][TRIVIAL] osmtest/osmtest.c: Fix endian of capability mask output Message-ID: <1165611934.26559.214.camel@hal.voltaire.com> osmtest/osmtest.c: Fix endian of capability mask output Signed-off-by: Hal Rosenstock diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index b3f2bb4..6a571f5 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -3749,7 +3749,8 @@ osmtest_validate_port_data( IN osmtest_t "Field mismatch port LID 0x%X Num:0x%X\n" "\t\t\t\tExpected capability_mask 0x%X, received 0x%X\n", cl_ntoh16( p_rec->lid ), p_rec->port_num, - p_port->rec.port_info.capability_mask, p_rec->port_info.capability_mask ); + cl_ntoh32( p_port->rec.port_info.capability_mask ), + cl_ntoh32( p_rec->port_info.capability_mask ) ); status = IB_ERROR; goto Exit; } From sashak at voltaire.com Fri Dec 8 13:55:23 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 8 Dec 2006 23:55:23 +0200 Subject: [openib-general] [PATCH] osm: Routing Tables are full of UNREACHABLE instead of real route In-Reply-To: <45782F7B.1010408@mellanox.co.il> References: <45782F7B.1010408@mellanox.co.il> Message-ID: <20061208215523.GF9193@sashak.voltaire.com> Hi Eitan, On 17:12 Thu 07 Dec , Eitan Zahavi wrote: > Hi Hal, > > I resolved the mystery behind the osm.fdbs that is now full of > UNREACHABLE instead of correct out ports. > > The problem is a consequence of the new code that does not use the > switch LFT blocks for the intermediate LFT assignments: > The idea of having incremental updates only relies on temporary buffer > that the routing algorithm fills. > Then it is sent to the wire only if there is a diff between the switch > LFT tables (from the SMDB) and the temporary buffer. > > So the switch LFT tables are not being directly updated by the routing > algorithm - but only by the GetResp obtained as > reply to the setting. Until this stage of the description - everything > looks right. > > But what is wrong is that the dump of LFT tables is invoked before the > GetResp is obtained. > So if only a single sweep is invoked the resulting osm.fdbs show the > original state of the SMDB tables whicg is full of 0xFF = UNREACHABLE. Right. > > The patch below is taking the easy way and should be probably revisited. > Instead of having a separate algorithm step for dumping out the > resulting GetResp data after all LFT responses were obtained it just > copies the sent LFT blocks to the SMDB. Would not this be better just to move all dumps at end of the OpenSM heavy sweep. This should be simple, right? Sasha > > I think we need to have at least this simple patch until we have the > dump move to a new algorithm step. > > Thanks > Eitan > > Signed-off-by: Eitan Zahavi > ===================================================================== > > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > index 5a55da8..3a62c7f 100644 > --- a/osm/opensm/osm_ucast_mgr.c > +++ b/osm/opensm/osm_ucast_mgr.c > @@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table( > "osm_ucast_mgr_set_fwd_table: ERR 3A05: " > "Sending linear fwd. tbl. block failed (%s)\n", > ib_get_err_str( status ) ); > - } > + } else { > + /* > + HACK: for now we will assume we succeeded to send > + and set the local DB based on it. This should allow > + us to immediatly dump out our routing > + */ > + osm_switch_set_ft_block( > + p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho); > + } > } > > OSM_LOG_EXIT( p_mgr->p_log ); > From sashak at voltaire.com Fri Dec 8 14:10:01 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 00:10:01 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <457995E5.40303@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> Message-ID: <20061208221001.GG9193@sashak.voltaire.com> On 18:42 Fri 08 Dec , Eitan Zahavi wrote: > Hal Rosenstock wrote: > >Hi Eitan, > > > >Just wanted to close the loop on the OpenSM issues of the last couple > >days. > > > >1. When can you supply an OpenSM verbose log for the InformInfo > >subscribe problem you reported earlier today ? Failing that, I don't > >know how to reproduce this. > > > Attached > >2. With the latest tree, do your simulation tests now work ? The > >osm.fdbs UNREACHABLE was only a problem with the file and not with the > >LFTs in the network. > > > Yes they do. > >3. In terms of file format changes, the lack of any file versioning > >makes it difficult to move these forward when the need arises. (The > >format change to osm.mcfdbs was unintentional (not by design)). > > > The issues until now were not that a file format change was required but > were unintentional. > When we will have a real need to change file format I am sure we can > agree on adding version and change all parsers. > >4. I encourage you to look at and comment on the OpenSM patches rather > >than waiting for them to be in the tree. > > > I am sure you did not mean to, but now I have to admit my limited skills > in catching bugs by reading patches :-( . > Instead on relying on bug reading I use automatic regression. I wish we > could agree on some regression that > each developer will have to run before patches are committed to the trunk. > On my side I would love to have an automatic way to include all the > patches posted (one at a time) run "dead or alive" check > and provide feedback. Currently my automation is limited to testing the > trunk. So I will always be complaining after the patches are > committed. I think this is the way most other components testing works. > > What kind of regression suite do you and Sasha use? On my side it clearly depends from kind of changes. In general I would call this "uni-testing". > Can we agree on minimal pre-commit testing? > Can we have a branch for that sake where all patches will first have to > go into for 2 days? (it will allow for pre-trunk testing). One more development branch? Will you test (or even see) this? If so I can publish the "fresh" tree. Sasha From venkatesh.babu at 3leafnetworks.com Fri Dec 8 14:12:03 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Fri, 08 Dec 2006 14:12:03 -0800 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1164674885.11808.760.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> Message-ID: <4579E333.4000901@3leafnetworks.com> I have got the same problem with OFED 1.1 stack also, but the frequency is less. I had to try 120 fail overs (by rebooting the highest priority OpenSM server) before getting into this problem. At this state OpenSM doesn't update anything to the log files; doesn't assign the LIDs to the other nodes; doesn't respond to the multi cast join operations. Even another OpenSM is started on another node with higher priority it can not become the master. The only way to recover from this is by killing the stuck OpenSM. VBabu Hal Rosenstock wrote: >I don't see any explicit changes to the SM state machine which would >affect this but as I have mentioned before there are many bug fixes in >OFED 1.1. I can't conclusively state whether this would fix the issue >you see but would be in a much better position to try to figure this >out. > >-- Hal > > > >> Hi >> >> I have topology of two switches and a bunch of nodes, with each >> node having 2port HCAs. Port1 of every node connects to switch1 and >> Port2 of every node connects to switch2. So Port1 and Port2 are in >> different subnets. So I am running one OpenSM (from OFED 1.0) for >> each port on one node designated as a server. To guard against that >> server going down I have another server node to run the OpenSM in >> "standby" mode for each port. I will adjust the priorities such that >> first server always has "master" OpenSM and second server has >> "standby" OpenSM. >> >> When the first server rebooted, "standby" OpenSM should takeover >> the mastership. It usually works fine but sometimes it is failing to >> takeover. In the following example OpenSM for Port1 failed to >> takeover, but OpenSM for Port2 took over and became "master". The >> OpenSM for Port1 seems be stuck in some weired state (strace shows >> that it is sleeping). It is no longer assigning LIDs to the rest of >> the nodes in the subnet and not responding to the broadcast joins. >> The log file shows nothing from past 4 days. I have the complete log >> files if needed. >> >> Is this a known problem and fixed in OFED 1.1 ? >> >> [root at vortex3l-72 158]# ibv_devinfo >> hca_id: mthca0 >> fw_ver: 5.1.400 >> node_guid: 0050:4501:4b1a:0000 >> sys_image_guid: 0050:4501:4b1a:0003 >> vendor_id: 0x02c9 >> vendor_part_id: 25218 >> hw_ver: 0xA0 >> board_id: ARM0020000001 >> phys_port_cnt: 2 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 7 >> port_lid: 1 >> port_lmc: 0x00 >> >> port: 2 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 1 >> port_lid: 1 >> port_lmc: 0x00 >> >> [root at vortex3l-72 158]# ps -aux | grep open >> Warning: bad syntax, perhaps a bogus '-'? See >> /usr/share/doc/procps-3.2.3/FAQ >> root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06 >> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f >> /var/log/opensm2.log >> root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06 >> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f >> /var/log/opensm1.log >> root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open >> [root at vortex3l-72 158]# strace -p7975 >> Process 7975 attached - interrupt to quit >> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, NULL) = 0 >> nanosleep({10, 0}, >> Process 7975 detached >> [root at vortex3l-72 158]# uptime >> 12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00 >> [root at vortex3l-72 158]# date >> Mon Nov 27 12:13:05 PST 2006 >> [root at vortex3l-72 158]# tail /var/log/opensm1.log >> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn >> 3673M >> >> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting >> Generic Notice type:3 num:66 from LID:0x0000 >> GID:0xfe80000000000000,0x0000000000000000 >> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting >> Generic Notice type:3 num:66 from LID:0x0000 >> GID:0xfe80000000000000,0x0000000000000000 >> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port >> 0x5045014b1a0001 >> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port >> 0x5045014b1a0001 >> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state >> >> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state >> >> [root at vortex3l-72 158]# tail /var/log/opensm2.log >> 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 00 00 >> >> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting >> Generic Notice type:3 num:65 from LID:0x0001 >> GID:0xfe80000000000000,0x005045014b1a0002 >> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: >> Cannot find destination port with LID:0x0002 >> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: >> Cannot find destination port with LID:0x0003 >> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: >> Cannot find destination port with LID:0x0004 >> Nov 27 12:10:32 146382 [41401960] -> Removed port with >> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1 >> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: >> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to >> light sweep sampling list >> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path: >> Path = [0][2] >> From adit.262 at gmail.com Fri Dec 8 14:31:49 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Fri, 8 Dec 2006 17:31:49 -0500 Subject: [openib-general] Assigning IP addresses to IB interfaces Message-ID: Hi, I have installed the OpenIB gen2 driver but the IB interfaces havent been assigned any IP addresses.. Is it possible to assign them ip addresses using ifconfig and ping between the interfaces of two machines? Thanks, Regards, Adit -- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From halr at voltaire.com Fri Dec 8 14:33:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2006 17:33:18 -0500 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <457995E5.40303@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> Message-ID: <1165617195.26559.4435.camel@hal.voltaire.com> On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi Eitan, > > > > Just wanted to close the loop on the OpenSM issues of the last couple > > days. > > > > 1. When can you supply an OpenSM verbose log for the InformInfo > > subscribe problem you reported earlier today ? Failing that, I don't > > know how to reproduce this. > > > Attached Hmmm.... osmtest seems to fail much earlier than OpenSM unless I am mistaken. OpenSM sees the final InformInfo unsubscribe (cleanup) and fails on that. I thought the osmtest side failed earlier. In a number of places in osm.log, I see: Dec 08 18:17:02 266690 [B2562BB0] -> __osmv_dispatch_rmpp_mad: [ Dec 08 18:17:02 266707 [B2562BB0] -> __osmv_dispatch_rmpp_snd: [ Dec 08 18:17:02 266723 [B2562BB0] -> Not supposed to receive DATA packets --> dropping the MAD Dec 08 18:17:02 266739 [B2562BB0] -> __osmv_dispatch_rmpp_snd: ] Dec 08 18:17:02 266755 [B2562BB0] -> __osmv_dispatch_rmpp_mad: ] Is that supposed to happen ? What does that mean ? Does that mess things up ? SA GetTable InformInfoRecord Dec 08 18:17:02 265333 [B6B69BB0] -> osm_infr_rcv_process_get_method: Query Subscriber GID:0x0000000000000000 : 0x0000000000000000(00) Enum:0x0(01) Dec 08 18:17:02 265370 [B6B69BB0] -> __osm_sa_inform_info_rec_by_comp_mask: [ Dec 08 18:17:02 265388 [B2562BB0] -> osmv_dispatch_mad: ] Dec 08 18:17:02 265406 [B6B69BB0] -> osm_infr_get_by_enum: [ Dec 08 18:17:02 265424 [B2562BB0] -> __osmv_ibms_receiver_callback: ] Dec 08 18:17:02 265443 [B6B69BB0] -> osm_infr_get_by_enum: ] Dec 08 18:17:02 265482 [B6B69BB0] -> __osm_sa_inform_info_rec_by_comp_mask: ] Dec 08 18:17:02 265499 [B6B69BB0] -> osm_infr_rcv_process_get_method: Returning 1 records SA Set InformInfo Dec 08 18:17:02 269386 [B756ABB0] -> osm_infr_rcv_process_set_method: UnSubscribe Request with QPN: 0x000001 Dec 08 18:17:02 269421 [B756ABB0] -> osm_infr_get_by_rec: [ Dec 08 18:17:02 269439 [B2562BB0] -> <-- Released lock 0x8d79c20 on bind handle 0x8d79c10 Dec 08 18:17:02 269457 [B756ABB0] -> __dump_all_informs: [ Dec 08 18:17:02 269476 [B2562BB0] -> osmv_dispatch_mad: ] Dec 08 18:17:02 269496 [B756ABB0] -> InformInfo dump: gid.....................0x0000000000000000 : 0x0000000000000000 lid_range_begin.........0x0 lid_range_end...........0x0 is_generic..............0x0 subscribe...............0x1 trap_type...............0x0 dev_id..................0x0 qpn.....................0x000001 resp_time_val...........0x0 vendor_id...............0x000000 Dec 08 18:17:02 269513 [B2562BB0] -> __osmv_ibms_receiver_callback: ] Dec 08 18:17:02 269532 [B756ABB0] -> __dump_all_informs: ] Dec 08 18:17:02 269566 [B756ABB0] -> osm_infr_get_by_rec: Looking for Inform Record Dec 08 18:17:02 269582 [B756ABB0] -> InformInfo dump: gid.....................0x0000000000000000 : 0x0000000000000000 lid_range_begin.........0x0 lid_range_end...........0x0 is_generic..............0x0 subscribe...............0x0 trap_type...............0x0 dev_id..................0x0 qpn.....................0x000001 resp_time_val...........0x0 vendor_id...............0x000000 Dec 08 18:17:02 269625 [B756ABB0] -> osm_infr_get_by_rec: InformInfo list size 1 Dec 08 18:17:02 269650 [B756ABB0] -> __match_inf_rec: [ Dec 08 18:17:02 269673 [B756ABB0] -> __match_inf_rec: Differ by Address Dec 08 18:17:02 269698 [B756ABB0] -> __match_inf_rec: ] Dec 08 18:17:02 269724 [B756ABB0] -> osm_infr_get_by_rec: ] Dec 08 18:17:02 269751 [B756ABB0] -> osm_infr_rcv_process_set_method: ERR 4307: Failed to UnSubscribe to non existing inform object Dec 08 18:17:02 269914 [B756ABB0] -> SA MAD dump: base_ver................0x1 mgmt_class..............0x3 class_ver...............0x2 method..................0x81 (SubnAdmGetResp) status..................0x200 resv....................0x0 trans_id................0x360600000033 attr_id.................0x3 (InformInfo) It looks like the OpenSM side fails on the following: if ( memcmp(&p_infr->report_addr, &p_infr_rec->report_addr, sizeof(p_infr_rec->report_addr)) ) { osm_log( p_log, OSM_LOG_DEBUG, "__match_inf_rec: " "Differ by Address\n" ); goto Exit; } Not sure why that is. Guess it needs to be debugged... > > 2. With the latest tree, do your simulation tests now work ? The > > osm.fdbs UNREACHABLE was only a problem with the file and not with the > > LFTs in the network. > > > Yes they do. Good. > > 3. In terms of file format changes, the lack of any file versioning > > makes it difficult to move these forward when the need arises. (The > > format change to osm.mcfdbs was unintentional (not by design)). > > > The issues until now were not that a file format change was required but > were unintentional. > When we will have a real need to change file format I am sure we can > agree on adding version and change all parsers. We will have a real need at some point. It is more likely the config files but there may be more info to add to other files as well. > > 4. I encourage you to look at and comment on the OpenSM patches rather > > than waiting for them to be in the tree. > > > I am sure you did not mean to, but now I have to admit my limited skills > in catching bugs by reading patches :-( . Not just read, but they are there to try out as well. > Instead on relying on bug reading I use automatic regression. I wish we > could agree on some regression that > each developer will have to run before patches are committed to the trunk. > On my side I would love to have an automatic way to include all the > patches posted (one at a time) run "dead or alive" check > and provide feedback. Currently my automation is limited to testing the > trunk. So I will always be complaining after the patches are > committed. I think this is the way most other components testing works. You could try out the patches and do the same thing before they are committed. > What kind of regression suite do you and Sasha use? Haven't we been over this before ? I might ask the same of you and Yevgeny. There are similar occurrences. I use osmtest for most of my testing as well as as a subnet on which I perform directed tests on the functionality being changed. Sasha does testing on both live and simulated subnets. > Can we agree on minimal pre-commit testing? I think we do a reasonable level of pre commit testing and have been responsive to breakages not necessarily of our own making. > Can we have a branch for that sake where all patches will first have to > go into for 2 days? (it will allow for pre-trunk testing). That's why the patches go out first. The patches in question were out there for over a week. This seems like another level of overhead to me. Is there real gain here ? -- Hal > > Thanks for your help in finding the bugs sooner. > > > > -- Hal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Fri Dec 8 14:44:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2006 17:44:40 -0500 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <4579E333.4000901@3leafnetworks.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> Message-ID: <1165617878.26559.4952.camel@hal.voltaire.com> On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote: > I have got the same problem with OFED 1.1 stack also, but the frequency > is less. I had to try 120 fail overs (by rebooting the highest priority > OpenSM server) before getting into this problem. If I understand you correctly, you reboot the master SM and the standby does not takeover (become master). Is that correct ? Is this with 2 SMs or more ? > At this state OpenSM doesn't update anything to the log files; > doesn't assign the LIDs to the other nodes; doesn't respond > to the multi cast join operations. Even another OpenSM is > started on another node with higher priority it can > not become the master. The only way to recover from this is by killing > the stuck OpenSM. What SMLID do the nodes in the subnet point to ? Can you determine where is it stuck ? Sounds like it could be in some tight loop. Can you build with gdb and attach when this occurs to see ? -- Hal > VBabu > > Hal Rosenstock wrote: > > >I don't see any explicit changes to the SM state machine which would > >affect this but as I have mentioned before there are many bug fixes in > >OFED 1.1. I can't conclusively state whether this would fix the issue > >you see but would be in a much better position to try to figure this > >out. > > > >-- Hal > > > > > > > >> Hi > >> > >> I have topology of two switches and a bunch of nodes, with each > >> node having 2port HCAs. Port1 of every node connects to switch1 and > >> Port2 of every node connects to switch2. So Port1 and Port2 are in > >> different subnets. So I am running one OpenSM (from OFED 1.0) for > >> each port on one node designated as a server. To guard against that > >> server going down I have another server node to run the OpenSM in > >> "standby" mode for each port. I will adjust the priorities such that > >> first server always has "master" OpenSM and second server has > >> "standby" OpenSM. > >> > >> When the first server rebooted, "standby" OpenSM should takeover > >> the mastership. It usually works fine but sometimes it is failing to > >> takeover. In the following example OpenSM for Port1 failed to > >> takeover, but OpenSM for Port2 took over and became "master". The > >> OpenSM for Port1 seems be stuck in some weired state (strace shows > >> that it is sleeping). It is no longer assigning LIDs to the rest of > >> the nodes in the subnet and not responding to the broadcast joins. > >> The log file shows nothing from past 4 days. I have the complete log > >> files if needed. > >> > >> Is this a known problem and fixed in OFED 1.1 ? > >> > >> [root at vortex3l-72 158]# ibv_devinfo > >> hca_id: mthca0 > >> fw_ver: 5.1.400 > >> node_guid: 0050:4501:4b1a:0000 > >> sys_image_guid: 0050:4501:4b1a:0003 > >> vendor_id: 0x02c9 > >> vendor_part_id: 25218 > >> hw_ver: 0xA0 > >> board_id: ARM0020000001 > >> phys_port_cnt: 2 > >> port: 1 > >> state: PORT_ACTIVE (4) > >> max_mtu: 2048 (4) > >> active_mtu: 2048 (4) > >> sm_lid: 7 > >> port_lid: 1 > >> port_lmc: 0x00 > >> > >> port: 2 > >> state: PORT_ACTIVE (4) > >> max_mtu: 2048 (4) > >> active_mtu: 2048 (4) > >> sm_lid: 1 > >> port_lid: 1 > >> port_lmc: 0x00 > >> > >> [root at vortex3l-72 158]# ps -aux | grep open > >> Warning: bad syntax, perhaps a bogus '-'? See > >> /usr/share/doc/procps-3.2.3/FAQ > >> root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06 > >> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f > >> /var/log/opensm2.log > >> root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06 > >> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f > >> /var/log/opensm1.log > >> root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open > >> [root at vortex3l-72 158]# strace -p7975 > >> Process 7975 attached - interrupt to quit > >> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, NULL) = 0 > >> nanosleep({10, 0}, > >> Process 7975 detached > >> [root at vortex3l-72 158]# uptime > >> 12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00 > >> [root at vortex3l-72 158]# date > >> Mon Nov 27 12:13:05 PST 2006 > >> [root at vortex3l-72 158]# tail /var/log/opensm1.log > >> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn > >> 3673M > >> > >> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting > >> Generic Notice type:3 num:66 from LID:0x0000 > >> GID:0xfe80000000000000,0x0000000000000000 > >> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting > >> Generic Notice type:3 num:66 from LID:0x0000 > >> GID:0xfe80000000000000,0x0000000000000000 > >> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port > >> 0x5045014b1a0001 > >> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port > >> 0x5045014b1a0001 > >> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state > >> > >> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state > >> > >> [root at vortex3l-72 158]# tail /var/log/opensm2.log > >> 00 00 00 00 00 00 00 00 00 00 00 00 > >> 00 00 00 00 > >> > >> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting > >> Generic Notice type:3 num:65 from LID:0x0001 > >> GID:0xfe80000000000000,0x005045014b1a0002 > >> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: > >> Cannot find destination port with LID:0x0002 > >> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: > >> Cannot find destination port with LID:0x0003 > >> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: > >> Cannot find destination port with LID:0x0004 > >> Nov 27 12:10:32 146382 [41401960] -> Removed port with > >> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1 > >> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: > >> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to > >> light sweep sampling list > >> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path: > >> Path = [0][2] > >> From greg.lindahl at qlogic.com Fri Dec 8 15:36:16 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri, 8 Dec 2006 15:36:16 -0800 Subject: [openib-general] version #defines for the kernel In-Reply-To: References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> Message-ID: <20061208233616.GA10646@greglaptop> On Thu, Dec 07, 2006 at 02:44:59PM -0800, Roland Dreier wrote: > But you should also cope with > non-OFED (vanilla upstream) drivers, probably by testing > LINUX_VERSION_CODE too I suppose. Although RHEL4 shows how this can break down in the future... they backport kernel stuff while leaving LINUX_VERSION_CODE set to 2.6.9. -- greg From venkatesh.babu at 3leafnetworks.com Fri Dec 8 15:44:38 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Fri, 08 Dec 2006 15:44:38 -0800 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1165617878.26559.4952.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> Message-ID: <4579F8E6.3040604@3leafnetworks.com> I have 3 nodes and 2 IB switches. Port 1 of all nodes connected to switch 1 and Port2 of all nodes connected to switch 2. So each switch creates its own subnet and hence I have two instances of OpenSM for each port. I have two OpenSMs running with priority 1 on node1 and two OpenSM's running with priority 13 on node 2. Node 3 doesn't have any OpenSM's but just a OFED kernel modules. I reboot the node 2 every 10minutes. Since it has the highest priority, every time it boots up it grabs the mastership from the node 1. It works most of the time, except when this problem occurs. When this problem occurs, node 3 shows the old/stale SMLID information. But if you reload the ofed drivers or reboot the node to get the new LID assignment it shows SMLID as 0. Even though Node 1's SMLID and port LID are same, it was not completely asserted the mastership. See the log messages below - [root ~]# ibv_devinfo ... port: 1 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 The strace output is shown below - [root~]# strace -p 7518 Process 7518 attached - interrupt to quit restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x335d) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 nanosleep({10, 0}, NULL) = 0 The GDB output is shown below - [root ~]# gdb /usr/bin/opensm 7518 GNU gdb Red Hat Linux (6.3.0.0-1.63rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... (no debugging symbols found) Using host libthread_db library "/lib64/tls/libthread_db.so.1". Attaching to program: /usr/bin/opensm, process 7518 Reading symbols from /usr/lib/libibumad.so.1...done. Loaded symbols for /usr/lib/libibumad.so.1 Reading symbols from /usr/lib/libopensm.so.1...done. Loaded symbols for /usr/lib/libopensm.so.1 Reading symbols from /usr/lib/libosmcomp.so.1...done. Loaded symbols for /usr/lib/libosmcomp.so.1 Reading symbols from /lib64/tls/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 182896213152 (LWP 7518)] [New Thread 1136679264 (LWP 7544)] [New Thread 1126189408 (LWP 7543)] [New Thread 1115699552 (LWP 7542)] [New Thread 1105209696 (LWP 7541)] [New Thread 1094719840 (LWP 7540)] [New Thread 1084229984 (LWP 7534)] Loaded symbols for /lib64/tls/libpthread.so.0 Reading symbols from /usr/lib/libosmvendor.so.1...done. Loaded symbols for /usr/lib/libosmvendor.so.1 Reading symbols from /usr/lib/libibcommon.so.1...done. Loaded symbols for /usr/lib/libibcommon.so.1 Reading symbols from /lib64/tls/libc.so.6...done. Loaded symbols for /lib64/tls/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) bt #0 0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 #1 0x00000031603bf368 in usleep () from /lib64/tls/libc.so.6 #2 0x000000316080df32 in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125 #3 0x000000000040584e in main () (gdb) print osm_hup_flag $1 = 0 (gdb) Following is the log output. It is entring to MASTER state. But it doesn't show"SUBNET UP" event. It gets stuck. [root ~]# tail /var/log/opensm1.log Dec 04 15:59:35 573040 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3726M Dec 04 15:59:35 783462 [9576BCA0] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Dec 04 15:59:35 783541 [9576BCA0] -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe80000000000000,0x0000000000000000 Dec 04 15:59:35 783589 [9576BCA0] -> osm_vendor_bind: Binding to port 0x5045014b1a0001 Dec 04 15:59:35 787924 [9576BCA0] -> osm_vendor_bind: Binding to port 0x5045014b1a0001 Dec 04 15:59:35 800404 [0000] -> Entering STANDBY state Dec 04 15:59:36 053784 [0000] -> Entering MASTER state Hal Rosenstock wrote: >On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote: > > >>I have got the same problem with OFED 1.1 stack also, but the frequency >>is less. I had to try 120 fail overs (by rebooting the highest priority >>OpenSM server) before getting into this problem. >> >> > >If I understand you correctly, you reboot the master SM and the standby >does not takeover (become master). Is that correct ? > >Is this with 2 SMs or more ? > > > >>At this state OpenSM doesn't update anything to the log files; >>doesn't assign the LIDs to the other nodes; doesn't respond >>to the multi cast join operations. Even another OpenSM is >>started on another node with higher priority it can >>not become the master. The only way to recover from this is by killing >>the stuck OpenSM. >> >> > >What SMLID do the nodes in the subnet point to ? > >Can you determine where is it stuck ? Sounds like it could be in some >tight loop. Can you build with gdb and attach when this occurs to see ? > >-- Hal > > > >> VBabu >> >>Hal Rosenstock wrote: >> >> >> >>>I don't see any explicit changes to the SM state machine which would >>>affect this but as I have mentioned before there are many bug fixes in >>>OFED 1.1. I can't conclusively state whether this would fix the issue >>>you see but would be in a much better position to try to figure this >>>out. >>> >>>-- Hal >>> >>> >>> >>> >>> >>>>Hi >>>> >>>> I have topology of two switches and a bunch of nodes, with each >>>>node having 2port HCAs. Port1 of every node connects to switch1 and >>>>Port2 of every node connects to switch2. So Port1 and Port2 are in >>>>different subnets. So I am running one OpenSM (from OFED 1.0) for >>>>each port on one node designated as a server. To guard against that >>>>server going down I have another server node to run the OpenSM in >>>>"standby" mode for each port. I will adjust the priorities such that >>>>first server always has "master" OpenSM and second server has >>>>"standby" OpenSM. >>>> >>>> When the first server rebooted, "standby" OpenSM should takeover >>>>the mastership. It usually works fine but sometimes it is failing to >>>>takeover. In the following example OpenSM for Port1 failed to >>>>takeover, but OpenSM for Port2 took over and became "master". The >>>>OpenSM for Port1 seems be stuck in some weired state (strace shows >>>>that it is sleeping). It is no longer assigning LIDs to the rest of >>>>the nodes in the subnet and not responding to the broadcast joins. >>>>The log file shows nothing from past 4 days. I have the complete log >>>>files if needed. >>>> >>>> Is this a known problem and fixed in OFED 1.1 ? >>>> >>>>[root at vortex3l-72 158]# ibv_devinfo >>>>hca_id: mthca0 >>>> fw_ver: 5.1.400 >>>> node_guid: 0050:4501:4b1a:0000 >>>> sys_image_guid: 0050:4501:4b1a:0003 >>>> vendor_id: 0x02c9 >>>> vendor_part_id: 25218 >>>> hw_ver: 0xA0 >>>> board_id: ARM0020000001 >>>> phys_port_cnt: 2 >>>> port: 1 >>>> state: PORT_ACTIVE (4) >>>> max_mtu: 2048 (4) >>>> active_mtu: 2048 (4) >>>> sm_lid: 7 >>>> port_lid: 1 >>>> port_lmc: 0x00 >>>> >>>> port: 2 >>>> state: PORT_ACTIVE (4) >>>> max_mtu: 2048 (4) >>>> active_mtu: 2048 (4) >>>> sm_lid: 1 >>>> port_lid: 1 >>>> port_lmc: 0x00 >>>> >>>>[root at vortex3l-72 158]# ps -aux | grep open >>>>Warning: bad syntax, perhaps a bogus '-'? See >>>>/usr/share/doc/procps-3.2.3/FAQ >>>>root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06 >>>>/usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f >>>>/var/log/opensm2.log >>>>root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06 >>>>/usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f >>>>/var/log/opensm1.log >>>>root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open >>>>[root at vortex3l-72 158]# strace -p7975 >>>>Process 7975 attached - interrupt to quit >>>>restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, NULL) = 0 >>>>nanosleep({10, 0}, >>>>Process 7975 detached >>>>[root at vortex3l-72 158]# uptime >>>>12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00 >>>>[root at vortex3l-72 158]# date >>>>Mon Nov 27 12:13:05 PST 2006 >>>>[root at vortex3l-72 158]# tail /var/log/opensm1.log >>>>Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn >>>>3673M >>>> >>>>Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting >>>>Generic Notice type:3 num:66 from LID:0x0000 >>>>GID:0xfe80000000000000,0x0000000000000000 >>>>Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting >>>>Generic Notice type:3 num:66 from LID:0x0000 >>>>GID:0xfe80000000000000,0x0000000000000000 >>>>Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port >>>>0x5045014b1a0001 >>>>Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port >>>>0x5045014b1a0001 >>>>Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state >>>> >>>>Nov 22 19:09:28 442435 [0000] -> Entering MASTER state >>>> >>>>[root at vortex3l-72 158]# tail /var/log/opensm2.log >>>> 00 00 00 00 00 00 00 00 00 00 00 00 >>>>00 00 00 00 >>>> >>>>Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting >>>>Generic Notice type:3 num:65 from LID:0x0001 >>>>GID:0xfe80000000000000,0x005045014b1a0002 >>>>Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: >>>>Cannot find destination port with LID:0x0002 >>>>Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: >>>>Cannot find destination port with LID:0x0003 >>>>Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: >>>>Cannot find destination port with LID:0x0004 >>>>Nov 27 12:10:32 146382 [41401960] -> Removed port with >>>>GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1 >>>>Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: >>>>Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to >>>>light sweep sampling list >>>>Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path: >>>> Path = [0][2] >>>> >>>> >>>> > > > From halr at voltaire.com Fri Dec 8 15:57:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2006 18:57:17 -0500 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <4579F8E6.3040604@3leafnetworks.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> Message-ID: <1165622233.26559.8108.camel@hal.voltaire.com> On Fri, 2006-12-08 at 18:44, Venkatesh Babu wrote: > I have 3 nodes and 2 IB switches. Port 1 of all nodes connected to > switch 1 and Port2 of all nodes connected to switch 2. So each switch > creates its own subnet and hence I have two instances of OpenSM for each > port. And the two switches are not connected to each other, right ? > I have two OpenSMs running with priority 1 on node1 and two > OpenSM's running with priority 13 on node 2. Do you set a different subnet prefix (other than the default on one) ? Not sure if this matters yet in OpenIB but it might. > Node 3 doesn't have any > OpenSM's but just a OFED kernel modules. I reboot the node 2 every > 10minutes. Since it has the highest priority, every time it boots up it > grabs the mastership from the node 1. It works most of the time, except > when this problem occurs. Now I understand the scenario. > When this problem occurs, node 3 shows the old/stale SMLID information. > But if you reload the ofed drivers or reboot the node to get the new LID > assignment it shows SMLID as 0. That's consistent with the SM not really taking over. Just wanted to be sure. > Even though Node 1's SMLID and port LID > are same, it was not completely asserted the mastership. OK. > See the log messages below - > > [root ~]# ibv_devinfo > ... > port: 1 > state: PORT_INIT (2) > max_mtu: 2048 (4) > active_mtu: 512 (2) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > port: 2 > state: PORT_INIT (2) > max_mtu: 2048 (4) > active_mtu: 512 (2) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > > > The strace output is shown below - > [root~]# strace -p 7518 > Process 7518 attached - interrupt to quit > restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x335d) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > nanosleep({10, 0}, NULL) = 0 > > The GDB output is shown below - > [root ~]# gdb /usr/bin/opensm 7518 > GNU gdb Red Hat Linux (6.3.0.0-1.63rh) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain > conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu"... > (no debugging symbols found) > Using host libthread_db library "/lib64/tls/libthread_db.so.1". > > Attaching to program: /usr/bin/opensm, process 7518 > Reading symbols from /usr/lib/libibumad.so.1...done. > Loaded symbols for /usr/lib/libibumad.so.1 > Reading symbols from /usr/lib/libopensm.so.1...done. > Loaded symbols for /usr/lib/libopensm.so.1 > Reading symbols from /usr/lib/libosmcomp.so.1...done. > Loaded symbols for /usr/lib/libosmcomp.so.1 > Reading symbols from /lib64/tls/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 182896213152 (LWP 7518)] > [New Thread 1136679264 (LWP 7544)] > [New Thread 1126189408 (LWP 7543)] > [New Thread 1115699552 (LWP 7542)] > [New Thread 1105209696 (LWP 7541)] > [New Thread 1094719840 (LWP 7540)] > [New Thread 1084229984 (LWP 7534)] > Loaded symbols for /lib64/tls/libpthread.so.0 > Reading symbols from /usr/lib/libosmvendor.so.1...done. > Loaded symbols for /usr/lib/libosmvendor.so.1 > Reading symbols from /usr/lib/libibcommon.so.1...done. > Loaded symbols for /usr/lib/libibcommon.so.1 > Reading symbols from /lib64/tls/libc.so.6...done. > Loaded symbols for /lib64/tls/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > 0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > #1 0x00000031603bf368 in usleep () from /lib64/tls/libc.so.6 > #2 0x000000316080df32 in cl_thread_suspend (pause_ms=10000) at > cl_thread.c:125 > #3 0x000000000040584e in main () > (gdb) print osm_hup_flag > $1 = 0 > (gdb) That's the main thread. It's in the following loop: while( !osm_exit_flag ) { if (opt.console) osm_console(&osm); else cl_thread_suspend( 10000 ); if (osm_hup_flag) { osm_hup_flag = 0; /* a HUP signal should only start a new heavy sweep */ osm.subn.force_immediate_heavy_sweep = TRUE; osm_opensm_sweep( &osm ); } What about the other threads ? What are they doing ? > Following is the log output. It is entring to MASTER state. But it > doesn't show"SUBNET UP" event. It gets stuck. I wouldn't expect that given the problem your hitting. The SUBNET UP only occurs once the heavy sweep is completed. That's not happening. -- Hal > [root ~]# tail /var/log/opensm1.log > Dec 04 15:59:35 573040 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3726M > > Dec 04 15:59:35 783462 [9576BCA0] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Dec 04 15:59:35 783541 [9576BCA0] -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0x0000 > GID:0xfe80000000000000,0x0000000000000000 > Dec 04 15:59:35 783589 [9576BCA0] -> osm_vendor_bind: Binding to port > 0x5045014b1a0001 > Dec 04 15:59:35 787924 [9576BCA0] -> osm_vendor_bind: Binding to port > 0x5045014b1a0001 > Dec 04 15:59:35 800404 [0000] -> Entering STANDBY state > > Dec 04 15:59:36 053784 [0000] -> Entering MASTER state > > > > Hal Rosenstock wrote: > > >On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote: > > > > > >>I have got the same problem with OFED 1.1 stack also, but the frequency > >>is less. I had to try 120 fail overs (by rebooting the highest priority > >>OpenSM server) before getting into this problem. > >> > >> > > > >If I understand you correctly, you reboot the master SM and the standby > >does not takeover (become master). Is that correct ? > > > >Is this with 2 SMs or more ? > > > > > > > >>At this state OpenSM doesn't update anything to the log files; > >>doesn't assign the LIDs to the other nodes; doesn't respond > >>to the multi cast join operations. Even another OpenSM is > >>started on another node with higher priority it can > >>not become the master. The only way to recover from this is by killing > >>the stuck OpenSM. > >> > >> > > > >What SMLID do the nodes in the subnet point to ? > > > >Can you determine where is it stuck ? Sounds like it could be in some > >tight loop. Can you build with gdb and attach when this occurs to see ? > > > >-- Hal > > > > > > > >> VBabu > >> > >>Hal Rosenstock wrote: > >> > >> > >> > >>>I don't see any explicit changes to the SM state machine which would > >>>affect this but as I have mentioned before there are many bug fixes in > >>>OFED 1.1. I can't conclusively state whether this would fix the issue > >>>you see but would be in a much better position to try to figure this > >>>out. > >>> > >>>-- Hal > >>> > >>> > >>> > >>> > >>> > >>>>Hi > >>>> > >>>> I have topology of two switches and a bunch of nodes, with each > >>>>node having 2port HCAs. Port1 of every node connects to switch1 and > >>>>Port2 of every node connects to switch2. So Port1 and Port2 are in > >>>>different subnets. So I am running one OpenSM (from OFED 1.0) for > >>>>each port on one node designated as a server. To guard against that > >>>>server going down I have another server node to run the OpenSM in > >>>>"standby" mode for each port. I will adjust the priorities such that > >>>>first server always has "master" OpenSM and second server has > >>>>"standby" OpenSM. > >>>> > >>>> When the first server rebooted, "standby" OpenSM should takeover > >>>>the mastership. It usually works fine but sometimes it is failing to > >>>>takeover. In the following example OpenSM for Port1 failed to > >>>>takeover, but OpenSM for Port2 took over and became "master". The > >>>>OpenSM for Port1 seems be stuck in some weired state (strace shows > >>>>that it is sleeping). It is no longer assigning LIDs to the rest of > >>>>the nodes in the subnet and not responding to the broadcast joins. > >>>>The log file shows nothing from past 4 days. I have the complete log > >>>>files if needed. > >>>> > >>>> Is this a known problem and fixed in OFED 1.1 ? > >>>> > >>>>[root at vortex3l-72 158]# ibv_devinfo > >>>>hca_id: mthca0 > >>>> fw_ver: 5.1.400 > >>>> node_guid: 0050:4501:4b1a:0000 > >>>> sys_image_guid: 0050:4501:4b1a:0003 > >>>> vendor_id: 0x02c9 > >>>> vendor_part_id: 25218 > >>>> hw_ver: 0xA0 > >>>> board_id: ARM0020000001 > >>>> phys_port_cnt: 2 > >>>> port: 1 > >>>> state: PORT_ACTIVE (4) > >>>> max_mtu: 2048 (4) > >>>> active_mtu: 2048 (4) > >>>> sm_lid: 7 > >>>> port_lid: 1 > >>>> port_lmc: 0x00 > >>>> > >>>> port: 2 > >>>> state: PORT_ACTIVE (4) > >>>> max_mtu: 2048 (4) > >>>> active_mtu: 2048 (4) > >>>> sm_lid: 1 > >>>> port_lid: 1 > >>>> port_lmc: 0x00 > >>>> > >>>>[root at vortex3l-72 158]# ps -aux | grep open > >>>>Warning: bad syntax, perhaps a bogus '-'? See > >>>>/usr/share/doc/procps-3.2.3/FAQ > >>>>root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06 > >>>>/usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f > >>>>/var/log/opensm2.log > >>>>root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06 > >>>>/usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f > >>>>/var/log/opensm1.log > >>>>root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open > >>>>[root at vortex3l-72 158]# strace -p7975 > >>>>Process 7975 attached - interrupt to quit > >>>>restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, NULL) = 0 > >>>>nanosleep({10, 0}, > >>>>Process 7975 detached > >>>>[root at vortex3l-72 158]# uptime > >>>>12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00 > >>>>[root at vortex3l-72 158]# date > >>>>Mon Nov 27 12:13:05 PST 2006 > >>>>[root at vortex3l-72 158]# tail /var/log/opensm1.log > >>>>Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn > >>>>3673M > >>>> > >>>>Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting > >>>>Generic Notice type:3 num:66 from LID:0x0000 > >>>>GID:0xfe80000000000000,0x0000000000000000 > >>>>Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting > >>>>Generic Notice type:3 num:66 from LID:0x0000 > >>>>GID:0xfe80000000000000,0x0000000000000000 > >>>>Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port > >>>>0x5045014b1a0001 > >>>>Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port > >>>>0x5045014b1a0001 > >>>>Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state > >>>> > >>>>Nov 22 19:09:28 442435 [0000] -> Entering MASTER state > >>>> > >>>>[root at vortex3l-72 158]# tail /var/log/opensm2.log > >>>> 00 00 00 00 00 00 00 00 00 00 00 00 > >>>>00 00 00 00 > >>>> > >>>>Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting > >>>>Generic Notice type:3 num:65 from LID:0x0001 > >>>>GID:0xfe80000000000000,0x005045014b1a0002 > >>>>Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: > >>>>Cannot find destination port with LID:0x0002 > >>>>Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: > >>>>Cannot find destination port with LID:0x0003 > >>>>Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: > >>>>Cannot find destination port with LID:0x0004 > >>>>Nov 27 12:10:32 146382 [41401960] -> Removed port with > >>>>GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1 > >>>>Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: > >>>>Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to > >>>>light sweep sampling list > >>>>Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path: > >>>> Path = [0][2] > >>>> > >>>> > >>>> > > > > > > From venkatesh.babu at 3leafnetworks.com Fri Dec 8 16:30:01 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Fri, 08 Dec 2006 16:30:01 -0800 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1165622233.26559.8108.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> Message-ID: <457A0389.7030103@3leafnetworks.com> Hal Rosenstock wrote: >And the two switches are not connected to each other, right ? > > Yes, the switches are not connected. >Do you set a different subnet prefix (other than the default on one) ? >Not sure if this matters yet in OpenIB but it might. > > I don't know how to set subnet prefix. So it may be default one. >That's the main thread. It's in the following loop: > > while( !osm_exit_flag ) { > if (opt.console) > osm_console(&osm); > else > cl_thread_suspend( 10000 ); > > if (osm_hup_flag) { > osm_hup_flag = 0; > /* a HUP signal should only start a new heavy sweep */ > osm.subn.force_immediate_heavy_sweep = TRUE; > osm_opensm_sweep( &osm ); > } > >What about the other threads ? What are they doing ? > > Yes. I got this. It was in this loop. I didn't realized there are other OpenSM threads running. I need to find that out. >I wouldn't expect that given the problem your hitting. The SUBNET UP >only occurs once the heavy sweep is completed. That's not happening. > >-- Hal > > Is the heavy sweep supposed to happen after the failover ? Is there any documentaion on the OpenSM architecture and design ? VBabu From halr at voltaire.com Fri Dec 8 16:48:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2006 19:48:13 -0500 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <457A0389.7030103@3leafnetworks.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> Message-ID: <1165625283.26559.10270.camel@hal.voltaire.com> On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >And the two switches are not connected to each other, right ? > > > > > Yes, the switches are not connected. > > >Do you set a different subnet prefix (other than the default on one) ? > >Not sure if this matters yet in OpenIB but it might. > > > > > I don't know how to set subnet prefix. In opensm.opts file: # Subnet prefix used on this subnet subnet_prefix 0xfe80000000000000 (that's the default one) > So it may be default one. > > >That's the main thread. It's in the following loop: > > > > while( !osm_exit_flag ) { > > if (opt.console) > > osm_console(&osm); > > else > > cl_thread_suspend( 10000 ); > > > > if (osm_hup_flag) { > > osm_hup_flag = 0; > > /* a HUP signal should only start a new heavy sweep */ > > osm.subn.force_immediate_heavy_sweep = TRUE; > > osm_opensm_sweep( &osm ); > > } > > > >What about the other threads ? What are they doing ? > > > > > Yes. I got this. It was in this loop. I didn't realized there are > other OpenSM threads running. I need to find that out. OK. > >I wouldn't expect that given the problem your hitting. The SUBNET UP > >only occurs once the heavy sweep is completed. That's not happening. > > > >-- Hal > > > > > Is the heavy sweep supposed to happen after the failover ? The standby after determining that the master is non responsive will go back to discovering but in your configuration will find no other SM and will go to master. I think it got that far. Once it transitions to master, it does a heavy sweep to configure the subnet. Something is stopping that from completing. I'm not sure what is going wrong. > Is there any documentaion on the OpenSM architecture and design ? Just the code AFAIK. You can read the SM and SA sections of IBA volume 1 for what an SM is supposed to do. -- Hal > VBabu From venkatesh.babu at 3leafnetworks.com Fri Dec 8 17:03:30 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Fri, 08 Dec 2006 17:03:30 -0800 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1165625283.26559.10270.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> <1165625283.26559.10270.camel@hal.voltaire.com> Message-ID: <457A0B62.2060501@3leafnetworks.com> Now I hit another instance of the problem. Now I have more information. Node1: ====== [root at vortex3l-71 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.1.400 node_guid: 0050:4501:4a5a:0000 sys_image_guid: 0050:4501:4a5a:0003 vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0xA0 board_id: ARM0020000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 7 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 4 port_lid: 4 port_lmc: 0x00 [root at vortex3l-71 ~]# ps -aux | grep open Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ root 6774 0.0 0.0 92844 1684 ? Sl Dec07 0:06 /usr/local/ofed/bin/opensm -g 0x005045014a5a0001 -p 1 -s 10 -u -f /var/log/opensm1.log root 21537 0.0 0.4 64556 9276 ttyS0 S+ 16:48 0:00 gdb /usr/local/ofed/bin/opensm 6787 root 6787 0.0 0.0 92844 1728 ? Tl Dec07 0:05 /usr/local/ofed/bin/opensm -g 0x005045014a5a0002 -p 1 -s 10 -u -f /var/log/opensm2.log root 22566 0.0 0.0 51072 692 pts/0 S+ 16:53 0:00 grep open [root at vortex3l-71 ~]# tail /var/log/opensm2.log 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 07 11:29:14 623895 [45007960] -> umad_receiver: ERR 5404: recv error on MAD sized umad (Interrupted system call) Dec 07 11:29:14 625421 [0000] -> Exiting SM [root at vortex3l-71 ~]# [root at vortex3l-71 ~]# gdb /usr/local/ofed/bin/opensm 6787 GNU gdb Red Hat Linux (6.3.0.0-1.63rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... (no debugging symbols found) Using host libthread_db library "/lib64/tls/libthread_db.so.1". Attaching to program: /usr/local/ofed/bin/opensm, process 6787 Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1 Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1 Reading symbols from /lib64/tls/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 182899544416 (LWP 6787)] [New Thread 1157658976 (LWP 6797)] [New Thread 1147169120 (LWP 6796)] [New Thread 1136679264 (LWP 6795)] [New Thread 1126189408 (LWP 6794)] [New Thread 1115699552 (LWP 6793)] [New Thread 1105209696 (LWP 6792)] [New Thread 1094719840 (LWP 6791)] [New Thread 1084229984 (LWP 6789)] Loaded symbols for /lib64/tls/libpthread.so.0 Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done. Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2 Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1 Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1 Reading symbols from /lib64/tls/libc.so.6...done. Loaded symbols for /lib64/tls/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 #1 0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6 #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125 #3 0x0000000000405b71 in main () (gdb) info threads 9 Thread 1084229984 (LWP 6789) 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 8 Thread 1094719840 (LWP 6791) 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 7 Thread 1105209696 (LWP 6792) 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 6 Thread 1115699552 (LWP 6793) 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 5 Thread 1126189408 (LWP 6794) 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 4 Thread 1136679264 (LWP 6795) 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 3 Thread 1147169120 (LWP 6796) 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 2 Thread 1157658976 (LWP 6797) 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6 1 Thread 182899544416 (LWP 6787) 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) thread 1 [Switching to thread 1 (Thread 182899544416 (LWP 6787))]#0 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 #1 0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6 #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125 #3 0x0000000000405b71 in main () (gdb) thread 2 [Switching to thread 2 (Thread 1157658976 (LWP 6797))]#0 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6 #1 0x0000002a9588d90d in dev_poll (fd=Variable "fd" is not available. ) at src/umad.c:775 #2 0x0000002a9588da2d in umad_recv (portid=Variable "portid" is not available. ) at src/umad.c:805 #3 0x0000002a9578367b in umad_receiver (p_ptr=0x5c2d50) at osm_vendor_ibumad.c:266 #4 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at cl_thread.c:61 #5 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #6 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 1147169120 (LWP 6796))]#0 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798, wait_us=10000000, interruptible=1) at cl_event.c:181 #2 0x00000000004362dc in __osm_sm_sweeper () #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at cl_thread.c:61 #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 4 [Switching to thread 4 (Thread 1136679264 (LWP 6795))]#0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x000000000044d771 in __osm_vl15_poller () #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at cl_thread.c:61 #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 5 [Switching to thread 5 (Thread 1126189408 (LWP 6794))]#0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61 #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 6 [Switching to thread 6 (Thread 1115699552 (LWP 6793))]#0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61 #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 7 [Switching to thread 7 (Thread 1105209696 (LWP 6792))]#0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61 #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 8 [Switching to thread 8 (Thread 1094719840 (LWP 6791))]#0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61 #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 9 [Switching to thread 9 (Thread 1084229984 (LWP 6789))]#0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a95675991 in __cl_timer_prov_cb (context=0x0) at cl_timer.c:157 #2 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 #3 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 #4 0x0000000000000000 in ?? () (gdb) Node 2: ====== [root at localhost ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.1.400 node_guid: 0050:4501:4a9e:0000 sys_image_guid: 0050:4501:4a9e:0003 vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0xA0 board_id: ARM0020000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 2 port_lmc: 0x00 port: 2 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 4 port_lid: 2 port_lmc: 0x00 [root at localhost ~]# ps -aux | grep open Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ root 6854 0.0 0.0 92844 1648 ? Sl 16:12 0:00 /usr/local/ofed/bin/opensm -g 0x005045014a9e0001 -p 8 -s 10 -u -f /var/log/opensm1.log root 14005 0.0 0.4 64632 9312 ttyS0 S+ 16:46 0:00 gdb /var/log/opensm2.log 6867 root 6867 0.0 0.0 92844 1536 ? Tl 16:12 0:00 /usr/local/ofed/bin/opensm -g 0x005045014a9e0002 -p 8 -s 10 -u -f /var/log/opensm2.log root 16223 0.0 0.0 51060 680 pts/0 S+ 16:56 0:00 grep open [root at localhost ~]# tail /var/log/opensm2.log Dec 07 05:15:07 675863 [41401960] -> osm_subn_set_up_down_min_hop_table: BFS through all port guids in the subnet ] Dec 07 05:15:07 675898 [41401960] -> osm_ucast_mgr_process: Min Hop Tables configured on all switches Dec 07 05:15:07 682095 [43204960] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25: Received an invalid delete request on MGID: 0xff12401bffff0000 : 0x00000000ffffffff for PortGID: 0xfe80000000000000 : 0x0050450148ba0002 Dec 07 05:15:07 677004 [0000] -> SUBNET UP Dec 07 05:15:09 598888 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x005045014a9e0002 Dec 07 07:26:17 429099 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x0050450148ba0002 Dec 07 07:26:18 429309 [41E02960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002 Dec 07 11:29:03 817752 [0000] -> Exiting SM [root at localhost ~]# [root at localhost ~]# gdb /var/log/opensm2.log 6867 GNU gdb Red Hat Linux (6.3.0.0-1.63rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"..."/var/log/opensm2.log": not in executable format: File format not recognized Attaching to process 6867 Reading symbols from /usr/local/ofed/bin/opensm...(no debugging symbols found)...done. Using host libthread_db library "/lib64/tls/libthread_db.so.1". Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1 Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1 Reading symbols from /lib64/tls/libpthread.so.0...done. [Thread debugging using libthread_db enabled] [New Thread 182899548512 (LWP 6867)] [New Thread 1157658976 (LWP 6884)] [New Thread 1147169120 (LWP 6883)] [New Thread 1136679264 (LWP 6882)] [New Thread 1126189408 (LWP 6881)] [New Thread 1115699552 (LWP 6880)] [New Thread 1105209696 (LWP 6879)] [New Thread 1094719840 (LWP 6878)] [New Thread 1084229984 (LWP 6869)] Loaded symbols for /lib64/tls/libpthread.so.0 Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done. Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2 Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1 Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done. Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1 Reading symbols from /lib64/tls/libc.so.6...done. Loaded symbols for /lib64/tls/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) bt #0 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 #1 0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6 #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125 #3 0x0000000000405b71 in main () (gdb) info threads 9 Thread 1084229984 (LWP 6869) 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 8 Thread 1094719840 (LWP 6878) 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 7 Thread 1105209696 (LWP 6879) 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 6 Thread 1115699552 (LWP 6880) 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 5 Thread 1126189408 (LWP 6881) 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 4 Thread 1136679264 (LWP 6882) 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 3 Thread 1147169120 (LWP 6883) 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 2 Thread 1157658976 (LWP 6884) 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6 1 Thread 182899548512 (LWP 6867) 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) thread 1 [Switching to thread 1 (Thread 182899548512 (LWP 6867))]#0 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 (gdb) bt #0 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 #1 0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6 #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125 #3 0x0000000000405b71 in main () (gdb) thread 2 [Switching to thread 2 (Thread 1157658976 (LWP 6884))]#0 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6 (gdb) bt #0 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6 #1 0x0000002a9588e90d in dev_poll (fd=Variable "fd" is not available. ) at src/umad.c:775 #2 0x0000002a9588ea2d in umad_recv (portid=Variable "portid" is not available. ) at src/umad.c:805 #3 0x0000002a9578467b in umad_receiver (p_ptr=0x5c2d50) at osm_vendor_ibumad.c:266 #4 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at cl_thread.c:61 #5 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #6 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #7 0x0000000000000000 in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 1147169120 (LWP 6883))]#0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798, wait_us=10000000, interruptible=1) at cl_event.c:181 #2 0x00000000004362dc in __osm_sm_sweeper () #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at cl_thread.c:61 #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 4 [Switching to thread 4 (Thread 1136679264 (LWP 6882))]#0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x000000000044d771 in __osm_vl15_poller () #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at cl_thread.c:61 #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 5 [Switching to thread 5 (Thread 1126189408 (LWP 6881))]#0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61 #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 6 [Switching to thread 6 (Thread 1115699552 (LWP 6880))]#0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61 #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 7 [Switching to thread 7 (Thread 1105209696 (LWP 6879))]#0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61 #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 8 [Switching to thread 8 (Thread 1094719840 (LWP 6878))]#0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, wait_us=4294967295, interruptible=1) at cl_event.c:168 #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) at cl_threadpool.c:71 #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61 #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #6 0x0000000000000000 in ?? () (gdb) thread 9 [Switching to thread 9 (Thread 1084229984 (LWP 6869))]#0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 (gdb) bt #0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 #1 0x0000002a956759cd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168 #2 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 #3 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 #4 0x0000000000000000 in ?? () (gdb) Node 3: ====== [root at devsunj ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.1.400 node_guid: 0002:c902:0020:ed58 sys_image_guid: 0002:c902:0020:ed5b vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0xA0 board_id: MT_0150000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 2 port_lid: 1 port_lmc: 0x00 port: 2 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 [root at devsunj ~]# Hal Rosenstock wrote: >On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote: > > >>Hal Rosenstock wrote: >> >> >> >>>And the two switches are not connected to each other, right ? >>> >>> >>> >>> >> Yes, the switches are not connected. >> >> >> >>>Do you set a different subnet prefix (other than the default on one) ? >>>Not sure if this matters yet in OpenIB but it might. >>> >>> >>> >>> >> I don't know how to set subnet prefix. >> >> > >In opensm.opts file: > ># Subnet prefix used on this subnet >subnet_prefix 0xfe80000000000000 > >(that's the default one) > > > >> So it may be default one. >> >> >> >>>That's the main thread. It's in the following loop: >>> >>> while( !osm_exit_flag ) { >>> if (opt.console) >>> osm_console(&osm); >>> else >>> cl_thread_suspend( 10000 ); >>> >>> if (osm_hup_flag) { >>> osm_hup_flag = 0; >>> /* a HUP signal should only start a new heavy sweep */ >>> osm.subn.force_immediate_heavy_sweep = TRUE; >>> osm_opensm_sweep( &osm ); >>> } >>> >>>What about the other threads ? What are they doing ? >>> >>> >>> >>> >> Yes. I got this. It was in this loop. I didn't realized there are >>other OpenSM threads running. I need to find that out. >> >> > >OK. > > > >>>I wouldn't expect that given the problem your hitting. The SUBNET UP >>>only occurs once the heavy sweep is completed. That's not happening. >>> >>>-- Hal >>> >>> >>> >>> >> Is the heavy sweep supposed to happen after the failover ? >> >> > >The standby after determining that the master is non responsive will go >back to discovering but in your configuration will find no other SM and >will go to master. I think it got that far. > >Once it transitions to master, it does a heavy sweep to configure the >subnet. Something is stopping that from completing. I'm not sure what is >going wrong. > > > >> Is there any documentaion on the OpenSM architecture and design ? >> >> > >Just the code AFAIK. You can read the SM and SA sections of IBA volume 1 >for what an SM is supposed to do. > >-- Hal > > > >> VBabu >> >> > > > From halr at voltaire.com Fri Dec 8 17:38:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 08 Dec 2006 20:38:48 -0500 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <457A0B62.2060501@3leafnetworks.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> <1165625283.26559.10270.camel@hal.voltaire.com> <457A0B62.2060501@3leafnetworks.com> Message-ID: <1165628315.26559.12385.camel@hal.voltaire.com> On Fri, 2006-12-08 at 20:03, Venkatesh Babu wrote: > Now I hit another instance of the problem. Now I have more information. Was this the same scenario or something different ? > Node1: > ====== > > [root at vortex3l-71 ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.1.400 > node_guid: 0050:4501:4a5a:0000 So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is that right ? > sys_image_guid: 0050:4501:4a5a:0003 > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: ARM0020000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 7 > port_lmc: 0x00 > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 4 > port_lid: 4 > port_lmc: 0x00 > > [root at vortex3l-71 ~]# ps -aux | grep open > Warning: bad syntax, perhaps a bogus '-'? See > /usr/share/doc/procps-3.2.3/FAQ > root 6774 0.0 0.0 92844 1684 ? Sl Dec07 0:06 > /usr/local/ofed/bin/opensm -g 0x005045014a5a0001 -p 1 -s 10 -u -f > /var/log/opensm1.log > root 21537 0.0 0.4 64556 9276 ttyS0 S+ 16:48 0:00 gdb > /usr/local/ofed/bin/opensm 6787 > root 6787 0.0 0.0 92844 1728 ? Tl Dec07 0:05 > /usr/local/ofed/bin/opensm -g 0x005045014a5a0002 -p 1 -s 10 -u -f > /var/log/opensm2.log > root 22566 0.0 0.0 51072 692 pts/0 S+ 16:53 0:00 grep open > [root at vortex3l-71 ~]# tail /var/log/opensm2.log > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > > Dec 07 11:29:14 623895 [45007960] -> umad_receiver: ERR 5404: recv error > on MAD sized umad (Interrupted system call) > Dec 07 11:29:14 625421 [0000] -> Exiting SM Does this correspond to when node 2 SM goes down, SM comes up, or something else ? Not sure why OpenSM decides to exit (due to this error which should be recoverable). It then fails to exit (hangs) as the other threads are not terminated. Is osm_exit_flag set ? I presume it is but would like verification. What are the thread_state values of the various threads ? > [root at vortex3l-71 ~]# > [root at vortex3l-71 ~]# gdb /usr/local/ofed/bin/opensm 6787 > GNU gdb Red Hat Linux (6.3.0.0-1.63rh) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain > conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu"... > (no debugging symbols found) > Using host libthread_db library "/lib64/tls/libthread_db.so.1". > > Attaching to program: /usr/local/ofed/bin/opensm, process 6787 > Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1 > Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1 > Reading symbols from /lib64/tls/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 182899544416 (LWP 6787)] > [New Thread 1157658976 (LWP 6797)] > [New Thread 1147169120 (LWP 6796)] > [New Thread 1136679264 (LWP 6795)] > [New Thread 1126189408 (LWP 6794)] > [New Thread 1115699552 (LWP 6793)] > [New Thread 1105209696 (LWP 6792)] > [New Thread 1094719840 (LWP 6791)] > [New Thread 1084229984 (LWP 6789)] > Loaded symbols for /lib64/tls/libpthread.so.0 > Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done. > Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2 > Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1 > Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1 > Reading symbols from /lib64/tls/libc.so.6...done. > Loaded symbols for /lib64/tls/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > #1 0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6 > #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at > cl_thread.c:125 > #3 0x0000000000405b71 in main () > (gdb) info threads > 9 Thread 1084229984 (LWP 6789) 0x0000003858c088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 8 Thread 1094719840 (LWP 6791) 0x0000003858c088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 7 Thread 1105209696 (LWP 6792) 0x0000003858c088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 6 Thread 1115699552 (LWP 6793) 0x0000003858c088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 5 Thread 1126189408 (LWP 6794) 0x0000003858c088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 4 Thread 1136679264 (LWP 6795) 0x0000003858c088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 3 Thread 1147169120 (LWP 6796) 0x0000003858c08acf in > pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 2 Thread 1157658976 (LWP 6797) 0x0000003857fbcd22 in poll () > from /lib64/tls/libc.so.6 > 1 Thread 182899544416 (LWP 6787) 0x0000003857f8ed65 in > __nanosleep_nocancel > () from /lib64/tls/libc.so.6 > (gdb) thread 1 > [Switching to thread 1 (Thread 182899544416 (LWP 6787))]#0 > 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > #1 0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6 > #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at > cl_thread.c:125 > #3 0x0000000000405b71 in main () > (gdb) thread 2 > [Switching to thread 2 (Thread 1157658976 (LWP 6797))]#0 > 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6 > #1 0x0000002a9588d90d in dev_poll (fd=Variable "fd" is not available. > ) at src/umad.c:775 > #2 0x0000002a9588da2d in umad_recv (portid=Variable "portid" is not > available. > ) at src/umad.c:805 > #3 0x0000002a9578367b in umad_receiver (p_ptr=0x5c2d50) > at osm_vendor_ibumad.c:266 > #4 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at > cl_thread.c:61 > #5 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #6 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #7 0x0000000000000000 in ?? () > (gdb) thread 3 > [Switching to thread 3 (Thread 1147169120 (LWP 6796))]#0 > 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798, > wait_us=10000000, interruptible=1) at cl_event.c:181 > #2 0x00000000004362dc in __osm_sm_sweeper () > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at > cl_thread.c:61 > #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 4 > [Switching to thread 4 (Thread 1136679264 (LWP 6795))]#0 > 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x000000000044d771 in __osm_vl15_poller () > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at > cl_thread.c:61 > #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 5 > [Switching to thread 5 (Thread 1126189408 (LWP 6794))]#0 > 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at > cl_thread.c:61 > #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 6 > [Switching to thread 6 (Thread 1115699552 (LWP 6793))]#0 > 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at > cl_thread.c:61 > #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 7 > [Switching to thread 7 (Thread 1105209696 (LWP 6792))]#0 > 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at > cl_thread.c:61 > #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 8 > [Switching to thread 8 (Thread 1094719840 (LWP 6791))]#0 > 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at > cl_thread.c:61 > #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 9 > [Switching to thread 9 (Thread 1084229984 (LWP 6789))]#0 > 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a95675991 in __cl_timer_prov_cb (context=0x0) at cl_timer.c:157 > #2 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0 > #3 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6 > #4 0x0000000000000000 in ?? () > (gdb) > > > Node 2: > ====== Is this when node 2 comes back up and SM is restarted on both ports or is it after the SM is stopped on port 2 ? > [root at localhost ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.1.400 > node_guid: 0050:4501:4a9e:0000 > sys_image_guid: 0050:4501:4a9e:0003 > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: ARM0020000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 2 > port_lmc: 0x00 > > port: 2 > state: PORT_INIT (2) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 4 This port still points at the SM on node 1, right ? > port_lid: 2 > port_lmc: 0x00 > > [root at localhost ~]# ps -aux | grep open > Warning: bad syntax, perhaps a bogus '-'? See > /usr/share/doc/procps-3.2.3/FAQ > root 6854 0.0 0.0 92844 1648 ? Sl 16:12 0:00 > /usr/local/ofed/bin/opensm -g 0x005045014a9e0001 -p 8 -s 10 -u -f > /var/log/opensm1.log > root 14005 0.0 0.4 64632 9312 ttyS0 S+ 16:46 0:00 gdb > /var/log/opensm2.log 6867 > root 6867 0.0 0.0 92844 1536 ? Tl 16:12 0:00 > /usr/local/ofed/bin/opensm -g 0x005045014a9e0002 -p 8 -s 10 -u -f > /var/log/opensm2.log > root 16223 0.0 0.0 51060 680 pts/0 S+ 16:56 0:00 grep open > [root at localhost ~]# tail /var/log/opensm2.log > Dec 07 05:15:07 675863 [41401960] -> osm_subn_set_up_down_min_hop_table: > BFS through all port guids in the subnet ] > Dec 07 05:15:07 675898 [41401960] -> osm_ucast_mgr_process: Min Hop > Tables configured on all switches > Dec 07 05:15:07 682095 [43204960] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25: > Received an invalid delete request on MGID: 0xff12401bffff0000 : > 0x00000000ffffffff for PortGID: 0xfe80000000000000 : 0x0050450148ba0002 > Dec 07 05:15:07 677004 [0000] -> SUBNET UP > > Dec 07 05:15:09 598888 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: > method = SubnAdmSet, scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x005045014a9e0002 > Dec 07 07:26:17 429099 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: > method = SubnAdmSet, scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x0050450148ba0002 > Dec 07 07:26:18 429309 [41E02960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: > method = SubnAdmSet, scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002 > Dec 07 11:29:03 817752 [0000] -> Exiting SM You stopped this SM, right ? > [root at localhost ~]# > [root at localhost ~]# gdb /var/log/opensm2.log 6867 Why gdb this node's SM ? I'm not following you. Should point at executable not log. > GNU gdb Red Hat Linux (6.3.0.0-1.63rh) > Copyright 2004 Free Software Foundation, Inc. > GDB is free software, covered by the GNU General Public License, and you are > welcome to change it and/or distribute copies of it under certain > conditions. > Type "show copying" to see the conditions. > There is absolutely no warranty for GDB. Type "show warranty" for details. > This GDB was configured as > "x86_64-redhat-linux-gnu"..."/var/log/opensm2.log": not in executable > format: File format not recognized > > Attaching to process 6867 > Reading symbols from /usr/local/ofed/bin/opensm...(no debugging symbols > found)...done. > Using host libthread_db library "/lib64/tls/libthread_db.so.1". > Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1 > Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1 > Reading symbols from /lib64/tls/libpthread.so.0...done. > [Thread debugging using libthread_db enabled] > [New Thread 182899548512 (LWP 6867)] > [New Thread 1157658976 (LWP 6884)] > [New Thread 1147169120 (LWP 6883)] > [New Thread 1136679264 (LWP 6882)] > [New Thread 1126189408 (LWP 6881)] > [New Thread 1115699552 (LWP 6880)] > [New Thread 1105209696 (LWP 6879)] > [New Thread 1094719840 (LWP 6878)] > [New Thread 1084229984 (LWP 6869)] > Loaded symbols for /lib64/tls/libpthread.so.0 > Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done. > Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2 > Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1 > Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done. > Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1 > Reading symbols from /lib64/tls/libc.so.6...done. > Loaded symbols for /lib64/tls/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > 0x00000032eec8ed65 in __nanosleep_nocancel () > from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > #1 0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6 > #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at > cl_thread.c:125 > #3 0x0000000000405b71 in main () > (gdb) info threads > 9 Thread 1084229984 (LWP 6869) 0x00000032ef908acf in > pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 8 Thread 1094719840 (LWP 6878) 0x00000032ef9088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 7 Thread 1105209696 (LWP 6879) 0x00000032ef9088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 6 Thread 1115699552 (LWP 6880) 0x00000032ef9088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 5 Thread 1126189408 (LWP 6881) 0x00000032ef9088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 4 Thread 1136679264 (LWP 6882) 0x00000032ef9088da in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 3 Thread 1147169120 (LWP 6883) 0x00000032ef908acf in > pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0 > 2 Thread 1157658976 (LWP 6884) 0x00000032eecbcd22 in poll () > from /lib64/tls/libc.so.6 > 1 Thread 182899548512 (LWP 6867) 0x00000032eec8ed65 in > __nanosleep_nocancel > () from /lib64/tls/libc.so.6 > (gdb) thread 1 > [Switching to thread 1 (Thread 182899548512 (LWP 6867))]#0 > 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 > #1 0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6 > #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at > cl_thread.c:125 > #3 0x0000000000405b71 in main () > (gdb) thread 2 > [Switching to thread 2 (Thread 1157658976 (LWP 6884))]#0 > 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6 > (gdb) bt > #0 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6 > #1 0x0000002a9588e90d in dev_poll (fd=Variable "fd" is not available. > ) at src/umad.c:775 > #2 0x0000002a9588ea2d in umad_recv (portid=Variable "portid" is not > available. > ) at src/umad.c:805 > #3 0x0000002a9578467b in umad_receiver (p_ptr=0x5c2d50) > at osm_vendor_ibumad.c:266 > #4 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at > cl_thread.c:61 > #5 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #6 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #7 0x0000000000000000 in ?? () > (gdb) thread 3 > [Switching to thread 3 (Thread 1147169120 (LWP 6883))]#0 > 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798, > wait_us=10000000, interruptible=1) at cl_event.c:181 > #2 0x00000000004362dc in __osm_sm_sweeper () > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at > cl_thread.c:61 > #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 4 > [Switching to thread 4 (Thread 1136679264 (LWP 6882))]#0 > 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x000000000044d771 in __osm_vl15_poller () > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at > cl_thread.c:61 > #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 5 > [Switching to thread 5 (Thread 1126189408 (LWP 6881))]#0 > 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at > cl_thread.c:61 > #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 6 > [Switching to thread 6 (Thread 1115699552 (LWP 6880))]#0 > 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at > cl_thread.c:61 > #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 7 > [Switching to thread 7 (Thread 1105209696 (LWP 6879))]#0 > 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at > cl_thread.c:61 > #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 8 > [Switching to thread 8 (Thread 1094719840 (LWP 6878))]#0 > 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540, > wait_us=4294967295, interruptible=1) at cl_event.c:168 > #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468) > at cl_threadpool.c:71 > #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at > cl_thread.c:61 > #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) thread 9 > [Switching to thread 9 (Thread 1084229984 (LWP 6869))]#0 > 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > (gdb) bt > #0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () > from /lib64/tls/libpthread.so.0 > #1 0x0000002a956759cd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168 > #2 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0 > #3 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6 > #4 0x0000000000000000 in ?? () > (gdb) > > > Node 3: > ====== > > [root at devsunj ~]# ibv_devinfo > hca_id: mthca0 > fw_ver: 5.1.400 > node_guid: 0002:c902:0020:ed58 > sys_image_guid: 0002:c902:0020:ed5b > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: MT_0150000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 1 > port_lmc: 0x00 > > port: 2 > state: PORT_INIT (2) > max_mtu: 2048 (4) > active_mtu: 512 (2) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > > [root at devsunj ~]# > > > > > Hal Rosenstock wrote: > > >On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote: > > > > > >>Hal Rosenstock wrote: > >> > >> > >> > >>>And the two switches are not connected to each other, right ? > >>> > >>> > >>> > >>> > >> Yes, the switches are not connected. > >> > >> > >> > >>>Do you set a different subnet prefix (other than the default on one) ? > >>>Not sure if this matters yet in OpenIB but it might. > >>> > >>> > >>> > >>> > >> I don't know how to set subnet prefix. > >> > >> > > > >In opensm.opts file: > > > ># Subnet prefix used on this subnet > >subnet_prefix 0xfe80000000000000 > > > >(that's the default one) > > > > > > > >> So it may be default one. > >> > >> > >> > >>>That's the main thread. It's in the following loop: > >>> > >>> while( !osm_exit_flag ) { > >>> if (opt.console) > >>> osm_console(&osm); > >>> else > >>> cl_thread_suspend( 10000 ); > >>> > >>> if (osm_hup_flag) { > >>> osm_hup_flag = 0; > >>> /* a HUP signal should only start a new heavy sweep */ > >>> osm.subn.force_immediate_heavy_sweep = TRUE; > >>> osm_opensm_sweep( &osm ); > >>> } > >>> > >>>What about the other threads ? What are they doing ? > >>> > >>> > >>> > >>> > >> Yes. I got this. It was in this loop. I didn't realized there are > >>other OpenSM threads running. I need to find that out. > >> > >> > > > >OK. > > > > > > > >>>I wouldn't expect that given the problem your hitting. The SUBNET UP > >>>only occurs once the heavy sweep is completed. That's not happening. > >>> > >>>-- Hal > >>> > >>> > >>> > >>> > >> Is the heavy sweep supposed to happen after the failover ? > >> > >> > > > >The standby after determining that the master is non responsive will go > >back to discovering but in your configuration will find no other SM and > >will go to master. I think it got that far. > > > >Once it transitions to master, it does a heavy sweep to configure the > >subnet. Something is stopping that from completing. I'm not sure what is > >going wrong. > > > > > > > >> Is there any documentaion on the OpenSM architecture and design ? > >> > >> > > > >Just the code AFAIK. You can read the SM and SA sections of IBA volume 1 > >for what an SM is supposed to do. > > > >-- Hal > > > > > > > >> VBabu > >> > >> > > > > > > From venkatesh.babu at 3leafnetworks.com Fri Dec 8 18:25:20 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Fri, 08 Dec 2006 18:25:20 -0800 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1165628315.26559.12385.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> <1165625283.26559.10270.camel@hal.voltaire.com> <457A0B62.2060501@3leafnetworks.com> <1165628315.26559.12385.camel@hal.voltaire.com> Message-ID: <457A1E90.5040606@3leafnetworks.com> Hal Rosenstock wrote: >Was this the same scenario or something different ? > > I had killed the previous OpenSM instance. So I lost that information. It is the same OpenSM failover issue and using the exact same setup and scripts to reproduce. It another instance of the problem. >So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is >that right ? > > > Yes, that is right. They are the OUI vendors for the IB HCAs. >Does this correspond to when node 2 SM goes down, SM comes up, or >something else ? > > I don't know the exact sequence when this message is displayed. All I can say is that it was the last message printed by the OpenSM. I am not rebooting the node 1 or killing the OpenSM. It is staying constant. I have a script to reboot node 2 every couple of minutes. It will stop rebooting if it finds one of these conditions - 1. SM1 on port1 is master but SM2 on port2 is not master 2. SM2 on port2 is master but SM1 on port1 is not master 3. Port1/2 is not ACTIVE 4. Port1/2's sm_lid/port lid is zero I am capturing this all the output at the end of the test when the script was terminated. >Not sure why OpenSM decides to exit (due to this error which should be >recoverable). It then fails to exit (hangs) as the other threads are not >terminated. > >Is osm_exit_flag set ? I presume it is but would like verification. >What are the thread_state values of the various threads ? > > Unfortunately someone powerd off Node1, while I was debugging. So I can not findout this. On Node2 : (gdb) p osm_exit_flag $1 = 0 How do I findout the thread_state value ? >>Node 2: >>====== >> >> > >Is this when node 2 comes back up and SM is restarted on both ports or >is it after the SM is stopped on port 2 ? > > > As I said earlier, this is the snapshot when the script is stopped rebooting as I described above. >> port: 2 >> state: PORT_INIT (2) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 4 >> >> > >This port still points at the SM on node 1, right ? > > Yes that is right. > > >> port_lid: 2 >> port_lmc: 0x00 >> >> >>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: >>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002 >>Dec 07 11:29:03 817752 [0000] -> Exiting SM >> >> > >You stopped this SM, right ? > > No I didn't stop the SM. >>[root at localhost ~]# >>[root at localhost ~]# gdb /var/log/opensm2.log 6867 >> >> > >Why gdb this node's SM ? I'm not following you. > >Should point at executable not log. > > You are right. It is a cut and paste error. VBabu From halr at voltaire.com Sat Dec 9 04:12:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Dec 2006 07:12:39 -0500 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <457A1E90.5040606@3leafnetworks.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> <1165625283.26559.10270.camel@hal.voltaire.com> <457A0B62.2060501@3leafnetworks.com> <1165628315.26559.12385.camel@hal.voltaire.com> <457A1E90.5040606@3leafnetworks.com> Message-ID: <1165666352.26559.39788.camel@hal.voltaire.com> On Fri, 2006-12-08 at 21:25, Venkatesh Babu wrote: > Hal Rosenstock wrote: > > >Was this the same scenario or something different ? > > > > > I had killed the previous OpenSM instance. So I lost that information. > It is the same OpenSM failover issue and using the exact same setup and > scripts to reproduce. It another instance of the problem. > > >So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is > >that right ? > > > > > > > Yes, that is right. They are the OUI vendors for the IB HCAs. > > >Does this correspond to when node 2 SM goes down, SM comes up, or > >something else ? > > > > > I don't know the exact sequence when this message is displayed. All I > can say is that it was the last message printed by the OpenSM. I am not > rebooting the node 1 or killing the OpenSM. It is staying constant. > I have a script to reboot node 2 every couple of minutes. It will > stop rebooting if it finds one of these conditions - > 1. SM1 on port1 is master but SM2 on port2 is not master > 2. SM2 on port2 is master but SM1 on port1 is not master > 3. Port1/2 is not ACTIVE > 4. Port1/2's sm_lid/port lid is zero Understood. > I am capturing this all the output at the end of the test when the > script was terminated. > > >Not sure why OpenSM decides to exit (due to this error which should be > >recoverable). It then fails to exit (hangs) as the other threads are not > >terminated. > > > >Is osm_exit_flag set ? I presume it is but would like verification. > >What are the thread_state values of the various threads ? > > > > > Unfortunately someone powerd off Node1, while I was debugging. So I > can not findout this. > > On Node2 : > (gdb) p osm_exit_flag > $1 = 0 I was interested in the one on Node1 when it appeared to be trying to exit (which it shouldn't be but is) and the other threads don't seem to terminate. > How do I findout the thread_state value ? It's a variable in the SM structure (in the SM thread). > >>Node 2: > >>====== > >> > >> > > > >Is this when node 2 comes back up and SM is restarted on both ports or > >is it after the SM is stopped on port 2 ? > > > > > > > As I said earlier, this is the snapshot when the script is stopped > rebooting as I described above. > > >> port: 2 > >> state: PORT_INIT (2) > >> max_mtu: 2048 (4) > >> active_mtu: 2048 (4) > >> sm_lid: 4 > >> > >> > > > >This port still points at the SM on node 1, right ? > > > > > Yes that is right. > > > > > > >> port_lid: 2 > >> port_lmc: 0x00 > >> > >> > >>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > >>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002 > >>Dec 07 11:29:03 817752 [0000] -> Exiting SM > >> > >> > > > >You stopped this SM, right ? > > > > > No I didn't stop the SM. > > >>[root at localhost ~]# > >>[root at localhost ~]# gdb /var/log/opensm2.log 6867 > >> > >> > > > >Why gdb this node's SM ? I'm not following you. > > > >Should point at executable not log. > > > > > You are right. It is a cut and paste error. One more thing: When you upgraded to OFED 1.2, did you build and install the management libraries (libibcommon, libibumad are important here and libibmad for diags) ? -- Hal > > VBabu From halr at voltaire.com Sat Dec 9 05:48:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Dec 2006 08:48:28 -0500 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1165666352.26559.39788.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> <1165625283.26559.10270.camel@hal.voltaire.com> <457A0B62.2060501@3leafnetworks.com> <1165628315.26559.12385.camel@hal.voltaire.com> <457A1E90.5040606@3leafnetworks.com> <1165666352.26559.39788.camel@hal.voltaire.com> Message-ID: <1165672098.26559.43885.camel@hal.voltaire.com> On Sat, 2006-12-09 at 07:12, Hal Rosenstock wrote: > One more thing: > > When you upgraded to OFED 1.2, did you build and install the management > libraries (libibcommon, libibumad are important here and libibmad for > diags) ? Does the problem always occur on the "second" subnet (port 2's subnet) or does it ever occur on port 1's subnet ? Can you totally not configure the "port 1" subnet on all machines (and OpenSM on the port 1's where that runs) and see if it is reproducible ? Thanks. -- Hal From eitan at mellanox.co.il Sat Dec 9 06:13:01 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 9 Dec 2006 16:13:01 +0200 Subject: [openib-general] [PATCH] osm: Routing Tables are full of UNREACHABLE instead of real route Message-ID: <6C2C79E72C305246B504CBA17B5500C976D272@mtlexch01.mtl.com> Hi Sasha, Your proposal for moving all "dump" files generation to end of sweep - just before "SUBNET UP" is reported - makes perfect sense to me. But it is a bit lower in priority to the rest of the stuff. Not sure if it worth tackling right now. Eitan Eitan Zahavi Senior Engineering Director, Software Architect Mellanox Technologies LTD Tel:+972-4-9097208 Fax:+972-4-9593245 P.O. Box 586 Yokneam 20692 ISRAEL > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Friday, December 08, 2006 11:55 PM > To: Eitan Zahavi > Cc: Hal Rosenstock; Yevgeny Kliteynik; OPENIB GENERAL > Subject: Re: [PATCH] osm: Routing Tables are full of UNREACHABLE instead of > real route > > Hi Eitan, > > On 17:12 Thu 07 Dec , Eitan Zahavi wrote: > > Hi Hal, > > > > I resolved the mystery behind the osm.fdbs that is now full of > > UNREACHABLE instead of correct out ports. > > > > The problem is a consequence of the new code that does not use the > > switch LFT blocks for the intermediate LFT assignments: > > The idea of having incremental updates only relies on temporary buffer > > that the routing algorithm fills. > > Then it is sent to the wire only if there is a diff between the switch > > LFT tables (from the SMDB) and the temporary buffer. > > > > So the switch LFT tables are not being directly updated by the routing > > algorithm - but only by the GetResp obtained as reply to the setting. > > Until this stage of the description - everything looks right. > > > > But what is wrong is that the dump of LFT tables is invoked before the > > GetResp is obtained. > > So if only a single sweep is invoked the resulting osm.fdbs show the > > original state of the SMDB tables whicg is full of 0xFF = UNREACHABLE. > > Right. > > > > > The patch below is taking the easy way and should be probably revisited. > > Instead of having a separate algorithm step for dumping out the > > resulting GetResp data after all LFT responses were obtained it just > > copies the sent LFT blocks to the SMDB. > > Would not this be better just to move all dumps at end of the OpenSM heavy > sweep. This should be simple, right? > > Sasha > > > > > I think we need to have at least this simple patch until we have the > > dump move to a new algorithm step. > > > > Thanks > > Eitan > > > > Signed-off-by: Eitan Zahavi > > > ================================================================ > ===== > > > > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > > index 5a55da8..3a62c7f 100644 > > --- a/osm/opensm/osm_ucast_mgr.c > > +++ b/osm/opensm/osm_ucast_mgr.c > > @@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table( > > "osm_ucast_mgr_set_fwd_table: ERR 3A05: " > > "Sending linear fwd. tbl. block failed (%s)\n", > > ib_get_err_str( status ) ); > > - } > > + } else { > > + /* > > + HACK: for now we will assume we succeeded to send > > + and set the local DB based on it. This should allow > > + us to immediatly dump out our routing > > + */ > > + osm_switch_set_ft_block( > > + p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho); > > + } > > } > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > From eitan at mellanox.co.il Sat Dec 9 06:26:55 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 09 Dec 2006 16:26:55 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061208221001.GG9193@sashak.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> Message-ID: <457AC7AF.5090202@mellanox.co.il> Hi Sasha, Without another devel branch I will not be able to test patches before the make it into the trunk. I do not know how to make an automatic mail extraction into patches into tree such that I can have automatic patch check. I am not a great fan of a new branch too. So we need to agree that regression runs resulting with bug reporting post commit to the trunk is our mode of work. I do not have a big issue with this (but it is more work for Hal). Eitan Sasha Khapyorsky wrote: > On 18:42 Fri 08 Dec , Eitan Zahavi wrote: > >> Instead on relying on bug reading I use automatic regression. I wish we >> could agree on some regression that >> each developer will have to run before patches are committed to the trunk. >> On my side I would love to have an automatic way to include all the >> patches posted (one at a time) run "dead or alive" check >> and provide feedback. Currently my automation is limited to testing the >> trunk. So I will always be complaining after the patches are >> committed. I think this is the way most other components testing works. >> >> What kind of regression suite do you and Sasha use? >> > > On my side it clearly depends from kind of changes. In general I would > call this "uni-testing". > > >> Can we agree on minimal pre-commit testing? >> Can we have a branch for that sake where all patches will first have to >> go into for 2 days? (it will allow for pre-trunk testing). >> > > One more development branch? Will you test (or even see) this? If so I > can publish the "fresh" tree. > > Sasha > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Sat Dec 9 06:35:10 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 09 Dec 2006 16:35:10 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <1165617195.26559.4435.camel@hal.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <1165617195.26559.4435.camel@hal.voltaire.com> Message-ID: <457AC99E.8050402@mellanox.co.il> Hal Rosenstock wrote: > On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: > >> Hal Rosenstock wrote: >> >>> Hi Eitan, >>> >>> Just wanted to close the loop on the OpenSM issues of the last couple >>> days. >>> >>> 1. When can you supply an OpenSM verbose log for the InformInfo >>> subscribe problem you reported earlier today ? Failing that, I don't >>> know how to reproduce this. >>> >>> >> Attached >> I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure. >>> 4. I encourage you to look at and comment on the OpenSM patches rather >>> than waiting for them to be in the tree. >>> >>> >> I am sure you did not mean to, but now I have to admit my limited skills >> in catching bugs by reading patches :-( . >> > > Not just read, but they are there to try out as well. > I will need an automatic flow for that sake. I can not keep up with the amount of patches manually. But I do not know how to automatically convert the mails into patches into a tree. > You could try out the patches and do the same thing before they are > committed. > > I have automation based on the committed tree that pull it (git trem) , compile and run regression. Actually this is how all other code is handled too. From sashak at voltaire.com Sat Dec 9 09:46:07 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 19:46:07 +0200 Subject: [openib-general] [PATCH] osm: Routing Tables are full of UNREACHABLE instead of real route In-Reply-To: <6C2C79E72C305246B504CBA17B5500C976D272@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C976D272@mtlexch01.mtl.com> Message-ID: <20061209174607.GK10000@sashak.voltaire.com> On 16:13 Sat 09 Dec , Eitan Zahavi wrote: > Hi Sasha, > > Your proposal for moving all "dump" files generation to end of sweep - > just before "SUBNET UP" is reported - makes perfect sense to me. > > But it is a bit lower in priority to the rest of the stuff. > Not sure if it worth tackling right now. Ok, I may do this. This should not be big deal. Sasha > > Eitan > > Eitan Zahavi > Senior Engineering Director, Software Architect > Mellanox Technologies LTD > Tel:+972-4-9097208 > Fax:+972-4-9593245 > P.O. Box 586 Yokneam 20692 ISRAEL > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Friday, December 08, 2006 11:55 PM > > To: Eitan Zahavi > > Cc: Hal Rosenstock; Yevgeny Kliteynik; OPENIB GENERAL > > Subject: Re: [PATCH] osm: Routing Tables are full of UNREACHABLE > instead of > > real route > > > > Hi Eitan, > > > > On 17:12 Thu 07 Dec , Eitan Zahavi wrote: > > > Hi Hal, > > > > > > I resolved the mystery behind the osm.fdbs that is now full of > > > UNREACHABLE instead of correct out ports. > > > > > > The problem is a consequence of the new code that does not use the > > > switch LFT blocks for the intermediate LFT assignments: > > > The idea of having incremental updates only relies on temporary > buffer > > > that the routing algorithm fills. > > > Then it is sent to the wire only if there is a diff between the > switch > > > LFT tables (from the SMDB) and the temporary buffer. > > > > > > So the switch LFT tables are not being directly updated by the > routing > > > algorithm - but only by the GetResp obtained as reply to the > setting. > > > Until this stage of the description - everything looks right. > > > > > > But what is wrong is that the dump of LFT tables is invoked before > the > > > GetResp is obtained. > > > So if only a single sweep is invoked the resulting osm.fdbs show the > > > original state of the SMDB tables whicg is full of 0xFF = > UNREACHABLE. > > > > Right. > > > > > > > > The patch below is taking the easy way and should be probably > revisited. > > > Instead of having a separate algorithm step for dumping out the > > > resulting GetResp data after all LFT responses were obtained it just > > > copies the sent LFT blocks to the SMDB. > > > > Would not this be better just to move all dumps at end of the OpenSM > heavy > > sweep. This should be simple, right? > > > > Sasha > > > > > > > > I think we need to have at least this simple patch until we have the > > > dump move to a new algorithm step. > > > > > > Thanks > > > Eitan > > > > > > Signed-off-by: Eitan Zahavi > > > > > ================================================================ > > ===== > > > > > > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > > > index 5a55da8..3a62c7f 100644 > > > --- a/osm/opensm/osm_ucast_mgr.c > > > +++ b/osm/opensm/osm_ucast_mgr.c > > > @@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table( > > > "osm_ucast_mgr_set_fwd_table: ERR 3A05: " > > > "Sending linear fwd. tbl. block failed (%s)\n", > > > ib_get_err_str( status ) ); > > > - } > > > + } else { > > > + /* > > > + HACK: for now we will assume we succeeded to send > > > + and set the local DB based on it. This should allow > > > + us to immediatly dump out our routing > > > + */ > > > + osm_switch_set_ft_block( > > > + p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho); > > > + } > > > } > > > > > > OSM_LOG_EXIT( p_mgr->p_log ); > > > From sashak at voltaire.com Sat Dec 9 10:01:01 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 20:01:01 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <457AC7AF.5090202@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> Message-ID: <20061209180101.GL10000@sashak.voltaire.com> Hi Eitan, On 16:26 Sat 09 Dec , Eitan Zahavi wrote: > > Without another devel branch I will not be able to test patches before > the make it into the trunk. > > I do not know how to make an automatic mail extraction into patches into > tree such that I can have automatic patch check. You can just pipe emails with patches to git-am (manually after review or automatically via procmail), so this will be committed in the local tree/branch as you want. > I am not a great fan of a new branch too. > > So we need to agree that regression runs resulting with bug reporting > post commit to the trunk is our mode of work. It is ok for me. At least as start point, if we will have automatic nightly regression tests for the trunk it is just fine. If this will work, and after collecting some experience we may think about "quarantine" branch/tree and the regression testing expansion. > I do not have a big issue with this (but it is more work for Hal). Hal, what do you say? Sasha > > Eitan > > Sasha Khapyorsky wrote: > >On 18:42 Fri 08 Dec , Eitan Zahavi wrote: > > > >>Instead on relying on bug reading I use automatic regression. I wish we > >>could agree on some regression that > >>each developer will have to run before patches are committed to the trunk. > >>On my side I would love to have an automatic way to include all the > >>patches posted (one at a time) run "dead or alive" check > >>and provide feedback. Currently my automation is limited to testing the > >>trunk. So I will always be complaining after the patches are > >>committed. I think this is the way most other components testing works. > >> > >>What kind of regression suite do you and Sasha use? > >> > > > >On my side it clearly depends from kind of changes. In general I would > >call this "uni-testing". > > > > > >>Can we agree on minimal pre-commit testing? > >>Can we have a branch for that sake where all patches will first have to > >>go into for 2 days? (it will allow for pre-trunk testing). > >> > > > >One more development branch? Will you test (or even see) this? If so I > >can publish the "fresh" tree. > > > >Sasha > > > >_______________________________________________ > >openib-general mailing list > >openib-general at openib.org > >http://openib.org/mailman/listinfo/openib-general > > > >To unsubscribe, please visit > >http://openib.org/mailman/listinfo/openib-general > > > From sashak at voltaire.com Sat Dec 9 10:03:44 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 20:03:44 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <457AC99E.8050402@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <1165617195.26559.4435.camel@hal.voltaire.com> <457AC99E.8050402@mellanox.co.il> Message-ID: <20061209180344.GM10000@sashak.voltaire.com> On 16:35 Sat 09 Dec , Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: > > > >> Hal Rosenstock wrote: > >> > >>> Hi Eitan, > >>> > >>> Just wanted to close the loop on the OpenSM issues of the last couple > >>> days. > >>> > >>> 1. When can you supply an OpenSM verbose log for the InformInfo > >>> subscribe problem you reported earlier today ? Failing that, I don't > >>> know how to reproduce this. > >>> > >>> > >> Attached > >> > I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure. > > >>> 4. I encourage you to look at and comment on the OpenSM patches rather > >>> than waiting for them to be in the tree. > >>> > >>> > >> I am sure you did not mean to, but now I have to admit my limited skills > >> in catching bugs by reading patches :-( . > >> > > > > Not just read, but they are there to try out as well. > > > I will need an automatic flow for that sake. I can not keep up with the > amount of patches manually. > But I do not know how to automatically convert the mails into patches > into a tree. As stated in other post with git it is simple - git-am applies mbox just fine. Sasha > > You could try out the patches and do the same thing before they are > > committed. > > > > > I have automation based on the committed tree that pull it (git trem) , > compile and run regression. > Actually this is how all other code is handled too. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Sat Dec 9 10:11:37 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 20:11:37 +0200 Subject: [openib-general] userspace git conversion status/cut over In-Reply-To: <20061206082242.GI26787@mellanox.co.il> References: <1164897683.11808.129709.camel@hal.voltaire.com> <456F0AE3.4060209@ichips.intel.com> <20061130191717.GJ18978@sashak.voltaire.com> <20061206082242.GI26787@mellanox.co.il> Message-ID: <20061209181137.GO10000@sashak.voltaire.com> On 10:22 Wed 06 Dec , Michael S. Tsirkin wrote: > > Other issue. There is /pub/scm/linux-2.6.18/.git tree, looks it was used > > for git installation testing or so. > > > > Does somebody use it? Could this be (re)moved? > > No one seemed to care, and 2.6.19 is out anyway :) > Let's kill it then. Ok, moved this out. Sasha From halr at voltaire.com Sat Dec 9 10:20:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Dec 2006 13:20:18 -0500 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061209180101.GL10000@sashak.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> <20061209180101.GL10000@sashak.voltaire.com> Message-ID: <1165688413.26559.55471.camel@hal.voltaire.com> On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote: > Hi Eitan, > > On 16:26 Sat 09 Dec , Eitan Zahavi wrote: > > > > Without another devel branch I will not be able to test patches before > > the make it into the trunk. > > > > I do not know how to make an automatic mail extraction into patches into > > tree such that I can have automatic patch check. > > You can just pipe emails with patches to git-am (manually after review > or automatically via procmail), so this will be committed in the local > tree/branch as you want. > > > I am not a great fan of a new branch too. > > > > So we need to agree that regression runs resulting with bug reporting > > post commit to the trunk is our mode of work. > > It is ok for me. At least as start point, if we will have automatic > nightly regression tests for the trunk it is just fine. If this will > work, and after collecting some experience we may think about > "quarantine" branch/tree and the regression testing expansion. > > > I do not have a big issue with this (but it is more work for Hal). > > Hal, what do you say? What is the nightly regression and who will run it ? It seems to me that the patches could be automated or a manual procedure can be put in place so I am not keen on maintaining a pre-trunk branch but would if I am convinced it can't be done easily by the methods I mentioned, that the regression would be run nightly on a continuing basis, and that reports would be issued based on the runs (to interested parties). -- Hal > Sasha > > > > > Eitan > > > > Sasha Khapyorsky wrote: > > >On 18:42 Fri 08 Dec , Eitan Zahavi wrote: > > > > > >>Instead on relying on bug reading I use automatic regression. I wish we > > >>could agree on some regression that > > >>each developer will have to run before patches are committed to the trunk. > > >>On my side I would love to have an automatic way to include all the > > >>patches posted (one at a time) run "dead or alive" check > > >>and provide feedback. Currently my automation is limited to testing the > > >>trunk. So I will always be complaining after the patches are > > >>committed. I think this is the way most other components testing works. > > >> > > >>What kind of regression suite do you and Sasha use? > > >> > > > > > >On my side it clearly depends from kind of changes. In general I would > > >call this "uni-testing". > > > > > > > > >>Can we agree on minimal pre-commit testing? > > >>Can we have a branch for that sake where all patches will first have to > > >>go into for 2 days? (it will allow for pre-trunk testing). > > >> > > > > > >One more development branch? Will you test (or even see) this? If so I > > >can publish the "fresh" tree. > > > > > >Sasha > > > > > >_______________________________________________ > > >openib-general mailing list > > >openib-general at openib.org > > >http://openib.org/mailman/listinfo/openib-general > > > > > >To unsubscribe, please visit > > >http://openib.org/mailman/listinfo/openib-general > > > > > From sashak at voltaire.com Sat Dec 9 11:11:48 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 21:11:48 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <1165688413.26559.55471.camel@hal.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> <20061209180101.GL10000@sashak.voltaire.com> <1165688413.26559.55471.camel@hal.voltaire.com> Message-ID: <20061209191148.GP10000@sashak.voltaire.com> On 13:20 Sat 09 Dec , Hal Rosenstock wrote: > On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote: > > Hi Eitan, > > > > On 16:26 Sat 09 Dec , Eitan Zahavi wrote: > > > > > > Without another devel branch I will not be able to test patches before > > > the make it into the trunk. > > > > > > I do not know how to make an automatic mail extraction into patches into > > > tree such that I can have automatic patch check. > > > > You can just pipe emails with patches to git-am (manually after review > > or automatically via procmail), so this will be committed in the local > > tree/branch as you want. > > > > > I am not a great fan of a new branch too. > > > > > > So we need to agree that regression runs resulting with bug reporting > > > post commit to the trunk is our mode of work. > > > > It is ok for me. At least as start point, if we will have automatic > > nightly regression tests for the trunk it is just fine. If this will > > work, and after collecting some experience we may think about > > "quarantine" branch/tree and the regression testing expansion. > > > > > I do not have a big issue with this (but it is more work for Hal). > > > > Hal, what do you say? > > What is the nightly regression and who will run it ? Good question. I guess Eitan has automated regression test suite which is able to pull _committed_ tree and run test series. Eitan, right? > > It seems to me that the patches could be automated or a manual procedure > can be put in place so I am not keen on maintaining a pre-trunk branch > but would if I am convinced it can't be done easily by the methods I > mentioned, that the regression would be run nightly on a continuing > basis, and that reports would be issued based on the runs (to interested > parties). Ok. I think we could start testing with trunk if we still have the issue with pre-trunk patches. Systematic regression report would be good thing. All this should be good start, and if I understand correctly this can be launched immediately. Then we can deal with pre-trunk stuff. Eitan, how is it hard for you to prepare procmail's rule which will automatically apply the patches from emails to the local pre-trunk tree? Or do you think it is insufficient? Sasha > > -- Hal > > > Sasha > > > > > > > > Eitan > > > > > > Sasha Khapyorsky wrote: > > > >On 18:42 Fri 08 Dec , Eitan Zahavi wrote: > > > > > > > >>Instead on relying on bug reading I use automatic regression. I wish we > > > >>could agree on some regression that > > > >>each developer will have to run before patches are committed to the trunk. > > > >>On my side I would love to have an automatic way to include all the > > > >>patches posted (one at a time) run "dead or alive" check > > > >>and provide feedback. Currently my automation is limited to testing the > > > >>trunk. So I will always be complaining after the patches are > > > >>committed. I think this is the way most other components testing works. > > > >> > > > >>What kind of regression suite do you and Sasha use? > > > >> > > > > > > > >On my side it clearly depends from kind of changes. In general I would > > > >call this "uni-testing". > > > > > > > > > > > >>Can we agree on minimal pre-commit testing? > > > >>Can we have a branch for that sake where all patches will first have to > > > >>go into for 2 days? (it will allow for pre-trunk testing). > > > >> > > > > > > > >One more development branch? Will you test (or even see) this? If so I > > > >can publish the "fresh" tree. > > > > > > > >Sasha > > > > > > > >_______________________________________________ > > > >openib-general mailing list > > > >openib-general at openib.org > > > >http://openib.org/mailman/listinfo/openib-general > > > > > > > >To unsubscribe, please visit > > > >http://openib.org/mailman/listinfo/openib-general > > > > > > > > From mst at mellanox.co.il Sat Dec 9 11:34:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 9 Dec 2006 21:34:43 +0200 Subject: [openib-general] version #defines for the kernel In-Reply-To: References: <045401c71b02$d8d17a40$0281a8c0@ebpc> Message-ID: <20061209193443.GB6891@mellanox.co.il> > > How about an OpenFabrics API version #define? > > No other kernel subsystem has one, so I don't think it's realistic to > expect one for IB. include/net/ieee80211.h has one. It does not seem to work too well though. -- MST From eitan at mellanox.co.il Sat Dec 9 11:36:44 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 09 Dec 2006 21:36:44 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061209191148.GP10000@sashak.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> <20061209180101.GL10000@sashak.voltaire.com> <1165688413.26559.55471.camel@hal.voltaire.com> <20061209191148.GP10000@sashak.voltaire.com> Message-ID: <457B104C.3090802@mellanox.co.il> Sasha Khapyorsky wrote: > On 13:20 Sat 09 Dec , Hal Rosenstock wrote: > >> On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote: >> >>> Hi Eitan, >>> >>> On 16:26 Sat 09 Dec , Eitan Zahavi wrote: >>> >>>> Without another devel branch I will not be able to test patches before >>>> the make it into the trunk. >>>> >>>> I do not know how to make an automatic mail extraction into patches into >>>> tree such that I can have automatic patch check. >>>> >>> You can just pipe emails with patches to git-am (manually after review >>> or automatically via procmail), so this will be committed in the local >>> tree/branch as you want. >>> >>> >>>> I am not a great fan of a new branch too. >>>> >>>> So we need to agree that regression runs resulting with bug reporting >>>> post commit to the trunk is our mode of work. >>>> >>> It is ok for me. At least as start point, if we will have automatic >>> nightly regression tests for the trunk it is just fine. If this will >>> work, and after collecting some experience we may think about >>> "quarantine" branch/tree and the regression testing expansion. >>> >>> >>>> I do not have a big issue with this (but it is more work for Hal). >>>> >>> Hal, what do you say? >>> >> What is the nightly regression and who will run it ? >> > > Good question. I guess Eitan has automated regression test suite which > is able to pull _committed_ tree and run test series. Eitan, right? > Yes that is what we have. Both simulated fabrics as well as the ULPs regression which uses OpenSM from the trunk (running a set of tests on smaller fabrics). > >> It seems to me that the patches could be automated or a manual procedure >> can be put in place so I am not keen on maintaining a pre-trunk branch >> but would if I am convinced it can't be done easily by the methods I >> mentioned, that the regression would be run nightly on a continuing >> basis, and that reports would be issued based on the runs (to interested >> parties). >> > > Ok. > > I think we could start testing with trunk if we still have the issue > with pre-trunk patches. Systematic regression report would be good > thing. All this should be good start, and if I understand correctly this > can be launched immediately. Then we can deal with pre-trunk stuff. > > Eitan, how is it hard for you to prepare procmail's rule which will > automatically apply the patches from emails to the local pre-trunk > tree? Or do you think it is insufficient? > I am not sure I can do the procmail thing myself. I am not familiar with it and lack the time to learn. I can ask around. But I question why we need to define a different testing method from the rest of the OFA tree? > Sasha > > >> -- Hal >> >> >>> Sasha >>> >>> >>>> Eitan >>>> >>>> Sasha Khapyorsky wrote: >>>> >>>>> On 18:42 Fri 08 Dec , Eitan Zahavi wrote: >>>>> >>>>> >>>>>> Instead on relying on bug reading I use automatic regression. I wish we >>>>>> could agree on some regression that >>>>>> each developer will have to run before patches are committed to the trunk. >>>>>> On my side I would love to have an automatic way to include all the >>>>>> patches posted (one at a time) run "dead or alive" check >>>>>> and provide feedback. Currently my automation is limited to testing the >>>>>> trunk. So I will always be complaining after the patches are >>>>>> committed. I think this is the way most other components testing works. >>>>>> >>>>>> What kind of regression suite do you and Sasha use? >>>>>> >>>>>> >>>>> On my side it clearly depends from kind of changes. In general I would >>>>> call this "uni-testing". >>>>> >>>>> >>>>> >>>>>> Can we agree on minimal pre-commit testing? >>>>>> Can we have a branch for that sake where all patches will first have to >>>>>> go into for 2 days? (it will allow for pre-trunk testing). >>>>>> >>>>>> >>>>> One more development branch? Will you test (or even see) this? If so I >>>>> can publish the "fresh" tree. >>>>> >>>>> Sasha >>>>> >>>>> _______________________________________________ >>>>> openib-general mailing list >>>>> openib-general at openib.org >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Sat Dec 9 12:08:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 9 Dec 2006 22:08:37 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061209191148.GP10000@sashak.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> <20061209180101.GL10000@sashak.voltaire.com> <1165688413.26559.55471.camel@hal.voltaire.com> <20061209191148.GP10000@sashak.voltaire.com> Message-ID: <20061209200837.GF6891@mellanox.co.il> > Eitan, how is it hard for you to prepare procmail's rule which will > automatically apply the patches from emails to the local pre-trunk > tree? Or do you think it is insufficient? This sounds like a fragile process. It seems much easier to just have an unstable branch with untested patches. No? -- MST From eitan at mellanox.co.il Sat Dec 9 12:09:57 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 09 Dec 2006 22:09:57 +0200 Subject: [openib-general] [PATCH] osm: trivial osm_log missmatch on vendor mlx Message-ID: <457B1815.7000404@mellanox.co.il> Hi Hal This patch fixes some osm_log issues on the mlx vendor. Signed-off-by: Eitan Zahavi --- osm/libvendor/osm_vendor_mlx_dispatcher.c | 3 ++- osm/libvendor/osm_vendor_mlx_txn.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/osm/libvendor/osm_vendor_mlx_dispatcher.c b/osm/libvendor/osm_vendor_mlx_dispatcher.c index e8b47dd..7e3bd78 100644 --- a/osm/libvendor/osm_vendor_mlx_dispatcher.c +++ b/osm/libvendor/osm_vendor_mlx_dispatcher.c @@ -134,7 +134,8 @@ osmv_dispatch_mad(IN osm_bind_handle_t { osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG, - "The bind handle %p is being closed. The MAD will not be dispatched.\n"); + "The bind handle %p is being closed. " + "The MAD will not be dispatched.\n", p_bo); ret = IB_INTERRUPTED; goto dispatch_mad_done; diff --git a/osm/libvendor/osm_vendor_mlx_txn.c b/osm/libvendor/osm_vendor_mlx_txn.c index 1fd262f..234e33b 100644 --- a/osm/libvendor/osm_vendor_mlx_txn.c +++ b/osm/libvendor/osm_vendor_mlx_txn.c @@ -631,7 +631,7 @@ __osmv_txn_timeout_cb(IN uint64_t key, osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG, "__osmv_txn_timeout_cb: " - "Retry request timout in : %u [msec].\n", + "Retry request timout in : %lu [msec].\n", next_timeout_ms); } } -- 1.4.4.1.GIT From sashak at voltaire.com Sat Dec 9 13:07:24 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 23:07:24 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061209200837.GF6891@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> <20061209180101.GL10000@sashak.voltaire.com> <1165688413.26559.55471.camel@hal.voltaire.com> <20061209191148.GP10000@sashak.voltaire.com> <20061209200837.GF6891@mellanox.co.il> Message-ID: <20061209210724.GQ10000@sashak.voltaire.com> On 22:08 Sat 09 Dec , Michael S. Tsirkin wrote: > > Eitan, how is it hard for you to prepare procmail's rule which will > > automatically apply the patches from emails to the local pre-trunk > > tree? Or do you think it is insufficient? > > This sounds like a fragile process. It seems much easier to just > have an unstable branch with untested patches. No? I think it is almost equivalent in the way how this should be generated. The difference is that "unstable" branch should be published and will require some maintenance. Sasha From sashak at voltaire.com Sat Dec 9 13:44:44 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 9 Dec 2006 23:44:44 +0200 Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or switch] In-Reply-To: <20061206132643.GR26787@mellanox.co.il> References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il> <20061206132643.GR26787@mellanox.co.il> Message-ID: <20061209214444.GS10000@sashak.voltaire.com> On 15:26 Wed 06 Dec , Michael S. Tsirkin wrote: > > > > Actually switches that do not have any MCG entry will not be included > > in the dump file. > > > > Signed-off-by: Eitan Zahavi > > > > --- osm/opensm/osm_mcast_mgr.c 2006-12-06 12:39:13.018015000 +0200 > > +++ osm/opensm/osm_mcast_mgr.c 2006-12-06 12:06:29.602097000 +0200 > > All, to make integrating patches easier, > please try to actually use git diff to generate patches, Or just 'git-format-patch', which generates mbox ready for email submission. Sasha > and put patches in following format: > > Subject: [PATCH anytext] short log > > From: <> <-------- optional author line if not same as person posting > Short explanation for commit log. > > Signed-off-by: <> > > --- > > arbirary long explanation > > patch > > > -- > MST > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Sat Dec 9 14:05:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Dec 2006 17:05:06 -0500 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <457B104C.3090802@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> <20061209180101.GL10000@sashak.voltaire.com> <1165688413.26559.55471.camel@hal.voltaire.com> <20061209191148.GP10000@sashak.voltaire.com> <457B104C.3090802@mellanox.co.il> Message-ID: <1165701888.26559.65048.camel@hal.voltaire.com> On Sat, 2006-12-09 at 14:36, Eitan Zahavi wrote: > Sasha Khapyorsky wrote: > > On 13:20 Sat 09 Dec , Hal Rosenstock wrote: > > > >> On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote: > >> > >>> Hi Eitan, > >>> > >>> On 16:26 Sat 09 Dec , Eitan Zahavi wrote: > >>> > >>>> Without another devel branch I will not be able to test patches before > >>>> the make it into the trunk. > >>>> > >>>> I do not know how to make an automatic mail extraction into patches into > >>>> tree such that I can have automatic patch check. > >>>> > >>> You can just pipe emails with patches to git-am (manually after review > >>> or automatically via procmail), so this will be committed in the local > >>> tree/branch as you want. > >>> > >>> > >>>> I am not a great fan of a new branch too. > >>>> > >>>> So we need to agree that regression runs resulting with bug reporting > >>>> post commit to the trunk is our mode of work. > >>>> > >>> It is ok for me. At least as start point, if we will have automatic > >>> nightly regression tests for the trunk it is just fine. If this will > >>> work, and after collecting some experience we may think about > >>> "quarantine" branch/tree and the regression testing expansion. > >>> > >>> > >>>> I do not have a big issue with this (but it is more work for Hal). > >>>> > >>> Hal, what do you say? > >>> > >> What is the nightly regression and who will run it ? > >> > > > > Good question. I guess Eitan has automated regression test suite which > > is able to pull _committed_ tree and run test series. Eitan, right? > > > Yes that is what we have. > Both simulated fabrics as well as the ULPs regression which uses OpenSM > from the trunk (running a set of tests on smaller fabrics). > > > >> It seems to me that the patches could be automated or a manual procedure > >> can be put in place so I am not keen on maintaining a pre-trunk branch > >> but would if I am convinced it can't be done easily by the methods I > >> mentioned, that the regression would be run nightly on a continuing > >> basis, and that reports would be issued based on the runs (to interested > >> parties). > >> > > > > Ok. > > > > I think we could start testing with trunk if we still have the issue > > with pre-trunk patches. Systematic regression report would be good > > thing. All this should be good start, and if I understand correctly this > > can be launched immediately. Then we can deal with pre-trunk stuff. > > > > Eitan, how is it hard for you to prepare procmail's rule which will > > automatically apply the patches from emails to the local pre-trunk > > tree? Or do you think it is insufficient? > > > I am not sure I can do the procmail thing myself. I am not familiar with > it and lack the time to learn. > I can ask around. But I question why we need to define a different > testing method from the rest of the OFA tree? The request for an extra branch for this is different from the rest of the OFA tree. -- Hal > > Sasha > > > > > >> -- Hal > >> > >> > >>> Sasha > >>> > >>> > >>>> Eitan > >>>> > >>>> Sasha Khapyorsky wrote: > >>>> > >>>>> On 18:42 Fri 08 Dec , Eitan Zahavi wrote: > >>>>> > >>>>> > >>>>>> Instead on relying on bug reading I use automatic regression. I wish we > >>>>>> could agree on some regression that > >>>>>> each developer will have to run before patches are committed to the trunk. > >>>>>> On my side I would love to have an automatic way to include all the > >>>>>> patches posted (one at a time) run "dead or alive" check > >>>>>> and provide feedback. Currently my automation is limited to testing the > >>>>>> trunk. So I will always be complaining after the patches are > >>>>>> committed. I think this is the way most other components testing works. > >>>>>> > >>>>>> What kind of regression suite do you and Sasha use? > >>>>>> > >>>>>> > >>>>> On my side it clearly depends from kind of changes. In general I would > >>>>> call this "uni-testing". > >>>>> > >>>>> > >>>>> > >>>>>> Can we agree on minimal pre-commit testing? > >>>>>> Can we have a branch for that sake where all patches will first have to > >>>>>> go into for 2 days? (it will allow for pre-trunk testing). > >>>>>> > >>>>>> > >>>>> One more development branch? Will you test (or even see) this? If so I > >>>>> can publish the "fresh" tree. > >>>>> > >>>>> Sasha > >>>>> > >>>>> _______________________________________________ > >>>>> openib-general mailing list > >>>>> openib-general at openib.org > >>>>> http://openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> To unsubscribe, please visit > >>>>> http://openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Sat Dec 9 14:08:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 09 Dec 2006 17:08:54 -0500 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061209200837.GF6891@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <20061208221001.GG9193@sashak.voltaire.com> <457AC7AF.5090202@mellanox.co.il> <20061209180101.GL10000@sashak.voltaire.com> <1165688413.26559.55471.camel@hal.voltaire.com> <20061209191148.GP10000@sashak.voltaire.com> <20061209200837.GF6891@mellanox.co.il> Message-ID: <1165701912.26559.65050.camel@hal.voltaire.com> On Sat, 2006-12-09 at 15:08, Michael S. Tsirkin wrote: > > Eitan, how is it hard for you to prepare procmail's rule which will > > automatically apply the patches from emails to the local pre-trunk > > tree? Or do you think it is insufficient? > > This sounds like a fragile process. It seems much easier to just > have an unstable branch with untested patches. No? Untested is an overexaggeration. They are tested but not by Eitan's regression. -- Hal From mst at mellanox.co.il Sat Dec 9 22:43:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 08:43:46 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <1165701912.26559.65050.camel@hal.voltaire.com> References: <1165701912.26559.65050.camel@hal.voltaire.com> Message-ID: <20061210064346.GC10403@mellanox.co.il> > > > Eitan, how is it hard for you to prepare procmail's rule which will > > > automatically apply the patches from emails to the local pre-trunk > > > tree? Or do you think it is insufficient? > > > > This sounds like a fragile process. It seems much easier to just > > have an unstable branch with untested patches. No? > > Untested is an overexaggeration. They are tested but not by Eitan's > regression. Sorry, I'm not trying to influence any policy decisions here, I'm coming purely from git angle. *If* you want Eitan to test and Ack some patches, *and want to automate the testing part*, the simplest thing to do is to apply them on some git branch. -- MST From ogerlitz at voltaire.com Sat Dec 9 23:42:34 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 10 Dec 2006 09:42:34 +0200 Subject: [openib-general] Assigning IP addresses to IB interfaces In-Reply-To: References: Message-ID: <457BBA6A.3020209@voltaire.com> Adit Ranadive wrote: > I have installed the OpenIB gen2 driver but the IB interfaces havent > been assigned any IP addresses.. > Is it possible to assign them ip addresses using ifconfig and ping > between the interfaces of two machines? yes From ogerlitz at voltaire.com Sat Dec 9 23:50:37 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 10 Dec 2006 09:50:37 +0200 Subject: [openib-general] version #defines for the kernel In-Reply-To: <045401c71b02$d8d17a40$0281a8c0@ebpc> References: <045401c71b02$d8d17a40$0281a8c0@ebpc> Message-ID: <457BBC4D.6050704@voltaire.com> Eric Barton wrote: >> > Actually a single OFED version #define would most probably >> > suit my purposes - >> > is that controversial? >> >> It might be sensible for OFED to supply that, if it's going to >> backport drivers to old kernels. But you should also cope with >> non-OFED (vanilla upstream) drivers, probably by testing >> LINUX_VERSION_CODE too I suppose. > > How about an OpenFabrics API version #define? The IB drivers provided by OFED are based on the mainline kernel ones, moreover, the existence of OFED is temporal, over time, distros would peek the IB code by themselves using releases of the linux kernel and of user space packages (libraries). There is no point in adding a version-ing system. Or. From ogerlitz at voltaire.com Sun Dec 10 00:49:51 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 10 Dec 2006 10:49:51 +0200 Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support for operation over IPoIB In-Reply-To: References: Message-ID: <457BCA2F.1010709@voltaire.com> Or Gerlitz wrote: > This patch series is a second version (see below link to V1) of the suggested > changes to the bonding driver such that it would be able to support non ARPHRD_ETHER > netdevices for its High-Availability (active-backup) mode. > > The motivation is to enable the bonding driver on its HA mode to work with the > IP over Infiniband (IPoIB) driver. With these patches I was able to enslave > IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with > fail-over and fail-back working fine. My working env was the net-2.6.20 git. > > More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which > is used by native IB ULPs whose addressing scheme is based on IP (eg iSER, SDP, > Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA for > these ULPs. This holds as when the ULP is informed by the IB HW on the failure > of the current IB connection, it just need to reconnect, where the bonding > device will now issue the IB ARP over the active IPoIB slave. As of the importance and great need for HA, I would really like to get feedback from people testing configurations with bonded IPoIB devices before moving forward with this. Or. From ogerlitz at voltaire.com Sun Dec 10 01:08:38 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 10 Dec 2006 11:08:38 +0200 Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace support In-Reply-To: References: Message-ID: <457BCE96.3070708@voltaire.com> Roland Dreier wrote: > > + 5/5 is the CMA user space support. I only did a light review of it > > but my understanding is that Sean used the in kernel ib_ucm > > design/code as the base line for this driver so there should be no > > special issues here. > > OK, I'll have to take a close look at this. ucm has known-broken > object lifetime handling (probably oopsable from userspace) Where are we standing with this patch series, i understand the 2.6.20 RC1 feature window is about to me closed in few days. Or. From ogerlitz at voltaire.com Sun Dec 10 01:21:17 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 10 Dec 2006 11:21:17 +0200 Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1165517253.14800.283.camel@brick.pathscale.com> References: <1165517253.14800.283.camel@brick.pathscale.com> Message-ID: <457BD18D.7000403@voltaire.com> Ralph Campbell wrote: > This version of the patch fixes ipath_sg_dma_address() and > updates the comments for ipath_dma.c as Or Gerlitz > suggested. > This patch implements the interposing DMA mapping functions to allow > support for IOMMUs and remove the dependence on phys_to_virt() and > bus_to_virt(). Ralph, The patch seems ready modulo the resolution of whether you implement the addresses returned by the ipath ib_dma_map_xxx code as keys into a SW IOTLB (which means you return dma_address_t and not u64 but assign it ipath semantics) or choose a different path to follow (ie assume the problem exists only under the unsupported by ipath 32bit / high-mem config, do nothing, etc) - what ever you set with Roland. Or. From ogerlitz at voltaire.com Sun Dec 10 02:19:01 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 10 Dec 2006 12:19:01 +0200 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> Message-ID: <457BDF15.6090608@voltaire.com> Sean Hefty wrote: > Updates the librdmacm to work with ABI version 3, which is the proposed > kernel changes for inclusion in 2.6.20. Sean, rdma_leave_multicast does not return zero on success but rather 24 which is the length of the leave mcast msg. patch made over your path, can you please queue this somewhere so it will not be forgotten? Or. > --- librdmacm/src/cma.c 2006-12-10 12:55:03.000000000 +0200 > +++ librdmacm-multicast/src/cma.c 2006-12-10 13:15:12.000000000 +0200 > @@ -1015,6 +1015,8 @@ int rdma_leave_multicast(struct rdma_cm_ > ret = write(id->channel->fd, msg, size); > if (ret != size) > ret = (ret > 0) ? -ENODATA : ret; > + else > + ret = 0; > > pthread_mutex_lock(&id_priv->mut); > while (mc->events_completed < resp->events_reported) From ogerlitz at voltaire.com Sun Dec 10 03:20:26 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 10 Dec 2006 13:20:26 +0200 (IST) Subject: [openib-general] SLES10 /sbin/ip shows wrong multicast addresses for IPoIB devices Message-ID: I see now that /sbin/ip that comes with SLES10 iproute2-2.6.15-14.4 shows wrong multicast hardware (L2) addresses for IPoIB devices, where on SLES9 it works just fine (and the package is iproute2-2.4.7-866.8) With strac-ing it, i can see that utility uses /proc/net/dev_mcast as the source for the hw mcast addresses, where these are reported fine, but then the lower 32 bits are somehow chopped and replaced by zeros, see below. Or. sage:~ # /sbin/ip a s ib1 6: ib1: mtu 1500 qdisc pfifo_fast qlen 128 link/infiniband 00:00:04:05:fe:80:00:00:00:00:00:00:00:02:c9:02:00:20:13:f2 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.168.10.153/24 brd 192.168.10.255 scope global ib1 inet6 fe80::202:c902:20:13f2/64 scope link valid_lft forever preferred_lft forever sage:~ # /sbin/ip m s ib1 6: ib1 link 00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:00:00:00:00 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:01:00:00:00:00 link 00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:00:00:00:00:00 link 00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:00:00:00:00 inet 224.5.5.5 inet 224.0.0.1 inet6 ff02::1:ff20:13f2 inet6 ff02::1 sage:~ # strace /sbin/ip m s ib1 2>&1 | grep open open("/etc/ld.so.cache", O_RDONLY) = 3 open("/lib64/libresolv.so.2", O_RDONLY) = 3 open("/lib64/libc.so.6", O_RDONLY) = 3 open("/proc/net/dev_mcast", O_RDONLY) = 4 open("/proc/net/igmp", O_RDONLY) = 4 open("/proc/net/igmp6", O_RDONLY) = 4 sage:~ # cat /proc/net/dev_mcast | grep ib1 6 ib1 1 0 00ffffffff12401b000000000000000000050505 6 ib1 1 0 00ffffffff12601b0000000000000001ff2013f2 6 ib1 1 0 00ffffffff12601b000000000000000000000001 6 ib1 1 0 00ffffffff12401b000000000000000000000001 From monis at voltaire.com Sun Dec 10 03:40:17 2006 From: monis at voltaire.com (Moni Shoua) Date: Sun, 10 Dec 2006 13:40:17 +0200 Subject: [openib-general] [PATCH v3] IB_mthca HCA profile module parameters Message-ID: <457BF221.8080701@voltaire.com> Hi, This patch was sent a while ago and I'd like to repost it now. thanks MoniS From: Leonid Arsh Adds module parameters that enable settting some of the HCA profile values Signed-off-by: Leonid Arsh Signed-off-by: Moni Shoua --- mthca_main.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 104 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 47ea021..deb0289 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -82,21 +82,110 @@ MODULE_PARM_DESC(tune_pci, "increase PCI struct mutex mthca_device_mutex; +#define MTHCA_DEFAULT_NUM_QP (1 << 16) +#define MTHCA_DEFAULT_RDB_PER_QP (1 << 2) +#define MTHCA_DEFAULT_NUM_CQ (1 << 16) +#define MTHCA_DEFAULT_NUM_MCG (1 << 13) +#define MTHCA_DEFAULT_NUM_MPT (1 << 17) +#define MTHCA_DEFAULT_NUM_MTT (1 << 20) +#define MTHCA_DEFAULT_NUM_UDAV (1 << 15) +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18) +#define MTHCA_DEFAULT_NUM_UARC_SIZE (1 << 18) + +static struct mthca_profile default_profile = { + .num_qp = MTHCA_DEFAULT_NUM_QP, + .rdb_per_qp = MTHCA_DEFAULT_RDB_PER_QP, + .num_cq = MTHCA_DEFAULT_NUM_CQ, + .num_mcg = MTHCA_DEFAULT_NUM_MCG, + .num_mpt = MTHCA_DEFAULT_NUM_MPT, + .num_mtt = MTHCA_DEFAULT_NUM_MTT, + .num_udav = MTHCA_DEFAULT_NUM_UDAV, /* Tavor only */ + .fmr_reserved_mtts = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */ + .uarc_size = MTHCA_DEFAULT_NUM_UARC_SIZE, /* Arbel only */ +}; + +module_param_named(num_qp, default_profile.num_qp, int, 0444); +MODULE_PARM_DESC(num_qp, "maximum number of available QPs per HCA"); + +module_param_named(rdb_per_qp, default_profile.rdb_per_qp, int, 0444); +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP"); + +module_param_named(num_cq, default_profile.num_cq, int, 0444); +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); + +module_param_named(num_mcg, default_profile.num_mcg, int, 0444); +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); + +module_param_named(num_mpt, default_profile.num_mpt, int, 0444); +MODULE_PARM_DESC(num_mpt, + "maximum number of memory protection pable entries per HCA"); + +module_param_named(num_mtt, default_profile.num_mtt, int, 0444); +MODULE_PARM_DESC(num_mtt, + "maximum number of memory translation table segments per HCA"); +/* Tavor only */ +module_param_named(num_udav, default_profile.num_udav, int, 0444); +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA"); + +/* Tavor only */ +module_param_named(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, int, 0444); +MODULE_PARM_DESC(fmr_reserved_mtts, + "number of memory translation table segments reserved for FMR"); + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; -static struct mthca_profile default_profile = { - .num_qp = 1 << 16, - .rdb_per_qp = 4, - .num_cq = 1 << 16, - .num_mcg = 1 << 13, - .num_mpt = 1 << 17, - .num_mtt = 1 << 20, - .num_udav = 1 << 15, /* Tavor only */ - .fmr_reserved_mtts = 1 << 18, /* Tavor only */ - .uarc_size = 1 << 18, /* Arbel only */ -}; + +static int __devinit mthca_check_profile_value(int* pval, int pval_default){ + /* value must be positive and power of 2 */ + int old_pval = *pval; + + if (old_pval <= 0) + *pval = pval_default; + else + *pval = roundup_pow_of_two(old_pval); + + return old_pval-*pval; +} + +#define mthca_check_profile_and_warn(name, var, defval) \ + if (mthca_check_profile_value(&var, defval)) \ + mthca_warn(mdev, "invalid %s passed. changed to %d.\n", #name, var); + +static int __devinit mthca_validate_profile(struct mthca_dev *mdev, + struct mthca_profile *profile) +{ + + mthca_check_profile_and_warn(num_qp, default_profile.num_qp, + MTHCA_DEFAULT_NUM_QP); + mthca_check_profile_and_warn(rdb_per_qp, default_profile.rdb_per_qp, + MTHCA_DEFAULT_RDB_PER_QP); + mthca_check_profile_and_warn(num_cq, default_profile.num_cq, + MTHCA_DEFAULT_NUM_CQ); + mthca_check_profile_and_warn(num_mcg, default_profile.num_mcg, + MTHCA_DEFAULT_NUM_MCG); + mthca_check_profile_and_warn(num_mpt, default_profile.num_mpt, + MTHCA_DEFAULT_NUM_MPT); + mthca_check_profile_and_warn(num_mtt, default_profile.num_mtt, + MTHCA_DEFAULT_NUM_MTT); + + if (!mthca_is_memfree(mdev)) { + mthca_check_profile_and_warn(num_udav, default_profile.num_udav, + MTHCA_DEFAULT_NUM_UDAV); + mthca_check_profile_and_warn(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, + MTHCA_DEFAULT_NUM_RESERVED_MTTS); + + if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) { + mthca_err(mdev, "Invalid fmr_reserved_mtts parameter" + "value (%d). Must be lower then num_mtt (%d)\n", + default_profile.fmr_reserved_mtts, + default_profile.num_mtt ); + return -EINVAL; + } + } + return 0; +} static int __devinit mthca_tune_pci(struct mthca_dev *mdev) { @@ -1084,6 +1173,10 @@ static int __mthca_init_one(struct pci_d if (err) goto err_cmd; + err = mthca_validate_profile(mdev, &default_profile); + if (err) + goto err_cmd; + err = mthca_init_hca(mdev); if (err) goto err_cmd; From mst at mellanox.co.il Sun Dec 10 05:41:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 15:41:37 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061205161944.GD30209@mellanox.co.il> References: <20061129140016.GO5061@mellanox.co.il> <20061205161944.GD30209@mellanox.co.il> Message-ID: <20061210134137.GL29174@mellanox.co.il> The following patch adds experimental support for IPoIB connected mode. The idea is to increase performance by increasing the MTU from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD. With this code, I'm able to get 800MByte/sec or more with netperf without options on a Mellanox 4x back-to-back DDR system. Signed-off-by: Michael S. Tsirkin --- Changes from the previous revision: - Use scatter on RX side instead of allocating a linear 64K skb - User now must explicitly enable connected mode through sysfs for each interface (I looked at using ethtool, and didn't find an appropriate option for that). A warning is printed when it's enabled. - Print a warning about multicast breakage when MTU > 2044 - Move more code to within #ifdef CONFIG_INFINIBAND_IPOIB_CM to avoid affecting code footprint when disabled at compile time Please review, and consider for merging. I labeled CM support as experimental, and set it to disabled by default, although its been very stable for me, mostly because there are still some things to be addressed before it's as usable as IPoIB UD. I am very interested in getting this code in shape for merging as early as possible, as opposed to maintaining it out of tree until it's fully mature, and I tried to split the CM code in a separate file to make this feasible. Let me know whether this was a good idea, or whether more needs to be done in this direction. Note that the connected mode support adds very little overhead when not activated at run time, and zero data-path overhead when not activated at compile time. Here's a short description of what the patch does: a. The code's here: git://staging.openfabrics.org/~mst/linux-2.6/.git ipoib_cm_branch This is based on 2.6.19, so ~>git diff v2.6.19..ipoib_cm_branch will show what I have done so far. b. How to activate: Server: #modprobe ib_ipoib #echo connected > /sys/class/net/ib0/mode #/sbin/ifconfig ib0 mtu 65520 #./netperf-2.4.2/src/netserver Client: #modprobe ib_ipoib #echo connected > /sys/class/net/ib0/mode #/sbin/ifconfig ib0 mtu 65520 #./netperf-2.4.2/src/netperf -H 11.4.3.68 -f M TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 16384 10.01 891.21 c. TODO list 1. Use timer to clean up stale RX connections 2. (Optional) Send side S/G support 3. (Optional) Make CM use same CQ IPoIB uses for UD d. Limitations With MTU > 2044, UDP multicast and UDP connections to IPoIB UD mode currently don't work since we get packets that are too large to send over a UD QP. As a work around, one can now create separate interfaces for use with CM and UD mode. e. Some notes on code 1. SRQ is used for scalability to large cluster sizes 2. Only RC connections are used (UC does not support SRQ now) 3. Retry count is set to 0 since spec draft warns against retries 4. Each connection is used for data transfers in only 1 direction, so each connection is either active(TX) or passive (RX). 2 sides that want to communicate create 2 connections. 5. Each active (TX) connection has a separate CQ for send completions - this keeps the code simple without CQ resize and other tricks I'm looking at ways to limit the path mtu for these connections, to make it work. diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig index c75322d..083c729 100644 --- a/drivers/infiniband/ulp/ipoib/Kconfig +++ b/drivers/infiniband/ulp/ipoib/Kconfig @@ -8,6 +8,20 @@ config INFINIBAND_IPOIB See Documentation/infiniband/ipoib.txt for more information +config INFINIBAND_IPOIB_CM + bool "IP-over-InfiniBand Connected Mode support" + depends on INFINIBAND_IPOIB && EXPERIMENTAL + default n + ---help--- + This option enables experimental support for IPoIB connected mode. + After enabling this option, you need to switch to connected mode through + /sys/class/net/ibXXX/mode to actually create connections, and then increase + the interface MTU with e.g. ifconfig ib0 mtu 65520. + + WARNING: Enabling connected mode currently breaks multicast and UD mode + connectivity from this interface unless you limit mtu + for these destinations to 2044. + config INFINIBAND_IPOIB_DEBUG bool "IP-over-InfiniBand debugging" if EMBEDDED depends on INFINIBAND_IPOIB diff --git a/drivers/infiniband/ulp/ipoib/Makefile b/drivers/infiniband/ulp/ipoib/Makefile index 8935e74..98ee38e 100644 --- a/drivers/infiniband/ulp/ipoib/Makefile +++ b/drivers/infiniband/ulp/ipoib/Makefile @@ -5,5 +5,6 @@ ib_ipoib-y := ipoib_main.o \ ipoib_multicast.o \ ipoib_verbs.o \ ipoib_vlan.o +ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM) += ipoib_cm.o ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG) += ipoib_fs.o diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 0b8a79d..e410d2b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -62,6 +62,10 @@ enum { IPOIB_ENCAP_LEN = 4, + IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */ + IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN, + IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE, + IPOIB_CM_RX_SG = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE, IPOIB_RX_RING_SIZE = 128, IPOIB_TX_RING_SIZE = 64, IPOIB_MAX_QUEUE_SIZE = 8192, @@ -81,6 +85,8 @@ enum { IPOIB_MCAST_RUN = 6, IPOIB_STOP_REAPER = 7, IPOIB_MCAST_STARTED = 8, + IPOIB_FLAG_NETIF_STOPPED = 9, + IPOIB_FLAG_ADMIN_CM = 10, IPOIB_MAX_BACKOFF_SECONDS = 16, @@ -113,6 +119,58 @@ struct ipoib_tx_buf { DECLARE_PCI_UNMAP_ADDR(mapping) }; +#ifdef CONFIG_INFINIBAND_IPOIB_CM +struct ib_cm_id; + +struct ipoib_cm_data { + __be32 qpn; /* High byte MUST be ignored on receive */ + __be32 mtu; +}; + +struct ipoib_cm_rx { + struct ib_cm_id *id; + struct ib_qp *qp; + struct list_head list; + struct net_device *dev; +}; + +struct ipoib_cm_tx { + struct ib_cm_id *id; + struct ib_cq *cq; + struct ib_qp *qp; + struct list_head list; + struct net_device *dev; + struct ipoib_neigh *neigh; + struct ipoib_path *path; + struct ipoib_tx_buf *tx_ring; + unsigned tx_head; + unsigned tx_tail; + unsigned long flags; + u32 mtu; + struct ib_wc ibwc[IPOIB_NUM_WC]; +}; + +struct ipoib_cm_rx_buf { + struct sk_buff *skb; + dma_addr_t mapping[IPOIB_CM_RX_SG]; +}; + +struct ipoib_cm_dev_priv { + struct ib_cq *cq; + struct ib_srq *srq; + struct ipoib_cm_rx_buf *srq_ring; + struct ib_cm_id *id; + struct list_head passive_ids; + struct work_struct start_task; + struct work_struct reap_task; + struct list_head start_list; + struct list_head reap_list; + struct ib_wc ibwc[IPOIB_NUM_WC]; + struct ib_sge rx_sge[IPOIB_CM_RX_SG]; + struct ib_recv_wr rx_wr; +}; + +#endif /* * Device private locking: tx_lock protects members used in TX fast * path (and we use LLTX so upper layers don't do extra locking). @@ -179,6 +237,10 @@ struct ipoib_dev_priv { struct list_head child_intfs; struct list_head list; +#ifdef CONFIG_INFINIBAND_IPOIB_CM + struct ipoib_cm_dev_priv cm; +#endif + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG struct list_head fs_list; struct dentry *mcg_dentry; @@ -212,6 +274,9 @@ struct ipoib_path { struct ipoib_neigh { struct ipoib_ah *ah; +#ifdef CONFIG_INFINIBAND_IPOIB_CM + struct ipoib_cm_tx *cm; +#endif union ib_gid dgid; struct sk_buff_head queue; @@ -315,6 +380,131 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); void ipoib_pkey_poll(void *dev); int ipoib_pkey_dev_delay_open(struct net_device *dev); +#ifdef CONFIG_INFINIBAND_IPOIB_CM + +#define IPOIB_FLAGS_RC 0x80 +#define IPOIB_FLAGS_UC 0x40 + +#define IPOIB_CM_SUPPORTED(ha) (ha[0] & (IPOIB_FLAGS_RC | IPOIB_FLAGS_UC)) + +static inline int ipoib_cm_admin_enabled(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + return IPOIB_CM_SUPPORTED(dev->dev_addr) && + test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); +} + +static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + return IPOIB_CM_SUPPORTED(n->ha) && + test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); +} + +static inline int ipoib_cm_up(struct ipoib_neigh *neigh) + +{ + return test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags); +} + +static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh) +{ + return neigh->cm; +} + +static inline void ipoib_cm_set(struct ipoib_neigh *neigh, struct ipoib_cm_tx *tx) +{ + neigh->cm = tx; +} + +void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx); +int ipoib_cm_dev_open(struct net_device *dev); +void ipoib_cm_dev_stop(struct net_device *dev); +int ipoib_cm_dev_init(struct net_device *dev); +int ipoib_cm_add_mode_attr(struct net_device *dev); +void ipoib_cm_dev_cleanup(struct net_device *dev); +struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh); +void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); +#else + +struct ipoib_cm_tx; + +static inline int ipoib_cm_admin_enabled(struct net_device *dev) +{ + return 0; +} +static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n) + +{ + return 0; +} + +static inline int ipoib_cm_up(struct ipoib_neigh *neigh) + +{ + return 0; +} + +static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh) +{ + return NULL; +} + +static inline void ipoib_cm_set(struct ipoib_neigh *neigh, struct ipoib_cm_tx *tx) +{ +} + +static inline +void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) +{ + return; +} + +static inline +int ipoib_cm_dev_open(struct net_device *dev) +{ + return 0; +} + +static inline +void ipoib_cm_dev_stop(struct net_device *dev) +{ + return; +} + +static inline +int ipoib_cm_dev_init(struct net_device *dev) +{ + return 0; +} + +static inline +void ipoib_cm_dev_cleanup(struct net_device *dev) +{ + return; +} + +static inline +struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh) +{ + return NULL; +} + +static inline +void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) +{ + return; +} + +static inline +int ipoib_cm_add_mode_attr(struct net_device *dev) +{ + return 0; +} +#endif + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG void ipoib_create_debug_files(struct net_device *dev); void ipoib_delete_debug_files(struct net_device *dev); @@ -392,4 +582,6 @@ extern int ipoib_debug_level; #define IPOIB_GID_ARG(gid) IPOIB_GID_RAW_ARG((gid).raw) +#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) + #endif /* _IPOIB_H */ diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c new file mode 100644 index 0000000..52dcc10 --- /dev/null +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -0,0 +1,1153 @@ +/* + * Copyright (c) 2006 Mellanox Technologies. All rights reserved + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include +#include + +#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA +static int data_debug_level; + +module_param_named(cm_data_debug_level, data_debug_level, int, 0644); +MODULE_PARM_DESC(cm_data_debug_level, + "Enable data path debug tracing for connected mode if > 0"); +#endif + +#include "ipoib.h" + +#define IPOIB_CM_IETF_ID 0x1000000000000000ULL + +#define IPOIB_OP_SRQ (1ul << 30) + +struct ipoib_cm_id { + struct ib_cm_id *id; + int flags; + u32 remote_qpn; + u32 remote_mtu; +}; + +int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, + dma_addr_t mapping[IPOIB_CM_RX_SG]) +{ + int i; + + dma_unmap_single(priv->ca->dma_device, mapping[0], + IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE); + + for (i = 0; i < IPOIB_CM_RX_SG - 1; ++i) { + dma_unmap_single(priv->ca->dma_device, mapping[i + 1], + PAGE_SIZE, DMA_FROM_DEVICE); + } +} + +static int ipoib_cm_post_receive(struct net_device *dev, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + + priv->cm.rx_wr.wr_id = id | IPOIB_OP_SRQ; + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i]; + + ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret); + ipoib_cm_dma_unmap_rx(priv, priv->cm.srq_ring[id].mapping); + dev_kfree_skb_any(priv->cm.srq_ring[id].skb); + priv->cm.srq_ring[id].skb = NULL; + } + + return ret; +} + +static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, + dma_addr_t mapping[IPOIB_CM_RX_SG]) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + int i; + + skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12); + if (unlikely(!skb)) + return -ENOMEM; + + /* + * IPoIB adds a 4 byte header. So we need 12 more bytes to align the + * IP header to a multiple of 16. + */ + skb_reserve(skb, 12); + + mapping[0] = dma_map_single(priv->ca->dma_device, skb->data, IPOIB_CM_HEAD_SIZE, + DMA_FROM_DEVICE); + if (unlikely(dma_mapping_error(mapping[0]))) { + dev_kfree_skb_any(skb); + return -EIO; + } + + for (i = 0; i < IPOIB_CM_RX_SG - 1; i++) { + struct page *page = alloc_page(GFP_ATOMIC); + + if (!page) + goto partial_error; + skb_fill_page_desc(skb, i, page, 0, PAGE_SIZE); + + mapping[i + 1] = dma_map_page(priv->ca->dma_device, + skb_shinfo(skb)->frags[i].page, + 0, PAGE_SIZE, DMA_TO_DEVICE); + if (unlikely(dma_mapping_error(mapping[i + 1]))) + goto partial_error; + } + + priv->cm.srq_ring[id].skb = skb; + return 0; + +partial_error: + + dma_unmap_single(priv->ca->dma_device, + mapping[0], + IPOIB_CM_HEAD_SIZE, + DMA_FROM_DEVICE); + + for (; i >= 0; --i) { + dma_unmap_single(priv->ca->dma_device, + mapping[i + 1], + PAGE_SIZE, + DMA_FROM_DEVICE); + } + kfree_skb(skb); + return -ENOMEM; +} + +static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr attr = { + .send_cq = priv->cm.cq, /* does not matter, we never send anything */ + .recv_cq = priv->cm.cq, + .srq = priv->cm.srq, + .cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */ + .cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */ + .sq_sig_type = IB_SIGNAL_ALL_WR, + .qp_type = IB_QPT_RC, + }; + return ib_create_qp(priv->pd, &attr); +} + +static int ipoib_cm_modify_rx_rts(struct net_device *dev, + struct ib_cm_id *cm_id, struct ib_qp *qp) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + + qp_attr.qp_state = IB_QPS_INIT; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for INIT: %d\n", ret); + return ret; + } + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to INIT: %d\n", ret); + return ret; + } + qp_attr.qp_state = IB_QPS_RTR; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret); + return ret; + } + qp_attr.rq_psn = 0 /* FIXME */; + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); + return ret; + } + return 0; +} + +static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id, + struct ib_qp *qp, struct ib_cm_req_event_param *req) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_data data = {}; + struct ib_cm_rep_param rep = {}; + + data.qpn = cpu_to_be32(priv->qp->qp_num); + data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE); + + rep.private_data = &data; + rep.private_data_len = sizeof data; + rep.flow_control = 0; + rep.rnr_retry_count = req->rnr_retry_count; + rep.target_ack_delay = 20; /* FIXME */ + rep.srq = 1; + rep.qp_num = qp->qp_num; + rep.starting_psn = 0 /* FIXME */; + return ib_send_cm_rep(cm_id, &rep); +} + +static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct net_device *dev = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_rx *p; + unsigned long flags; + int ret; + + ipoib_dbg(priv, "REQ arrived\n"); + p = kzalloc(sizeof *p, GFP_KERNEL); + if (!p) + return -ENOMEM; + p->dev = dev; + p->id = cm_id; + p->qp = ipoib_cm_create_rx_qp(dev); + if (IS_ERR(p->qp)) { + ret = PTR_ERR(p->qp); + goto err_qp; + } + + ret = ipoib_cm_modify_rx_rts(dev, cm_id, p->qp); + if (ret) + goto err_modify; + + ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd); + if (ret) { + ipoib_warn(priv, "failed to send REP: %d\n", ret); + goto err_rep; + } + + cm_id->context = p; + spin_lock_irqsave(&priv->lock, flags); + list_add(&p->list, &priv->cm.passive_ids); + spin_unlock_irqrestore(&priv->lock, flags); + return 0; + +err_rep: +err_modify: + ib_destroy_qp(p->qp); +err_qp: + kfree(p); + return ret; +} + +int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct ipoib_cm_rx *p; + struct ipoib_dev_priv *priv; + unsigned long flags; + int ret; + + switch (event->event) { + case IB_CM_REQ_RECEIVED: + return ipoib_cm_req_handler(cm_id, event); + case IB_CM_DREQ_RECEIVED: + p = cm_id->context; + ib_send_cm_drep(cm_id, NULL, 0); + /* Fall through */ + case IB_CM_REJ_RECEIVED: + p = cm_id->context; + priv = netdev_priv(p->dev); + spin_lock_irqsave(&priv->lock, flags); + if (list_empty(&p->list)) + ret = 0; /* Connection is going away already. */ + else { + list_del(&p->list); + ret = -ECONNRESET; + } + spin_unlock_irqrestore(&priv->lock, flags); + if (ret) { + ib_destroy_qp(p->qp); + kfree(p); + return ret; + } + return 0; + default: + return 0; + } +} +/* Adjust length of skb with fragments to match received data */ +static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space, + unsigned int length) +{ + int i, num_frags; + unsigned int size; + + /* put header into skb */ + size = min(length, hdr_space); + skb->tail += size; + skb->len += size; + length -= size; + + num_frags = skb_shinfo(skb)->nr_frags; + for (i = 0; i < num_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + + if (length == 0) { + /* don't need this page */ + __free_page(frag->page); + --skb_shinfo(skb)->nr_frags; + } else { + size = min(length, (unsigned) PAGE_SIZE); + + frag->size = size; + skb->data_len += size; + skb->truesize += size; + skb->len += size; + length -= size; + } + } +} + +static void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id & ~IPOIB_OP_SRQ; + struct sk_buff *skb; + dma_addr_t mapping[IPOIB_CM_RX_SG]; + + ipoib_dbg_data(priv, "cm recv completion: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (unlikely(wr_id >= ipoib_recvq_size)) { + ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n", + wr_id, ipoib_recvq_size); + return; + } + + skb = priv->cm.srq_ring[wr_id].skb; + + if (unlikely(wc->status != IB_WC_SUCCESS)) { + ipoib_dbg(priv, "cm recv error " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + ++priv->stats.rx_dropped; + goto repost; + } + + if (unlikely(ipoib_cm_alloc_rx_skb(dev, wr_id, mapping))) { + /* + * If we can't allocate a new RX buffer, dump + * this packet and reuse the old buffer. + */ + ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id); + ++priv->stats.rx_dropped; + goto repost; + } + + ipoib_cm_dma_unmap_rx(priv, priv->cm.srq_ring[wr_id].mapping); + memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, sizeof mapping); + + ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", + wc->byte_len, wc->slid); + + skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len); + + skb->protocol = ((struct ipoib_header *) skb->data)->proto; + skb->mac.raw = skb->data; + skb_pull(skb, IPOIB_ENCAP_LEN); + + dev->last_rx = jiffies; + ++priv->stats.rx_packets; + priv->stats.rx_bytes += skb->len; + + skb->dev = dev; + /* XXX get correct PACKET_ type here */ + skb->pkt_type = PACKET_HOST; + netif_rx_ni(skb); + +repost: + if (unlikely(ipoib_cm_post_receive(dev, wr_id))) + ipoib_warn(priv, "ipoib_cm_post_receive failed " + "for buf %d\n", wr_id); +} + +void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr) +{ + struct net_device *dev = (struct net_device *) dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->cm.ibwc); + for (i = 0; i < n; ++i) + ipoib_cm_handle_rx_wc(dev, priv->cm.ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +static inline int post_send(struct ipoib_dev_priv *priv, + struct ipoib_cm_tx *tx, + unsigned int wr_id, + dma_addr_t addr, int len) +{ + struct ib_send_wr *bad_wr; + + priv->tx_sge.addr = addr; + priv->tx_sge.length = len; + + priv->tx_wr.wr_id = wr_id; + + return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr); +} + +void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_tx_buf *tx_req; + dma_addr_t addr; + + if (unlikely(skb->len > tx->mtu)) { + ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", + skb->len, tx->mtu); + ++priv->stats.tx_dropped; + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + + ipoib_dbg_data(priv, "sending packet %p, head %d length=%d connection=%p\n", + skb, tx->tx_head, skb->len, tx); + + /* + * We put the skb into the tx_ring _before_ we call post_send() + * because it's entirely possible that the completion handler will + * run before we execute anything after the post_send(). That + * means we have to make sure everything is properly recorded and + * our state is consistent before we call post_send(). + */ + tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)]; + tx_req->skb = skb; + addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, + DMA_TO_DEVICE); + if (unlikely(dma_mapping_error(addr))) { + ++priv->stats.tx_errors; + dev_kfree_skb_any(skb); + return; + } + pci_unmap_addr_set(tx_req, mapping, addr); + + if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), + addr, skb->len))) { + ipoib_warn(priv, "post_send failed\n"); + ++priv->stats.tx_errors; + dma_unmap_single(priv->ca->dma_device, addr, skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(skb); + } else { + dev->trans_start = jiffies; + ++tx->tx_head; + + if (tx->tx_head - tx->tx_tail == ipoib_sendq_size) { + ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); + netif_stop_queue(dev); + set_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags); + } + } +} + +static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx, + struct ib_wc *wc) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + unsigned int wr_id = wc->wr_id; + struct ipoib_tx_buf *tx_req; + unsigned long flags; + + ipoib_dbg_data(priv, "cm send completion: id %d, op %d, status: %d\n", + wr_id, wc->opcode, wc->status); + + if (unlikely(wr_id >= ipoib_sendq_size)) { + ipoib_warn(priv, "cm send completion event with wrid %d (> %d)\n", + wr_id, ipoib_sendq_size); + return; + } + + tx_req = &tx->tx_ring[wr_id]; + + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + + /* FIXME: is this right? Shouldn't we only increment on success? */ + ++priv->stats.tx_packets; + priv->stats.tx_bytes += tx_req->skb->len; + + dev_kfree_skb_any(tx_req->skb); + + spin_lock_irqsave(&priv->tx_lock, flags); + ++tx->tx_tail; + if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags) && + tx->tx_head - tx->tx_tail <= ipoib_sendq_size >> 1) { + netif_wake_queue(dev); + } + + if (wc->status != IB_WC_SUCCESS && + wc->status != IB_WC_WR_FLUSH_ERR) { + struct ipoib_neigh *neigh; + + ipoib_dbg(priv, "failed cm send event " + "(status=%d, wrid=%d vend_err %x)\n", + wc->status, wr_id, wc->vendor_err); + + spin_lock(&priv->lock); + neigh = tx->neigh; + + if (neigh) { + neigh->cm = NULL; + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + + tx->neigh = NULL; + } + if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { + list_move(&tx->list, &priv->cm.reap_list); + queue_work(ipoib_workqueue, &priv->cm.reap_task); + } + + clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags); + + spin_unlock(&priv->lock); + } + + spin_unlock_irqrestore(&priv->tx_lock, flags); +} + +void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) +{ + struct ipoib_cm_tx *tx = tx_ptr; + int n, i; + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + do { + n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc); + for (i = 0; i < n; ++i) + ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i); + } while (n == IPOIB_NUM_WC); +} + +int ipoib_cm_dev_open(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int ret; + + if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) + return 0; + + priv->cm.cq = ib_create_cq(priv->ca, ipoib_cm_rx_completion, NULL, dev, + ipoib_recvq_size + 1); + if (IS_ERR(priv->cm.cq)) { + printk(KERN_WARNING "%s: failed to create CQ\n", priv->ca->name); + return PTR_ERR(priv->cm.cq); + } + + ib_req_notify_cq(priv->cm.cq, IB_CQ_NEXT_COMP); + + priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev); + if (IS_ERR(priv->cm.id)) { + printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name); + ib_destroy_cq(priv->cm.cq); + return IS_ERR(priv->cm.id); + } + + ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num), + 0, NULL); + if (ret) { + printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name, + IPOIB_CM_IETF_ID | priv->qp->qp_num); + ib_destroy_cm_id(priv->cm.id); + ib_destroy_cq(priv->cm.cq); + return ret; + } + return 0; +} + +void ipoib_cm_dev_stop(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_rx *p; + unsigned long flags; + + if (!IPOIB_CM_SUPPORTED(dev->dev_addr)) + return; + + ib_destroy_cm_id(priv->cm.id); + spin_lock_irqsave(&priv->lock, flags); + while (!list_empty(&priv->cm.passive_ids)) { + p = list_entry(priv->cm.passive_ids.next, typeof(*p), list); + list_del_init(&p->list); + spin_unlock_irqrestore(&priv->lock, flags); + ib_destroy_cm_id(p->id); + ib_destroy_qp(p->qp); + kfree(p); + spin_lock_irqsave(&priv->lock, flags); + } + spin_unlock_irqrestore(&priv->lock, flags); + ib_destroy_cq(priv->cm.cq); +} + +static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct ipoib_cm_tx *p = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + struct ipoib_cm_data *data = event->private_data; + struct sk_buff_head skqueue; + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + struct sk_buff *skb; + unsigned long flags; + + p->mtu = be32_to_cpu(data->mtu); + + if (p->mtu < priv->dev->mtu + IPOIB_ENCAP_LEN) { + ipoib_warn(priv, "Rejecting connection: mtu %d < device mtu %d + 4\n", + p->mtu, priv->dev->mtu); + return -EINVAL; + } + + qp_attr.qp_state = IB_QPS_RTR; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret); + return ret; + } + + qp_attr.rq_psn = 0 /* FIXME */; + ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret); + return ret; + } + + qp_attr.qp_state = IB_QPS_RTS; + ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret); + return ret; + } + ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret); + return ret; + } + + skb_queue_head_init(&skqueue); + + spin_lock_irqsave(&priv->lock, flags); + set_bit(IPOIB_FLAG_OPER_UP, &p->flags); + if (p->neigh) + while ((skb = __skb_dequeue(&p->neigh->queue))) + __skb_queue_tail(&skqueue, skb); + spin_unlock_irqrestore(&priv->lock, flags); + + while ((skb = __skb_dequeue(&skqueue))) { + skb->dev = p->dev; + if (dev_queue_xmit(skb)) + ipoib_warn(priv, "dev_queue_xmit failed " + "to requeue packet\n"); + } + + ret = ib_send_cm_rtu(cm_id, NULL, 0); + if (ret) { + ipoib_warn(priv, "failed to send RTU: %d\n", ret); + return ret; + } + return 0; +} + +static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_init_attr attr = {}; + attr.recv_cq = priv->cm.cq; + attr.srq = priv->cm.srq; + attr.cap.max_send_wr = ipoib_sendq_size; + attr.cap.max_send_sge = 1; + attr.sq_sig_type = IB_SIGNAL_ALL_WR; + attr.qp_type = IB_QPT_RC; + attr.send_cq = cq; + return ib_create_qp(priv->pd, &attr); +} + +static int ipoib_cm_send_req(struct net_device *dev, + struct ib_cm_id *id, struct ib_qp *qp, + u32 qpn, + struct ib_sa_path_rec *pathrec) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_data data = {}; + struct ib_cm_req_param req = {}; + + data.qpn = cpu_to_be32(priv->qp->qp_num); + data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE); + + req.primary_path = pathrec; + req.alternate_path = NULL; + req.service_id = cpu_to_be64(IPOIB_CM_IETF_ID | qpn); + req.qp_num = qp->qp_num; + req.qp_type = qp->qp_type; + req.private_data = &data; + req.private_data_len = sizeof data; + req.flow_control = 0; + + req.starting_psn = 0; /* FIXME */ + + /* + * Pick some arbitrary defaults here; we could make these + * module parameters if anyone cared about setting them. + */ + req.responder_resources = 4; + req.remote_cm_response_timeout = 20; + req.local_cm_response_timeout = 20; + req.retry_count = 0; /* RFC draft warns against retries */ + req.rnr_retry_count = 0; /* RFC draft warns against retries */ + req.max_cm_retries = 15; + req.srq = 15; + return ib_send_cm_req(id, &req); +} + +static int ipoib_cm_modify_tx_init(struct net_device *dev, + struct ib_cm_id *cm_id, struct ib_qp *qp) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_qp_attr qp_attr; + int qp_attr_mask, ret; + ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index); + if (ret) { + ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret); + return ret; + } + + qp_attr.qp_state = IB_QPS_INIT; + qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE; + qp_attr.port_num = priv->port; + qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT; + + ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask); + if (ret) { + ipoib_warn(priv, "failed to modify tx QP to INIT: %d\n", ret); + return ret; + } + return 0; +} + +int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, struct ib_sa_path_rec *pathrec) +{ + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + int ret; + + ipoib_dbg(priv, "Request connection %p for gid " IPOIB_GID_FMT " qpn 0x%x\n", + p, IPOIB_GID_ARG(pathrec->dgid), qpn); + + p->tx_ring = kzalloc(ipoib_sendq_size * sizeof *p->tx_ring, + GFP_KERNEL); + if (!p->tx_ring) { + ipoib_warn(priv, "failed to allocate tx ring\n"); + ret = -ENOMEM; + goto err_tx; + } + + p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p, + ipoib_sendq_size + 1); + if (IS_ERR(p->cq)) { + ret = PTR_ERR(p->cq); + ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret); + goto err_cq; + } + + ret = ib_req_notify_cq(p->cq, IB_CQ_NEXT_COMP); + if (ret) { + ipoib_warn(priv, "failed to request completion notification: %d\n", ret); + goto err_req_notify; + } + + p->qp = ipoib_cm_create_tx_qp(p->dev, p->cq); + if (IS_ERR(p->qp)) { + ret = PTR_ERR(p->qp); + ipoib_warn(priv, "failed to allocate tx qp: %d\n", ret); + goto err_qp; + } + + p->id = ib_create_cm_id(priv->ca, ipoib_cm_tx_handler, p); + if (IS_ERR(p->id)) { + ret = PTR_ERR(p->id); + ipoib_warn(priv, "failed to create tx cm id: %d\n", ret); + goto err_id; + } + + ret = ipoib_cm_modify_tx_init(p->dev, p->id, p->qp); + if (ret) { + ipoib_warn(priv, "failed to modify tx qp to rtr: %d\n", ret); + goto err_modify; + } + + ret = ipoib_cm_send_req(p->dev, p->id, p->qp, qpn, pathrec); + if (ret) { + ipoib_warn(priv, "failed to send cm req: %d\n", ret); + goto err_send_cm; + } + return 0; + +err_send_cm: +err_modify: + ib_destroy_cm_id(p->id); +err_id: + p->id = NULL; + ib_destroy_qp(p->qp); +err_req_notify: +err_qp: + p->qp = NULL; + ib_destroy_cq(p->cq); +err_cq: + p->cq = NULL; +err_tx: + return ret; +} + +void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) +{ + struct ipoib_dev_priv *priv = netdev_priv(p->dev); + struct ipoib_tx_buf *tx_req; + + ipoib_dbg(priv, "Destroy active connection %p. head 0x%x tail 0x%x\n", + p, p->tx_head, p->tx_tail); + + if (p->id) + ib_destroy_cm_id(p->id); + + if (p->qp) + ib_destroy_qp(p->qp); + + if (p->cq) + ib_destroy_cq(p->cq); + + if (p->tx_ring) { + while ((int) p->tx_tail - (int) p->tx_head < 0) { + tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)]; + dma_unmap_single(priv->ca->dma_device, + pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, + DMA_TO_DEVICE); + dev_kfree_skb_any(tx_req->skb); + ++p->tx_tail; + } + + kfree(p->tx_ring); + } + + kfree(p); +} + +int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct ipoib_cm_tx *tx = cm_id->context; + struct ipoib_dev_priv *priv = netdev_priv(tx->dev); + struct ipoib_neigh *neigh; + unsigned long flags; + int ret; + + switch (event->event) { + case IB_CM_DREQ_RECEIVED: + ipoib_dbg(priv, "DREQ received.\n"); + ib_send_cm_drep(cm_id, NULL, 0); + break; + case IB_CM_REP_RECEIVED: + ipoib_dbg(priv, "REP received.\n"); + ret = ipoib_cm_rep_handler(cm_id, event); + if (ret) + ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED, + NULL, 0, NULL, 0); + break; + case IB_CM_REQ_ERROR: + case IB_CM_REJ_RECEIVED: + case IB_CM_TIMEWAIT_EXIT: + ipoib_dbg(priv, "CM error %d.\n", event->event); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + neigh = tx->neigh; + + if (neigh) { + neigh->cm = NULL; + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + + tx->neigh = NULL; + } + + if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { + list_move(&tx->list, &priv->cm.reap_list); + queue_work(ipoib_workqueue, &priv->cm.reap_task); + } + + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); + break; + default: + break; + } + + return 0; +} + +struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_tx *tx; + + tx = kzalloc(sizeof *tx, GFP_ATOMIC); + if (!tx) + return NULL; + + neigh->cm = tx; + tx->neigh = neigh; + tx->path = path; + tx->dev = dev; + list_add(&tx->list, &priv->cm.start_list); + set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags); + queue_work(ipoib_workqueue, &priv->cm.start_task); + return tx; +} + +void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) +{ + struct ipoib_dev_priv *priv = netdev_priv(tx->dev); + if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) { + list_move(&tx->list, &priv->cm.reap_list); + queue_work(ipoib_workqueue, &priv->cm.reap_task); + ipoib_dbg(priv, "Reap connection for gid " IPOIB_GID_FMT "\n", + IPOIB_GID_ARG(tx->neigh->dgid)); + tx->neigh = NULL; + } +} + +void ipoib_cm_tx_start(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_neigh *neigh; + struct ipoib_cm_tx *p; + unsigned long flags; + int ret; + + struct ib_sa_path_rec pathrec; + u32 qpn; + + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + while (!list_empty(&priv->cm.start_list)) { + p = list_entry(priv->cm.start_list.next, typeof(*p), list); + list_del_init(&p->list); + neigh = p->neigh; + qpn = IPOIB_QPN(neigh->neighbour->ha); + memcpy(&pathrec, &p->path->pathrec, sizeof pathrec); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); + ret = ipoib_cm_tx_init(p, qpn, &pathrec); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + if (ret) { + neigh = p->neigh; + if (neigh) { + neigh->cm = NULL; + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + } + list_del(&p->list); + kfree(p); + } + } + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); +} + +void ipoib_cm_tx_reap(void *dev_ptr) +{ + struct net_device *dev = dev_ptr; + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ipoib_cm_tx *p; + unsigned long flags; + + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + while (!list_empty(&priv->cm.reap_list)) { + p = list_entry(priv->cm.reap_list.next, typeof(*p), list); + list_del(&p->list); + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); + ipoib_cm_tx_destroy(p); + spin_lock_irqsave(&priv->tx_lock, flags); + spin_lock(&priv->lock); + } + spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->tx_lock, flags); +} + +static ssize_t show_mode(struct class_device *cdev, char *buf) +{ + struct net_device *dev = container_of(cdev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); + + if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags)) + return sprintf(buf, "connected\n"); + else + return sprintf(buf, "datagram\n"); +} + +static ssize_t set_mode(struct class_device *cdev, + const char *buf, size_t count) +{ + struct net_device *dev = container_of(cdev, struct net_device, class_dev); + struct ipoib_dev_priv *priv = netdev_priv(dev); + + /* flush paths if we switch modes so that connections are restarted */ + if (IPOIB_CM_SUPPORTED(dev->dev_addr) && !strcmp(buf, "connected\n")) { + set_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); + ipoib_warn(priv, "enabling connected mode breaks multicast!\n"); + ipoib_flush_paths(dev); + return count; + } + + if (!strcmp(buf, "datagram\n")) { + clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); + ipoib_flush_paths(dev); + return count; + } + + return -EINVAL; +} + +static CLASS_DEVICE_ATTR(mode, S_IWUGO | S_IRUGO, show_mode, set_mode); + +int ipoib_cm_add_mode_attr(struct net_device *dev) +{ + return class_device_create_file(&dev->class_dev, &class_device_attr_mode); +} + +int ipoib_cm_dev_init(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_srq_init_attr srq_init_attr = { + .attr = { + .max_wr = ipoib_recvq_size, + .max_sge = IPOIB_CM_RX_SG + } + }; + int ret, i; + + INIT_LIST_HEAD(&priv->cm.passive_ids); + INIT_LIST_HEAD(&priv->cm.reap_list); + INIT_LIST_HEAD(&priv->cm.start_list); + INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start, dev); + INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap, dev); + + priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr); + if (IS_ERR(priv->cm.srq)) { + ret = PTR_ERR(priv->cm.srq); + priv->cm.srq = NULL; + return ret; + } + + priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring, + GFP_KERNEL); + if (!priv->cm.srq_ring) { + printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n", + priv->ca->name, ipoib_recvq_size); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + + for (i = 0; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].lkey = priv->mr->lkey; + + priv->cm.rx_sge[0].length = IPOIB_CM_HEAD_SIZE; + for (i = 1; i < IPOIB_CM_RX_SG; ++i) + priv->cm.rx_sge[i].length = PAGE_SIZE; + priv->cm.rx_wr.next = NULL; + priv->cm.rx_wr.sg_list = priv->cm.rx_sge; + priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG; + + for (i = 0; i < ipoib_recvq_size; ++i) { + if (ipoib_cm_alloc_rx_skb(dev, i, priv->cm.srq_ring[i].mapping)) { + ipoib_warn(priv, "failed to allocate receive buffer %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -ENOMEM; + } + if (ipoib_cm_post_receive(dev, i)) { + ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i); + ipoib_cm_dev_cleanup(dev); + return -EIO; + } + } + + priv->dev->dev_addr[0] = IPOIB_FLAGS_RC; + return 0; +} + +void ipoib_cm_dev_cleanup(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + int i, ret; + + if (!priv->cm.srq) + return; + + ipoib_dbg(priv, "Cleanup ipoib connected mode.\n"); + + ret = ib_destroy_srq(priv->cm.srq); + if (ret) + ipoib_warn(priv, "ib_destroy_srq failed: %d\n", ret); + + priv->cm.srq = NULL; + if (!priv->cm.srq_ring) + return; + for (i = 0; i < ipoib_recvq_size; ++i) + if (priv->cm.srq_ring[i].skb) { + ipoib_cm_dma_unmap_rx(priv, priv->cm.srq_ring[i].mapping); + dev_kfree_skb_any(priv->cm.srq_ring[i].skb); + priv->cm.srq_ring[i].skb = NULL; + } + kfree(priv->cm.srq_ring); + priv->cm.srq_ring = NULL; +} diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 8bf5e9e..2372cfc 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -273,10 +273,10 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(&priv->tx_lock, flags); ++priv->tx_tail; - if (netif_queue_stopped(dev) && - test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) && - priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) + if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags) && + priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) { netif_wake_queue(dev); + } spin_unlock_irqrestore(&priv->tx_lock, flags); if (wc->status != IB_WC_SUCCESS && @@ -378,6 +378,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb, if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); + set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); } } } @@ -429,6 +430,13 @@ int ipoib_ib_dev_open(struct net_device *dev) return -1; } + ret = ipoib_cm_dev_open(dev); + if (ret) { + ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret); + ipoib_ib_dev_stop(dev); + return -1; + } + clear_bit(IPOIB_STOP_REAPER, &priv->flags); queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ); @@ -514,6 +522,8 @@ int ipoib_ib_dev_stop(struct net_device *dev) clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); + ipoib_cm_dev_stop(dev); + /* * Move our QP to the error state and then reinitialize in * when all work requests have completed or have been flushed. @@ -603,6 +613,8 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) return -ENODEV; } + ipoib_cm_dev_init(dev); + if (dev->flags & IFF_UP) { if (ipoib_ib_dev_open(dev)) { ipoib_transport_dev_cleanup(dev); @@ -659,6 +671,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) ipoib_mcast_stop_thread(dev, 1); ipoib_mcast_dev_flush(dev); + ipoib_cm_dev_cleanup(dev); ipoib_transport_dev_cleanup(dev); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 85522da..5319ac1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -49,8 +49,6 @@ #include -#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) - MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); @@ -145,6 +143,8 @@ static int ipoib_stop(struct net_device *dev) netif_stop_queue(dev); + clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags); + /* * Now flush workqueue to make sure a scheduled task doesn't * bring our internal state back up. @@ -178,8 +178,17 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu) { struct ipoib_dev_priv *priv = netdev_priv(dev); - if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) + /* dev->mtu > 2K ==> connected mode */ + if (ipoib_cm_admin_enabled(dev) && new_mtu <= IPOIB_CM_MTU) { + if (new_mtu > priv->mcast_mtu) + ipoib_warn(priv, "mtu > %d breaks multicast!\n", priv->mcast_mtu); + dev->mtu = new_mtu; + return 0; + } + + if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) { return -EINVAL; + } priv->admin_mtu = new_mtu; @@ -414,6 +423,20 @@ static void path_rec_completion(int status, memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, sizeof(union ib_gid)); + if (ipoib_cm_enabled(dev, neigh->neighbour)) { + if (!ipoib_cm_get(neigh)) + ipoib_cm_set(neigh, ipoib_cm_create_tx(dev, + path, + neigh)); + if (!ipoib_cm_get(neigh)) { + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + continue; + } + } + while ((skb = __skb_dequeue(&neigh->queue))) __skb_queue_tail(&skqueue, skb); } @@ -522,7 +545,25 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw, sizeof(union ib_gid)); - ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); + if (ipoib_cm_enabled(dev, neigh->neighbour)) { + if (!ipoib_cm_get(neigh)) + ipoib_cm_set(neigh, ipoib_cm_create_tx(dev, path, neigh)); + if (!ipoib_cm_get(neigh)) { + list_del(&neigh->list); + if (neigh->ah) + ipoib_put_ah(neigh->ah); + ipoib_neigh_free(neigh); + goto err_drop; + } + if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) + __skb_queue_tail(&neigh->queue, skb); + else { + ipoib_warn(priv, "queue length limit %d. Packet drop.\n", + skb_queue_len(&neigh->queue)); + goto err_drop; + } + } else + ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); } else { neigh->ah = NULL; __skb_queue_tail(&neigh->queue, skb); @@ -539,6 +580,7 @@ err_list: err_path: ipoib_neigh_free(neigh); +err_drop: ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -641,7 +683,12 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (likely(neigh->ah)) { + if (ipoib_cm_get(neigh)) { + if (ipoib_cm_up(neigh)) { + ipoib_cm_send(dev, skb, ipoib_cm_get(neigh)); + goto out; + } + } else if (neigh->ah) { if (unlikely(memcmp(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, sizeof(union ib_gid)))) { @@ -805,6 +852,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) neigh->neighbour = neighbour; *to_ipoib_neigh(neighbour) = neigh; + ipoib_cm_set(neigh, NULL); return neigh; } @@ -812,6 +860,8 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour) void ipoib_neigh_free(struct ipoib_neigh *neigh) { *to_ipoib_neigh(neigh->neighbour) = NULL; + if (ipoib_cm_get(neigh)) + ipoib_cm_destroy_tx(ipoib_cm_get(neigh)); kfree(neigh); } @@ -1075,6 +1125,8 @@ static struct net_device *ipoib_add_port(const char *format, ipoib_create_debug_files(priv->dev); + if (ipoib_cm_add_mode_attr(priv->dev)) + goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; if (class_device_create_file(&priv->dev->class_dev, diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 3faa182..ea387b3 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -594,7 +594,9 @@ void ipoib_mcast_join_task(void *dev_ptr) priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) - IPOIB_ENCAP_LEN; - dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); + + if (!ipoib_cm_admin_enabled(dev)) + dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n"); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index f887780..d9fd82d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -115,6 +115,8 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) ipoib_create_debug_files(priv->dev); + if (ipoib_cm_add_mode_attr(priv->dev)) + goto sysfs_failed; if (ipoib_add_pkey_attr(priv->dev)) goto sysfs_failed; -- MST From halr at voltaire.com Sun Dec 10 08:10:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Dec 2006 11:10:42 -0500 Subject: [openib-general] [PATCH] osm: trivial osm_log missmatch on vendor mlx In-Reply-To: <457B1815.7000404@mellanox.co.il> References: <457B1815.7000404@mellanox.co.il> Message-ID: <1165767007.26559.111479.camel@hal.voltaire.com> On Sat, 2006-12-09 at 15:09, Eitan Zahavi wrote: > Hi Hal > > This patch fixes some osm_log issues on the mlx vendor. > > Signed-off-by: Eitan Zahavi > > --- > osm/libvendor/osm_vendor_mlx_dispatcher.c | 3 ++- > osm/libvendor/osm_vendor_mlx_txn.c | 2 +- > 2 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/osm/libvendor/osm_vendor_mlx_dispatcher.c > b/osm/libvendor/osm_vendor_mlx_dispatcher.c > index e8b47dd..7e3bd78 100644 > --- a/osm/libvendor/osm_vendor_mlx_dispatcher.c > +++ b/osm/libvendor/osm_vendor_mlx_dispatcher.c > @@ -134,7 +134,8 @@ osmv_dispatch_mad(IN osm_bind_handle_t > { > > osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG, > - "The bind handle %p is being closed. The MAD will not be > dispatched.\n"); This line is wrapped. > + "The bind handle %p is being closed. " > + "The MAD will not be dispatched.\n", p_bo); > > ret = IB_INTERRUPTED; > goto dispatch_mad_done; > diff --git a/osm/libvendor/osm_vendor_mlx_txn.c > b/osm/libvendor/osm_vendor_mlx_txn.c > index 1fd262f..234e33b 100644 > --- a/osm/libvendor/osm_vendor_mlx_txn.c > +++ b/osm/libvendor/osm_vendor_mlx_txn.c > @@ -631,7 +631,7 @@ __osmv_txn_timeout_cb(IN uint64_t key, > > osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG, > "__osmv_txn_timeout_cb: " > - "Retry request timout in : %u [msec].\n", > + "Retry request timout in : %lu [msec].\n", > next_timeout_ms); > } > } > -- > 1.4.4.1.GIT Thanks. Applied with osm_vendor_mlx_dispatcher.c done by hand. -- Hal From halr at voltaire.com Sun Dec 10 08:20:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 10 Dec 2006 11:20:55 -0500 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061210064346.GC10403@mellanox.co.il> References: <1165701912.26559.65050.camel@hal.voltaire.com> <20061210064346.GC10403@mellanox.co.il> Message-ID: <1165767642.26559.111898.camel@hal.voltaire.com> On Sun, 2006-12-10 at 01:43, Michael S. Tsirkin wrote: > > > > Eitan, how is it hard for you to prepare procmail's rule which > will > > > > automatically apply the patches from emails to the local > pre-trunk > > > > tree? Or do you think it is insufficient? > > > > > > This sounds like a fragile process. It seems much easier to just > > > have an unstable branch with untested patches. No? > > > > Untested is an overexaggeration. They are tested but not by Eitan's > > regression. > > Sorry, I'm not trying to influence any policy decisions here, > I'm coming purely from git angle. *If* you want Eitan to test and Ack > some > patches, *and want to automate the testing part*, the simplest thing > to do is to > apply them on some git branch. Couldn't he also back off the head on the "trunk" if that doesn't work too ? That (which version) could be taken as input to the automatic regression with less overhead than another branch to have to track or figuring out how to apply patches automagically. -- Hal > -- > MST > > From rdreier at cisco.com Sun Dec 10 09:34:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 10 Dec 2006 09:34:18 -0800 Subject: [openib-general] version #defines for the kernel References: <045401c71b02$d8d17a40$0281a8c0@ebpc> <20061209193443.GB6891@mellanox.co.il> Message-ID: > include/net/ieee80211.h has one. It does not seem to work too well though. Do you mean #define IEEE80211_VERSION "git-1.1.13" The only thing it seems useful for is printing out -- you certainly can't compare a string like that in any sane way using the C preprocessor. - R. From rdreier at cisco.com Sun Dec 10 09:39:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 10 Dec 2006 09:39:18 -0800 Subject: [openib-general] version #defines for the kernel References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> <20061208233616.GA10646@greglaptop> Message-ID: > > But you should also cope with > > non-OFED (vanilla upstream) drivers, probably by testing > > LINUX_VERSION_CODE too I suppose. > > Although RHEL4 shows how this can break down in the future... they > backport kernel stuff while leaving LINUX_VERSION_CODE set to 2.6.9. I don't think there's any sane way to handle that. Since a backport might only pick part of the new interface and stick with an old API elsewhere, you can't have a single IB version number. And I don't want an ever-growing mass of "#define HAVE_FEATURE_BLAH" metastasizing in the IB headers... From mst at mellanox.co.il Sun Dec 10 10:29:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 20:29:54 +0200 Subject: [openib-general] version #defines for the kernel In-Reply-To: References: Message-ID: <20061210182954.GB1708@mellanox.co.il> > > include/net/ieee80211.h has one. It does not seem to work too well though. > > Do you mean > > #define IEEE80211_VERSION "git-1.1.13" Yes. > The only thing it seems useful for is printing out -- you certainly > can't compare a string like that in any sane way using the C > preprocessor. Right. Intel's out of tree drivers which I looked at at some point try to run scripts to parse this, and fail miserably. -- MST From mst at mellanox.co.il Sun Dec 10 11:54:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 21:54:05 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <1165767642.26559.111898.camel@hal.voltaire.com> References: <1165701912.26559.65050.camel@hal.voltaire.com> <20061210064346.GC10403@mellanox.co.il> <1165767642.26559.111898.camel@hal.voltaire.com> Message-ID: <20061210195405.GE1708@mellanox.co.il> > > > > > Eitan, how is it hard for you to prepare procmail's rule which will > > > > > automatically apply the patches from emails to the local pre-trunk > > > > > tree? Or do you think it is insufficient? > > > > > > > > This sounds like a fragile process. It seems much easier to just > > > > have an unstable branch with untested patches. No? > > > > > > Untested is an overexaggeration. They are tested but not by Eitan's > > > regression. > > > > Sorry, I'm not trying to influence any policy decisions here, I'm coming > > purely from git angle. *If* you want Eitan to test and Ack some patches, > > *and want to automate the testing part*, the simplest thing to do is > > to apply them on some git branch. > > Couldn't he also back off the head on the "trunk" if that doesn't work > too ? That (which version) could be taken as input to the automatic > regression with less overhead than another branch to have to track or > figuring out how to apply patches automagically. No, this is backwards - rewinding history in trunk branch will break git pull for anyone who tries to base his work on that, so that's not a good idea. Or you get a lot of little "feature X" "unbreak feature X" ... "fix feature X" commits that just make the history log messy and unreadable. Guys, don't be so scared of branches, they don't really have any significant overhead in git: branch (and tag) are basically just symbolic names for commit. There's not "maintainance" associated with it that I know of. Try it. This is how e.g. git itself seems to be developed: there's a main branch for next release, next branch for less stable stuff and "pu" branch for experimental stuff, and there's a bugfix branch for last stable release. -- MST From sashak at voltaire.com Sun Dec 10 12:52:03 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 10 Dec 2006 22:52:03 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061210195405.GE1708@mellanox.co.il> References: <1165701912.26559.65050.camel@hal.voltaire.com> <20061210064346.GC10403@mellanox.co.il> <1165767642.26559.111898.camel@hal.voltaire.com> <20061210195405.GE1708@mellanox.co.il> Message-ID: <20061210205203.GA21155@sashak.voltaire.com> On 21:54 Sun 10 Dec , Michael S. Tsirkin wrote: > > > > > > Eitan, how is it hard for you to prepare procmail's rule which will > > > > > > automatically apply the patches from emails to the local pre-trunk > > > > > > tree? Or do you think it is insufficient? > > > > > > > > > > This sounds like a fragile process. It seems much easier to just > > > > > have an unstable branch with untested patches. No? > > > > > > > > Untested is an overexaggeration. They are tested but not by Eitan's > > > > regression. > > > > > > Sorry, I'm not trying to influence any policy decisions here, I'm coming > > > purely from git angle. *If* you want Eitan to test and Ack some patches, > > > *and want to automate the testing part*, the simplest thing to do is > > > to apply them on some git branch. > > > > Couldn't he also back off the head on the "trunk" if that doesn't work > > too ? That (which version) could be taken as input to the automatic > > regression with less overhead than another branch to have to track or > > figuring out how to apply patches automagically. > > No, this is backwards - rewinding history in trunk branch will break git pull for anyone > who tries to base his work on that, so that's not a good idea. I think Hal was about rewinding local tree, there is nothing wrong with it. In general non-linear history changes in public repositories are not something "impossible", basically this should work, but may require additional merging efforts from pullers. I also think that it is better to not do it, at least not now. > Or you get a lot of little > "feature X" > "unbreak feature X" > ... > "fix feature X" > commits that just make the history log messy and unreadable. > > Guys, don't be so scared of branches, they don't really have > any significant overhead in git: branch (and tag) are basically > just symbolic names for commit. Right, branch in git is cheap, and if one needs branch in his tree he can just create this branch in his tree, it is not necessary to ask origin's tree maintainer to create this branch for him. Sasha > > There's not "maintainance" associated with it that I know of. Try it. > This is how e.g. git itself seems to be developed: there's a main branch for next release, > next branch for less stable stuff and "pu" branch for experimental stuff, > and there's a bugfix branch for last stable release. > > -- > MST > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Sun Dec 10 13:05:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 23:05:43 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061210205203.GA21155@sashak.voltaire.com> References: <20061210205203.GA21155@sashak.voltaire.com> Message-ID: <20061210210543.GB9205@mellanox.co.il> > > Guys, don't be so scared of branches, they don't really have > > any significant overhead in git: branch (and tag) are basically > > just symbolic names for commit. > > Right, branch in git is cheap, and if one needs branch in his tree he can > just create this branch in his tree, it is not necessary to ask origin's > tree maintainer to create this branch for him. I agree. If, on the other hand, the tree maintainer wants someone else to test a set of patches automatically, the simplest way is for *said maintainer* to create a known branch or tag with that patch set, and have test scripts pick that up. It's really simple - if you want people to help, make it easy on them. -- MST From adit.262 at gmail.com Sun Dec 10 13:26:18 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Sun, 10 Dec 2006 16:26:18 -0500 Subject: [openib-general] Assigning IP addresses to IB interfaces In-Reply-To: <457BBA6A.3020209@voltaire.com> References: <457BBA6A.3020209@voltaire.com> Message-ID: I tried assigining IP addresses to IB interfaces - ifconfig ib1 10.0.0.1 ifconfig ib1 10.0.0.2 on the other machine Did a "ping 10.0.0.2 -I ib1" from the first - it says destination host unreachable. Is there anything specific to be done for being able to ping between the 2 interfaces? Thanks, Adit On 12/10/06, Or Gerlitz wrote: > Adit Ranadive wrote: > > I have installed the OpenIB gen2 driver but the IB interfaces havent > > been assigned any IP addresses.. > > Is it possible to assign them ip addresses using ifconfig and ping > > between the interfaces of two machines? > > yes > > -- Adit Ranadive Freshman, Georgia Institute of Technology, Atlanta, GA From mst at mellanox.co.il Sun Dec 10 13:39:20 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 23:39:20 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061129214302.GF18978@sashak.voltaire.com> References: <1164829955.28427.69.camel@stevo-desktop> <20061129203916.GL16763@sashak.voltaire.com> <1164835084.28427.83.camel@stevo-desktop> <20061129214302.GF18978@sashak.voltaire.com> Message-ID: <20061210213920.GF9205@mellanox.co.il> > Sean, you can do > > chmod 755 hooks/post-update > > This hook runs git-server-update-info after each push. It seems we really want this as default. Sasha, could you please chmod 755 /usr/share/git-core/templates/hooks/pre-commit so that this will be the default for all new users? -- MST From mst at mellanox.co.il Sun Dec 10 13:40:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 23:40:35 +0200 Subject: [openib-general] Assigning IP addresses to IB interfaces In-Reply-To: References: <457BBA6A.3020209@voltaire.com> Message-ID: <20061210214035.GG9205@mellanox.co.il> Any chance that SM isn't running on the fabric? Did the ports come up? Quoting r. Adit Ranadive : Subject: Re: Assigning IP addresses to IB interfaces I tried assigining IP addresses to IB interfaces - ifconfig ib1 10.0.0.1 ifconfig ib1 10.0.0.2 on the other machine Did a "ping 10.0.0.2 -I ib1" from the first - it says destination host unreachable. Is there anything specific to be done for being able to ping between the 2 interfaces? Thanks, Adit On 12/10/06, Or Gerlitz wrote: > Adit Ranadive wrote: > > I have installed the OpenIB gen2 driver but the IB interfaces havent > > been assigned any IP addresses.. > > Is it possible to assign them ip addresses using ifconfig and ping > > between the interfaces of two machines? > > yes > > -- Adit Ranadive Freshman, Georgia Institute of Technology, Atlanta, GA _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From sashak at voltaire.com Sun Dec 10 13:50:33 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 10 Dec 2006 23:50:33 +0200 Subject: [openib-general] userspace git trees Message-ID: <20061210215033.GC21155@sashak.voltaire.com> Hi, Recently I found this OFA 'Userspace Git Trees' downloading howto: https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories and thought that we could make it simpler for end-user to choose the "right" git tree just by adding one more series of symbolic links under /pub/scm. This links will point to the maintainer's "official" trees, and we will have only one such link per project. So typical downloading howto for end-users will looks like: git clone git://staging.openfabrics.org/dapl git clone git://staging.openfabrics.org/ibutils git clone git://staging.openfabrics.org/imgen ... instead of git clone git://staging.openfabrics.org/~ardavis/dapl git clone git://staging.openfabrics.org/~eitan/ibutils git clone git://staging.openfabrics.org/~mst/imgen ... as it is now. To illustrate this I've added already couple of such symbolic links under /pub/scm and it is visible now via gitweb: http://staging.openfabrics.org/git Comments, objections? (I did this just to show how this looks and probably missed some projects. And of course I will remove those links if this idea will be rejected.) Sasha From mst at mellanox.co.il Sun Dec 10 13:59:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 10 Dec 2006 23:59:57 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061210215033.GC21155@sashak.voltaire.com> References: <20061210215033.GC21155@sashak.voltaire.com> Message-ID: <20061210215956.GI9205@mellanox.co.il> > Recently I found this OFA 'Userspace Git Trees' downloading howto: > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories > > and thought that we could make it simpler for end-user to choose the > "right" git tree just by adding one more series of symbolic links under > /pub/scm. This links will point to the maintainer's "official" trees, and > we will have only one such link per project. > > So typical downloading howto for end-users will looks like: > > git clone git://staging.openfabrics.org/dapl > git clone git://staging.openfabrics.org/ibutils > git clone git://staging.openfabrics.org/imgen > ... > > instead of > > git clone git://staging.openfabrics.org/~ardavis/dapl > git clone git://staging.openfabrics.org/~eitan/ibutils > git clone git://staging.openfabrics.org/~mst/imgen > ... > > as it is now. NACK, please remove this. These soft links are messy, and the fact that one needs root just to add a tree shows just how the approach is broken. If you have some temporary tree, just mention this in description, and gitweb will show this. And won't the problem basically go away if you move ~sashak temporary trees out of ~/scm? It seems we don't have a lot of duplicates besides that. But in the long run, no development git tree is or should be the *official* one - otherwise we get back to the mess we had with svn, with people pushing for inclusion in the "official" tree just to get visibility. The result? The "official" tree then becomes also the least stable. What we need is official *releases*. Not official development trees. And end users should either stick to releases or know what they are doing and select the tree they actually *want*. -- MST From swise at opengridcomputing.com Sun Dec 10 14:04:33 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Sun, 10 Dec 2006 16:04:33 -0600 Subject: [openib-general] [PATCH] - ucma updates for miscdev changes Message-ID: <1165788273.25243.8.camel@linux-q667.site> Sean, As part of merging up to linus's tree as of 12/8/2006, I had to change ucma.c to support changes in the miscdevice stuff. Below is a patch for this. In addition to this change, I had to merge your ucma patches to get them to apply. Nothing functional changed, I don't think, but some of the changes in your tree are already in linus's tree, so those patches were ignored. And one didn't apply cleanly and I had to fix it manually. You can see these changes including the patch below as a single patch in git://staging.openfabrics.org/~swise/cxgb3.git commit number: d1ac2e74680d61a5e87165e1c6b4cec44533f2bd. Signed-off-by: Steve Wise ----- --- rdma-dev/drivers/infiniband/core/ucma.c 2006-12-08 11:03:31.000000000 -0600 +++ cxgb3.git/drivers/infiniband/core/ucma.c 2006-12-09 09:41:03.000000000 -0600 @@ -836,11 +836,12 @@ static struct miscdevice ucma_misc = { .fops = &ucma_fops, }; -static ssize_t show_abi_version(struct class_device *class_dev, char *buf) +static ssize_t show_abi_version(struct device *class_dev, + struct device_attribute *attr, char *buf) { return sprintf(buf, "%d\n", RDMA_USER_CM_ABI_VERSION); } -static CLASS_DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); +static DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL); static int __init ucma_init(void) { @@ -850,8 +851,7 @@ static int __init ucma_init(void) if (ret) return ret; - ret = class_device_create_file(ucma_misc.class, - &class_device_attr_abi_version); + ret = device_create_file(ucma_misc.this_device, &dev_attr_abi_version); if (ret) { printk(KERN_ERR "rdma_ucm: couldn't create abi_version attr\n"); goto err; @@ -864,8 +864,7 @@ err: static void __exit ucma_cleanup(void) { - class_device_remove_file(ucma_misc.class, - &class_device_attr_abi_version); + device_remove_file(ucma_misc.this_device, &dev_attr_abi_version); misc_deregister(&ucma_misc); idr_destroy(&ctx_idr); } From sashak at voltaire.com Sun Dec 10 14:18:05 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 00:18:05 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061210213920.GF9205@mellanox.co.il> References: <1164829955.28427.69.camel@stevo-desktop> <20061129203916.GL16763@sashak.voltaire.com> <1164835084.28427.83.camel@stevo-desktop> <20061129214302.GF18978@sashak.voltaire.com> <20061210213920.GF9205@mellanox.co.il> Message-ID: <20061210221805.GD21155@sashak.voltaire.com> On 23:39 Sun 10 Dec , Michael S. Tsirkin wrote: > > Sean, you can do > > > > chmod 755 hooks/post-update > > > > This hook runs git-server-update-info after each push. > > It seems we really want this as default. > Sasha, could you please > chmod 755 /usr/share/git-core/templates/hooks/pre-commit > so that this will be the default for all new users? Would prefer to not do this. All hooks are "off" is reasonable default IMO and this should be tree maintainer's decision to enable specific hook or not. If somebody needs help with setup, we can help, or we could write sort of 'howto' if there are common problems. But I think we cannot take "ownership" there. Sasha From sashak at voltaire.com Sun Dec 10 14:33:29 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 00:33:29 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <20061210210543.GB9205@mellanox.co.il> References: <20061210205203.GA21155@sashak.voltaire.com> <20061210210543.GB9205@mellanox.co.il> Message-ID: <20061210223329.GE21155@sashak.voltaire.com> On 23:05 Sun 10 Dec , Michael S. Tsirkin wrote: > > > Guys, don't be so scared of branches, they don't really have > > > any significant overhead in git: branch (and tag) are basically > > > just symbolic names for commit. > > > > Right, branch in git is cheap, and if one needs branch in his tree he can > > just create this branch in his tree, it is not necessary to ask origin's > > tree maintainer to create this branch for him. > > I agree. > If, on the other hand, the tree maintainer wants someone else to test a set of > patches automatically, the simplest way is for *said maintainer* to create a > known branch or tag with that patch set, and have test scripts pick that up. > > It's really simple - if you want people to help, make it easy on them. I agree with last sentence, but it is not "git angle" :) Sasha From swise at opengridcomputing.com Sun Dec 10 14:32:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:32:44 -0600 Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver Message-ID: <20061210223244.27166.36192.stgit@dell3.ogc.int> Roland, I believe all comments so far have been incorporated. Version 3 changes: - BugFix: Don't use mutex inside of the mmap function. - BugFix: Move QP to TERMINATE when TERMINATE AE is processed - Support the new work queue design - Merged up to linus's tree as of 12/8/2006 - Misc nits Version 2 changes: - Make code sparse endian clean - Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays - Clean up confusing bitfields - Use random32() instead of local random function - Use krefs to track endpoint reference counts - Misc nits ----- The following series implements the Chelsio T3 iWARP/RDMA Driver to be considered for inclusion in 2.6.20. It depends on the Chelsio T3 Ethernet driver which is also under review now for 2.6.20. See http://www.mail-archive.com/netdev at vger.kernel.org/msg27801.html for the latest posting of the T3 Ethernet driver. This patch series is against Linus's tree as of 12/8/2006 and can also be pulled from: http://www.opengridcomputing.com/downloads/iw_cxgb3_patches_v3.tar.bz2 The Chelsio T3 Ethernet driver patch can be pulled from: http://service.chelsio.com/kernel.org/cxgb3.patch.bz2 A complete GIT kernel tree with all the T3 drivers can be pulled from: git://staging.openfabrics.org/~swise/cxgb3.git Thanks, Steve. From swise at opengridcomputing.com Sun Dec 10 14:33:15 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:33:15 -0600 Subject: [openib-general] [PATCH v3 01/13] Linux RDMA Core Changes In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223314.27166.28952.stgit@dell3.ogc.int> Support provider-specific data in ib_uverbs_cmd_req_notify_cq(). The Chelsio iwarp provider library needs to pass information to the kernel verb for re-arming the CQ. Signed-off-by: Steve Wise --- drivers/infiniband/core/uverbs_cmd.c | 9 +++++++-- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 3 ++- drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 ++- drivers/infiniband/hw/ehca/ehca_reqs.c | 3 ++- drivers/infiniband/hw/ipath/ipath_cq.c | 4 +++- drivers/infiniband/hw/ipath/ipath_verbs.h | 3 ++- drivers/infiniband/hw/mthca/mthca_cq.c | 6 ++++-- drivers/infiniband/hw/mthca/mthca_dev.h | 4 ++-- include/rdma/ib_verbs.h | 5 +++-- 10 files changed, 28 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 743247e..5dd1de9 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i int out_len) { struct ib_uverbs_req_notify_cq cmd; + struct ib_udata udata; struct ib_cq *cq; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i if (!cq) return -EINVAL; - ib_req_notify_cq(cq, cmd.solicited_only ? - IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + INIT_UDATA(&udata, buf + sizeof cmd, 0, + in_len - sizeof cmd, 0); + + cq->device->req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP, + &udata); put_cq_read(cq); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h index 04a9db5..9a76869 100644 --- a/drivers/infiniband/hw/amso1100/c2.h +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata); /* CM */ extern int c2_llp_connect(struct iw_cm_id *cm_id, diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 05c9154..7ce8bca 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n return npolled; } -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct c2_mq_shared __iomem *shared; struct c2_cq *cq; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 3720e30..566b30c 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata); struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index b46bda1..3ed6992 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -634,7 +634,8 @@ poll_cq_exit0: return ret; } -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 87462e0..27ba4db 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq) * ipath_req_notify_cq - change the notification type for a completion queue * @ibcq: the completion queue * @notify: the type of notification to request + * @udata: user data * * Returns 0 for success. * * This may be called from interrupt context. Also called by * ib_req_notify_cq() in the generic verbs code. */ -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); unsigned long flags; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 8039f6e..0d39960 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_ int ipath_destroy_cq(struct ib_cq *ibcq); -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata); int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 283d50b..15cbd49 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -722,7 +722,8 @@ repoll: return err == 0 || err == -EAGAIN ? npolled : err; } -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, + struct ib_udata *udata) { __be32 doorbell[2]; @@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, return 0; } -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct mthca_cq *cq = to_mcq(ibcq); __be32 doorbell[2]; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index fe5cecf..6b9ccf6 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 8eacc35..e3e1a2c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -941,7 +941,8 @@ struct ib_device { struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); int (*req_notify_cq)(struct ib_cq *cq, - enum ib_cq_notify cq_notify); + enum ib_cq_notify cq_notify, + struct ib_udata *udata); int (*req_ncomp_notif)(struct ib_cq *cq, int wc_cnt); struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_ static inline int ib_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) { - return cq->device->req_notify_cq(cq, cq_notify); + return cq->device->req_notify_cq(cq, cq_notify, NULL); } /** From swise at opengridcomputing.com Sun Dec 10 14:33:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:33:45 -0600 Subject: [openib-general] [PATCH v3 02/13] Device Discovery and ULLD Linkage In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223345.27166.26908.stgit@dell3.ogc.int> Code to discover all the T3 devices and register them with the T3 RDMA Core and the Linux RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 189 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 175 +++++++++++++++++++++++++++++++++ 2 files changed, 364 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c new file mode 100644 index 0000000..acbe449 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -0,0 +1,189 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" +#include "iwch_user.h" +#include "iwch.h" +#include "iwch_cm.h" + +#define DRV_VERSION "1.1" + +MODULE_AUTHOR("Boyd Faulkner, Steve Wise"); +MODULE_DESCRIPTION("Chelsio T3 RDMA Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; + +static void open_rnic_dev(struct t3cdev *); +static void close_rnic_dev(struct t3cdev *); + +struct cxgb3_client t3c_client = { + .name = "iw_cxgb3", + .add = open_rnic_dev, + .remove = close_rnic_dev, + .handlers = t3c_handlers, + .redirect = iwch_ep_redirect +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(dev_mutex); + +static void rnic_init(struct iwch_dev *rnicp) +{ + PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); + idr_init(&rnicp->cqidr); + idr_init(&rnicp->qpidr); + idr_init(&rnicp->mmidr); + spin_lock_init(&rnicp->lock); + + rnicp->attr.vendor_id = 0x168; + rnicp->attr.vendor_part_id = 7; + rnicp->attr.max_qps = T3_MAX_NUM_QP - 32; + rnicp->attr.max_wrs = (1UL << 24) - 1; + rnicp->attr.max_sge_per_wr = T3_MAX_SGE; + rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE; + rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1; + rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1; + rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev); + rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE; + rnicp->attr.max_pds = T3_MAX_NUM_PD - 1; + rnicp->attr.mem_pgsizes_bitmask = 0x7FFF; /* 4KB-128MB */ + rnicp->attr.can_resize_wq = 0; + rnicp->attr.max_rdma_reads_per_qp = 8; + rnicp->attr.max_rdma_read_resources = + rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps; + rnicp->attr.max_rdma_read_qp_depth = 8; /* IRD */ + rnicp->attr.max_rdma_read_depth = + rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps; + rnicp->attr.rq_overflow_handled = 0; + rnicp->attr.can_modify_ird = 0; + rnicp->attr.can_modify_ord = 0; + rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1; + rnicp->attr.stag0_value = 1; + rnicp->attr.zbva_support = 1; + rnicp->attr.local_invalidate_fence = 1; + rnicp->attr.cq_overflow_detection = 1; + return; +} + +static void open_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *rnicp; + static int vers_printed; + + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + if (!vers_printed++) + printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", + DRV_VERSION); + rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); + if (!rnicp) { + printk(KERN_ERR MOD "Cannot allocate ib device\n"); + return; + } + rnicp->rdev.ulp = rnicp; + rnicp->rdev.t3cdev_p = tdev; + + if (cxio_rdev_open(&rnicp->rdev)) { + printk(KERN_ERR MOD "Unable to open CXIO rdev\n"); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + rnic_init(rnicp); + + mutex_lock(&dev_mutex); + list_add_tail(&rnicp->entry, &dev_list); + mutex_unlock(&dev_mutex); + + if (iwch_register_device(rnicp)) { + printk(KERN_ERR MOD "Unable to register device\n"); + close_rnic_dev(tdev); + } + printk(KERN_INFO MOD "Initialized device %s\n", + pci_name(rnicp->rdev.rnic_info.pdev)); + return; +} + +static void close_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *dev, *tmp; + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + mutex_lock(&dev_mutex); + list_for_each_entry_safe(dev, tmp, &dev_list, entry) { + if (dev->rdev.t3cdev_p == tdev) { + list_del(&dev->entry); + iwch_unregister_device(dev); + cxio_rdev_close(&dev->rdev); + idr_destroy(&dev->cqidr); + idr_destroy(&dev->qpidr); + idr_destroy(&dev->mmidr); + ib_dealloc_device(&dev->ibdev); + break; + } + } + mutex_unlock(&dev_mutex); +} + +extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb); + +static int __init iwch_init_module(void) +{ + int err; + + err = cxio_hal_init(); + if (err) + return err; + err = iwch_cm_init(); + if (err) + return err; + cxio_register_ev_cb(iwch_ev_dispatch); + cxgb3_register_client(&t3c_client); + return 0; +} + +static void __exit iwch_exit_module(void) +{ + cxgb3_unregister_client(&t3c_client); + cxio_unregister_ev_cb(iwch_ev_dispatch); + iwch_cm_term(); + cxio_hal_exit(); +} + +module_init(iwch_init_module); +module_exit(iwch_exit_module); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h new file mode 100644 index 0000000..752b6ad --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_H__ +#define __IWCH_H__ + +#include +#include +#include +#include + +#include + +#include "cxio_hal.h" +#include "cxgb3_offload.h" + +struct iwch_pd; +struct iwch_cq; +struct iwch_qp; +struct iwch_mr; + +struct iwch_rnic_attributes { + u32 vendor_id; + u32 vendor_part_id; + u32 max_qps; + u32 max_wrs; /* Max for any SQ/RQ */ + u32 max_sge_per_wr; + u32 max_sge_per_rdma_write_wr; /* for RDMA Write WR */ + u32 max_cqs; + u32 max_cqes_per_cq; + u32 max_mem_regs; + u32 max_phys_buf_entries; /* for phys buf list */ + u32 max_pds; + + /* + * The memory page sizes supported by this RNIC. + * Bit position i in bitmap indicates page of + * size (4k)^i. Phys block list mode unsupported. + */ + u32 mem_pgsizes_bitmask; + u8 can_resize_wq; + + /* + * The maximum number of RDMA Reads that can be outstanding + * per QP with this RNIC as the target. + */ + u32 max_rdma_reads_per_qp; + + /* + * The maximum number of resources used for RDMA Reads + * by this RNIC with this RNIC as the target. + */ + u32 max_rdma_read_resources; + + /* + * The max depth per QP for initiation of RDMA Read + * by this RNIC. + */ + u32 max_rdma_read_qp_depth; + + /* + * The maximum depth for initiation of RDMA Read + * operations by this RNIC on all QPs + */ + u32 max_rdma_read_depth; + u8 rq_overflow_handled; + u32 can_modify_ird; + u32 can_modify_ord; + u32 max_mem_windows; + u32 stag0_value; + u8 zbva_support; + u8 local_invalidate_fence; + u32 cq_overflow_detection; +}; + +struct iwch_dev { + struct ib_device ibdev; + struct cxio_rdev rdev; + u32 device_cap_flags; + struct iwch_rnic_attributes attr; + struct idr cqidr; + struct idr qpidr; + struct idr mmidr; + spinlock_t lock; + struct list_head entry; +}; + +static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct iwch_dev, ibdev); +} + +static inline int t3b_device(const struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3B); +} + +static inline int t3a_device(const struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3A); +} + +static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid) +{ + return idr_find(&rhp->cqidr, cqid); +} + +static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid) +{ + return idr_find(&rhp->qpidr, qpid); +} + +static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) +{ + return idr_find(&rhp->mmidr, mmid); +} + +static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, + void *handle, u32 id) +{ + int ret; + u32 newid; + + do { + if (!idr_pre_get(idr, GFP_KERNEL)) { + return -ENOMEM; + } + spin_lock_irq(&rhp->lock); + ret = idr_get_new_above(idr, handle, id, &newid); + BUG_ON(newid != id); + spin_unlock_irq(&rhp->lock); + } while (ret == -EAGAIN); + + return ret; +} + +static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id) +{ + spin_lock_irq(&rhp->lock); + idr_remove(idr, id); + spin_unlock_irq(&rhp->lock); +} + +extern struct cxgb3_client t3c_client; +extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; +#endif From swise at opengridcomputing.com Sun Dec 10 14:34:15 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:34:15 -0600 Subject: [openib-general] [PATCH v3 03/13] Provider Methods and Data Structures In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223415.27166.42003.stgit@dell3.ogc.int> Provider methods to support the Linux RDMA verbs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_provider.c | 1171 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_provider.h | 363 ++++++++ drivers/infiniband/hw/cxgb3/iwch_user.h | 68 ++ 3 files changed, 1602 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c new file mode 100644 index 0000000..e9721b1 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -0,0 +1,1171 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" +#include "iwch_user.h" + +static int iwch_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return -ENOSYS; +} + +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + return ERR_PTR(-ENOSYS); +} + +static int iwch_ah_destroy(struct ib_ah *ah) +{ + return -ENOSYS; +} + +static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + return -ENOSYS; +} + +static int iwch_dealloc_ucontext(struct ib_ucontext *context) +{ + struct iwch_dev *rhp = to_iwch_dev(context->device); + struct iwch_ucontext *ucontext = to_iwch_ucontext(context); + PDBG("%s context %p\n", __FUNCTION__, context); + cxio_release_ucontext(&rhp->rdev, &ucontext->uctx); + kfree(ucontext); + return 0; +} + +static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct iwch_ucontext *context; + struct iwch_dev *rhp = to_iwch_dev(ibdev); + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + cxio_init_ucontext(&rhp->rdev, &context->uctx); + INIT_LIST_HEAD(&context->mmaps); + spin_lock_init(&context->mmap_lock); + return &context->ibucontext; +} + +static int iwch_destroy_cq(struct ib_cq *ib_cq) +{ + struct iwch_cq *chp; + + PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq); + chp = to_iwch_cq(ib_cq); + + remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid); + atomic_dec(&chp->refcnt); + wait_event(chp->wait, !atomic_read(&chp->refcnt)); + + cxio_destroy_cq(&chp->rhp->rdev, &chp->cq); + kfree(chp); + return 0; +} + +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + struct iwch_create_cq_resp uresp; + + PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries); + rhp = to_iwch_dev(ibdev); + chp = kzalloc(sizeof(*chp), GFP_KERNEL); + if (!chp) + return ERR_PTR(-ENOMEM); + + if (t3a_device(rhp)) { + + /* + * T3A: Add some fluff to handle extra CQEs inserted + * for various errors. + * Additional CQE possibilities: + * TERMINATE, + * incoming RDMA WRITE Failures + * incoming RDMA READ REQUEST FAILUREs + * NOTE: We cannot ensure the CQ won't overflow. + */ + entries += 16; + } + entries = roundup_pow_of_two(entries); + chp->cq.size_log2 = ilog2(entries); + + if (cxio_create_cq(&rhp->rdev, &chp->cq)) { + kfree(chp); + return ERR_PTR(-ENOMEM); + } + chp->rhp = rhp; + chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1; + spin_lock_init(&chp->lock); + atomic_set(&chp->refcnt, 1); + init_waitqueue_head(&chp->wait); + insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid); + + if (context) { + struct iwch_mm_entry *mm; + + mm = kmalloc(sizeof *mm, GFP_KERNEL); + if (!mm) { + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-ENOMEM); + } + uresp.cqid = chp->cq.cqid; + uresp.size_log2 = chp->cq.size_log2; + uresp.physaddr = virt_to_phys(chp->cq.queue); + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm); + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-EFAULT); + } + mm->addr = uresp.physaddr; + mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * + sizeof (struct t3_cqe)); + insert_mmap(to_iwch_ucontext(context), mm); + } + PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n", + chp->cq.cqid, chp, (1 << chp->cq.size_log2), + (u64)chp->cq.dma_addr); + return &chp->ibcq; +} + +static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + struct iwch_cq *chp = to_iwch_cq(cq); + struct t3_cq oldcq, newcq; + int ret; + + PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe); + + /* We don't downsize... */ + if (cqe <= cq->cqe) + return 0; + + /* create new t3_cq with new size */ + cqe = roundup_pow_of_two(cqe+1); + newcq.size_log2 = ilog2(cqe); + + /* Dont allow resize to less than the current wce count */ + if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) { + return -ENOMEM; + } + + /* Quiesce all QPs using this CQ */ + ret = iwch_quiesce_qps(chp); + if (ret) { + return ret; + } + + ret = cxio_create_cq(&chp->rhp->rdev, &newcq); + if (ret) { + kfree(chp); + return ret; + } + + /* copy CQEs */ + memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * + sizeof(struct t3_cqe)); + + /* old iwch_qp gets new t3_cq but keeps old cqid */ + oldcq = chp->cq; + chp->cq = newcq; + chp->cq.cqid = oldcq.cqid; + + /* resize new t3_cq to update the HW context */ + ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq); + if (ret) { + chp->cq = oldcq; + return ret; + } + chp->ibcq.cqe = (1<cq.size_log2) - 1; + + /* destroy old t3_cq */ + oldcq.cqid = newcq.cqid; + ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq); + if (ret) { + printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", + __FUNCTION__, ret); + } + + /* add user hooks here */ + + /* resume qps */ + ret = iwch_resume_qps(chp); + return ret; +} + +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + enum t3_cq_opcode cq_op; + int err; + unsigned long flag; + struct iwch_req_notify_cq ucmd; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + if (notify == IB_CQ_SOLICITED) + cq_op = CQ_ARM_SE; + else + cq_op = CQ_ARM_AN; + if (udata && t3b_device(rhp)) { + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return -EFAULT; + spin_lock_irqsave(&chp->lock, flag); + chp->cq.rptr = ucmd.rptr; + } else + spin_lock_irqsave(&chp->lock, flag); + PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); + err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); + spin_unlock_irqrestore(&chp->lock, flag); + if (err) + printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, + chp->cq.cqid); + return err; +} + +static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + int len = vma->vm_end - vma->vm_start; + u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT; + struct cxio_rdev *rdev_p; + int ret = 0; + struct iwch_mm_entry *mm; + struct iwch_ucontext *ucontext; + + PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, + pgaddr, len); + + if (vma->vm_start & (PAGE_SIZE-1)) { + return -EINVAL; + } + + rdev_p = &(to_iwch_dev(context->device)->rdev); + ucontext = to_iwch_ucontext(context); + + mm = remove_mmap(ucontext, pgaddr, len); + if (!mm) + return -EINVAL; + kfree(mm); + + if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && + (pgaddr < (rdev_p->rnic_info.udbell_physbase + + rdev_p->rnic_info.udbell_len))) { + + /* + * Map T3 DB register. + */ + if (vma->vm_flags & VM_READ) { + return -EPERM; + } + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; + vma->vm_flags &= ~VM_MAYREAD; + ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } else { + + /* + * Map WQ or CQ contig dma memory... + */ + ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } + + return ret; +} + +static int iwch_deallocate_pd(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + + php = to_iwch_pd(pd); + rhp = php->rhp; + PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid); + cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid); + kfree(php); + return 0; +} + +static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_pd *php; + u32 pdid; + struct iwch_dev *rhp; + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + rhp = (struct iwch_dev *) ibdev; + pdid = cxio_hal_get_pdid(rhp->rdev.rscp); + if (!pdid) + return ERR_PTR(-EINVAL); + php = kzalloc(sizeof(*php), GFP_KERNEL); + if (!php) { + cxio_hal_put_pdid(rhp->rdev.rscp, pdid); + return ERR_PTR(-ENOMEM); + } + php->pdid = pdid; + php->rhp = rhp; + if (context) { + if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) { + iwch_deallocate_pd(&php->ibpd); + return ERR_PTR(-EFAULT); + } + } + PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php); + return &php->ibpd; +} + +static int iwch_dereg_mr(struct ib_mr *ib_mr) +{ + struct iwch_dev *rhp; + struct iwch_mr *mhp; + u32 mmid; + + PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr); + /* There can be no memory windows */ + if (atomic_read(&ib_mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(ib_mr); + rhp = mhp->rhp; + mmid = mhp->attr.stag >> 8; + cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, + mhp->attr.pbl_addr); + remove_handle(rhp, &rhp->mmidr, mmid); + if (mhp->kva) + kfree((void *) (unsigned long) mhp->kva); + PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); + kfree(mhp); + return 0; +} + +static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + __be64 *page_list; + int shift; + u64 total_size; + int npages; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + int ret; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + php = to_iwch_pd(pd); + rhp = php->rhp; + + acc = iwch_convert_access(acc); + + + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, + &total_size, &npages, &shift, &page_list); + if (ret) + goto err; + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + + /* NOTE: TPT perms are backwards from BIND WR perms! */ + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + ret = iwch_register_mem(rhp, php, mhp, shift, page_list); + kfree(page_list); + if (ret) { + goto err; + } + return &mhp->ibmr; +err: + kfree(mhp); + return ERR_PTR(ret); + +} + +static int iwch_reregister_phys_mem(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 * iova_start) +{ + + struct iwch_mr mh, *mhp; + struct iwch_pd *php; + struct iwch_dev *rhp; + int new_acc; + __be64 *page_list = NULL; + int shift = 0; + u64 total_size; + int npages; + int ret; + + PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd); + + /* There can be no memory windows */ + if (atomic_read(&mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(mr); + rhp = mhp->rhp; + php = to_iwch_pd(mr->pd); + + /* make sure we are on the same adapter */ + if (rhp != php->rhp) + return -EINVAL; + + new_acc = mhp->attr.perms; + + memcpy(&mh, mhp, sizeof *mhp); + + if (mr_rereg_mask & IB_MR_REREG_PD) + php = to_iwch_pd(pd); + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mh.attr.perms = iwch_convert_access(acc); + if (mr_rereg_mask & IB_MR_REREG_TRANS) + ret = build_phys_page_list(buffer_list, num_phys_buf, + iova_start, + &total_size, &npages, + &shift, &page_list); + + ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages); + kfree(page_list); + if (ret) { + return ret; + } + if (mr_rereg_mask & IB_MR_REREG_PD) + mhp->attr.pdid = php->pdid; + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mhp->attr.perms = acc; + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + mhp->attr.zbva = 0; + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + } + + return 0; +} + + +struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + __be64 *pages; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + struct iwch_reg_user_mr_resp uresp; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + shift = ffs(region->page_size) - 1; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + acc = iwch_convert_access(acc); + + i = n = 0; + + list_for_each_entry(chunk, ®ion->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = cpu_to_be64(sg_dma_address( + &chunk->page_list[j]) + + region->page_size * k); + } + } + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + mhp->attr.va_fbo = region->virt_base; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) region->length; + mhp->attr.pbl_size = i; + err = iwch_register_mem(rhp, php, mhp, shift, pages); + kfree(pages); + if (err) + goto err; + + if (udata && t3b_device(rhp)) { + uresp.pbl_addr = (mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3; + PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, + uresp.pbl_addr); + + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + iwch_dereg_mr(&mhp->ibmr); + err = -EFAULT; + goto err; + } + } + + return &mhp->ibmr; + +err: + kfree(mhp); + return ERR_PTR(err); +} + +struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva; + struct ib_mr *ibmr; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + + /* + * T3 only supports 32 bits of size. + */ + bl.size = 0xffffffff; + bl.addr = 0; + kva = 0; + ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva); + return ibmr; +} + +struct ib_mw *iwch_alloc_mw(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mw *mhp; + u32 mmid; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_MW; + mhp->attr.stag = stag; + mmid = (stag) >> 8; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag); + return &(mhp->ibmw); +} + +int iwch_dealloc_mw(struct ib_mw *mw) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + u32 mmid; + + mhp = to_iwch_mw(mw); + rhp = mhp->rhp; + mmid = (mw->rkey) >> 8; + cxio_deallocate_window(&rhp->rdev, mhp->attr.stag); + remove_handle(rhp, &rhp->mmidr, mmid); + kfree(mhp); + PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp); + return 0; +} + +static int iwch_destroy_qp(struct ib_qp *ib_qp) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_qp_attributes attrs; + struct iwch_ucontext *ucontext; + + qhp = to_iwch_qp(ib_qp); + rhp = qhp->rhp; + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0); + } + wait_event(qhp->wait, !qhp->ep); + + remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid); + + atomic_dec(&qhp->refcnt); + wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); + + ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) + : NULL; + cxio_destroy_qp(&rhp->rdev, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx); + + PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, + ib_qp, qhp->wq.qpid, qhp); + kfree(qhp); + return 0; +} + +static struct ib_qp *iwch_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *attrs, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_pd *php; + struct iwch_cq *schp; + struct iwch_cq *rchp; + struct iwch_create_qp_resp uresp; + int wqsize, sqsize, rqsize; + struct iwch_ucontext *ucontext; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + if (attrs->qp_type != IB_QPT_RC) + return ERR_PTR(-EINVAL); + php = to_iwch_pd(pd); + rhp = php->rhp; + schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid); + rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid); + if (!schp || !rchp) + return ERR_PTR(-EINVAL); + + /* The RQT size must be # of entries + 1 rounded up to a power of two */ + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr); + if (rqsize == attrs->cap.max_recv_wr) + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1); + + /* T3 doesn't support RQT depth < 16 */ + if (rqsize < 16) + rqsize = 16; + + if (rqsize > T3_MAX_RQ_SIZE) + return ERR_PTR(-EINVAL); + + /* + * NOTE: The SQ and total WQ sizes don't need to be + * a power of two. However, all the code assumes + * they are. EG: Q_FREECNT() and friends. + */ + sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); + wqsize = roundup_pow_of_two(rqsize + sqsize); + PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, + wqsize, sqsize, rqsize); + qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); + if (!qhp) + return ERR_PTR(-ENOMEM); + qhp->wq.size_log2 = ilog2(wqsize); + qhp->wq.rq_size_log2 = ilog2(rqsize); + qhp->wq.sq_size_log2 = ilog2(sqsize); + ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; + if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) { + kfree(qhp); + return ERR_PTR(-ENOMEM); + } + attrs->cap.max_recv_wr = rqsize - 1; + attrs->cap.max_send_wr = sqsize; + qhp->rhp = rhp; + qhp->attr.pd = php->pdid; + qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid; + qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid; + qhp->attr.sq_num_entries = attrs->cap.max_send_wr; + qhp->attr.rq_num_entries = attrs->cap.max_recv_wr; + qhp->attr.sq_max_sges = attrs->cap.max_send_sge; + qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge; + qhp->attr.rq_max_sges = attrs->cap.max_recv_sge; + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.next_state = IWCH_QP_STATE_IDLE; + + /* + * XXX - These don't get passed in from the openib user + * at create time. The CM sets them via a QP modify. + * Need to fix... I think the CM should + */ + qhp->attr.enable_rdma_read = 1; + qhp->attr.enable_rdma_write = 1; + qhp->attr.enable_bind = 1; + qhp->attr.max_ord = 1; + qhp->attr.max_ird = 1; + + spin_lock_init(&qhp->lock); + init_waitqueue_head(&qhp->wait); + atomic_set(&qhp->refcnt, 1); + insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid); + + if (udata) { + + struct iwch_mm_entry *mm1, *mm2; + + mm1 = kmalloc(sizeof *mm1, GFP_KERNEL); + if (!mm1) { + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + mm2 = kmalloc(sizeof *mm2, GFP_KERNEL); + if (!mm2) { + kfree(mm1); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + uresp.qpid = qhp->wq.qpid; + uresp.size_log2 = qhp->wq.size_log2; + uresp.sq_size_log2 = qhp->wq.sq_size_log2; + uresp.rq_size_log2 = qhp->wq.rq_size_log2; + uresp.physaddr = virt_to_phys(qhp->wq.queue); + uresp.doorbell = qhp->wq.udb; + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm1); + kfree(mm2); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-EFAULT); + } + mm1->addr = uresp.physaddr; + mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr)); + insert_mmap(ucontext, mm1); + mm2->addr = uresp.doorbell & PAGE_MASK; + mm2->len = PAGE_SIZE; + insert_mmap(ucontext, mm2); + } + qhp->ibqp.qp_num = qhp->wq.qpid; + init_timer(&(qhp->timer)); + PDBG("%s sq_num_entries %d, rq_num_entries %d " + "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n", + __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries, + qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2); + return (&qhp->ibqp); +} + +static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + enum iwch_qp_attr_mask mask = 0; + struct iwch_qp_attributes attrs; + + PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp); + + /* iwarp does not support the RTR state */ + if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR)) + attr_mask &= ~IB_QP_STATE; + + /* Make sure we still have something left to do */ + if (!attr_mask) + return 0; + + memset(&attrs, 0, sizeof attrs); + qhp = to_iwch_qp(ibqp); + rhp = qhp->rhp; + + attrs.next_state = iwch_convert_state(attr->qp_state); + attrs.enable_rdma_read = (attr->qp_access_flags & + IB_ACCESS_REMOTE_READ) ? 1 : 0; + attrs.enable_rdma_write = (attr->qp_access_flags & + IB_ACCESS_REMOTE_WRITE) ? 1 : 0; + attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0; + + + mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0; + mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? + (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0; + + return iwch_modify_qp(rhp, qhp, mask, &attrs, 0); +} + +void iwch_qp_add_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + atomic_inc(&(to_iwch_qp(qp)->refcnt)); +} + +void iwch_qp_rem_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt))) + wake_up(&(to_iwch_qp(qp)->wait)); +} + +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn) +{ + PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn); + return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn); +} + + +static int iwch_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + *pkey = 0; + return 0; +} + +static int iwch_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct iwch_dev *dev; + + PDBG("%s ibdev %p, port %d, index %d, gid %p\n", + __FUNCTION__, ibdev, port, index, gid); + dev = to_iwch_dev(ibdev); + BUG_ON(port == 0 || port > 2); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6); + return 0; +} + +static int iwch_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + + struct iwch_dev *dev; + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + + dev = to_iwch_dev(ibdev); + memset(props, 0, sizeof *props); + memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + props->device_cap_flags = dev->device_cap_flags; + props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor; + props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device; + props->max_mr_size = ~0ull; + props->max_qp = dev->attr.max_qps; + props->max_qp_wr = dev->attr.max_wrs; + props->max_sge = dev->attr.max_sge_per_wr; + props->max_sge_rd = 1; + props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp; + props->max_cq = dev->attr.max_cqs; + props->max_cqe = dev->attr.max_cqes_per_cq; + props->max_mr = dev->attr.max_mem_regs; + props->max_pd = dev->attr.max_pds; + props->local_ca_ack_delay = 0; + + return 0; +} + +static int iwch_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 2; + props->active_speed = 2; + props->max_msg_sz = -1; + + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.fw_version); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.driver); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, dev); + return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor, + dev->rdev.rnic_info.pdev->device); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *iwch_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +int iwch_register_device(struct iwch_dev *dev) +{ + int ret; + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX); + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + dev->ibdev.owner = THIS_MODULE; + dev->device_cap_flags = + (IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + dev->ibdev.node_type = RDMA_NODE_RNIC; + memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); + dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports; + dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.query_device = iwch_query_device; + dev->ibdev.query_port = iwch_query_port; + dev->ibdev.modify_port = iwch_modify_port; + dev->ibdev.query_pkey = iwch_query_pkey; + dev->ibdev.query_gid = iwch_query_gid; + dev->ibdev.alloc_ucontext = iwch_alloc_ucontext; + dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext; + dev->ibdev.mmap = iwch_mmap; + dev->ibdev.alloc_pd = iwch_allocate_pd; + dev->ibdev.dealloc_pd = iwch_deallocate_pd; + dev->ibdev.create_ah = iwch_ah_create; + dev->ibdev.destroy_ah = iwch_ah_destroy; + dev->ibdev.create_qp = iwch_create_qp; + dev->ibdev.modify_qp = iwch_ib_modify_qp; + dev->ibdev.destroy_qp = iwch_destroy_qp; + dev->ibdev.create_cq = iwch_create_cq; + dev->ibdev.destroy_cq = iwch_destroy_cq; + dev->ibdev.resize_cq = iwch_resize_cq; + dev->ibdev.poll_cq = iwch_poll_cq; + dev->ibdev.get_dma_mr = iwch_get_dma_mr; + dev->ibdev.reg_phys_mr = iwch_register_phys_mem; + dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem; + dev->ibdev.reg_user_mr = iwch_reg_user_mr; + dev->ibdev.dereg_mr = iwch_dereg_mr; + dev->ibdev.alloc_mw = iwch_alloc_mw; + dev->ibdev.bind_mw = iwch_bind_mw; + dev->ibdev.dealloc_mw = iwch_dealloc_mw; + + dev->ibdev.attach_mcast = iwch_multicast_attach; + dev->ibdev.detach_mcast = iwch_multicast_detach; + dev->ibdev.process_mad = iwch_process_mad; + + dev->ibdev.req_notify_cq = iwch_arm_cq; + dev->ibdev.post_send = iwch_post_send; + dev->ibdev.post_recv = iwch_post_receive; + + + dev->ibdev.iwcm = + (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs), + GFP_KERNEL); + dev->ibdev.iwcm->connect = iwch_connect; + dev->ibdev.iwcm->accept = iwch_accept_cr; + dev->ibdev.iwcm->reject = iwch_reject_cr; + dev->ibdev.iwcm->create_listen = iwch_create_listen; + dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen; + dev->ibdev.iwcm->add_ref = iwch_qp_add_ref; + dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref; + dev->ibdev.iwcm->get_qp = iwch_get_qp; + + ret = ib_register_device(&dev->ibdev); + if (ret) + goto bail1; + + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + if (ret) { + goto bail2; + } + } + return 0; +bail2: + ib_unregister_device(&dev->ibdev); +bail1: + return ret; +} + +void iwch_unregister_device(struct iwch_dev *dev) +{ + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) + class_device_remove_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + ib_unregister_device(&dev->ibdev); + return; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h new file mode 100644 index 0000000..4d98886 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -0,0 +1,363 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_PROVIDER_H__ +#define __IWCH_PROVIDER_H__ + +#include +#include +#include +#include +#include "t3cdev.h" +#include "iwch.h" +#include "cxio_wr.h" +#include "cxio_hal.h" + +struct iwch_pd { + struct ib_pd ibpd; + u32 pdid; + struct iwch_dev *rhp; +}; + +static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct iwch_pd, ibpd); +} + +struct tpt_attributes { + u32 stag; + u32 state:1; + u32 type:2; + u32 rsvd:1; + enum tpt_mem_perm perms; + u32 remote_invaliate_disable:1; + u32 zbva:1; + u32 mw_bind_enable:1; + u32 page_size:5; + + u32 pdid; + u32 qpid; + u32 pbl_addr; + u32 len; + u64 va_fbo; + u32 pbl_size; +}; + +struct iwch_mr { + struct ib_mr ibmr; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +typedef struct iwch_mw iwch_mw_handle; + +static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct iwch_mr, ibmr); +} + +struct iwch_mw { + struct ib_mw ibmw; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw) +{ + return container_of(ibmw, struct iwch_mw, ibmw); +} + +struct iwch_cq { + struct ib_cq ibcq; + struct iwch_dev *rhp; + struct t3_cq cq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; +}; + +static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct iwch_cq, ibcq); +} + +enum IWCH_QP_FLAGS { + QP_QUIESCED = 0x01 +}; + +struct iwch_mpa_attributes { + u8 recv_marker_enabled; + u8 xmit_marker_enabled; /* iWARP: enable inbound Read Resp. */ + u8 crc_enabled; + u8 version; /* 0 or 1 */ +}; + +struct iwch_qp_attributes { + u32 scq; + u32 rcq; + u32 sq_num_entries; + u32 rq_num_entries; + u32 sq_max_sges; + u32 sq_max_sges_rdma_write; + u32 rq_max_sges; + u32 state; + u8 enable_rdma_read; + u8 enable_rdma_write; /* enable inbound Read Resp. */ + u8 enable_bind; + u8 enable_mmid0_fastreg; /* Enable STAG0 + Fast-register */ + /* + * Next QP state. If specify the current state, only the + * QP attributes will be modified. + */ + u32 max_ord; + u32 max_ird; + u32 pd; /* IN */ + u32 next_state; + char terminate_buffer[52]; + u32 terminate_msg_len; + u8 is_terminate_local; + struct iwch_mpa_attributes mpa_attr; /* IN-OUT */ + struct iwch_ep *llp_stream_handle; + char *stream_msg_buf; /* Last stream msg. before Idle -> RTS */ + u32 stream_msg_buf_len; /* Only on Idle -> RTS */ +}; + +struct iwch_qp { + struct ib_qp ibqp; + struct iwch_dev *rhp; + struct iwch_ep *ep; + struct iwch_qp_attributes attr; + struct t3_wq wq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; + enum IWCH_QP_FLAGS flags; + struct timer_list timer; +}; + +static inline int qp_quiesced(struct iwch_qp *qhp) +{ + return (qhp->flags & QP_QUIESCED); +} + +static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct iwch_qp, ibqp); +} + +void iwch_qp_add_ref(struct ib_qp *qp); +void iwch_qp_rem_ref(struct ib_qp *qp); +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn); + +struct iwch_ucontext { + struct ib_ucontext ibucontext; + struct cxio_ucontext uctx; + spinlock_t mmap_lock; + struct list_head mmaps; +}; + +static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c) +{ + return container_of(c, struct iwch_ucontext, ibucontext); +} + +struct iwch_mm_entry { + struct list_head entry; + u64 addr; + unsigned len; +}; + +static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext, + u64 addr, unsigned len) +{ + struct list_head *pos, *nxt; + struct iwch_mm_entry *mm; + + spin_lock_irq(&ucontext->mmap_lock); + list_for_each_safe(pos, nxt, &ucontext->mmaps) { + + mm = list_entry(pos, struct iwch_mm_entry, entry); + if (mm->addr == addr && mm->len == len) { + list_del_init(&mm->entry); + spin_unlock_irq(&ucontext->mmap_lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, + mm->len); + return mm; + } + } + spin_unlock_irq(&ucontext->mmap_lock); + return NULL; +} + +static inline void insert_mmap(struct iwch_ucontext *ucontext, + struct iwch_mm_entry *mm) +{ + spin_lock_irq(&ucontext->mmap_lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len); + list_add_tail(&mm->entry, &ucontext->mmaps); + spin_unlock_irq(&ucontext->mmap_lock); +} + +enum iwch_qp_attr_mask { + IWCH_QP_ATTR_NEXT_STATE = 1 << 0, + IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7, + IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8, + IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9, + IWCH_QP_ATTR_MAX_ORD = 1 << 11, + IWCH_QP_ATTR_MAX_IRD = 1 << 12, + IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22, + IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23, + IWCH_QP_ATTR_MPA_ATTR = 1 << 24, + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25, + IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_MAX_ORD | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_STREAM_MSG_BUFFER | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE) +}; + +int iwch_modify_qp(struct iwch_dev *rhp, + struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal); + +enum iwch_qp_state { + IWCH_QP_STATE_IDLE, + IWCH_QP_STATE_RTS, + IWCH_QP_STATE_ERROR, + IWCH_QP_STATE_TERMINATE, + IWCH_QP_STATE_CLOSING, + IWCH_QP_STATE_TOT +}; + +static inline int iwch_convert_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + case IB_QPS_INIT: + return IWCH_QP_STATE_IDLE; + case IB_QPS_RTS: + return IWCH_QP_STATE_RTS; + case IB_QPS_SQD: + return IWCH_QP_STATE_CLOSING; + case IB_QPS_SQE: + return IWCH_QP_STATE_TERMINATE; + case IB_QPS_ERR: + return IWCH_QP_STATE_ERROR; + default: + return -1; + } +} + +enum iwch_mem_perms { + IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0, + IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1, + IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2, + IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3, + IWCH_MEM_ACCESS_ATOMICS = 1 << 4, + IWCH_MEM_ACCESS_BINDING = 1 << 5, + IWCH_MEM_ACCESS_LOCAL = + (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE), + IWCH_MEM_ACCESS_REMOTE = + (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ) + /* cannot go beyond 1 << 31 */ +} __attribute__ ((packed)); + +static inline u32 iwch_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0) + | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) | + (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) | + IWCH_MEM_ACCESS_LOCAL_READ; +} + +enum iwch_mmid_state { + IWCH_STAG_STATE_VALID, + IWCH_STAG_STATE_INVALID +}; + +enum iwch_qp_query_flags { + IWCH_QP_QUERY_CONTEXT_NONE = 0x0, /* No ctx; Only attrs */ + IWCH_QP_QUERY_CONTEXT_GET = 0x1, /* Get ctx + attrs */ + IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2, /* Not Supported */ + + /* + * Quiesce QP context; Consumer + * will NOT replay outstanding WR + */ + IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4, + IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8, + IWCH_QP_QUERY_TEST_USERWRITE = 0x32 /* Test special */ +}; + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc); +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg); +int iwch_register_device(struct iwch_dev *dev); +void iwch_unregister_device(struct iwch_dev *dev); +int iwch_quiesce_qps(struct iwch_cq *chp); +int iwch_resume_qps(struct iwch_cq *chp); +void stop_read_rep_timer(struct iwch_qp *qhp); +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list); +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages); +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list); + + +#define IWCH_NODE_DESC "cxgb3 Chelsio Communications" + +#endif diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h new file mode 100644 index 0000000..4e4b9c9 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_user.h @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_USER_H__ +#define __IWCH_USER_H__ + +#define IWCH_UVERBS_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct iwch_create_cq_resp { + __u64 physaddr; + __u32 cqid; + __u32 size_log2; +}; + +struct iwch_create_qp_resp { + __u64 physaddr; + __u64 doorbell; + __u32 qpid; + __u32 size_log2; + __u32 sq_size_log2; + __u32 rq_size_log2; +}; + +struct iwch_reg_user_mr_resp { + __u32 pbl_addr; +}; + +struct iwch_req_notify_cq { + __u32 rptr; +}; +#endif From swise at opengridcomputing.com Sun Dec 10 14:34:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:34:45 -0600 Subject: [openib-general] [PATCH v3 04/13] Connection Manager In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223445.27166.65471.stgit@dell3.ogc.int> This code implements the iWARP CM provider methods for the Chelsio driver. The Chelsio ULLD is used to setup and teardown TCP connections, and the T3 RDMA Core is used to move the connections in and out of RDMA mode. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 2059 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 223 ++++ drivers/infiniband/hw/cxgb3/tcb.h | 603 ++++++++++ 3 files changed, 2885 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c new file mode 100644 index 0000000..4d5df00 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -0,0 +1,2059 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "tcb.h" +#include "cxgb3_offload.h" +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" + +char *states[] = { + "idle", + "listen", + "connecting", + "mpa_wait_req", + "mpa_req_sent", + "mpa_req_rcvd", + "mpa_rep_sent", + "fpdu_mode", + "aborting", + "closing", + "moribund", + "dead", + NULL, +}; + +static int ep_timeout_secs = 10; +module_param(ep_timeout_secs, int, 0444); +MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " + "in seconds (default=10)"); + +static int mpa_rev = 1; +module_param(mpa_rev, int, 0444); +MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " + "1 is spec compliant. (default=1)"); + +static int markers_enabled = 0; +module_param(markers_enabled, int, 0444); +MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)"); + +static int crc_enabled = 1; +module_param(crc_enabled, int, 0444); +MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)"); + +static int rcv_win = 512 * 1024; +module_param(rcv_win, int, 0444); +MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)"); + +static int snd_win = 512 * 1024; +module_param(snd_win, int, 0444); +MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)"); + +static unsigned int nocong = 1; +module_param(nocong, uint, 0444); +MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)"); + +static void process_work(struct work_struct *work); +static struct workqueue_struct *workq; +DECLARE_WORK(skb_work, process_work); + +static struct sk_buff_head rxq; +static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS]; + +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp); +static void ep_timeout(unsigned long arg); +static void connect_reply_upcall(struct iwch_ep *ep, int status); + +static void start_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + if (timer_pending(&ep->timer)) { + PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + } else + get_ep(&ep->com); + ep->timer.expires = jiffies + ep_timeout_secs * HZ; + ep->timer.data = (unsigned long)ep; + ep->timer.function = ep_timeout; + add_timer(&ep->timer); +} + +static void stop_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + put_ep(&ep->com); +} + +static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) +{ + struct cpl_tid_release *req; + + skb = get_skb(skb, sizeof *req, GFP_KERNEL); + if (!skb) + return; + req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); + skb->priority = CPL_PRIORITY_SETUP; + tdev->send(tdev, skb); + return; +} + +int iwch_quiesce_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +int iwch_resume_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = 0; + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static void set_emss(struct iwch_ep *ep, u16 opt) +{ + PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt); + ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40; + if (G_TCPOPT_TSTAMP(opt)) + ep->emss -= 12; + if (ep->emss < 128) + ep->emss = 128; + PDBG("emss=%d\n", ep->emss); +} + +static int state_comp_exch(struct iwch_ep_common *epc, + enum iwch_ep_state comp, + enum iwch_ep_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&epc->lock, flags); + ret = (epc->state == comp); + if (ret) + epc->state = exch; + spin_unlock_irqrestore(&epc->lock, flags); + return ret; +} + +static enum iwch_ep_state state_read(struct iwch_ep_common *epc) +{ + unsigned long flags; + enum iwch_ep_state state; + + spin_lock_irqsave(&epc->lock, flags); + state = epc->state; + spin_unlock_irqrestore(&epc->lock, flags); + return state; +} + +static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new) +{ + unsigned long flags; + + spin_lock_irqsave(&epc->lock, flags); + PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], + states[new]); + epc->state = new; + spin_unlock_irqrestore(&epc->lock, flags); + return; +} + +static void *alloc_ep(int size, gfp_t gfp) +{ + struct iwch_ep_common *epc; + + epc = kmalloc(size, gfp); + if (epc) { + memset(epc, 0, size); + kref_init(&epc->kref); + spin_lock_init(&epc->lock); + init_waitqueue_head(&epc->waitq); + } + PDBG("%s alloc ep %p\n", __FUNCTION__, epc); + return (void *) epc; +} + +void __free_ep(struct kref *kref) +{ + struct iwch_ep_common *epc; + epc = container_of(kref, struct iwch_ep_common, kref); + PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]); + kfree(epc); +} + +static void release_ep_resources(struct iwch_ep *ep) +{ + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + state_set(&ep->com, DEAD); + cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, ep->hwtid, NULL); + put_ep(&ep->com); +} + +static void process_work(struct work_struct *work) +{ + struct sk_buff *skb = NULL; + void *ep; + struct t3cdev *tdev; + int ret; + + while ((skb = skb_dequeue(&rxq))) { + ep = *((void **) (skb->cb)); + tdev = *((struct t3cdev **) (skb->cb + sizeof(void *))); + ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep); + if (ret & CPL_RET_BUF_DONE) + kfree_skb(skb); + + /* + * ep was referenced in sched(), and is freed here. + */ + put_ep((struct iwch_ep_common *)ep); + } +} + +static int status2errno(int status) +{ + switch (status) { + case CPL_ERR_NONE: + return 0; + case CPL_ERR_CONN_RESET: + return -ECONNRESET; + case CPL_ERR_ARP_MISS: + return -EHOSTUNREACH; + case CPL_ERR_CONN_TIMEDOUT: + return -ETIMEDOUT; + case CPL_ERR_TCAM_FULL: + return -ENOMEM; + case CPL_ERR_CONN_EXIST: + return -EADDRINUSE; + default: + return -EIO; + } +} + +/* + * Try and reuse skbs already allocated... + */ +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp) +{ + if (skb) { + BUG_ON(skb_cloned(skb)); + skb_trim(skb, 0); + skb_get(skb); + } else { + skb = alloc_skb(len, gfp); + } + return skb; +} + +static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip, + __be32 peer_ip, __be16 local_port, + __be16 peer_port, u8 tos) +{ + struct rtable *rt; + struct flowi fl = { + .oif = 0, + .nl_u = { + .ip4_u = { + .daddr = peer_ip, + .saddr = local_ip, + .tos = tos} + }, + .proto = IPPROTO_TCP, + .uli_u = { + .ports = { + .sport = local_port, + .dport = peer_port} + } + }; + + if (ip_route_output_flow(&rt, &fl, NULL, 0)) + return NULL; + return rt; +} + +static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu) +{ + int i = 0; + + while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu) + ++i; + return i; +} + +static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb) +{ + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for an active open. + */ +static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + printk(KERN_ERR MOD "ARP failure duing connect\n"); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for a CPL_ABORT_REQ. Change it into a no RST variant + * and send it along. + */ +static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_abort_req *req = cplhdr(skb); + + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + req->cmd = CPL_ABORT_NO_RST; + cxgb3_ofld_send(dev, skb); +} + +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) +{ + struct cpl_close_con_req *req; + struct sk_buff *skb; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) +{ + struct cpl_abort_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(skb, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, abort_arp_failure); + req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); + req->cmd = CPL_ABORT_SEND_RST; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_connect(struct iwch_ep *ep) +{ + struct cpl_act_open_req *req; + struct sk_buff *skb; + u32 opt0h, opt0l, opt2; + unsigned int mtu_idx; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + skb->priority = CPL_PRIORITY_SETUP; + set_arp_failure_handler(skb, act_open_req_arp_failure); + + req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid)); + req->local_port = ep->com.local_addr.sin_port; + req->peer_port = ep->com.remote_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_ip = ep->com.remote_addr.sin_addr.s_addr; + req->opt0h = htonl(opt0h); + req->opt0l = htonl(opt0l); + req->params = 0; + req->opt2 = htonl(opt2); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + + PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen); + + BUG_ON(skb_cloned(skb)); + + mpalen = sizeof(*mpa) + ep->plen; + if (skb->data + mpalen + sizeof(*req) > skb->end) { + kfree_skb(skb); + skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + connect_reply_upcall(ep, -ENOMEM); + return; + } + } + skb_trim(skb, 0); + skb_reserve(skb, sizeof(*req)); + skb_put(skb, mpalen); + skb->priority = CPL_PRIORITY_DATA; + mpa = (struct mpa_message *) skb->data; + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key)); + mpa->flags = (crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->private_data_size = htons(ep->plen); + mpa->revision = mpa_rev; + + if (ep->plen) + memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + start_ep_timer(ep); + state_set(&ep->com, MPA_REQ_SENT); + return; +} + +static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = MPA_REJECT; + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb again. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(mpalen); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + ep->mpa_skb = skb; + state_set(&ep->com, MPA_REP_SENT); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_establish *req = cplhdr(skb); + unsigned int tid = GET_TID(req); + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid); + + dst_confirm(ep->dst); + + /* setup the hwtid for this connection */ + ep->hwtid = tid; + cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid); + + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + /* dealloc the atid */ + cxgb3_free_atid(ep->com.tdev, ep->atid); + + /* start MPA negotiation */ + send_mpa_req(ep, skb); + + return 0; +} + +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb) +{ + PDBG("%s ep %p\n", __FILE__, ep); + state_set(&ep->com, ABORTING); + send_abort(ep, skb, GFP_KERNEL); +} + +static void close_complete_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + if (ep->com.cm_id) { + PDBG("close complete delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void peer_close_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_DISCONNECT; + if (ep->com.cm_id) { + PDBG("peer close delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static void peer_abort_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + event.status = -ECONNRESET; + if (ep->com.cm_id) { + PDBG("abort delivered ep %p cm_id %p tid %d\n", ep, + ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_reply_upcall(struct iwch_ep *ep, int status) +{ + struct iw_cm_event event; + + PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REPLY; + event.status = status; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + + if ((status == 0) || (status == -ECONNREFUSED)) { + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + } + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, + ep->hwtid, status); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } + if (status < 0) { + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_request_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REQUEST; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + event.provider_data = ep; + if (state_read(&ep->parent_ep->com) != DEAD) + ep->parent_ep->com.cm_id->event_handler( + ep->parent_ep->com.cm_id, + &event); + put_ep(&ep->parent_ep->com); + ep->parent_ep = NULL; +} + +static void established_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_ESTABLISHED; + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static int update_rx_credits(struct iwch_ep *ep, u32 credits) +{ + struct cpl_rx_data_ack *req; + struct sk_buff *skb; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n"); + return 0; + } + + req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); + req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); + skb->priority = CPL_PRIORITY_ACK; + ep->com.tdev->send(ep->com.tdev, skb); + return credits; +} + +static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + int err; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) + return; + state_set(&ep->com, FPDU_MODE); + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + err = -EINVAL; + goto err; + } + + /* + * copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * if we don't even have the mpa message, then bail. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* Validate MPA header. */ + if (mpa->revision != mpa_rev) { + err = -EPROTO; + goto err; + } + if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) { + err = -EPROTO; + goto err; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + err = -EPROTO; + goto err; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + err = -EPROTO; + goto err; + } + + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + if (mpa->flags & MPA_REJECT) { + err = -ECONNREFUSED; + goto err; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. And + * the MPA header is valid. + */ + + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ird; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD; + + /* bind QP and TID with INIT_WR */ + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + if (!err) + goto out; +err: + abort_connection(ep, skb); +out: + connect_reply_upcall(ep, err); + return; +} + +static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) + return; + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + abort_connection(ep, skb); + return; + } + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * If we don't even have the mpa message, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* + * Validate MPA Header. + */ + if (mpa->revision != mpa_rev) { + abort_connection(ep, skb); + return; + } + + if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) { + abort_connection(ep, skb); + return; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + abort_connection(ep, skb); + return; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + abort_connection(ep, skb); + return; + } + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. + */ + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + state_set(&ep->com, MPA_REQ_RCVD); + + /* drive upcall */ + connect_request_upcall(ep); + return; +} + +static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_rx_data *hdr = cplhdr(skb); + unsigned int dlen = ntohs(hdr->len); + + PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen); + + skb_pull(skb, sizeof(*hdr)); + skb_trim(skb, dlen); + + switch (state_read(&ep->com)) { + case MPA_REQ_SENT: + process_mpa_reply(ep, skb); + break; + case MPA_REQ_WAIT: + process_mpa_request(ep, skb); + break; + case MPA_REP_SENT: + break; + default: + printk(KERN_ERR MOD "%s Unexpected streaming data." + " ep %p state %d tid %d\n", + __FUNCTION__, ep, state_read(&ep->com), ep->hwtid); + + /* + * The ep will timeout and inform the ULP of the failure. + * See ep_timeout(). + */ + break; + } + + /* update RX credits */ + update_rx_credits(ep, dlen); + + return CPL_RET_BUF_DONE; +} + +/* + * Upcall from the adapter indicating data has been transmitted. + * For us its just the single MPA request or reply. We can now free + * the skb holding the mpa message. + */ +static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_wr_ack *hdr = cplhdr(skb); + unsigned int credits = ntohs(hdr->credits); + enum iwch_qp_attr_mask mask; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + + if (credits == 0) + return CPL_RET_BUF_DONE; + BUG_ON(credits != 1); + BUG_ON(ep->mpa_skb == NULL); + kfree_skb(ep->mpa_skb); + ep->mpa_skb = NULL; + dst_confirm(ep->dst); + if (state_read(&ep->com) == MPA_REP_SENT) { + struct iwch_qp_attributes attrs; + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (!ep->com.rpl_err) { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + + ep->com.rpl_done = 1; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + } + return CPL_RET_BUF_DONE; +} + +static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + close_complete_upcall(ep); + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status, + status2errno(rpl->status)); + connect_reply_upcall(ep, status2errno(rpl->status)); + state_set(&ep->com, DEAD); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, GET_TID(rpl), NULL); + cxgb3_free_atid(ep->com.tdev, ep->atid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + return -ENOMEM; + } + + req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); + req->local_port = ep->com.local_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_port = 0; + req->peer_ip = 0; + req->peer_netmask = 0; + req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS); + req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10)); + req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); + + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_pass_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, + rpl->status, status2errno(rpl->status)); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + + return CPL_RET_BUF_DONE; +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_close_listserv_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, + void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_close_listserv_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + return CPL_RET_BUF_DONE; +} + +static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) +{ + struct cpl_pass_accept_rpl *rpl; + unsigned int mtu_idx; + u32 opt0h, opt0l, opt2; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(*rpl)); + skb_get(skb); + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + + rpl = cplhdr(skb); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(opt0h); + rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT); + rpl->opt2 = htonl(opt2); + rpl->rsvd = rpl->opt2; /* workaround for HW bug */ + skb->priority = CPL_PRIORITY_SETUP; + l2t_send(ep->com.tdev, skb, ep->l2t); + + return; +} + +static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip, + struct sk_buff *skb) +{ + PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, + peer_ip); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(struct cpl_tid_release)); + skb_get(skb); + + if (tdev->type == T3B) + release_tid(tdev, hwtid, skb); + else { + struct cpl_pass_accept_rpl *rpl; + + rpl = cplhdr(skb); + skb->priority = CPL_PRIORITY_SETUP; + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, + hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(F_TCAM_BYPASS); + rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); + rpl->opt2 = 0; + rpl->rsvd = rpl->opt2; + tdev->send(tdev, skb); + } +} + +static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *child_ep, *parent_ep = ctx; + struct cpl_pass_accept_req *req = cplhdr(skb); + unsigned int hwtid = GET_TID(req); + struct dst_entry *dst; + struct l2t_entry *l2t; + struct rtable *rt; + struct iff_mac tim; + + PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid); + + if (state_read(&parent_ep->com) != LISTEN) { + printk(KERN_ERR "%s - listening ep not in LISTEN\n", + __FUNCTION__); + goto reject; + } + + /* + * Find the netdev for this connection request. + */ + tim.mac_addr = req->dst_mac; + tim.vlan_tag = ntohs(req->vlan_tag); + if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) { + printk(KERN_ERR + "%s bad dst mac %02x %02x %02x %02x %02x %02x\n", + __FUNCTION__, + req->dst_mac[0], + req->dst_mac[1], + req->dst_mac[2], + req->dst_mac[3], + req->dst_mac[4], + req->dst_mac[5]); + goto reject; + } + + /* Find output route */ + rt = find_route(tdev, + req->local_ip, + req->peer_ip, + req->local_port, + req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid))); + if (!rt) { + printk(KERN_ERR MOD "%s - failed to find dst entry!\n", + __FUNCTION__); + goto reject; + } + dst = &rt->u.dst; + l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev->if_port); + if (!l2t) { + printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n", + __FUNCTION__); + dst_release(dst); + goto reject; + } + child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL); + if (!child_ep) { + printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n", + __FUNCTION__); + l2t_release(L2DATA(tdev), l2t); + dst_release(dst); + goto reject; + } + state_set(&child_ep->com, CONNECTING); + child_ep->com.tdev = tdev; + child_ep->com.cm_id = NULL; + child_ep->com.local_addr.sin_family = PF_INET; + child_ep->com.local_addr.sin_port = req->local_port; + child_ep->com.local_addr.sin_addr.s_addr = req->local_ip; + child_ep->com.remote_addr.sin_family = PF_INET; + child_ep->com.remote_addr.sin_port = req->peer_port; + child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip; + get_ep(&parent_ep->com); + child_ep->parent_ep = parent_ep; + child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid)); + child_ep->l2t = l2t; + child_ep->dst = dst; + child_ep->hwtid = hwtid; + init_timer(&child_ep->timer); + cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid); + accept_cr(child_ep, req->peer_ip, skb); + goto out; +reject: + reject_cr(tdev, hwtid, req->peer_ip, skb); +out: + return CPL_RET_BUF_DONE; +} + +static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_pass_establish *req = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + dst_confirm(ep->dst); + state_set(&ep->com, MPA_REQ_WAIT); + start_ep_timer(ep); + + return CPL_RET_BUF_DONE; +} + +static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + int ret; + int abort = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + dst_confirm(ep->dst); + switch (state_read(&ep->com)) { + case MPA_REQ_WAIT: + state_set(&ep->com, CLOSING); + break; + case MPA_REQ_SENT: + state_set(&ep->com, CLOSING); + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + state_set(&ep->com, CLOSING); + get_ep(&ep->com); + break; + case MPA_REP_SENT: + state_set(&ep->com, CLOSING); + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case FPDU_MODE: + state_set(&ep->com, CLOSING); + peer_close_upcall(ep); + attrs.next_state = IWCH_QP_STATE_CLOSING; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) { + printk(KERN_ERR MOD "%s - qp <- closing err!\n", + __FUNCTION__); + abort = 1; + } + break; + case ABORTING: + goto out; + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + goto out; + case MORIBUND: + stop_ep_timer(ep); + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + goto out; + case DEAD: + goto out; + default: + BUG_ON(1); + } + iwch_ep_disconnect(ep, abort, GFP_KERNEL); +out: + return CPL_RET_BUF_DONE; +} + +/* + * Returns whether an ABORT_REQ_RSS message is a negative advice. + */ +static inline int is_neg_adv_abort(unsigned int status) +{ + return status == CPL_ERR_RTX_NEG_ADVICE || + status == CPL_ERR_PERSIST_NEG_ADVICE; +} + +static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_abort_req_rss *req = cplhdr(skb); + struct iwch_ep *ep = ctx; + struct cpl_abort_rpl *rpl; + struct sk_buff *rpl_skb; + struct iwch_qp_attributes attrs; + int ret; + int state; + + if (is_neg_adv_abort(req->status)) { + PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, + ep->hwtid); + t3_l2t_send_event(ep->com.tdev, ep->l2t); + return CPL_RET_BUF_DONE; + } + + state = state_read(&ep->com); + PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state); + switch (state) { + case CONNECTING: + break; + case MPA_REQ_WAIT: + break; + case MPA_REQ_SENT: + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REP_SENT: + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + get_ep(&ep->com); + break; + case MORIBUND: + stop_ep_timer(ep); + case FPDU_MODE: + case CLOSING: + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) + printk(KERN_ERR MOD + "%s - qp <- error failed!\n", + __FUNCTION__); + } + peer_abort_upcall(ep); + break; + case ABORTING: + break; + case DEAD: + PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__); + return CPL_RET_BUF_DONE; + default: + BUG_ON(1); + break; + } + dst_confirm(ep->dst); + + rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL); + if (!rpl_skb) { + printk(KERN_ERR MOD "%s - cannot allocate skb!\n", + __FUNCTION__); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; + } + rpl_skb->priority = CPL_PRIORITY_DATA; + rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl)); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL)); + rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); + rpl->cmd = CPL_ABORT_NO_RST; + ep->com.tdev->send(ep->com.tdev, rpl_skb); + if (state != ABORTING) + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(!ep); + + /* The cm_id may be null if we failed to connect */ + switch (state_read(&ep->com)) { + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + break; + case MORIBUND: + stop_ep_timer(ep); + if ((ep->com.cm_id) && (ep->com.qp)) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + break; + case DEAD: + default: + BUG_ON(1); + break; + } + + return CPL_RET_BUF_DONE; +} + +/* + * T3A does 3 things when a TERM is received: + * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet + * 2) generate an async event on the QP with the TERMINATE opcode + * 3) post a TERMINATE opcde cqe into the associated CQ. + * + * For (1), we save the message in the qp for later consumer consumption. + * For (2), we move the QP into TERMINATE, post a QP event and disconnect. + * For (3), we toss the CQE in cxio_poll_cq(). + * + * terminate() handles case (1)... + */ +static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb_pull(skb, sizeof(struct cpl_rdma_terminate)); + PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len); + memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len); + ep->com.qp->attr.terminate_msg_len = skb->len; + ep->com.qp->attr.is_terminate_local = 0; + return CPL_RET_BUF_DONE; +} + +static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_rdma_ec_status *rep = cplhdr(skb); + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, + rep->status); + if (rep->status) { + struct iwch_qp_attributes attrs; + + printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n", + __FUNCTION__, ep->hwtid); + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + abort_connection(ep, NULL); + } + return CPL_RET_BUF_DONE; +} + +static void ep_timeout(unsigned long arg) +{ + struct iwch_ep *ep = (struct iwch_ep *)arg; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) { + struct sk_buff *skb; + + connect_reply_upcall(ep, -ETIMEDOUT); + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) { + struct sk_buff *skb; + + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) { + struct sk_buff *skb; + + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + put_ep(&ep->com); +} + +int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + state_set(&ep->com, CLOSING); + if (mpa_rev == 0) + abort_connection(ep, NULL); + else { + err = send_mpa_reject(ep, pdata, pdata_len); + err = send_halfclose(ep, GFP_KERNEL); + } + return 0; +} + +int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + struct iwch_ep *ep = to_ep(cm_id); + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_qp *qp = get_qhp(h, conn_param->qpn); + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + BUG_ON(!qp); + + if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) || + (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) { + abort_connection(ep, NULL); + return -EINVAL; + } + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = qp; + + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord); + get_ep(&ep->com); + err = send_mpa_reply(ep, conn_param->private_data, + conn_param->private_data_len); + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + put_ep(&ep->com); + return err; + } + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + } else { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + put_ep(&ep->com); + return err; +} + +int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_ep *ep; + struct rtable *rt; + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto out; + } + init_timer(&ep->timer); + ep->plen = conn_param->private_data_len; + if (ep->plen) + memcpy(ep->mpa_pkt + sizeof(struct mpa_message), + conn_param->private_data, ep->plen); + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + ep->com.tdev = h->rdev.t3cdev_p; + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = get_qhp(h, conn_param->qpn); + BUG_ON(!ep->com.qp); + PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, + ep->com.qp, cm_id); + + /* + * Allocate an active TID to initiate a TCP connection. + */ + ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->atid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + /* find a route */ + rt = find_route(h->rdev.t3cdev_p, + cm_id->local_addr.sin_addr.s_addr, + cm_id->remote_addr.sin_addr.s_addr, + cm_id->local_addr.sin_port, + cm_id->remote_addr.sin_port, IPTOS_LOWDELAY); + if (!rt) { + printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__); + err = -EHOSTUNREACH; + goto fail3; + } + ep->dst = &rt->u.dst; + + /* get a l2t entry */ + ep->l2t = t3_l2t_get(ep->com.tdev, + ep->dst->neighbour, + ep->dst->neighbour->dev->if_port); + if (!ep->l2t) { + printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__); + err = -ENOMEM; + goto fail4; + } + + state_set(&ep->com, CONNECTING); + ep->tos = IPTOS_LOWDELAY; + ep->com.local_addr = cm_id->local_addr; + ep->com.remote_addr = cm_id->remote_addr; + + /* send connect request to rnic */ + err = send_connect(ep); + if (!err) + goto out; + + l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t); +fail4: + dst_release(ep->dst); +fail3: + cxgb3_free_atid(ep->com.tdev, ep->atid); +fail2: + put_ep(&ep->com); +out: + return err; +} + +int iwch_create_listen(struct iw_cm_id *cm_id, int backlog) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_listen_ep *ep; + + + might_sleep(); + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto fail1; + } + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.tdev = h->rdev.t3cdev_p; + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->backlog = backlog; + ep->com.local_addr = cm_id->local_addr; + + /* + * Allocate a server TID. + */ + ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + state_set(&ep->com, LISTEN); + err = listen_start(ep); + if (err) + goto fail3; + + /* wait for pass_open_rpl */ + wait_event(ep->com.waitq, ep->com.rpl_done); + err = ep->com.rpl_err; + if (!err) { + cm_id->provider_data = ep; + goto out; + } +fail3: + cxgb3_free_stid(ep->com.tdev, ep->stid); +fail2: + put_ep(&ep->com); +fail1: +out: + return err; +} + +int iwch_destroy_listen(struct iw_cm_id *cm_id) +{ + int err; + struct iwch_listen_ep *ep = to_listen_ep(cm_id); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + might_sleep(); + state_set(&ep->com, DEAD); + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + err = listen_stop(ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + cxgb3_free_stid(ep->com.tdev, ep->stid); + err = ep->com.rpl_err; + cm_id->rem_ref(cm_id); + put_ep(&ep->com); + return err; +} + +int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) +{ + int ret=0; + int state; + + + state = state_read(&ep->com); + PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, + states[state], abrupt); + if (state == DEAD) { + PDBG("%s already dead ep %p\n", __FUNCTION__, ep); + return 0; + } + if (abrupt) { + if (state != ABORTING) { + state_set(&ep->com, ABORTING); + ret = send_abort(ep, NULL, gfp); + } + } else { + + if (state != CLOSING) + state_set(&ep->com, CLOSING); + else { + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + } + + ret = send_halfclose(ep, gfp); + } + return ret; +} + +int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, + struct l2t_entry *l2t) +{ + struct iwch_ep *ep = ctx; + + if (ep->dst != old) + return 0; + + PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, + l2t); + dst_hold(new); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + ep->l2t = l2t; + dst_release(old); + ep->dst = new; + return 1; +} + +/* + * All the CM events are handled on a work queue to have a safe context. + */ +static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep_common *epc = ctx; + + get_ep(epc); + + /* + * Save ctx and tdev in the skb->cb area. + */ + *((void **) skb->cb) = ctx; + *((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev; + + /* + * Queue the skb and schedule the worker thread. + */ + skb_queue_tail(&rxq, skb); + queue_work(workq, &skb_work); + return 0; +} + +int __init iwch_cm_init(void) +{ + skb_queue_head_init(&rxq); + + workq = create_singlethread_workqueue("iw_cxgb3"); + if (!workq) + return -ENOMEM; + + /* + * All upcalls from the T3 Core go to sched() to + * schedule the processing on a work queue. + */ + t3c_handlers[CPL_ACT_ESTABLISH] = sched; + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; + t3c_handlers[CPL_RX_DATA] = sched; + t3c_handlers[CPL_TX_DMA_ACK] = sched; + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; + t3c_handlers[CPL_ABORT_RPL] = sched; + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; + t3c_handlers[CPL_PASS_ESTABLISH] = sched; + t3c_handlers[CPL_PEER_CLOSE] = sched; + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; + t3c_handlers[CPL_RDMA_TERMINATE] = sched; + t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + + /* + * These are the real handlers that are called from a + * work queue. + */ + work_handlers[CPL_ACT_ESTABLISH] = act_establish; + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; + work_handlers[CPL_RX_DATA] = rx_data; + work_handlers[CPL_TX_DMA_ACK] = tx_ack; + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; + work_handlers[CPL_ABORT_RPL] = abort_rpl; + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; + work_handlers[CPL_PEER_CLOSE] = peer_close; + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; + work_handlers[CPL_RDMA_TERMINATE] = terminate; + work_handlers[CPL_RDMA_EC_STATUS] = ec_status; + return 0; +} + +void __exit iwch_cm_term(void) +{ + flush_workqueue(workq); + destroy_workqueue(workq); +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h new file mode 100644 index 0000000..893f9d0 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -0,0 +1,223 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _IWCH_CM_H_ +#define _IWCH_CM_H_ + +#include +#include +#include +#include + +#include +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" + +#define MPA_KEY_REQ "MPA ID Req Frame" +#define MPA_KEY_REP "MPA ID Rep Frame" + +#define MPA_MAX_PRIVATE_DATA 256 +#define MPA_REV 0 /* XXX - amso1100 uses rev 0 ! */ +#define MPA_REJECT 0x20 +#define MPA_CRC 0x40 +#define MPA_MARKERS 0x80 +#define MPA_FLAGS_MASK 0xE0 + +#define put_ep(ep) { \ + PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_put(&((ep)->kref), __free_ep); \ +} + +#define get_ep(ep) { \ + PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_get(&((ep)->kref)); \ +} + +struct mpa_message { + u8 key[16]; + u8 flags; + u8 revision; + __be16 private_data_size; + u8 private_data[0]; +}; + +struct terminate_message { + u8 layer_etype; + u8 ecode; + __be16 hdrct_rsvd; + u8 len_hdrs[0]; +}; + +#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28) + +enum iwch_layers_types { + LAYER_RDMAP = 0x00, + LAYER_DDP = 0x10, + LAYER_MPA = 0x20, + RDMAP_LOCAL_CATA = 0x00, + RDMAP_REMOTE_PROT = 0x01, + RDMAP_REMOTE_OP = 0x02, + DDP_LOCAL_CATA = 0x00, + DDP_TAGGED_ERR = 0x01, + DDP_UNTAGGED_ERR = 0x02, + DDP_LLP = 0x03 +}; + +enum iwch_rdma_ecodes { + RDMAP_INV_STAG = 0x00, + RDMAP_BASE_BOUNDS = 0x01, + RDMAP_ACC_VIOL = 0x02, + RDMAP_STAG_NOT_ASSOC = 0x03, + RDMAP_TO_WRAP = 0x04, + RDMAP_INV_VERS = 0x05, + RDMAP_INV_OPCODE = 0x06, + RDMAP_STREAM_CATA = 0x07, + RDMAP_GLOBAL_CATA = 0x08, + RDMAP_CANT_INV_STAG = 0x09, + RDMAP_UNSPECIFIED = 0xff +}; + +enum iwch_ddp_ecodes { + DDPT_INV_STAG = 0x00, + DDPT_BASE_BOUNDS = 0x01, + DDPT_STAG_NOT_ASSOC = 0x02, + DDPT_TO_WRAP = 0x03, + DDPT_INV_VERS = 0x04, + DDPU_INV_QN = 0x01, + DDPU_INV_MSN_NOBUF = 0x02, + DDPU_INV_MSN_RANGE = 0x03, + DDPU_INV_MO = 0x04, + DDPU_MSG_TOOBIG = 0x05, + DDPU_INV_VERS = 0x06 +}; + +enum iwch_mpa_ecodes { + MPA_CRC_ERR = 0x02, + MPA_MARKER_ERR = 0x03 +}; + +enum iwch_ep_state { + IDLE = 0, + LISTEN, + CONNECTING, + MPA_REQ_WAIT, + MPA_REQ_SENT, + MPA_REQ_RCVD, + MPA_REP_SENT, + FPDU_MODE, + ABORTING, + CLOSING, + MORIBUND, + DEAD, +}; + +struct iwch_ep_common { + struct iw_cm_id *cm_id; + struct iwch_qp *qp; + struct t3cdev *tdev; + enum iwch_ep_state state; + struct kref kref; + spinlock_t lock; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + wait_queue_head_t waitq; + int rpl_done; + int rpl_err; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; + unsigned int stid; + int backlog; +}; + +struct iwch_ep { + struct iwch_ep_common com; + struct iwch_ep *parent_ep; + struct timer_list timer; + unsigned int atid; + u32 hwtid; + u32 snd_seq; + struct l2t_entry *l2t; + struct dst_entry *dst; + struct sk_buff *mpa_skb; + struct iwch_mpa_attributes mpa_attr; + unsigned int mpa_pkt_len; + u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA]; + u8 tos; + u16 emss; + u16 plen; + u32 ird; + u32 ord; +}; + +static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_ep *)cm_id->provider_data; +} + +static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_listen_ep *)cm_id->provider_data; +} + +static inline int compute_wscale(int win) +{ + int wscale = 0; + + while (wscale < 14 && (65535< References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223515.27166.60256.stgit@dell3.ogc.int> Code to manipulate the QP. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++ 1 files changed, 1007 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c new file mode 100644 index 0000000..9f6b251 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -0,0 +1,1007 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" + +#define NO_SUPPORT -1 + +static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr, + u8 * flit_cnt) +{ + int i; + u32 plen; + + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE; + else + wqe->send.rdmaop = T3_SEND; + wqe->send.rem_stag = 0; + break; +#if 0 /* Not currently supported */ + case TYPE_SEND_INVALIDATE: + case TYPE_SEND_INVALIDATE_IMMEDIATE: + wqe->send.rdmaop = T3_SEND_WITH_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; + case TYPE_SEND_SE_INVALIDATE: + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; +#endif + default: + break; + } + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->send.reserved[0] = 0; + wqe->send.reserved[1] = 0; + wqe->send.reserved[2] = 0; + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + plen = 4; + wqe->send.sgl[0].stag = wr->imm_data; + wqe->send.sgl[0].len = __constant_cpu_to_be32(0); + wqe->send.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 5; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + } + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); + } + wqe->send.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + int i; + u32 plen; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->write.rdmaop = T3_RDMA_WRITE; + wqe->write.reserved[0] = 0; + wqe->write.reserved[1] = 0; + wqe->write.reserved[2] = 0; + wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey); + wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr); + + if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + plen = 4; + wqe->write.sgl[0].stag = wr->imm_data; + wqe->write.sgl[0].len = __constant_cpu_to_be32(0); + wqe->write.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 6; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->write.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->write.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->write.sgl[i].to = + cpu_to_be64(wr->sg_list[i].addr); + } + wqe->write.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 5 + ((wr->num_sge) << 1); + } + wqe->write.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + if (wr->num_sge > 1) + return -EINVAL; + wqe->read.rdmaop = T3_READ_REQ; + wqe->read.reserved[0] = 0; + wqe->read.reserved[1] = 0; + wqe->read.reserved[2] = 0; + wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr); + wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey); + wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length); + wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr); + *flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3; + return 0; +} + +/* + * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. + */ +static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp, + struct ib_sge *sg_list, u32 num_sgle, + u32 * pbl_addr, u8 * page_size) +{ + int i; + struct iwch_mr *mhp; + u32 offset; + for (i = 0; i < num_sgle; i++) { + + mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8); + if (!mhp) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (!mhp->attr.state) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (mhp->attr.zbva) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + + if (sg_list[i].addr < mhp->attr.va_fbo) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) < + sg_list[i].addr) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) > + mhp->attr.va_fbo + ((u64) mhp->attr.len)) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + offset = sg_list[i].addr - mhp->attr.va_fbo; + offset += ((u32) mhp->attr.va_fbo) % + (1UL << (12 + mhp->attr.page_size)); + pbl_addr[i] = ((mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3) + + (offset >> (12 + mhp->attr.page_size)); + page_size[i] = mhp->attr.page_size; + } + return 0; +} + +static inline int iwch_build_rdma_recv(struct iwch_dev *rhp, + union t3_wr *wqe, + struct ib_recv_wr *wr) +{ + int i, err = 0; + u32 pbl_addr[4]; + u8 page_size[4]; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, + page_size); + if (err) + return err; + wqe->recv.pagesz[0] = page_size[0]; + wqe->recv.pagesz[1] = page_size[1]; + wqe->recv.pagesz[2] = page_size[2]; + wqe->recv.pagesz[3] = page_size[3]; + wqe->recv.num_sgle = cpu_to_be32(wr->num_sge); + for (i = 0; i < wr->num_sge; i++) { + wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + + /* to in the WQE == the offset into the page */ + wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % + (1UL << (12 + page_size[i]))); + + /* pbl_addr is the adapters address in the PBL */ + wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); + } + for (; i < T3_MAX_SGE; i++) { + wqe->recv.sgl[i].stag = 0; + wqe->recv.sgl[i].len = 0; + wqe->recv.sgl[i].to = 0; + wqe->recv.pbl_addr[i] = 0; + } + return 0; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + int err = 0; + u8 t3_wr_flit_cnt; + enum t3_wr_opcode t3_wr_opcode = 0; + enum t3_wr_flags t3_wr_flags; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if (num_wrs <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + while (wr) { + if (num_wrs == 0) { + err = -ENOMEM; + *bad_wr = wr; + break; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + t3_wr_flags = 0; + if (wr->send_flags & IB_SEND_SOLICITED) + t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; + if (wr->send_flags & IB_SEND_SIGNALED) + t3_wr_flags |= T3_COMPLETION_FLAG; + sqp = qhp->wq.sq + + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + t3_wr_opcode = T3_WR_SEND; + err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + t3_wr_opcode = T3_WR_WRITE; + err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_READ: + t3_wr_opcode = T3_WR_READ; + t3_wr_flags = 0; /* T3 reads are always signaled */ + err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt); + if (err) + break; + sqp->read_len = wqe->read.local_len; + if (!qhp->wq.oldest_read) + qhp->wq.oldest_read = sqp; + break; + default: + PDBG("%s post of type=%d TBD!\n", __FUNCTION__, + wr->opcode); + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp->wr_id = wr->wr_id; + sqp->opcode = wr2opcode(t3_wr_opcode); + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED); + + build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, t3_wr_flit_cnt); + PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", + __FUNCTION__, wr->wr_id, idx, + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2), + sqp->opcode); + wr = wr->next; + num_wrs--; + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + int err = 0; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, + qhp->wq.rq_size_log2) - 1; + if (!wr) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + while (wr) { + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + if (num_wrs) + err = iwch_build_rdma_recv(qhp->rhp, wqe, wr); + else + err = -ENOMEM; + if (err) { + *bad_wr = wr; + break; + } + qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = + wr->wr_id; + build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, sizeof(struct t3_receive_wr) >> 3); + PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x " + "wqe %p \n", __FUNCTION__, wr->wr_id, idx, + qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); + ++(qhp->wq.rq_wptr); + ++(qhp->wq.wptr); + wr = wr->next; + num_wrs--; + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_qp *qhp; + union t3_wr *wqe; + u32 pbl_addr; + u8 page_size; + u32 num_wrs; + unsigned long flag; + struct ib_sge sgl; + int err=0; + enum t3_wr_flags t3_wr_flags; + u32 idx; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(qp); + mhp = to_iwch_mw(mw); + rhp = qhp->rhp; + + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if ((num_wrs) <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, + mw, mw_bind); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + + t3_wr_flags = 0; + if (mw_bind->send_flags & IB_SEND_SIGNALED) + t3_wr_flags = T3_COMPLETION_FLAG; + + sgl.addr = mw_bind->addr; + sgl.lkey = mw_bind->mr->lkey; + sgl.length = mw_bind->length; + wqe->bind.reserved = 0; + wqe->bind.type = T3_VA_BASED_TO; + + /* TBD: check perms */ + wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags); + wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey); + wqe->bind.mw_stag = cpu_to_be32(mw->rkey); + wqe->bind.mw_len = cpu_to_be32(mw_bind->length); + wqe->bind.mw_va = cpu_to_be64(mw_bind->addr); + err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size); + if (err) { + spin_unlock_irqrestore(&qhp->lock, flag); + return err; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + sqp->wr_id = mw_bind->wr_id; + sqp->opcode = T3_BIND_MW; + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED); + wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr); + wqe->bind.mr_pagesz = page_size; + wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; + build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, + sizeof(struct t3_bind_mw_wr) >> 3); + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + spin_unlock_irqrestore(&qhp->lock, flag); + + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + + return err; +} + +static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode, + int tagged) +{ + switch (t3err) { + case TPT_ERR_STAG: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_STAG; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_INV_STAG; + } + break; + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_STAG_NOT_ASSOC; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_STAG_NOT_ASSOC; + } + break; + case TPT_ERR_WRAP: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_TO_WRAP; + break; + case TPT_ERR_BOUND: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_BASE_BOUNDS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + } + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_CANT_INV_STAG; + break; + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + *layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_OUT_OF_RQE: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_NOBUF; + break; + case TPT_ERR_PBL_ADDR_BOUND: + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + break; + case TPT_ERR_CRC: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_CRC_ERR; + break; + case TPT_ERR_MARKER: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_MARKER_ERR; + break; + case TPT_ERR_PDU_LEN_ERR: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + break; + case TPT_ERR_DDP_VERSION: + if (tagged) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_VERS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_VERS; + } + break; + case TPT_ERR_RDMA_VERSION: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_VERS; + break; + case TPT_ERR_OPCODE: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_OPCODE; + break; + case TPT_ERR_DDP_QUEUE_NUM: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_QN; + break; + case TPT_ERR_MSN: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_RANGE; + break; + case TPT_ERR_TBIT: + *layer_type = LAYER_DDP|DDP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_MO: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MO; + break; + default: + *layer_type = LAYER_RDMAP|DDP_LOCAL_CATA; + *ecode = 0; + break; + } +} + +/* + * This posts a TERMINATE with layer=RDMA, type=catastrophic. + */ +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) +{ + union t3_wr *wqe; + struct terminate_message *term; + int status; + int tagged = 0; + struct sk_buff *skb; + + PDBG("%s %d\n", __FUNCTION__, __LINE__); + skb = alloc_skb(40, GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (union t3_wr *)skb_put(skb, 40); + memset(wqe, 0, 40); + wqe->send.rdmaop = T3_TERMINATE; + + /* immediate data length */ + wqe->send.plen = htonl(4); + + /* immediate data starts here. */ + term = (struct terminate_message *)wqe->send.sgl; + if (rsp_msg) { + status = CQE_STATUS(rsp_msg->cqe); + if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE) + tagged = 1; + if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) || + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) + tagged = 2; + } else { + status = TPT_ERR_INTERNAL_ERR; + } + build_term_codes(status, &term->layer_etype, &term->ecode, tagged); + build_fw_riwrh((void *)wqe, T3_WR_SEND, + T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, + qhp->ep->hwtid, 5); + skb->priority = CPL_PRIORITY_DATA; + return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb)); +} + +/* + * Assumes qhp lock is held. + */ +static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + struct iwch_cq *rchp, *schp; + int count; + + rchp = get_chp(qhp->rhp, qhp->attr.rcq); + schp = get_chp(qhp->rhp, qhp->attr.scq); + + PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp); + /* take a ref on the qhp since we must release the lock */ + atomic_inc(&qhp->refcnt); + spin_unlock_irqrestore(&qhp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&rchp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&rchp->cq); + cxio_count_rcqes(&rchp->cq, &qhp->wq, &count); + cxio_flush_rq(&qhp->wq, &rchp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&rchp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&schp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&schp->cq); + cxio_count_scqes(&schp->cq, &qhp->wq, &count); + cxio_flush_sq(&qhp->wq, &schp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&schp->lock, *flag); + + /* deref */ + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); + + spin_lock_irqsave(&qhp->lock, *flag); +} + +static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + if (t3b_device(qhp->rhp)) + cxio_set_wq_in_error(&qhp->wq); + else + __flush_qp(qhp, flag); +} + + +/* + * Return non zero if at least one RECV was pre-posted. + */ +static inline int rqes_posted(struct iwch_qp *qhp) +{ + return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV); +} + +static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs) +{ + struct t3_rdma_init_attr init_attr; + int ret; + + init_attr.tid = qhp->ep->hwtid; + init_attr.qpid = qhp->wq.qpid; + init_attr.pdid = qhp->attr.pd; + init_attr.scqid = qhp->attr.scq; + init_attr.rcqid = qhp->attr.rcq; + init_attr.rq_addr = qhp->wq.rq_addr; + init_attr.rq_size = 1 << qhp->wq.rq_size_log2; + init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | + qhp->attr.mpa_attr.recv_marker_enabled | + (qhp->attr.mpa_attr.xmit_marker_enabled << 1) | + (qhp->attr.mpa_attr.crc_enabled << 2); + + /* + * XXX - The IWCM doesn't quite handle getting these + * attrs set before going into RTS. For now, just turn + * them on always... + */ +#if 0 + init_attr.qpcaps = qhp->attr.enableRdmaRead | + (qhp->attr.enableRdmaWrite << 1) | + (qhp->attr.enableBind << 2) | + (qhp->attr.enable_stag0_fastreg << 3) | + (qhp->attr.enable_stag0_fastreg << 4); +#else + init_attr.qpcaps = 0x1f; +#endif + init_attr.tcp_emss = qhp->ep->emss; + init_attr.ord = qhp->attr.max_ord; + init_attr.ird = qhp->attr.max_ird; + init_attr.qp_dma_addr = qhp->wq.dma_addr; + init_attr.qp_dma_size = (1UL << qhp->wq.size_log2); + init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0; + PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d " + "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, + init_attr.rq_addr, init_attr.rq_size, + init_attr.flags, init_attr.qpcaps); + ret = cxio_rdma_init(&rhp->rdev, &init_attr); + PDBG("%s ret %d\n", __FUNCTION__, ret); + return ret; +} + +int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal) +{ + int ret = 0; + struct iwch_qp_attributes newattr = qhp->attr; + unsigned long flag; + int disconnect = 0; + int terminate = 0; + int abort = 0; + int free = 0; + struct iwch_ep *ep = NULL; + + PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, + qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, + (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1); + + spin_lock_irqsave(&qhp->lock, flag); + + /* Process attr changes if in IDLE */ + if (mask & IWCH_QP_ATTR_VALID_MODIFY) { + if (qhp->attr.state != IWCH_QP_STATE_IDLE) { + ret = -EIO; + goto out; + } + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ) + newattr.enable_rdma_read = attrs->enable_rdma_read; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE) + newattr.enable_rdma_write = attrs->enable_rdma_write; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND) + newattr.enable_bind = attrs->enable_bind; + if (mask & IWCH_QP_ATTR_MAX_ORD) { + if (attrs->max_ord > + rhp->attr.max_rdma_read_qp_depth) { + ret = -EINVAL; + goto out; + } + newattr.max_ord = attrs->max_ord; + } + if (mask & IWCH_QP_ATTR_MAX_IRD) { + if (attrs->max_ird > + rhp->attr.max_rdma_reads_per_qp) { + ret = -EINVAL; + goto out; + } + newattr.max_ird = attrs->max_ird; + } + qhp->attr = newattr; + } + + if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) + goto out; + if (qhp->attr.state == attrs->next_state) + goto out; + + switch (qhp->attr.state) { + case IWCH_QP_STATE_IDLE: + switch (attrs->next_state) { + case IWCH_QP_STATE_RTS: + if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) { + ret = -EINVAL; + goto out; + } + if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) { + ret = -EINVAL; + goto out; + } + qhp->attr.mpa_attr = attrs->mpa_attr; + qhp->attr.llp_stream_handle = attrs->llp_stream_handle; + qhp->ep = qhp->attr.llp_stream_handle; + qhp->attr.state = IWCH_QP_STATE_RTS; + + /* + * Ref the endpoint here and deref when we + * disassociate the endpoint from the QP. This + * happens in CLOSING->IDLE transition or *->ERROR + * transition. + */ + get_ep(&qhp->ep->com); + spin_unlock_irqrestore(&qhp->lock, flag); + ret = rdma_init(rhp, qhp, mask, attrs); + spin_lock_irqsave(&qhp->lock, flag); + if (ret) + goto err; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + flush_qp(qhp, &flag); + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_RTS: + switch (attrs->next_state) { + case IWCH_QP_STATE_CLOSING: + BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2); + qhp->attr.state = IWCH_QP_STATE_CLOSING; + if (!internal) { + abort=0; + disconnect = 1; + ep = qhp->ep; + } + break; + case IWCH_QP_STATE_TERMINATE: + qhp->attr.state = IWCH_QP_STATE_TERMINATE; + if (!internal) + terminate = 1; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + if (!internal) { + abort=1; + disconnect = 1; + ep = qhp->ep; + } + goto err; + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_CLOSING: + if (!internal) { + ret = -EINVAL; + goto out; + } + switch (attrs->next_state) { + case IWCH_QP_STATE_IDLE: + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.llp_stream_handle = NULL; + put_ep(&qhp->ep->com); + qhp->ep = NULL; + wake_up(&qhp->wait); + break; + case IWCH_QP_STATE_ERROR: + goto err; + default: + ret = -EINVAL; + goto err; + } + break; + case IWCH_QP_STATE_ERROR: + if (attrs->next_state != IWCH_QP_STATE_IDLE) { + ret = -EINVAL; + goto out; + } + + if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || + !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) { + ret = -EINVAL; + goto out; + } + qhp->attr.state = IWCH_QP_STATE_IDLE; + memset(&qhp->attr, 0, sizeof(qhp->attr)); + break; + case IWCH_QP_STATE_TERMINATE: + if (!internal) { + ret = -EINVAL; + goto out; + } + goto err; + break; + default: + printk(KERN_ERR "%s in a bad state %d\n", + __FUNCTION__, qhp->attr.state); + ret = -EINVAL; + goto err; + break; + } + goto out; +err: + PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, + qhp->wq.qpid); + + /* disassociate the LLP connection */ + qhp->attr.llp_stream_handle = NULL; + ep = qhp->ep; + qhp->ep = NULL; + qhp->attr.state = IWCH_QP_STATE_ERROR; + free=1; + wake_up(&qhp->wait); + BUG_ON(!ep); + flush_qp(qhp, &flag); +out: + spin_unlock_irqrestore(&qhp->lock, flag); + + if (terminate) + iwch_post_terminate(qhp, NULL); + + /* + * If disconnect is 1, then we need to initiate a disconnect + * on the EP. This can be a normal close (RTS->CLOSING) or + * an abnormal close (RTS/CLOSING->ERROR). + */ + if (disconnect) + iwch_ep_disconnect(ep, abort, GFP_KERNEL); + + /* + * If free is 1, then we've disassociated the EP from the QP + * and we need to dereference the EP. + */ + if (free) + put_ep(&ep->com); + + PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state); + return ret; +} + +static int quiesce_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_quiesce_tid(qhp->ep); + qhp->flags |= QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +static int resume_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_resume_tid(qhp->ep); + qhp->flags &= ~QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +int iwch_quiesce_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) { + quiesce_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) + quiesce_qp(qhp); + } + return 0; +} + +int iwch_resume_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) { + resume_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp)) + resume_qp(qhp); + } + return 0; +} From swise at opengridcomputing.com Sun Dec 10 14:35:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:35:45 -0600 Subject: [openib-general] [PATCH v3 06/13] Completion Queues In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223545.27166.81531.stgit@dell3.ogc.int> Functions to manipulate CQs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cq.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c new file mode 100644 index 0000000..9d82df4 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" + +/* + * Get one cq entry from cxio and map it to openib. + * + * Returns: + * 0 EMPTY; + * 1 cqe returned + * -EAGAIN caller must try again + * any other -errno fatal error + */ +int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp, + struct ib_wc *wc) +{ + struct iwch_qp *qhp = NULL; + struct t3_cqe cqe, *rd_cqe; + struct t3_wq *wq; + u32 credit = 0; + u8 cqe_flushed; + u64 cookie; + int ret = 1; + + rd_cqe = cxio_next_cqe(&chp->cq); + + if (!rd_cqe) + return 0; + + qhp = get_qhp(rhp, CQE_QPID(*rd_cqe)); + if (!qhp) + wq = NULL; + else { + spin_lock(&qhp->lock); + wq = &(qhp->wq); + } + ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie, + &credit); + if (t3a_device(chp->rhp) && credit) { + PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, + credit, chp->cq.cqid); + cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit); + } + + if (ret) { + ret = -EAGAIN; + goto out; + } + ret = 1; + + wc->wr_id = cookie; + wc->qp_num = qhp->wq.qpid; + wc->vendor_err = CQE_STATUS(cqe); + + PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x " + "lo 0x%x cookie 0x%llx\n", __FUNCTION__, + CQE_QPID(cqe), CQE_TYPE(cqe), + CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe), + CQE_WRID_LOW(cqe), cookie); + + if (CQE_TYPE(cqe) == 0) { + if (!CQE_STATUS(cqe)) + wc->byte_len = CQE_LEN(cqe); + else + wc->byte_len = 0; + wc->opcode = IB_WC_RECV; + } else { + switch (CQE_OPCODE(cqe)) { + case T3_RDMA_WRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case T3_READ_REQ: + wc->opcode = IB_WC_RDMA_READ; + wc->byte_len = CQE_LEN(cqe); + break; + case T3_SEND: + case T3_SEND_WITH_SE: + wc->opcode = IB_WC_SEND; + break; + case T3_BIND_MW: + wc->opcode = IB_WC_BIND_MW; + break; + + /* these aren't supported yet */ + case T3_SEND_WITH_INV: + case T3_SEND_WITH_SE_INV: + case T3_LOCAL_INV: + case T3_FAST_REGISTER: + default: + printk(KERN_ERR MOD "Unexpected opcode %d " + "in the CQE received for QPID=0x%0x\n", + CQE_OPCODE(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + goto out; + } + } + + if (cqe_flushed) + wc->status = IB_WC_WR_FLUSH_ERR; + else { + + switch (CQE_STATUS(cqe)) { + case TPT_ERR_SUCCESS: + wc->status = IB_WC_SUCCESS; + break; + case TPT_ERR_STAG: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_PDID: + wc->status = IB_WC_LOC_PROT_ERR; + break; + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_WRAP: + wc->status = IB_WC_GENERAL_ERR; + break; + case TPT_ERR_BOUND: + wc->status = IB_WC_LOC_LEN_ERR; + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + wc->status = IB_WC_MW_BIND_ERR; + break; + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + case TPT_ERR_OPCODE: + wc->status = IB_WC_FATAL_ERR; + break; + case TPT_ERR_SWFLUSH: + wc->status = IB_WC_WR_FLUSH_ERR; + break; + default: + printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for " + "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + } + } +out: + if (wq) + spin_unlock(&qhp->lock); + return ret; +} + +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + unsigned long flags; + int npolled; + int err = 0; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + + spin_lock_irqsave(&chp->lock, flags); + for (npolled = 0; npolled < num_entries; ++npolled) { +#ifdef DEBUG + int i=0; +#endif + + /* + * Because T3 can post CQEs that are _not_ associated + * with a WR, we might have to poll again after removing + * one of these. + */ + do { + err = iwch_poll_cq_one(rhp, chp, wc + npolled); +#ifdef DEBUG + BUG_ON(++i > 1000); +#endif + } while (err == -EAGAIN); + if (err <= 0) + break; + } + spin_unlock_irqrestore(&chp->lock, flags); + + if (err < 0) + return err; + else { + return npolled; + } +} + +int iwch_modify_cq(struct ib_cq *cq, int cqe) +{ + PDBG("iwch_modify_cq: TBD\n"); + return 0; +} From swise at opengridcomputing.com Sun Dec 10 14:36:15 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:36:15 -0600 Subject: [openib-general] [PATCH v3 07/13] Async Event Handler In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223615.27166.4800.stgit@dell3.ogc.int> Code to handle async events coming from the T3 RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_ev.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c new file mode 100644 index 0000000..b0bd014 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp, + struct respQ_msg_t *rsp_msg, + enum ib_event_type ib_event, + int send_term) +{ + struct ib_event event; + struct iwch_qp_attributes attrs; + struct iwch_qp *qhp; + + printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + + spin_lock(&rnicp->lock); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + + if (!qhp) { + printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", + __FUNCTION__, CQE_STATUS(rsp_msg->cqe), + CQE_QPID(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + if ((qhp->attr.state == IWCH_QP_STATE_ERROR) || + (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) { + PDBG("%s AE received after RTS - " + "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, + qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + atomic_inc(&qhp->refcnt); + spin_unlock(&rnicp->lock); + + event.event = ib_event; + event.device = chp->ibcq.device; + if (ib_event == IB_EVENT_CQ_ERR) + event.element.cq = &chp->ibcq; + else + event.element.qp = &qhp->ibqp; + + if (qhp->ibqp.event_handler) + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_TERMINATE; + iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (send_term) + iwch_post_terminate(qhp, rsp_msg); + } + + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); +} + +void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb) +{ + struct iwch_dev *rnicp; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + struct iwch_cq *chp; + struct iwch_qp *qhp; + u32 cqid = RSPQ_CQID(rsp_msg); + + rnicp = (struct iwch_dev *) rdev_p->ulp; + spin_lock(&rnicp->lock); + chp = get_chp(rnicp, cqid); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + if (!chp || !qhp) { + printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d " + "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", + cqid, CQE_QPID(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), + CQE_WRID_LOW(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + goto out; + } + iwch_qp_add_ref(&qhp->ibqp); + atomic_inc(&chp->refcnt); + spin_unlock(&rnicp->lock); + + /* + * 1) completion of our sending a TERMINATE. + * 2) incoming TERMINATE message. + */ + if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && + (CQE_STATUS(rsp_msg->cqe) == 0)) { + if (SQ_TYPE(rsp_msg->cqe)) { + PDBG("%s QPID 0x%x ep %p disconnecting\n", + __FUNCTION__, qhp->wq.qpid, qhp->ep); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } else { + PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, + qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, + IB_EVENT_QP_REQ_ERR, 0); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } + goto done; + } + + /* Bad incoming Read request */ + if (SQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + /* Bad incoming write */ + if (RQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + switch (CQE_STATUS(rsp_msg->cqe)) { + + /* Completion Events */ + case TPT_ERR_SUCCESS: + + /* + * Confirm the destination entry if this is a RECV completion. + */ + if (qhp->ep && SQ_TYPE(rsp_msg->cqe)) + dst_confirm(qhp->ep->dst); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + break; + + case TPT_ERR_STAG: + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + case TPT_ERR_WRAP: + case TPT_ERR_BOUND: + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1); + break; + + /* Device Fatal Errors */ + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1); + break; + + /* QP Fatal Errors */ + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_PBL_ADDR_BOUND: + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_OPCODE: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_RQE_ADDR_BOUND: + case TPT_ERR_IRD_OVERFLOW: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + + default: + printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", + CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + } +done: + if (atomic_dec_and_test(&chp->refcnt)) + wake_up(&chp->wait); + iwch_qp_rem_ref(&qhp->ibqp); +out: + dev_kfree_skb_irq(skb); +} From swise at opengridcomputing.com Sun Dec 10 14:36:45 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:36:45 -0600 Subject: [openib-general] [PATCH v3 08/13] Memory Registration In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223645.27166.44081.stgit@dell3.ogc.int> Functions to register memory regions. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_mem.c | 170 ++++++++++++++++++++++++++++++++ 1 files changed, 170 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c new file mode 100644 index 0000000..774d11e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c @@ -0,0 +1,170 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include + +#include +#include + +#include "cxio_hal.h" +#include "iwch.h" +#include "iwch_provider.h" + +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list) +{ + u32 stag; + u32 mmid; + + + if (cxio_register_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages) +{ + u32 stag; + u32 mmid; + + + /* We could support this... */ + if (npages > mhp->attr.pbl_size) + return -ENOMEM; + + stag = mhp->attr.stag; + if (cxio_reregister_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list) +{ + u64 mask; + int i, j, n; + + mask = 0; + *total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) + return -EINVAL; + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return -EINVAL; + *total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + if (*total_size > 0xFFFFFFFFULL) + return -ENOMEM; + + /* Find largest page shift we can use to cover buffers */ + for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift)) + if (num_phys_buf > 1) { + if ((1ULL << *shift) & mask) + break; + } else + if (1ULL << *shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << *shift) - 1))) + break; + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1); + buffer_list[0].addr &= ~0ull << *shift; + + *npages = 0; + for (i = 0; i < num_phys_buf; ++i) + *npages += (buffer_list[i].size + + (1ULL << *shift) - 1) >> *shift; + + if (!*npages) + return -EINVAL; + + *page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL); + if (!*page_list) + return -ENOMEM; + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; + ++j) + (*page_list)[n++] = cpu_to_be64(buffer_list[i].addr + + ((u64) j << *shift)); + + PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n", + __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages); + + return 0; + +} From swise at opengridcomputing.com Sun Dec 10 14:37:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:37:16 -0600 Subject: [openib-general] [PATCH v3 09/13] Core WQE/CQE Types In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223715.27166.81773.stgit@dell3.ogc.int> T3 WQE and CQE structures, defines, etc... Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_wr.h | 685 ++++++++++++++++++++++++++++ 1 files changed, 685 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h new file mode 100644 index 0000000..45870be --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h @@ -0,0 +1,685 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_WR_H__ +#define __CXIO_WR_H__ + +#include +#include +#include +#include "firmware_exports.h" + +#define T3_MAX_SGE 4 + +#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) +#define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ + ((rptr)!=(wptr)) ) +#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1)) +#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<> S_FW_RIWR_OP)) & M_FW_RIWR_OP) + +#define S_FW_RIWR_SOPEOP 22 +#define M_FW_RIWR_SOPEOP 0x3 +#define V_FW_RIWR_SOPEOP(x) ((x) << S_FW_RIWR_SOPEOP) + +#define S_FW_RIWR_FLAGS 8 +#define M_FW_RIWR_FLAGS 0x3fffff +#define V_FW_RIWR_FLAGS(x) ((x) << S_FW_RIWR_FLAGS) +#define G_FW_RIWR_FLAGS(x) ((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS) + +#define S_FW_RIWR_TID 8 +#define V_FW_RIWR_TID(x) ((x) << S_FW_RIWR_TID) + +#define S_FW_RIWR_LEN 0 +#define V_FW_RIWR_LEN(x) ((x) << S_FW_RIWR_LEN) + +#define S_FW_RIWR_GEN 31 +#define V_FW_RIWR_GEN(x) ((x) << S_FW_RIWR_GEN) + +struct t3_sge { + __be32 stag; + __be32 len; + __be64 to; +}; + +/* If num_sgle is zero, flit 5+ contains immediate data.*/ +struct t3_send_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be32 plen; /* 3 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ +}; + +struct t3_local_inv_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 stag; /* 2 */ + __be32 reserved3; +}; + +struct t3_rdma_write_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 stag_sink; + __be64 to_sink; /* 3 */ + __be32 plen; /* 4 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 5+ */ +}; + +struct t3_rdma_read_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be64 rem_to; /* 3 */ + __be32 local_stag; /* 4 */ + __be32 local_len; + __be64 local_to; /* 5 */ +}; + +enum t3_addr_type { + T3_VA_BASED_TO = 0x0, + T3_ZERO_BASED_TO = 0x1 +} __attribute__ ((packed)); + +enum t3_mem_perms { + T3_MEM_ACCESS_LOCAL_READ = 0x1, + T3_MEM_ACCESS_LOCAL_WRITE = 0x2, + T3_MEM_ACCESS_REM_READ = 0x4, + T3_MEM_ACCESS_REM_WRITE = 0x8 +} __attribute__ ((packed)); + +struct t3_bind_mw_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u16 reserved; /* 2 */ + u8 type; + u8 perms; + __be32 mr_stag; + __be32 mw_stag; /* 3 */ + __be32 mw_len; + __be64 mw_va; /* 4 */ + __be32 mr_pbl_addr; /* 5 */ + u8 reserved2[3]; + u8 mr_pagesz; +}; + +struct t3_receive_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 pagesz[T3_MAX_SGE]; + __be32 num_sgle; /* 2 */ + struct t3_sge sgl[T3_MAX_SGE]; /* 3+ */ + __be32 pbl_addr[T3_MAX_SGE]; +}; + +struct t3_bypass_wr { + struct fw_riwrh wrh; + union t3_wrid wrid; /* 1 */ +}; + +struct t3_modify_qp_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 flags; /* 2 */ + __be32 quiesce; /* 2 */ + __be32 max_ird; /* 3 */ + __be32 max_ord; /* 3 */ + __be64 sge_cmd; /* 4 */ + __be64 ctx1; /* 5 */ + __be64 ctx0; /* 6 */ +}; + +enum t3_modify_qp_flags { + MODQP_QUIESCE = 0x01, + MODQP_MAX_IRD = 0x02, + MODQP_MAX_ORD = 0x04, + MODQP_WRITE_EC = 0x08, + MODQP_READ_EC = 0x10, +}; + + +enum t3_mpa_attrs { + uP_RI_MPA_RX_MARKER_ENABLE = 0x1, + uP_RI_MPA_TX_MARKER_ENABLE = 0x2, + uP_RI_MPA_CRC_ENABLE = 0x4, + uP_RI_MPA_IETF_ENABLE = 0x8 +} __attribute__ ((packed)); + +enum t3_qp_caps { + uP_RI_QP_RDMA_READ_ENABLE = 0x01, + uP_RI_QP_RDMA_WRITE_ENABLE = 0x02, + uP_RI_QP_BIND_ENABLE = 0x04, + uP_RI_QP_FAST_REGISTER_ENABLE = 0x08, + uP_RI_QP_STAG0_ENABLE = 0x10 +} __attribute__ ((packed)); + +struct t3_rdma_init_attr { + u32 tid; + u32 qpid; + u32 pdid; + u32 scqid; + u32 rcqid; + u32 rq_addr; + u32 rq_size; + enum t3_mpa_attrs mpaattrs; + enum t3_qp_caps qpcaps; + u16 tcp_emss; + u32 ord; + u32 ird; + u64 qp_dma_addr; + u32 qp_dma_size; + u32 flags; +}; + +struct t3_rdma_init_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 qpid; /* 2 */ + __be32 pdid; + __be32 scqid; /* 3 */ + __be32 rcqid; + __be32 rq_addr; /* 4 */ + __be32 rq_size; + u8 mpaattrs; /* 5 */ + u8 qpcaps; + __be16 ulpdu_size; + __be32 flags; /* bits 31-1 - reservered */ + /* bit 0 - set if RECV posted */ + __be32 ord; /* 6 */ + __be32 ird; + __be64 qp_dma_addr; /* 7 */ + __be32 qp_dma_size; /* 8 */ + u32 rsvd; +}; + +struct t3_genbit { + u64 flit[15]; + __be64 genbit; +}; + +enum rdma_init_wr_flags { + RECVS_POSTED = 1, +}; + +union t3_wr { + struct t3_send_wr send; + struct t3_rdma_write_wr write; + struct t3_rdma_read_wr read; + struct t3_receive_wr recv; + struct t3_local_inv_wr local_inv; + struct t3_bind_mw_wr bind; + struct t3_bypass_wr bypass; + struct t3_rdma_init_wr init; + struct t3_modify_qp_wr qp_mod; + struct t3_genbit genbit; + u64 flit[16]; +}; + +#define T3_SQ_CQE_FLIT 13 +#define T3_SQ_COOKIE_FLIT 14 + +#define T3_RQ_COOKIE_FLIT 13 +#define T3_RQ_CQE_FLIT 14 + +static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe) +{ + return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags)); +} + +static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, + enum t3_wr_flags flags, u8 genbit, u32 tid, + u8 len) +{ + wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | + V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_FLAGS(flags)); + wmb(); + wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | + V_FW_RIWR_TID(tid) | + V_FW_RIWR_LEN(len)); + /* 2nd gen bit... */ + ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit); +} + +/* + * T3 ULP2_TX commands + */ +enum t3_utx_mem_op { + T3_UTX_MEM_READ = 2, + T3_UTX_MEM_WRITE = 3 +}; + +/* T3 MC7 RDMA TPT entry format */ + +enum tpt_mem_type { + TPT_NON_SHARED_MR = 0x0, + TPT_SHARED_MR = 0x1, + TPT_MW = 0x2, + TPT_MW_RELAXED_PROTECTION = 0x3 +}; + +enum tpt_addr_type { + TPT_ZBTO = 0, + TPT_VATO = 1 +}; + +enum tpt_mem_perm { + TPT_LOCAL_READ = 0x8, + TPT_LOCAL_WRITE = 0x4, + TPT_REMOTE_READ = 0x2, + TPT_REMOTE_WRITE = 0x1 +}; + +struct tpt_entry { + __be32 valid_stag_pdid; + __be32 flags_pagesize_qpid; + + __be32 rsvd_pbl_addr; + __be32 len; + __be32 va_hi; + __be32 va_low_or_fbo; + + __be32 rsvd_bind_cnt_or_pstag; + __be32 rsvd_pbl_size; +}; + +#define S_TPT_VALID 31 +#define V_TPT_VALID(x) ((x) << S_TPT_VALID) +#define F_TPT_VALID V_TPT_VALID(1U) + +#define S_TPT_STAG_KEY 23 +#define M_TPT_STAG_KEY 0xFF +#define V_TPT_STAG_KEY(x) ((x) << S_TPT_STAG_KEY) +#define G_TPT_STAG_KEY(x) (((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY) + +#define S_TPT_STAG_STATE 22 +#define V_TPT_STAG_STATE(x) ((x) << S_TPT_STAG_STATE) +#define F_TPT_STAG_STATE V_TPT_STAG_STATE(1U) + +#define S_TPT_STAG_TYPE 20 +#define M_TPT_STAG_TYPE 0x3 +#define V_TPT_STAG_TYPE(x) ((x) << S_TPT_STAG_TYPE) +#define G_TPT_STAG_TYPE(x) (((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE) + +#define S_TPT_PDID 0 +#define M_TPT_PDID 0xFFFFF +#define V_TPT_PDID(x) ((x) << S_TPT_PDID) +#define G_TPT_PDID(x) (((x) >> S_TPT_PDID) & M_TPT_PDID) + +#define S_TPT_PERM 28 +#define M_TPT_PERM 0xF +#define V_TPT_PERM(x) ((x) << S_TPT_PERM) +#define G_TPT_PERM(x) (((x) >> S_TPT_PERM) & M_TPT_PERM) + +#define S_TPT_REM_INV_DIS 27 +#define V_TPT_REM_INV_DIS(x) ((x) << S_TPT_REM_INV_DIS) +#define F_TPT_REM_INV_DIS V_TPT_REM_INV_DIS(1U) + +#define S_TPT_ADDR_TYPE 26 +#define V_TPT_ADDR_TYPE(x) ((x) << S_TPT_ADDR_TYPE) +#define F_TPT_ADDR_TYPE V_TPT_ADDR_TYPE(1U) + +#define S_TPT_MW_BIND_ENABLE 25 +#define V_TPT_MW_BIND_ENABLE(x) ((x) << S_TPT_MW_BIND_ENABLE) +#define F_TPT_MW_BIND_ENABLE V_TPT_MW_BIND_ENABLE(1U) + +#define S_TPT_PAGE_SIZE 20 +#define M_TPT_PAGE_SIZE 0x1F +#define V_TPT_PAGE_SIZE(x) ((x) << S_TPT_PAGE_SIZE) +#define G_TPT_PAGE_SIZE(x) (((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE) + +#define S_TPT_PBL_ADDR 0 +#define M_TPT_PBL_ADDR 0x1FFFFFFF +#define V_TPT_PBL_ADDR(x) ((x) << S_TPT_PBL_ADDR) +#define G_TPT_PBL_ADDR(x) (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR) + +#define S_TPT_QPID 0 +#define M_TPT_QPID 0xFFFFF +#define V_TPT_QPID(x) ((x) << S_TPT_QPID) +#define G_TPT_QPID(x) (((x) >> S_TPT_QPID) & M_TPT_QPID) + +#define S_TPT_PSTAG 0 +#define M_TPT_PSTAG 0xFFFFFF +#define V_TPT_PSTAG(x) ((x) << S_TPT_PSTAG) +#define G_TPT_PSTAG(x) (((x) >> S_TPT_PSTAG) & M_TPT_PSTAG) + +#define S_TPT_PBL_SIZE 0 +#define M_TPT_PBL_SIZE 0xFFFFF +#define V_TPT_PBL_SIZE(x) ((x) << S_TPT_PBL_SIZE) +#define G_TPT_PBL_SIZE(x) (((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE) + +/* + * CQE defs + */ +struct t3_cqe { + __be32 header; + __be32 len; + union { + struct { + __be32 stag; + __be32 msn; + } rcqe; + struct { + u32 wrid_hi; + u32 wrid_low; + } scqe; + } u; +}; + +#define S_CQE_OOO 31 +#define M_CQE_OOO 0x1 +#define G_CQE_OOO(x) ((((x) >> S_CQE_OOO)) & M_CQE_OOO) +#define V_CEQ_OOO(x) ((x)<> S_CQE_QPID)) & M_CQE_QPID) +#define V_CQE_QPID(x) ((x)<> S_CQE_SWCQE)) & M_CQE_SWCQE) +#define V_CQE_SWCQE(x) ((x)<> S_CQE_GENBIT) & M_CQE_GENBIT) +#define V_CQE_GENBIT(x) ((x)<> S_CQE_STATUS)) & M_CQE_STATUS) +#define V_CQE_STATUS(x) ((x)<> S_CQE_TYPE)) & M_CQE_TYPE) +#define V_CQE_TYPE(x) ((x)<> S_CQE_OPCODE)) & M_CQE_OPCODE) +#define V_CQE_OPCODE(x) ((x)<queue->flit[13] = 1; +} + +static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + return NULL; +} + +static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +#endif From swise at opengridcomputing.com Sun Dec 10 14:37:46 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:37:46 -0600 Subject: [openib-general] [PATCH v3 10/13] Core HAL In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223746.27166.57624.stgit@dell3.ogc.int> The RDMA Core interfaces with the T3 HW and ULLD providing a low level RDMA interface. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_hal.h | 201 ++++ 2 files changed, 1503 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c new file mode 100644 index 0000000..ffc4ec0 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c @@ -0,0 +1,1302 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include +#include +#include +#include + +#include "cxio_resource.h" +#include "cxio_hal.h" +#include "cxgb3_offload.h" +#include "sge_defs.h" + +static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC]; +static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL; + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (!strcmp(rdev_tbl[i]->dev_name, dev_name)) + return rdev_tbl[i]; + return NULL; +} + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev + *tdev) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (rdev_tbl[i]->t3cdev_p == tdev) + return rdev_tbl[i]; + return NULL; +} + +static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (!rdev_tbl[i]) { + rdev_tbl[i] = rdev_p; + break; + } + return (i == T3_MAX_NUM_RNIC); +} + +static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i] == rdev_p) { + rdev_tbl[i] = NULL; + break; + } +} + +int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit) +{ + int ret; + struct t3_cqe *cqe; + u32 rptr; + + struct rdma_cq_op setup; + setup.id = cq->cqid; + setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0; + setup.op = op; + ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup); + + if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) + return ret; + + /* + * If the rearm returned an index other than our current index, + * then there might be CQE's in flight (being DMA'd). We must wait + * here for them to complete or the consumer can miss a notification. + */ + if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) { + int i=0; + + rptr = cq->rptr; + + /* + * Keep the generation correct by bumping rptr until it + * matches the index returned by the rearm - 1. + */ + while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret) + rptr++; + + /* + * Now rptr is the index for the (last) cqe that was + * in-flight at the time the HW rearmed the CQ. We + * spin until that CQE is valid. + */ + cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2); + while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) { + udelay(1); + if (i++ > 1000000) { + BUG_ON(1); + printk(KERN_ERR "%s: stalled rnic\n", + rdev_p->dev_name); + return -EIO; + } + } + } + return 0; +} + +static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid) +{ + struct rdma_cq_setup setup; + setup.id = cqid; + setup.base_addr = 0; /* NULL address */ + setup.size = 0; /* disaable the CQ */ + setup.credits = 0; + setup.credit_thres = 0; + setup.ovfl_mode = 0; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) +{ + u64 sge_cmd; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = qpid << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe); + + cq->cqid = cxio_hal_get_cqid(rdev_p->rscp); + if (!cq->cqid) + return -ENOMEM; + cq->sw_queue = kzalloc(size, GFP_KERNEL); + if (!cq->sw_queue) + return -ENOMEM; + cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) * + sizeof(struct t3_cqe), + &(cq->dma_addr), GFP_KERNEL); + if (!cq->queue) { + kfree(cq->sw_queue); + return -ENOMEM; + } + pci_unmap_addr_set(cq, mapping, cq->dma_addr); + memset(cq->queue, 0, size); + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = 65535; + setup.credit_thres = 1; + if (rdev_p->t3cdev_p->type == T3B) + setup.ovfl_mode = 0; + else + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = setup.size; + setup.credit_thres = setup.size; /* TBD: overflow recovery */ + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + u32 qpid; + int i; + + mutex_lock(&uctx->lock); + if (!list_empty(&uctx->qpids)) { + entry = list_entry(uctx->qpids.next, struct cxio_qpid_list, + entry); + list_del(&entry->entry); + qpid = entry->qpid; + kfree(entry); + } else { + qpid = cxio_hal_get_qpid(rdev_p->rscp); + if (!qpid) + goto out; + for (i = qpid+1; i & rdev_p->qpmask; i++) { + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + break; + entry->qpid = i; + list_add_tail(&entry->entry, &uctx->qpids); + } + } +out: + mutex_unlock(&uctx->lock); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid, + struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + return; + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + entry->qpid = qpid; + mutex_lock(&uctx->lock); + list_add_tail(&entry->entry, &uctx->qpids); + mutex_unlock(&uctx->lock); +} + +void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct list_head *pos, *nxt; + struct cxio_qpid_list *entry; + + mutex_lock(&uctx->lock); + list_for_each_safe(pos, nxt, &uctx->qpids) { + entry = list_entry(pos, struct cxio_qpid_list, entry); + list_del_init(&entry->entry); + if (!(entry->qpid & rdev_p->qpmask)) + cxio_hal_put_qpid(rdev_p->rscp, entry->qpid); + kfree(entry); + } + mutex_unlock(&uctx->lock); +} + +void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + INIT_LIST_HEAD(&uctx->qpids); + mutex_init(&uctx->lock); +} + +int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain, + struct t3_wq *wq, struct cxio_ucontext *uctx) +{ + int depth = 1UL << wq->size_log2; + int rqsize = 1UL << wq->rq_size_log2; + + wq->qpid = get_qpid(rdev_p, uctx); + if (!wq->qpid) + return -ENOMEM; + + wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL); + if (!wq->rq) + goto err1; + + wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize); + if (!wq->rq_addr) + goto err2; + + wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL); + if (!wq->sq) + goto err3; + + wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + depth * sizeof(union t3_wr), + &(wq->dma_addr), GFP_KERNEL); + if (!wq->queue) + goto err4; + + memset(wq->queue, 0, depth * sizeof(union t3_wr)); + pci_unmap_addr_set(wq, mapping, wq->dma_addr); + wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + if (!kernel_domain) + wq->udb = (u64)rdev_p->rnic_info.udbell_physbase + + (wq->qpid << rdev_p->qpshift); + PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__, + wq->qpid, wq->doorbell, wq->udb); + return 0; +err4: + kfree(wq->sq); +err3: + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize); +err2: + kfree(wq->rq); +err1: + put_qpid(rdev_p, wq->qpid, uctx); + return -ENOMEM; +} + +int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + int err; + err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid); + kfree(cq->sw_queue); + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) + * sizeof(struct t3_cqe), cq->queue, + pci_unmap_addr(cq, mapping)); + cxio_hal_put_cqid(rdev_p->rscp, cq->cqid); + return err; +} + +int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq, + struct cxio_ucontext *uctx) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (wq->size_log2)) + * sizeof(union t3_wr), wq->queue, + pci_unmap_addr(wq, mapping)); + kfree(wq->sq); + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2)); + kfree(wq->rq); + put_qpid(rdev_p, wq->qpid, uctx); + return 0; +} + +static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(T3_SEND) | + V_CQE_TYPE(0) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + u32 ptr; + + PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq); + + /* flush RQ */ + PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, + wq->rq_rptr, wq->rq_wptr, count); + ptr = wq->rq_rptr + count; + while (ptr++ != wq->rq_wptr) + insert_recv_cqe(wq, cq); +} + +static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, + struct t3_swsq *sqp) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(sqp->opcode) | + V_CQE_TYPE(1) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + cqe.u.scqe.wrid_hi = sqp->sq_wptr; + + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + __u32 ptr; + struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2); + + ptr = wq->sq_rptr + count; + sqp += count; + while (ptr != wq->sq_wptr) { + insert_sq_cqe(wq, cq, sqp); + sqp++; + ptr++; + } +} + +/* + * Move all CQEs from the HWCQ into the SWCQ. + */ +void cxio_flush_hw_cq(struct t3_cq *cq) +{ + struct t3_cqe *cqe, *swcqe; + + PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid); + cqe = cxio_next_hw_cqe(cq); + while (cqe) { + PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n", + __FUNCTION__, cq->rptr, cq->sw_wptr); + swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2); + *swcqe = *cqe; + swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1)); + cq->sw_wptr++; + cq->rptr++; + cqe = cxio_next_hw_cqe(cq); + } +} + +static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq) +{ + if (CQE_OPCODE(*cqe) == T3_TERMINATE) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) + return 0; + + return 1; +} + +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) && + (CQE_QPID(*cqe) == wq->qpid)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + PDBG("%s count zero %d\n", __FUNCTION__, *count); + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) && + (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p) +{ + struct rdma_cq_setup setup; + setup.id = 0; + setup.base_addr = 0; /* NULL address */ + setup.size = 1; /* enable the CQ */ + setup.credits = 0; + + /* force SGE to redirect to RspQ and interrupt */ + setup.credit_thres = 0; + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) +{ + int err; + u64 sge_cmd, ctx0, ctx1; + u64 base_addr; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + + + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + err = cxio_hal_init_ctrl_cq(rdev_p); + if (err) { + PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err); + return err; + } + rdev_p->ctrl_qp.workq = dma_alloc_coherent( + &(rdev_p->rnic_info.pdev->dev), + (1 << T3_CTRL_QP_SIZE_LOG2) * + sizeof(union t3_wr), + &(rdev_p->ctrl_qp.dma_addr), + GFP_KERNEL); + if (!rdev_p->ctrl_qp.workq) { + PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__); + return -ENOMEM; + } + pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, + rdev_p->ctrl_qp.dma_addr); + rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + memset(rdev_p->ctrl_qp.workq, 0, + (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr)); + + init_MUTEX(&rdev_p->ctrl_qp.sem); + init_waitqueue_head(&rdev_p->ctrl_qp.waitq); + + /* update HW Ctrl QP context */ + base_addr = rdev_p->ctrl_qp.dma_addr; + base_addr >>= 12; + ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) | + V_EC_BASE_LO((u32) base_addr & 0xffff)); + ctx0 <<= 32; + ctx0 |= V_EC_CREDITS(FW_WR_NUM); + base_addr >>= 16; + ctx1 = (u32) base_addr; + base_addr >>= 32; + ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) | + V_EC_TYPE(0) | V_EC_GEN(1) | + V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32; + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1, + T3_CTL_QP_TID, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + wqe->ctx1 = cpu_to_be64(ctx1); + wqe->ctx0 = cpu_to_be64(ctx0); + PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n", + (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq, + 1 << T3_CTRL_QP_SIZE_LOG2); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << T3_CTRL_QP_SIZE_LOG2) + * sizeof(union t3_wr), rdev_p->ctrl_qp.workq, + pci_unmap_addr(&rdev_p->ctrl_qp, mapping)); + return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID); +} + +/* write len bytes of data into addr (32B aligned address) + * If data is NULL, clear len byte of memory to zero. + * caller aquires the sem before the call + */ +static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, + u32 len, void *data, int completion) +{ + u32 i, nr_wqe, copy_len; + u8 *copy_data; + u8 wr_len, utx_len; /* lenght in 8 byte flit */ + enum t3_wr_flags flag; + __be64 *wqe; + u64 utx_cmd; + addr &= 0x7FFFFFF; + nr_wqe = len % 96 ? len / 96 + 1 : len / 96; /* 96B max per WQE */ + PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n", + __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, + nr_wqe, data, addr); + utx_len = 3; /* in 32B unit */ + for (i = 0; i < nr_wqe; i++) { + if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2)) { + PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, " + "wait for more space i %d\n", __FUNCTION__, + rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i); + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + !Q_FULL(rdev_p->ctrl_qp.rptr, + rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2))) { + PDBG("%s ctrl_qp workq interrupted\n", + __FUNCTION__); + return -ERESTARTSYS; + } + PDBG("%s ctrl_qp wakeup, continue posting work request " + "i %d\n", __FUNCTION__, i); + } + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + flag = 0; + if (i == (nr_wqe - 1)) { + /* last WQE */ + flag = completion ? T3_COMPLETION_FLAG : 0; + if (len % 32) + utx_len = len / 32 + 1; + else + utx_len = len / 32; + } + + /* + * Force a CQE to return the credit to the workq in case + * we posted more than half the max QP size of WRs + */ + if ((i != 0) && + (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) { + flag = T3_COMPLETION_FLAG; + PDBG("%s force completion at i %d\n", __FUNCTION__, i); + } + + /* build the utx mem command */ + wqe += (sizeof(struct t3_bypass_wr) >> 3); + utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3); + utx_cmd <<= 32; + utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1); + *wqe = cpu_to_be64(utx_cmd); + wqe++; + copy_data = (u8 *) data + i * 96; + copy_len = len > 96 ? 96 : len; + + /* clear memory content if data is NULL */ + if (data) + memcpy(wqe, copy_data, copy_len); + else + memset(wqe, 0, copy_len); + if (copy_len % 32) + memset(((u8 *) wqe) + copy_len, 0, + 32 - (copy_len % 32)); + wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + + (utx_len << 2); + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + + /* wptr in the WRID[31:0] */ + ((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr; + + /* + * This must be the last write with a memory barrier + * for the genbit + */ + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag, + Q_GENBIT(rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID, + wr_len); + if (flag == T3_COMPLETION_FLAG) + ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID); + len -= 96; + rdev_p->ctrl_qp.wptr++; + } + return 0; +} + +/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size + * OUT: stag index, actual pbl_size, pbl_addr allocated. + * TBD: shared memory region support + */ +static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, + u32 *stag, u8 stag_state, u32 pdid, + enum tpt_mem_type type, enum tpt_mem_perm perm, + u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl, + u32 *pbl_size, u32 *pbl_addr) +{ + int err; + struct tpt_entry tpt; + u32 stag_idx; + u32 wptr; + int rereg = (*stag != T3_STAG_UNSET); + + stag_state = stag_state > 0; + stag_idx = (*stag) >> 8; + + if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) { + stag_idx = cxio_hal_get_stag(rdev_p->rscp); + if (!stag_idx) + return -ENOMEM; + *stag = (stag_idx << 8) | ((*stag) & 0xFF); + } + PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", + __FUNCTION__, stag_state, type, pdid, stag_idx); + + if (reset_tpt_entry) + cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3); + else if (!rereg) { + *pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3); + if (!*pbl_addr) { + return -ENOMEM; + } + } + + down_interruptible(&rdev_p->ctrl_qp.sem); + + /* write PBL first if any - update pbl only if pbl list exist */ + if (pbl) { + + PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n", + __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base, + *pbl_size); + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + (*pbl_addr >> 5), + (*pbl_size << 3), pbl, 0); + if (err) + goto ret; + } + + /* write TPT entry */ + if (reset_tpt_entry) + memset(&tpt, 0, sizeof(tpt)); + else { + tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID | + V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) | + V_TPT_STAG_STATE(stag_state) | + V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid)); + BUG_ON(page_size >= 28); + tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | + F_TPT_MW_BIND_ENABLE | + V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) | + V_TPT_PAGE_SIZE(page_size)); + tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3)); + tpt.len = cpu_to_be32(len); + tpt.va_hi = cpu_to_be32((u32) (to >> 32)); + tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL)); + tpt.rsvd_bind_cnt_or_pstag = 0; + tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2)); + } + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + stag_idx + + (rdev_p->rnic_info.tpt_base >> 5), + sizeof(tpt), &tpt, 1); + + /* release the stag index to free pool */ + if (reset_tpt_entry) + cxio_hal_put_stag(rdev_p->rscp, stag_idx); +ret: + wptr = rdev_p->ctrl_qp.wptr; + up(&rdev_p->ctrl_qp.sem); + if (!err) + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + SEQ32_GE(rdev_p->ctrl_qp.rptr, + wptr))) + return -ERESTARTSYS; + return err; +} + +/* IN : stag key, pdid, pbl_size + * Out: stag index, actaul pbl_size, and pbl_addr allocated. + */ +int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr) +{ + *stag = T3_STAG_UNSET; + return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, + perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr)); +} + +int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, + u32 pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + &pbl_size, &pbl_addr); +} + +int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) +{ + u32 pbl_size = 0; + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0, + NULL, &pbl_size, NULL); +} + +int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + NULL, NULL); +} + +int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) +{ + struct t3_rdma_init_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC); + if (!skb) + return -ENOMEM; + PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p); + wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe)); + wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT)); + wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) | + V_FW_RIWR_LEN(sizeof(*wqe) >> 3)); + wqe->wrid.id1 = 0; + wqe->qpid = cpu_to_be32(attr->qpid); + wqe->pdid = cpu_to_be32(attr->pdid); + wqe->scqid = cpu_to_be32(attr->scqid); + wqe->rcqid = cpu_to_be32(attr->rcqid); + wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base); + wqe->rq_size = cpu_to_be32(attr->rq_size); + wqe->mpaattrs = attr->mpaattrs; + wqe->qpcaps = attr->qpcaps; + wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss); + wqe->flags = cpu_to_be32(attr->flags); + wqe->ord = cpu_to_be32(attr->ord); + wqe->ird = cpu_to_be32(attr->ird); + wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr); + wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); + wqe->rsvd = 0; + skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = ev_cb; +} + +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = NULL; +} + +static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb) +{ + static int cnt; + struct cxio_rdev *rdev_p = NULL; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x" + " se %0x notify %0x cqbranch %0x creditth %0x\n", + cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg), + RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg), + RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg), + RSPQ_CREDIT_THRESH(rsp_msg)); + PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d " + "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + rdev_p = (struct cxio_rdev *)t3cdev_p->ulp; + if (!rdev_p) { + PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__, + t3cdev_p); + return 0; + } + if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) { + rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1; + wake_up_interruptible(&rdev_p->ctrl_qp.waitq); + dev_kfree_skb_irq(skb); + } else if (CQE_QPID(rsp_msg->cqe) == 0xfff8) + dev_kfree_skb_irq(skb); + else if (cxio_ev_cb) + (*cxio_ev_cb) (rdev_p, skb); + else + dev_kfree_skb_irq(skb); + cnt++; + return 0; +} + +/* Caller takes care of locking if needed */ +int cxio_rdev_open(struct cxio_rdev *rdev_p) +{ + struct net_device *netdev_p = NULL; + int err = 0; + if (strlen(rdev_p->dev_name)) { + if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) { + return -EBUSY; + } + netdev_p = dev_get_by_name(rdev_p->dev_name); + if (!netdev_p) { + return -EINVAL; + } + dev_put(netdev_p); + } else if (rdev_p->t3cdev_p) { + if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) { + return -EBUSY; + } + netdev_p = rdev_p->t3cdev_p->lldev; + strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name, + T3_MAX_DEV_NAME_LEN); + } else { + PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__); + return -EINVAL; + } + + if (cxio_hal_add_rdev(rdev_p)) + return -ENOMEM; + + PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name); + memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp)); + if (!rdev_p->t3cdev_p) + rdev_p->t3cdev_p = T3CDEV(netdev_p); + rdev_p->t3cdev_p->ulp = (void *) rdev_p; + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS, + &(rdev_p->rnic_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS, + &(rdev_p->port_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + + /* + * qpshift is the number of bits to shift the qpid left in order + * to get the correct address of the doorbell for that qp. + */ + cxio_init_ucontext(rdev_p, &rdev_p->uctx); + rdev_p->qpshift = PAGE_SHIFT - + ilog2(65536 >> + ilog2(rdev_p->rnic_info.udbell_len >> + PAGE_SHIFT)); + rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT; + rdev_p->qpmask = (65536 >> ilog2(rdev_p->qpnr)) - 1; + PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d " + "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n", + __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base, + rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p), + rdev_p->rnic_info.pbl_base, + rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base, + rdev_p->rnic_info.rqt_top); + PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu " + "qpnr %d qpmask 0x%x\n", + rdev_p->rnic_info.udbell_len, + rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr, + rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask); + + err = cxio_hal_init_ctrl_qp(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing ctrl_qp.\n", + __FUNCTION__, err); + goto err1; + } + err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0, + 0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ, + T3_MAX_NUM_PD); + if (err) { + printk(KERN_ERR "%s error %d initializing hal resources.\n", + __FUNCTION__, err); + goto err2; + } + err = cxio_hal_pblpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing pbl mem pool.\n", + __FUNCTION__, err); + goto err3; + } + err = cxio_hal_rqtpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing rqt mem pool.\n", + __FUNCTION__, err); + goto err4; + } + return 0; +err4: + cxio_hal_pblpool_destroy(rdev_p); +err3: + cxio_hal_destroy_resource(rdev_p->rscp); +err2: + cxio_hal_destroy_ctrl_qp(rdev_p); +err1: + cxio_hal_delete_rdev(rdev_p); + return err; +} + +void cxio_rdev_close(struct cxio_rdev *rdev_p) +{ + if (rdev_p) { + cxio_hal_pblpool_destroy(rdev_p); + cxio_hal_rqtpool_destroy(rdev_p); + cxio_hal_delete_rdev(rdev_p); + rdev_p->t3cdev_p->ulp = NULL; + cxio_hal_destroy_ctrl_qp(rdev_p); + cxio_hal_destroy_resource(rdev_p->rscp); + } +} + +int __init cxio_hal_init(void) +{ + if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI)) + return -ENOMEM; + memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *)); + t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler); + return 0; +} + +void __exit cxio_hal_exit(void) +{ + int i; + t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL); + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + cxio_rdev_close(rdev_tbl[i]); + cxio_hal_destroy_rhdl_resource(); +} + +static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_swsq *sqp; + __u32 ptr = wq->sq_rptr; + int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr); + + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + while (count--) + if (!sqp->signaled) { + ptr++; + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + } else if (sqp->complete) { + + /* + * Insert this completed cqe into the swcq. + */ + PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n", + __FUNCTION__, Q_PTR2IDX(ptr, wq->sq_size_log2), + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)); + sqp->cqe.header |= htonl(V_CQE_SWCQE(1)); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) + = sqp->cqe; + cq->sw_wptr++; + sqp->signaled = 0; + break; + } else + break; +} + +static inline void create_read_req_cqe(struct t3_wq *wq, + struct t3_cqe *hw_cqe, + struct t3_cqe *read_cqe) +{ + read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr; + read_cqe->len = wq->oldest_read->read_len; + read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) | + V_CQE_SWCQE(SW_CQE(*hw_cqe)) | + V_CQE_OPCODE(T3_READ_REQ) | + V_CQE_TYPE(1)); +} + +/* + * Return a ptr to the next read wr in the SWSQ or NULL. + */ +static inline void advance_oldest_read(struct t3_wq *wq) +{ + + u32 rptr = wq->oldest_read - wq->sq + 1; + u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2); + + while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) { + wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2); + + if (wq->oldest_read->opcode == T3_READ_REQ) + return; + rptr++; + } + wq->oldest_read = NULL; +} + +/* + * cxio_poll_cq + * + * Caller must: + * check the validity of the first CQE, + * supply the wq assicated with the qpid. + * + * credit: cq credit to return to sge. + * cqe_flushed: 1 iff the CQE is flushed. + * cqe: copy of the polled CQE. + * + * return value: + * 0 CQE returned, + * -1 CQE skipped, try again. + */ +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit) +{ + int ret = 0; + struct t3_cqe *hw_cqe, read_cqe; + + *cqe_flushed = 0; + *credit = 0; + hw_cqe = cxio_next_cqe(cq); + + PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x" + " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe), + CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe), + CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe), + CQE_WRID_LOW(*hw_cqe)); + + /* + * skip cqe's not affiliated with a QP. + */ + if (wq == NULL) { + ret = -1; + goto skip_cqe; + } + + /* + * Gotta tweak READ completions: + * 1) the cqe doesn't contain the sq_wptr from the wr. + * 2) opcode not reflected from the wr. + * 3) read_len not reflected from the wr. + * 4) cq_type is RQ_TYPE not SQ_TYPE. + */ + if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) { + + /* + * Don't write to the HWCQ, so create a new read req CQE + * in local memory. + */ + create_read_req_cqe(wq, hw_cqe, &read_cqe); + hw_cqe = &read_cqe; + advance_oldest_read(wq); + } + + /* + * T3A: Discard TERMINATE CQEs. + */ + if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) { + ret = -1; + wq->error = 1; + goto skip_cqe; + } + + if (CQE_STATUS(*hw_cqe) || wq->error) { + *cqe_flushed = wq->error; + wq->error = 1; + + /* + * T3A inserts errors into the CQE. We cannot return + * these as work completions. + */ + /* incoming write failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE) + && RQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + /* incoming read request failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + + /* incoming SEND with no receive posted failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) { + ret = -1; + goto skip_cqe; + } + goto proc_cqe; + } + + /* + * RECV completion. + */ + if (RQ_TYPE(*hw_cqe)) { + + /* + * HW only validates 4 bits of MSN. So we must validate that + * the MSN in the SEND is the next expected MSN. If its not, + * then we complete this with TPT_ERR_MSN and mark the wq in + * error. + */ + if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) { + wq->error = 1; + hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN)); + goto proc_cqe; + } + goto proc_cqe; + } + + /* + * If we get here its a send completion. + * + * Handle out of order completion. These get stuffed + * in the SW SQ. Then the SW SQ is walked to move any + * now in-order completions into the SW CQ. This handles + * 2 cases: + * 1) reaping unsignaled WRs when the first subsequent + * signaled WR is completed. + * 2) out of order read completions. + */ + if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) { + struct t3_swsq *sqp; + + PDBG("%s out of order completion going in swsq at idx %ld\n", + __FUNCTION__, + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2)); + sqp = wq->sq + + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2); + sqp->cqe = *hw_cqe; + sqp->complete = 1; + ret = -1; + goto flush_wq; + } + +proc_cqe: + *cqe = *hw_cqe; + + /* + * Reap the associated WR(s) that are freed up with this + * completion. + */ + if (SQ_TYPE(*hw_cqe)) { + wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe); + PDBG("%s completing sq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2)); + *cookie = (wq->sq + + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id; + wq->sq_rptr++; + } else { + PDBG("%s completing rq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + *cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + wq->rq_rptr++; + } + +flush_wq: + /* + * Flush any completed cqes that are now in-order. + */ + flush_completed_wrs(wq, cq); + +skip_cqe: + if (SW_CQE(*hw_cqe)) { + PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->sw_rptr); + ++cq->sw_rptr; + } else { + PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->rptr); + ++cq->rptr; + + /* + * T3A: compute credits. + */ + if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1))) + || ((cq->rptr - cq->wptr) >= 128)) { + *credit = cq->rptr - cq->wptr; + cq->wptr = cq->rptr; + } + } + return ret; +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h new file mode 100644 index 0000000..bde5cfb --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_HAL_H__ +#define __CXIO_HAL_H__ + +#include +#include + +#include "t3_cpl.h" +#include "t3cdev.h" +#include "cxgb3_ctl_defs.h" +#include "cxio_wr.h" + +#define T3_CTRL_QP_ID FW_RI_SGEEC_START +#define T3_CTL_QP_TID FW_RI_TID_START +#define T3_CTRL_QP_SIZE_LOG2 8 +#define T3_CTRL_CQ_ID 0 + +/* TBD */ +#define T3_MAX_NUM_RNIC 8 +#define T3_MAX_NUM_RI (1<<15) +#define T3_MAX_NUM_QP (1<<15) +#define T3_MAX_NUM_CQ (1<<15) +#define T3_MAX_NUM_PD (1<<15) +#define T3_MAX_PBL_SIZE 256 +#define T3_MAX_RQ_SIZE 1024 +#define T3_MAX_NUM_STAG (1<<15) + +#define T3_STAG_UNSET 0xffffffff + +#define T3_MAX_DEV_NAME_LEN 32 + +struct cxio_hal_ctrl_qp { + u32 wptr; + u32 rptr; + struct semaphore sem; /* for the wtpr, can sleep */ + wait_queue_head_t waitq; /* wait for RspQ/CQE msg */ + union t3_wr *workq; /* the work request queue */ + dma_addr_t dma_addr; /* pci bus address of the workq */ + DECLARE_PCI_UNMAP_ADDR(mapping) + void __iomem *doorbell; +}; + +struct cxio_hal_resource { + struct kfifo *tpt_fifo; + spinlock_t tpt_fifo_lock; + struct kfifo *qpid_fifo; + spinlock_t qpid_fifo_lock; + struct kfifo *cqid_fifo; + spinlock_t cqid_fifo_lock; + struct kfifo *pdid_fifo; + spinlock_t pdid_fifo_lock; +}; + +struct cxio_qpid_list { + struct list_head entry; + u32 qpid; +}; + +struct cxio_ucontext { + struct list_head qpids; + struct mutex lock; +}; + +struct cxio_rdev { + char dev_name[T3_MAX_DEV_NAME_LEN]; + struct t3cdev *t3cdev_p; + struct rdma_info rnic_info; + struct adap_ports port_info; + struct cxio_hal_resource *rscp; + struct cxio_hal_ctrl_qp ctrl_qp; + void *ulp; + unsigned long qpshift; + u32 qpnr; + u32 qpmask; + struct cxio_ucontext uctx; + struct gen_pool *pbl_pool; + struct gen_pool *rqt_pool; +}; + +static inline int cxio_num_stags(struct cxio_rdev *rdev_p) +{ + return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5)); +} + +typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p, + struct sk_buff * skb); + +#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff) +#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff) +#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1) +#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1) +#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1) +#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1) +#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1) +#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1) +#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1) + +struct respQ_msg_t { + __be32 flags; /* flit 0 */ + __be32 cq_ptrid; + __be64 rsvd; /* flit 1 */ + struct t3_cqe cqe; /* flits 2-3 */ +}; + +enum t3_cq_opcode { + CQ_ARM_AN = 0x2, + CQ_ARM_SE = 0x6, + CQ_FORCE_AN = 0x3, + CQ_CREDIT_UPDATE = 0x7 +}; + +int cxio_rdev_open(struct cxio_rdev *rdev); +void cxio_rdev_close(struct cxio_rdev *rdev); +int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit); +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid); +int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode); +int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr); +int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, + u32 pbl_addr); +int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); +int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag); +int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr); +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +u32 cxio_hal_get_rhdl(void); +void cxio_hal_put_rhdl(u32 rhdl); +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp); +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid); +int __init cxio_hal_init(void); +void __exit cxio_hal_exit(void); +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_flush_hw_cq(struct t3_cq *cq); +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit); + +#define MOD "iw_cxgb3: " +#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args) + +#ifdef DEBUG +void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag); +void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift); +void cxio_dump_wqe(union t3_wr *wqe); +void cxio_dump_wce(struct t3_cqe *wce); +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents); +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid); +#endif + +#endif From swise at opengridcomputing.com Sun Dec 10 14:38:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:38:16 -0600 Subject: [openib-general] [PATCH v3 11/13] Core Resource Allocation In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223816.27166.81499.stgit@dell3.ogc.int> Core functions to carve up adapter memory, stag, qp, and cq IDs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_resource.c | 331 ++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_resource.h | 70 +++++ 2 files changed, 401 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c new file mode 100644 index 0000000..444df15 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c @@ -0,0 +1,331 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +/* Crude resource management */ +#include +#include +#include +#include +#include +#include +#include "cxio_resource.h" +#include "cxio_hal.h" + +static struct kfifo *rhdl_fifo; +static spinlock_t rhdl_fifo_lock; + +#define RANDOM_SIZE 16 + +static int __cxio_init_resource_fifo(struct kfifo **fifo, + spinlock_t *fifo_lock, + u32 nr, u32 skip_low, + u32 skip_high, + int random) +{ + u32 i, j, entry = 0, idx; + u32 random_bytes; + u32 rarray[16]; + spin_lock_init(fifo_lock); + + *fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock); + if (IS_ERR(*fifo)) + return -ENOMEM; + + for (i = 0; i < skip_low + skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32)); + if (random) { + j = 0; + random_bytes = random32(); + for (i = 0; i < RANDOM_SIZE; i++) + rarray[i] = i + skip_low; + for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) { + if (j >= RANDOM_SIZE) { + j = 0; + random_bytes = random32(); + } + idx = (random_bytes >> (j * 2)) & 0xF; + __kfifo_put(*fifo, + (unsigned char *) &rarray[idx], + sizeof(u32)); + rarray[idx] = i; + j++; + } + for (i = 0; i < RANDOM_SIZE; i++) + __kfifo_put(*fifo, + (unsigned char *) &rarray[i], + sizeof(u32)); + } else + for (i = skip_low; i < nr - skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32)); + + for (i = 0; i < skip_low + skip_high; i++) + kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32)); + return 0; +} + +static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 0)); +} + +static int cxio_init_resource_fifo_random(struct kfifo **fifo, + spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 1)); +} + +static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p) +{ + u32 i; + + spin_lock_init(&rdev_p->rscp->qpid_fifo_lock); + + rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), + GFP_KERNEL, + &rdev_p->rscp->qpid_fifo_lock); + if (IS_ERR(rdev_p->rscp->qpid_fifo)) + return -ENOMEM; + + for (i = 16; i < T3_MAX_NUM_QP; i++) + if (!(i & rdev_p->qpmask)) + __kfifo_put(rdev_p->rscp->qpid_fifo, + (unsigned char *) &i, sizeof(u32)); + return 0; +} + +int cxio_hal_init_rhdl_resource(u32 nr_rhdl) +{ + return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1, + 0); +} + +void cxio_hal_destroy_rhdl_resource(void) +{ + kfifo_free(rhdl_fifo); +} + +/* nr_* must be power of 2 */ +int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid) +{ + int err = 0; + struct cxio_hal_resource *rscp; + + rscp = kmalloc(sizeof(*rscp), GFP_KERNEL); + if (!rscp) + return -ENOMEM; + rdev_p->rscp = rscp; + err = cxio_init_resource_fifo_random(&rscp->tpt_fifo, + &rscp->tpt_fifo_lock, + nr_tpt, 1, 0); + if (err) + goto tpt_err; + err = cxio_init_qpid_fifo(rdev_p); + if (err) + goto qpid_err; + err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, + nr_cqid, 1, 0); + if (err) + goto cqid_err; + err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, + nr_pdid, 1, 0); + if (err) + goto pdid_err; + return 0; +pdid_err: + kfifo_free(rscp->cqid_fifo); +cqid_err: + kfifo_free(rscp->qpid_fifo); +qpid_err: + kfifo_free(rscp->tpt_fifo); +tpt_err: + return -ENOMEM; +} + +/* + * returns 0 if no resource available + */ +static inline u32 cxio_hal_get_resource(struct kfifo *fifo) +{ + u32 entry; + if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32))) + return entry; + else + return 0; /* fifo emptry */ +} + +static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry) +{ + BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0); +} + +u32 cxio_hal_get_rhdl(void) +{ + return cxio_hal_get_resource(rhdl_fifo); +} + +void cxio_hal_put_rhdl(u32 rhdl) +{ + cxio_hal_put_resource(rhdl_fifo, rhdl); +} + +u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->tpt_fifo); +} + +void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag) +{ + cxio_hal_put_resource(rscp->tpt_fifo, stag); +} + +u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp) +{ + u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid) +{ + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + cxio_hal_put_resource(rscp->qpid_fifo, qpid); +} + +u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->cqid_fifo); +} + +void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid) +{ + cxio_hal_put_resource(rscp->cqid_fifo, cqid); +} + +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->pdid_fifo); +} + +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid) +{ + cxio_hal_put_resource(rscp->pdid_fifo, pdid); +} + +void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp) +{ + kfifo_free(rscp->tpt_fifo); + kfifo_free(rscp->cqid_fifo); + kfifo_free(rscp->qpid_fifo); + kfifo_free(rscp->pdid_fifo); + kfree(rscp); +} + +/* + * PBL Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_PBL_SHIFT 8 /* 256B == min PBL size (32 entries) */ +#define PBL_CHUNK 2*1024*1024 + +u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size); + return (u32)addr; +} + +void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size); + gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size); +} + +int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1); + if (rdev_p->pbl_pool) + for (i = rdev_p->rnic_info.pbl_base; + i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; + i += PBL_CHUNK) + gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1); + return rdev_p->pbl_pool ? 0 : -ENOMEM; +} + +void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->pbl_pool); +} + +/* + * RQT Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_RQT_SHIFT 10 /* 1KB == mini RQT size (16 entries) */ +#define RQT_CHUNK 2*1024*1024 + +u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6); + return (u32)addr; +} + +void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6); + gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6); +} + +int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1); + if (rdev_p->rqt_pool) + for (i = rdev_p->rnic_info.rqt_base; + i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; + i += RQT_CHUNK) + gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1); + return rdev_p->rqt_pool ? 0 : -ENOMEM; +} + +void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->rqt_pool); +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h new file mode 100644 index 0000000..a6bbe83 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_RESOURCE_H__ +#define __CXIO_RESOURCE_H__ + +#include +#include +#include +#include +#include +#include +#include +#include "cxio_hal.h" + +extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl); +extern void cxio_hal_destroy_rhdl_resource(void); +extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, + u32 nr_pdid); +extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag); +extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid); +extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid); +extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp); + +#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base ) +extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); + +#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base ) +extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); +#endif From swise at opengridcomputing.com Sun Dec 10 14:38:46 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:38:46 -0600 Subject: [openib-general] [PATCH v3 12/13] Core Debug functions In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223846.27166.55367.stgit@dell3.ogc.int> Debug code to dump various data structs, some of which are in adapter memory. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_dbg.c | 205 +++++++++++++++++++++++++++ 1 files changed, 205 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c new file mode 100644 index 0000000..22f4f75 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c @@ -0,0 +1,205 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifdef DEBUG +#include +#include "common.h" +#include "cxgb3_ioctl.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size = 32; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base; + m->len = size; + PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size, npages; + + shift += 12; + npages = (len + (1ULL << shift) - 1) >> shift; + size = npages * sizeof(u64); + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = pbl_addr; + m->len = size; + PDBG("%s PBL addr 0x%x len %d depth %d\n", + __FUNCTION__, m->addr, m->len, npages); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_wqe(union t3_wr *wqe) +{ + __be64 *data = (__be64 *)wqe; + uint size = (uint)(be64_to_cpu(*data) & 0xff); + + if (size == 0) + size = 8; + while (size > 0) { + PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data)); + size--; + data++; + } +} + +void cxio_dump_wce(struct t3_cqe *wce) +{ + __be64 *data = (__be64 *)wce; + int size = sizeof(*wce); + + while (size > 0) { + PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data)); + size -= 8; + data++; + } +} + +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents) +{ + struct ch_mem_range *m; + int size = nents * 64; + u64 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base; + m->len = size; + PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid) +{ + struct ch_mem_range *m; + int size = TCB_SIZE; + u32 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_CM; + m->addr = hwtid * size; + m->len = size; + PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u32 *)m->buf; + while (size > 0) { + printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", + m->addr, + *(data+2), *(data+3), *(data),*(data+1), + *(data+6), *(data+7), *(data+4), *(data+5)); + size -= 32; + data += 8; + m->addr += 32; + } + kfree(m); +} +#endif From swise at opengridcomputing.com Sun Dec 10 14:39:16 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sun, 10 Dec 2006 16:39:16 -0600 Subject: [openib-general] [PATCH v3 13/13] Kconfig/Makefile In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <20061210223916.27166.82130.stgit@dell3.ogc.int> Signed-off-by: Steve Wise --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/cxgb3/Kconfig | 27 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/Makefile | 12 ++++++++++++ drivers/infiniband/hw/cxgb3/locking.txt | 25 +++++++++++++++++++++++++ 5 files changed, 66 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 59b3932..06453ab 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/hw/ipath/Kconfig" source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/hw/amso1100/Kconfig" +source "drivers/infiniband/hw/cxgb3/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index 570b30a..69bdd55 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mt obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ +obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig new file mode 100644 index 0000000..84f0f6e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Kconfig @@ -0,0 +1,27 @@ +config INFINIBAND_CXGB3 + tristate "Chelsio RDMA Driver" + depends on CHELSIO_T3 && INFINIBAND + select GENERIC_ALLOCATOR + ---help--- + This is an iWARP/RDMA driver for the Chelsio T3 1GbE and + 10GbE adapters. + + For general information about Chelsio and our products, visit + our website at . + + For customer support, please visit our customer support page at + . + + Please send feedback to . + + To compile this driver as a module, choose M here: the module + will be called iw_cxgb3. + +config INFINIBAND_CXGB3_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_CXGB3 + default n + ---help--- + This option causes the Chelsio RDMA driver to produce copious + amounts of debug messages. Select this if you are developing + the driver or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile new file mode 100644 index 0000000..0df2b3d --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Makefile @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \ + -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core + +obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o + +iw_cxgb3-y := iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \ + iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o + +ifdef CONFIG_INFINIBAND_CXGB3_DEBUG +EXTRA_CFLAGS += -DDEBUG -O1 -g +iw_cxgb3-y += core/cxio_dbg.o +endif diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt new file mode 100644 index 0000000..e5e9991 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/locking.txt @@ -0,0 +1,25 @@ +cq lock: + - spin lock + - used to synchronize the t3_cq + +qp lock: + - spin lock + - used to synchronize updates to the qp state, attrs, and the t3_wq. + - touched on interrupt and process context + +rnicp lock: + - spin lock + - touched on interrupt and process context + - used around lookup tables mapping CQID and QPID to a structure. + - used also to bump the refcnt atomically with the lookup. + +poll: + lock+disable on cq lock + lock qp lock for each cqe that is polled around the call + to cxio_poll_cq(). + +post: + lock+disable qp lock + +global mutex iwch_mutex: + used to maintain global device list. From sashak at voltaire.com Sun Dec 10 14:56:13 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 00:56:13 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061210215956.GI9205@mellanox.co.il> References: <20061210215033.GC21155@sashak.voltaire.com> <20061210215956.GI9205@mellanox.co.il> Message-ID: <20061210225613.GF21155@sashak.voltaire.com> On 23:59 Sun 10 Dec , Michael S. Tsirkin wrote: > > Recently I found this OFA 'Userspace Git Trees' downloading howto: > > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories > > > > and thought that we could make it simpler for end-user to choose the > > "right" git tree just by adding one more series of symbolic links under > > /pub/scm. This links will point to the maintainer's "official" trees, and > > we will have only one such link per project. > > > > So typical downloading howto for end-users will looks like: > > > > git clone git://staging.openfabrics.org/dapl > > git clone git://staging.openfabrics.org/ibutils > > git clone git://staging.openfabrics.org/imgen > > ... > > > > instead of > > > > git clone git://staging.openfabrics.org/~ardavis/dapl > > git clone git://staging.openfabrics.org/~eitan/ibutils > > git clone git://staging.openfabrics.org/~mst/imgen > > ... > > > > as it is now. > > NACK, please remove this. These soft links are messy, and > the fact that one needs root just to add a tree shows just how the approach > is broken. No, it is not instead, but in addition to ~user/ links, so root is _not_ required to add tree. > If you have some temporary tree, just mention this in description, And when it is not temporary tree? > and gitweb will show this. And won't the problem basically go away > if you move ~sashak temporary trees out of ~/scm? For me it is unclear yet how long we may need this - 1.1 still be in SVN yet, and 1.1 git branch is updated there. > It seems we don't > have a lot of duplicates besides that. But we will have - we are running git hosting only week or so and already talking about pre-trunk trees for some projects. :) Other opinions? Sasha From randy.dunlap at oracle.com Sun Dec 10 14:56:02 2006 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Sun, 10 Dec 2006 14:56:02 -0800 Subject: [openib-general] [PATCH v3 13/13] Kconfig/Makefile In-Reply-To: <20061210223916.27166.82130.stgit@dell3.ogc.int> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> <20061210223916.27166.82130.stgit@dell3.ogc.int> Message-ID: <20061210145602.d2a8bb98.randy.dunlap@oracle.com> On Sun, 10 Dec 2006 16:39:16 -0600 Steve Wise wrote: > drivers/infiniband/Kconfig | 1 + > drivers/infiniband/Makefile | 1 + > drivers/infiniband/hw/cxgb3/Kconfig | 27 +++++++++++++++++++++++++++ > drivers/infiniband/hw/cxgb3/Makefile | 12 ++++++++++++ > drivers/infiniband/hw/cxgb3/locking.txt | 25 +++++++++++++++++++++++++ > 5 files changed, 66 insertions(+), 0 deletions(-) > > diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig > new file mode 100644 > index 0000000..84f0f6e > --- /dev/null > +++ b/drivers/infiniband/hw/cxgb3/Kconfig > @@ -0,0 +1,27 @@ > +config INFINIBAND_CXGB3 > + tristate "Chelsio RDMA Driver" > + depends on CHELSIO_T3 && INFINIBAND > + select GENERIC_ALLOCATOR > + ---help--- > + This is an iWARP/RDMA driver for the Chelsio T3 1GbE and > + 10GbE adapters. > + > + For general information about Chelsio and our products, visit > + our website at . > + > + For customer support, please visit our customer support page at > + . > + > + Please send feedback to . > + > + To compile this driver as a module, choose M here: the module > + will be called iw_cxgb3. Please indent all of that the same amount. Kconfig help text should be indented 1 tab + 2 spaces, like the first 2 lines are. > diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt > new file mode 100644 > index 0000000..e5e9991 > --- /dev/null > +++ b/drivers/infiniband/hw/cxgb3/locking.txt > @@ -0,0 +1,25 @@ > +cq lock: > + - spin lock > + - used to synchronize the t3_cq > + > +qp lock: > + - spin lock > + - used to synchronize updates to the qp state, attrs, and the t3_wq. > + - touched on interrupt and process context > + > +rnicp lock: > + - spin lock > + - touched on interrupt and process context > + - used around lookup tables mapping CQID and QPID to a structure. > + - used also to bump the refcnt atomically with the lookup. > + > +poll: > + lock+disable on cq lock > + lock qp lock for each cqe that is polled around the call > + to cxio_poll_cq(). > + > +post: > + lock+disable qp lock > + > +global mutex iwch_mutex: > + used to maintain global device list. Should be in Documentation/infiniband/. Docs go in the Documentation/ dir, not in drivers/ dir. --- ~Randy From swise at opengridcomputing.com Sun Dec 10 15:04:14 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Sun, 10 Dec 2006 17:04:14 -0600 Subject: [openib-general] [PATCH v3 13/13] Kconfig/Makefile In-Reply-To: <20061210145602.d2a8bb98.randy.dunlap@oracle.com> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> <20061210223916.27166.82130.stgit@dell3.ogc.int> <20061210145602.d2a8bb98.randy.dunlap@oracle.com> Message-ID: <1165791854.25243.11.camel@linux-q667.site> > > +++ b/drivers/infiniband/hw/cxgb3/Kconfig > > @@ -0,0 +1,27 @@ > > +config INFINIBAND_CXGB3 > > + tristate "Chelsio RDMA Driver" > > + depends on CHELSIO_T3 && INFINIBAND > > + select GENERIC_ALLOCATOR > > + ---help--- > > + This is an iWARP/RDMA driver for the Chelsio T3 1GbE and > > + 10GbE adapters. > > + > > + For general information about Chelsio and our products, visit > > + our website at . > > + > > + For customer support, please visit our customer support page at > > + . > > + > > + Please send feedback to . > > + > > + To compile this driver as a module, choose M here: the module > > + will be called iw_cxgb3. > > Please indent all of that the same amount. > Kconfig help text should be indented 1 tab + 2 spaces, > like the first 2 lines are. > Will do. > > > diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt > > new file mode 100644 > > index 0000000..e5e9991 > > --- /dev/null > > +++ b/drivers/infiniband/hw/cxgb3/locking.txt > > @@ -0,0 +1,25 @@ > > +cq lock: > > + - spin lock > > + - used to synchronize the t3_cq > > + > > +qp lock: > > + - spin lock > > + - used to synchronize updates to the qp state, attrs, and the t3_wq. > > + - touched on interrupt and process context > > + > > +rnicp lock: > > + - spin lock > > + - touched on interrupt and process context > > + - used around lookup tables mapping CQID and QPID to a structure. > > + - used also to bump the refcnt atomically with the lookup. > > + > > +poll: > > + lock+disable on cq lock > > + lock qp lock for each cqe that is polled around the call > > + to cxio_poll_cq(). > > + > > +post: > > + lock+disable qp lock > > + > > +global mutex iwch_mutex: > > + used to maintain global device list. > > Should be in Documentation/infiniband/. > Docs go in the Documentation/ dir, not in drivers/ dir. > I think I'll just remove this file. I don't think its that useful... Steve. From mst at mellanox.co.il Sun Dec 10 15:05:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 01:05:15 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061210225613.GF21155@sashak.voltaire.com> References: <20061210215033.GC21155@sashak.voltaire.com> <20061210215956.GI9205@mellanox.co.il> <20061210225613.GF21155@sashak.voltaire.com> Message-ID: <20061210230515.GJ9205@mellanox.co.il> > > > Recently I found this OFA 'Userspace Git Trees' downloading howto: > > > > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories > > > > > > and thought that we could make it simpler for end-user to choose the > > > "right" git tree just by adding one more series of symbolic links under > > > /pub/scm. This links will point to the maintainer's "official" trees, and > > > we will have only one such link per project. > > > > > > So typical downloading howto for end-users will looks like: > > > > > > git clone git://staging.openfabrics.org/dapl > > > git clone git://staging.openfabrics.org/ibutils > > > git clone git://staging.openfabrics.org/imgen > > > ... > > > > > > instead of > > > > > > git clone git://staging.openfabrics.org/~ardavis/dapl > > > git clone git://staging.openfabrics.org/~eitan/ibutils > > > git clone git://staging.openfabrics.org/~mst/imgen > > > ... > > > > > > as it is now. > > > > NACK, please remove this. These soft links are messy, and > > the fact that one needs root just to add a tree shows just how the approach > > is broken. > > No, it is not instead, but in addition to ~user/ links, so root is _not_ > required to add tree. right but suddenly root is needed to make it "official". Let's avoid the whole policy-setting-by-softlinks. "I have root" should not equal, or be required for "I say what's official". > > If you have some temporary tree, just mention this in description, > > And when it is not temporary tree? Say what it is in the description. Put a link in wiki. > > and gitweb will show this. And won't the problem basically go away > > if you move ~sashak temporary trees out of ~/scm? > > For me it is unclear yet how long we may need this - 1.1 still be in > SVN yet, and 1.1 git branch is updated there. So ~sashak/scm things track the 1.1 branch in git? Move it to ~sashak/scm/ofed-1.1 then, and set the description accordingly? > > It seems we don't > > have a lot of duplicates besides that. > > But we will have - we are running git hosting only week or so and already > talking about pre-trunk trees for some projects. :) These should be branches, not separate trees. So no issue there. -- MST From mst at mellanox.co.il Sun Dec 10 15:10:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 01:10:04 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061210221805.GD21155@sashak.voltaire.com> References: <20061210221805.GD21155@sashak.voltaire.com> Message-ID: <20061210231004.GK9205@mellanox.co.il> > On 23:39 Sun 10 Dec , Michael S. Tsirkin wrote: > > > Sean, you can do > > > > > > chmod 755 hooks/post-update > > > > > > This hook runs git-server-update-info after each push. > > > > It seems we really want this as default. > > Sasha, could you please > > chmod 755 /usr/share/git-core/templates/hooks/pre-commit > > so that this will be the default for all new users? > > Would prefer to not do this. All hooks are "off" is reasonable default > IMO and this should be tree maintainer's decision to enable specific > hook or not. > > If somebody needs help with setup, we can help, or we could write sort > of 'howto' if there are common problems. But I think we cannot take > "ownership" there. Defaults should be sane and help people. Everyone can still override the template or disable the hook. So how is it taking ownership? -- MST From eeb at bartonsoftware.com Sun Dec 10 15:08:45 2006 From: eeb at bartonsoftware.com (Eric Barton) Date: Sun, 10 Dec 2006 23:08:45 -0000 Subject: [openib-general] version #defines for the kernel In-Reply-To: Message-ID: <076a01c71cb0$244a7630$0281a8c0@ebpc> Roland, > No other kernel subsystem has one, so I don't think it's realistic to > expect one for IB. Don't you think it would be useful? Even if only to make API changes explicit? Cheers, Eric From swise at opengridcomputing.com Sun Dec 10 15:15:15 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Sun, 10 Dec 2006 17:15:15 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: <1165530250.14449.85.camel@stevo-desktop> References: <1165530250.14449.85.camel@stevo-desktop> Message-ID: <1165792515.25243.20.camel@linux-q667.site> On Thu, 2006-12-07 at 16:24 -0600, Steve Wise wrote: > On Thu, 2006-12-07 at 14:21 -0800, Woodruff, Robert J wrote: > > Steve wrote, > > >Yea maybe. For now, you get everything I need to make cxgb3 run on > > >2.6.19. I'll think about the multiple branch approach. > > > > The issue is this. I am working on putting together an OFA integration > > tree that integrates several components from several different > > developers. > > The same will be true when we start to integrate code into OFED 1.2. > > Most code will come from Linus's tree, but some code will need to > > come directly from the developer's git trees and we will need > > a way to generate a patch for only your code, as we will get things like > > the local_sa cache code directly from Sean's. > > > > So if you can make a branch that only contains the cxgb3 code, it makes > > generating a patch with only your code easier, and this will be needed > > both for my early OFA integration work and also for OFED 1.2. > > Once your code is upstream, life is easier as we will get it from > > linus, until then we'd like a way to patch the existing released kernel > > (2.6.19 in this case) with your code. > > > > make sense ? > > I understand. I've updated the tree and it now includes 2 branches: cxgb3 and cxgb3_prereqs. To see only the Chelsio T3 drivers (with needed infiniband/core changes): git-diff --patch-with-stat cxgb3_prereqs cxgb3 The cxgb3_prereqs branch includes anything I want in my tree for testing the chelsio code. Currently that includes krping and Sean's ucma code. BTW: the IWCM core fixes are now in linus's tree so I no longer need them explicitly. The cxgb3 branch includes all from the cxgb3_prereqs branch plus all the T3 drivers under review now. NOTE: This git tree is backed against Linus's tree and I merged up to his latest on 12/8. So it's past 2.6.19 and now depends on changes that are post 2.6.19 (the workqueue changes). Steve. From sashak at voltaire.com Sun Dec 10 15:28:44 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 01:28:44 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061210231004.GK9205@mellanox.co.il> References: <20061210221805.GD21155@sashak.voltaire.com> <20061210231004.GK9205@mellanox.co.il> Message-ID: <20061210232844.GA32199@sashak.voltaire.com> On 01:10 Mon 11 Dec , Michael S. Tsirkin wrote: > > On 23:39 Sun 10 Dec , Michael S. Tsirkin wrote: > > > > Sean, you can do > > > > > > > > chmod 755 hooks/post-update > > > > > > > > This hook runs git-server-update-info after each push. > > > > > > It seems we really want this as default. > > > Sasha, could you please > > > chmod 755 /usr/share/git-core/templates/hooks/pre-commit > > > so that this will be the default for all new users? > > > > Would prefer to not do this. All hooks are "off" is reasonable default > > IMO and this should be tree maintainer's decision to enable specific > > hook or not. > > > > If somebody needs help with setup, we can help, or we could write sort > > of 'howto' if there are common problems. But I think we cannot take > > "ownership" there. > > Defaults should be sane and help people. > Everyone can still override the template or disable > the hook. Right, and everyone can enable this, if _he_ wants. Sasha From sashak at voltaire.com Sun Dec 10 15:36:57 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 01:36:57 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061210230515.GJ9205@mellanox.co.il> References: <20061210215033.GC21155@sashak.voltaire.com> <20061210215956.GI9205@mellanox.co.il> <20061210225613.GF21155@sashak.voltaire.com> <20061210230515.GJ9205@mellanox.co.il> Message-ID: <20061210233657.GB32199@sashak.voltaire.com> On 01:05 Mon 11 Dec , Michael S. Tsirkin wrote: > > > > Recently I found this OFA 'Userspace Git Trees' downloading howto: > > > > > > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories > > > > > > > > and thought that we could make it simpler for end-user to choose the > > > > "right" git tree just by adding one more series of symbolic links under > > > > /pub/scm. This links will point to the maintainer's "official" trees, and > > > > we will have only one such link per project. > > > > > > > > So typical downloading howto for end-users will looks like: > > > > > > > > git clone git://staging.openfabrics.org/dapl > > > > git clone git://staging.openfabrics.org/ibutils > > > > git clone git://staging.openfabrics.org/imgen > > > > ... > > > > > > > > instead of > > > > > > > > git clone git://staging.openfabrics.org/~ardavis/dapl > > > > git clone git://staging.openfabrics.org/~eitan/ibutils > > > > git clone git://staging.openfabrics.org/~mst/imgen > > > > ... > > > > > > > > as it is now. > > > > > > NACK, please remove this. These soft links are messy, and > > > the fact that one needs root just to add a tree shows just how the approach > > > is broken. > > > > No, it is not instead, but in addition to ~user/ links, so root is _not_ > > required to add tree. > > right but suddenly root is needed to make it "official". > Let's avoid the whole policy-setting-by-softlinks. > "I have root" should not equal, or be required for "I say what's official". What are you trying to avoid? That only sysadmin will decide which git tree will be "official" for OFED and which will not? > > > > If you have some temporary tree, just mention this in description, > > > > And when it is not temporary tree? > > Say what it is in the description. > Put a link in wiki. > > > > and gitweb will show this. And won't the problem basically go away > > > if you move ~sashak temporary trees out of ~/scm? > > > > For me it is unclear yet how long we may need this - 1.1 still be in > > SVN yet, and 1.1 git branch is updated there. > > So ~sashak/scm things track the 1.1 branch in git? All active SVN branches. > Move it to ~sashak/scm/ofed-1.1 then, and set the description accordingly? > > > > It seems we don't > > > have a lot of duplicates besides that. > > > > But we will have - we are running git hosting only week or so and already > > talking about pre-trunk trees for some projects. :) > > These should be branches, not separate trees. Why not? Sasha From rdreier at cisco.com Sun Dec 10 20:02:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 10 Dec 2006 20:02:20 -0800 Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: I haven't seen any evidence of the corresponding ethernet NIC driver being merged for 2.6.20 (which is a prerequisite, right). What's the status of that? - R. From rdreier at cisco.com Sun Dec 10 21:02:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 10 Dec 2006 21:02:20 -0800 Subject: [openib-general] [PATCH v3 13/13] Kconfig/Makefile References: <20061210223244.27166.36192.stgit@dell3.ogc.int> <20061210223916.27166.82130.stgit@dell3.ogc.int> <20061210145602.d2a8bb98.randy.dunlap@oracle.com> Message-ID: > > +++ b/drivers/infiniband/hw/cxgb3/locking.txt > Should be in Documentation/infiniband/. > Docs go in the Documentation/ dir, not in drivers/ dir. Or put it in a comment in the appropriate header, if you want to keep it close to the driver source... From rdreier at cisco.com Sun Dec 10 21:27:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 10 Dec 2006 21:27:20 -0800 Subject: [openib-general] cannot clone librdmacm References: <20061210221805.GD21155@sashak.voltaire.com> <20061210231004.GK9205@mellanox.co.il> <20061210232844.GA32199@sashak.voltaire.com> Message-ID: > Right, and everyone can enable this, if _he_ wants. I think the point is that in the OFA environment, there's no obvious reason to disable the hook, since without the hook http:// transport is broken. So it makes sense to help people who aren't necessarily git experts, and pick a default that makes things work smoothly. Experts can disable the hook if there's some reason to do so (although to be honest I don't see any reason). - R. From mst at mellanox.co.il Sun Dec 10 21:48:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 07:48:08 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061210233657.GB32199@sashak.voltaire.com> References: <20061210233657.GB32199@sashak.voltaire.com> Message-ID: <20061211054539.GL9205@mellanox.co.il> > > > > > Recently I found this OFA 'Userspace Git Trees' downloading howto: > > > > > > > > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories > > > > > > > > > > and thought that we could make it simpler for end-user to choose the > > > > > "right" git tree just by adding one more series of symbolic links under > > > > > /pub/scm. This links will point to the maintainer's "official" trees, and > > > > > we will have only one such link per project. > > > > > > > > > > So typical downloading howto for end-users will looks like: > > > > > > > > > > git clone git://staging.openfabrics.org/dapl > > > > > git clone git://staging.openfabrics.org/ibutils > > > > > git clone git://staging.openfabrics.org/imgen > > > > > ... > > > > > > > > > > instead of > > > > > > > > > > git clone git://staging.openfabrics.org/~ardavis/dapl > > > > > git clone git://staging.openfabrics.org/~eitan/ibutils > > > > > git clone git://staging.openfabrics.org/~mst/imgen > > > > > ... > > > > > > > > > > as it is now. > > > > > > > > NACK, please remove this. These soft links are messy, and > > > > the fact that one needs root just to add a tree shows just how the approach > > > > is broken. > > > > > > No, it is not instead, but in addition to ~user/ links, so root is _not_ > > > required to add tree. > > > > right but suddenly root is needed to make it "official". > > Let's avoid the whole policy-setting-by-softlinks. > > "I have root" should not equal, or be required for "I say what's official". > > What are you trying to avoid? That only sysadmin will decide which git > tree will be "official" for OFED and which will not? Yes. Another point is that I should not need sysadmin priviledges to create a new project and declare my tree the official source. But not only that - staging is used to develop more than just OFED. Read the rant part in the original mail if you like for more detail - development trees should all be equal. Only releases should be official. And release has an immutable name, so it does not *matter* which tree you get it from. > > > > > > If you have some temporary tree, just mention this in description, > > > > > > And when it is not temporary tree? > > > > Say what it is in the description. > > Put a link in wiki. > > > > > > and gitweb will show this. And won't the problem basically go away > > > > if you move ~sashak temporary trees out of ~/scm? > > > > > > For me it is unclear yet how long we may need this - 1.1 still be in > > > SVN yet, and 1.1 git branch is updated there. > > > > So ~sashak/scm things track the 1.1 branch in git? > > All active SVN branches. But there *shouldn't* be any active SVN branches now besides the 1.1 branch. So the rest can be killed off. > > Move it to ~sashak/scm/ofed-1.1 then, and set the description accordingly? > > > > > > It seems we don't > > > > have a lot of duplicates besides that. > > > > > > But we will have - we are running git hosting only week or so and already > > > talking about pre-trunk trees for some projects. :) > > > > These should be branches, not separate trees. > > Why not? You seem to have a fear of branches :). Many trees do not buy you anything, I tried this with ofed 1.1 in the beginning. You can have many trees. But a single project maintained by a single person belongs in a single public tree, scattering it around between multiple trees just makes it messy for people to track, and messy to figure out the delta between branches. Finally, it wastes space. -- MST From mst at mellanox.co.il Mon Dec 11 00:24:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 10:24:10 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061210134137.GL29174@mellanox.co.il> References: <20061129140016.GO5061@mellanox.co.il> <20061205161944.GD30209@mellanox.co.il> <20061210134137.GL29174@mellanox.co.il> Message-ID: <20061211082410.GB29276@mellanox.co.il> > The following patch adds experimental support for IPoIB connected mode. > The idea is to increase performance by increasing the MTU > from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD. > With this code, I'm able to get 800MByte/sec or more with netperf > without options on a Mellanox 4x back-to-back DDR system. > > Signed-off-by: Michael S. Tsirkin BTW, Roland, could you give me some indication on whether this has a chance getting into 2.6.20? If yes I'll stop writing new code and focus on polishing this. -- MST From erezz at voltaire.com Mon Dec 11 01:20:55 2006 From: erezz at voltaire.com (Erez Zilber) Date: Mon, 11 Dec 2006 11:20:55 +0200 Subject: [openib-general] open-iscsi update for OFED 1.2 In-Reply-To: <20061127071729.GA6925@mellanox.co.il> References: <456A8FB5.9060602@voltaire.com> <20061127071729.GA6925@mellanox.co.il> Message-ID: <457D22F7.6060507@voltaire.com> Michael S. Tsirkin wrote: >>> More than this - since ofed really starts from kernel.org kernel, >>> just give us the list of files and ofed scripts will check that >>> out and build. You'll have to backport open-iscsi to distro kernels though. >>> Here are the open-iscsi kernel files: drivers/scsi/iscsi_tcp.c drivers/scsi/iscsi_tcp.h drivers/scsi/libiscsi.c drivers/scsi/scsi_transport_iscsi.c include/scsi/iscsi_if.h include/scsi/iscsi_proto.h include/scsi/libiscsi.h include/scsi/scsi_transport_iscsi.h Thanks, Erez From mst at mellanox.co.il Mon Dec 11 01:25:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 11:25:56 +0200 Subject: [openib-general] open-iscsi update for OFED 1.2 In-Reply-To: <457D22F7.6060507@voltaire.com> References: <457D22F7.6060507@voltaire.com> Message-ID: <20061211092556.GC29276@mellanox.co.il> > >>> More than this - since ofed really starts from kernel.org kernel, > >>> just give us the list of files and ofed scripts will check that > >>> out and build. You'll have to backport open-iscsi to distro kernels though. > >>> > Here are the open-iscsi kernel files: > > drivers/scsi/iscsi_tcp.c > drivers/scsi/iscsi_tcp.h > drivers/scsi/libiscsi.c > drivers/scsi/scsi_transport_iscsi.c > include/scsi/iscsi_if.h > include/scsi/iscsi_proto.h > include/scsi/libiscsi.h > include/scsi/scsi_transport_iscsi.h OK. So after we'll add that to checkout scripts (hope Vlad can do this this week), next thing you'll need is to add backport patches/addons and update makefile to build iscsi. -- MST From erezz at voltaire.com Mon Dec 11 01:31:21 2006 From: erezz at voltaire.com (Erez Zilber) Date: Mon, 11 Dec 2006 11:31:21 +0200 Subject: [openib-general] open-iscsi update for OFED 1.2 In-Reply-To: <20061211092556.GC29276@mellanox.co.il> References: <457D22F7.6060507@voltaire.com> <20061211092556.GC29276@mellanox.co.il> Message-ID: <457D2569.2000805@voltaire.com> Michael S. Tsirkin wrote: >>>>> More than this - since ofed really starts from kernel.org kernel, >>>>> just give us the list of files and ofed scripts will check that >>>>> out and build. You'll have to backport open-iscsi to distro kernels though. >>>>> >>>>> >> Here are the open-iscsi kernel files: >> >> drivers/scsi/iscsi_tcp.c >> drivers/scsi/iscsi_tcp.h >> drivers/scsi/libiscsi.c >> drivers/scsi/scsi_transport_iscsi.c >> include/scsi/iscsi_if.h >> include/scsi/iscsi_proto.h >> include/scsi/libiscsi.h >> include/scsi/scsi_transport_iscsi.h >> > > OK. So after we'll add that to checkout scripts (hope Vlad can do this this > week), next thing you'll need is to add backport patches/addons and update makefile > to build iscsi. > > I understand that the kernel version that OFED 1.2 will be based on is unknown yet (or am I wrong?). In order to create backport patches to a specific distro, I need to know where I start from (i.e which kernel version). Erez From eitan at mellanox.co.il Mon Dec 11 01:51:29 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 11 Dec 2006 11:51:29 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location Message-ID: <457D2A21.9030804@mellanox.co.il> Hi, Currently libsdp.conf is installed into $prefix/etc. This seems a little non standard to me. Instead I would think it needs to go into /etc/infiniband/libsdp.conf. Any comments - please speak up. BTW: libsdp.conf used to be overwritten in previous install. I have fixed the nakefile to avoid that and instead create a new file with install date under the same directory. Thanks Eitan From mst at mellanox.co.il Mon Dec 11 02:06:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 12:06:07 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: <457D2A21.9030804@mellanox.co.il> References: <457D2A21.9030804@mellanox.co.il> Message-ID: <20061211100607.GF29276@mellanox.co.il> Quoting r. Eitan Zahavi : Subject: libsdp: RFC changing libsdp.conf location Hi, > Currently libsdp.conf is installed into $prefix/etc. > This seems a little non standard to me. There's no real standard on configuration files in Unix. So you can do whatever you want within reason. > Instead I would think it needs > to go > into /etc/infiniband/libsdp.conf. /etc/infiniband is an OFED thing. I suggest keeping libsdp separate so that it is distribution agnostic. > > Any comments - please speak up. In the past, lots of customers asked that installed files reside under $prefix. It *is* important since it lets people find out easily what is added to their systems. OFED does not follow this rule 100% but its better not to add more exceptions. > BTW: libsdp.conf used to be overwritten in previous install. > I have fixed the nakefile to avoid that and instead create a > new file with install date under the same directory. So installed file hits a different location depending on date and on whether I have an old library installed? This pretty much guarantees user won't be able to find the file you have installed: you seem to assume that users read installation logs but that's typically not the case. Why not just have libsdp.conf.example, or something like that, under $prefix/etc and install that always, and only copy to $prefix/etc/libsdp.conf if that does not exist? -- MST From mst at mellanox.co.il Mon Dec 11 02:22:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 12:22:22 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: <457D2A21.9030804@mellanox.co.il> References: <457D2A21.9030804@mellanox.co.il> Message-ID: <20061211102222.GB5944@mellanox.co.il> > BTW: libsdp.conf used to be overwritten in previous install. > I have fixed the nakefile to avoid that and instead create a > new file with install date under the same directory. Here's a simple proposal that will address this issue: - Make libsdp behave sanely when not libsdp.conf file is present. Do not install anything in default location in make install. - in make install, copy the example configuration file into libsdp.conf.example. Add a line to the top of it saying "rename this file to libsdp.conf to make lbisdp use it". -- MST From eitan at mellanox.co.il Mon Dec 11 02:26:49 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 11 Dec 2006 12:26:49 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: <20061211102222.GB5944@mellanox.co.il> References: <457D2A21.9030804@mellanox.co.il> <20061211102222.GB5944@mellanox.co.il> Message-ID: <457D3269.3070401@mellanox.co.il> Hi Michael, Thanks. This proposal is simple and clear to me. Let's wait a day and see if anybody else have other ideas. Thanks Eitan Michael S. Tsirkin wrote: >> BTW: libsdp.conf used to be overwritten in previous install. >> I have fixed the nakefile to avoid that and instead create a >> new file with install date under the same directory. >> > > Here's a simple proposal that will address this issue: > - Make libsdp behave sanely when not libsdp.conf file is present. > Do not install anything in default location in make install. > > - in make install, copy the example configuration file into > libsdp.conf.example. Add a line to the top of it saying > "rename this file to libsdp.conf to make lbisdp use it". > > From mst at mellanox.co.il Mon Dec 11 02:26:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 12:26:56 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: <20061211102222.GB5944@mellanox.co.il> References: <457D2A21.9030804@mellanox.co.il> <20061211102222.GB5944@mellanox.co.il> Message-ID: <20061211102656.GC5944@mellanox.co.il> > - Make libsdp behave sanely when not libsdp.conf file is present. This should have been "when libsdp.conf file is not present" :). -- MST From mst at mellanox.co.il Mon Dec 11 06:48:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 16:48:13 +0200 Subject: [openib-general] ofed backports update Message-ID: <20061211144813.GA15870@mellanox.co.il> Here's a small update on OFED 1.2 backports. This describes a change I did a couple of weeks ago but never got to documenting. NOTE: This info is relevant only for people developing OFED kernel code, everything is transparent for others. NOTE: This is by *no means* a comprehensive writeup of OFED build process - just a small update for people familiar with development in OFED 1.1. Background: OFED 1.1 did all backports by applying patches under kernel_patches/backports// directory. To back-port a package, you just stuck a patch there and one OFED detected an appropriate kernel, it was applied before build. In many cases - where the kernel we are back-porting to was simply missing some macro - what patch actually did was just add a file under the include directory, and OFED build scripts knew to pick these up before standard linux includes. Managing these became somewhat of a pain as it is often hard to see the history of a patch: try git diff on a patch that sits in git tree and see what I mean. Update: So for OFED 1.2 I've created a new directory kernel_addons, and converted all patches that created new files to plain files under the relevant kernel directory. OFED scripts now look there for files before standard Linux headers. For an example, look at how backport to 2.6.18 looks: http://staging.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=tree;f=kernel_addons/backport/2.6.18/include/linux;h=5eabed1f98596f92ce149dae65c4ab1ceb1d6a67;hb=HEAD Unfortunately, not all patches are of this form - some really tweak source inside the infiniband subtree - but we can strive to reduce the number of this and in this way make maintaining backports more of a seamless process. Bottom line There are now 2 mechanisms for back-porting in OFED: - if you want to add a kernel-specific file, stick it under kernel_addons/backport//. - if you must change an existing file depending on kernel version, stick a patch in kernel_patches/backports//. -- MST From mst at mellanox.co.il Mon Dec 11 07:06:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 17:06:43 +0200 Subject: [openib-general] open-iscsi update for OFED 1.2 In-Reply-To: <457D2569.2000805@voltaire.com> References: <457D2569.2000805@voltaire.com> Message-ID: <20061211150643.GC15870@mellanox.co.il> > >>>>> More than this - since ofed really starts from kernel.org kernel, > >>>>> just give us the list of files and ofed scripts will check that > >>>>> out and build. You'll have to backport open-iscsi to distro kernels though. > >>>>> > >>>>> > >> Here are the open-iscsi kernel files: > >> > >> drivers/scsi/iscsi_tcp.c > >> drivers/scsi/iscsi_tcp.h > >> drivers/scsi/libiscsi.c > >> drivers/scsi/scsi_transport_iscsi.c > >> include/scsi/iscsi_if.h > >> include/scsi/iscsi_proto.h > >> include/scsi/libiscsi.h > >> include/scsi/scsi_transport_iscsi.h > >> > > > > OK. So after we'll add that to checkout scripts (hope Vlad can do this this > > week), next thing you'll need is to add backport patches/addons and update makefile > > to build iscsi. > > > > > I understand that the kernel version that OFED 1.2 will be based on is > unknown yet (or am I wrong?). In order to create backport patches to a > specific distro, I need to know where I start from (i.e which kernel > version). Not really. Start from here: git://staging.openfabrics.org/~vlad/ofed_1_2/.git This is currently based on 2.6.19. Clone this and work off ofed_1_2, test and ask for pull. Then when there's an -rc from Linus, iser build might break and then you'll need to fix the backports. However, from experience, if backports are done carefully enough (separating the actual code in new header files) this is either easy or nothing breaks. See the mail I've just sent to openib on new tricks we have in OFED 1.2 to make this easier. -- MST From sashak at voltaire.com Mon Dec 11 07:28:01 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 17:28:01 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: References: <20061210221805.GD21155@sashak.voltaire.com> <20061210231004.GK9205@mellanox.co.il> <20061210232844.GA32199@sashak.voltaire.com> Message-ID: <20061211152801.GC465@sashak.voltaire.com> On 21:27 Sun 10 Dec , Roland Dreier wrote: > > Right, and everyone can enable this, if _he_ wants. > > I think the point is that in the OFA environment, there's no obvious > reason to disable the hook, since without the hook http:// transport > is broken. > > So it makes sense to help people who aren't necessarily git experts, IMO the maintaining the public git repository requires some minimal experience anyway and I don't think that hiding such details under default settings is so helpful - "default" git defaults are reasonable start point. And I guess to make 'chmod +x hooks/port-update' (after such stuff as running git-init-db, editing description, pushing whole history, etc...) is not big issue. If I'm wrong about it and it is I'm ready to help to each one, who needs such help. > and pick a default that makes things work smoothly. Experts can > disable the hook if there's some reason to do so (although to be > honest I don't see any reason). I may not see any reason too, but this should not be my decision (or such aggressive "suggestion" as hook enabled by default). BTW we likely will want to setup email notification hooks as well, this can be more "complicated" than just 'chmod +x'. I guess we will no want to prepare executable hook template with predefined email addresses... Sasha From jlentini at netapp.com Mon Dec 11 07:23:11 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 11 Dec 2006 10:23:11 -0500 (EST) Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: <4579C6C3.5090207@mellanox.com> References: <4579C6C3.5090207@mellanox.com> Message-ID: A couple of questions Vu: What NFS-RDMA release are you using? This looks like release 7. Is this reproducible? What kernel version are you using? What hardware is this on? It looks like x86-64 to me, which is fine. I just want to be sure I know what I'm looking at. As many specifics as possible is good (number of CPUs, hyperthreading, etc.) Could you send the output of objdump -Slr /path/to/kernel/mm/swap.o Actually, just the put_page disassembly is all I want to see. Is there any more text available? Usually there is an explanation given for an oops message (e.g. "Unable to handle kernel paging request.."). I opened a bug at the NFS-RDMA SourceForge project to track this: http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 Thanks for reporting this. james On Fri, 8 Dec 2006, Vu Pham wrote: > Hi James, > I got these errors in server's /var/log/messages and then the server stop > responding to login, I/O...; however, the server is still up, ipoib is still > working > > > Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[] > [] put_page+0x17/0x40 > Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS: 00010246 > Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001 > RCX: 000000000003ffff > Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001 > RDI: ffff8102274e92f8 > Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034 > R09: 0000000000000000 > Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000 > R12: ffff81020ef96800 > Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000 > R15: ffff8102053ee890 > Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000) > GS:ffff81022066eb40(0000) knlGS:0000000000000000 > Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000 > CR4: 00000000000006e0 > Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo > ffff810219dde000, task ffff81020d87f0c0) > Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 ffff81020ef96968 > ffff81020ef96800 ffff81020ef96958 > Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90 > ffffffff80424e05 0000000000000000 > Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90 > ffffffff80239b90 ffff81020d87f0c0 > Dec 8 06:38:21 ibd201 kernel: Call Trace: > Dec 8 06:38:21 ibd201 kernel: [] > :sunrpc:svc_rdma_put_context+0x37/0xd0 > Dec 8 06:38:21 ibd201 kernel: [] > :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0 > Dec 8 06:38:21 ibd201 kernel: [] > schedule_timeout+0x95/0xb0 > Dec 8 06:38:21 ibd201 kernel: [] process_timeout+0x0/0x10 > Dec 8 06:38:21 ibd201 kernel: [] > wait_for_completion_timeout+0xcd/0x150 > Dec 8 06:38:21 ibd201 kernel: [] > default_wake_function+0x0/0x10 > Dec 8 06:38:21 ibd201 kernel: [] > :ib_mthca:mthca_cmd_post+0x232/0x260 > Dec 8 06:38:21 ibd201 kernel: [] > default_wake_function+0x0/0x10 > Dec 8 06:38:21 ibd201 kernel: [] __next_cpu+0x19/0x30 > Dec 8 06:38:21 ibd201 kernel: [] > find_busiest_group+0x24e/0x6d0 > Dec 8 06:38:21 ibd201 kernel: [] thread_return+0x0/0xde > Dec 8 06:38:21 ibd201 kernel: [] > _spin_unlock_irqrestore+0x8/0x10 > Dec 8 06:38:21 ibd201 kernel: [] > try_to_del_timer_sync+0x51/0x60 > Dec 8 06:38:21 ibd201 kernel: [] del_timer_sync+0xc/0x20 > Dec 8 06:38:21 ibd201 kernel: [] > schedule_timeout+0x95/0xb0 > Dec 8 06:38:21 ibd201 kernel: [] > :sunrpc:svc_recv+0x416/0x510 > Dec 8 06:38:21 ibd201 kernel: [] > default_wake_function+0x0/0x10 > Dec 8 06:38:21 ibd201 kernel: [] > default_wake_function+0x0/0x10 > Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 > Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x111/0x380 > Dec 8 06:38:21 ibd201 kernel: [] child_rip+0xa/0x12 > Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 > Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 > Dec 8 06:38:21 ibd201 kernel: [] child_rip+0x0/0x12 > Dec 8 06:38:21 ibd201 kernel: > Dec 8 06:38:21 ibd201 kernel: > Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f 08 > 0f 94 c0 84 c0 74 > Dec 8 06:38:21 ibd201 kernel: RIP [] put_page+0x17/0x40 > Dec 8 06:38:21 ibd201 kernel: RSP > > -vu > From mst at mellanox.co.il Mon Dec 11 07:25:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 17:25:52 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061211152801.GC465@sashak.voltaire.com> References: <20061211152801.GC465@sashak.voltaire.com> Message-ID: <20061211152552.GD15870@mellanox.co.il> > On 21:27 Sun 10 Dec , Roland Dreier wrote: > > > Right, and everyone can enable this, if _he_ wants. > > > > I think the point is that in the OFA environment, there's no obvious > > reason to disable the hook, since without the hook http:// transport > > is broken. > > > > So it makes sense to help people who aren't necessarily git experts, > > IMO the maintaining the public git repository requires some minimal > experience anyway and I don't think that hiding such details under > default settings is so helpful - "default" git defaults are reasonable > start point. They are, for repositories not exposed with http. Since all repositories on staging are exposed with http, the default is wrong in that case and need to be fixed. Agree? -- MST From swise at opengridcomputing.com Mon Dec 11 07:36:29 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 11 Dec 2006 09:36:29 -0600 Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver In-Reply-To: References: <20061210223244.27166.36192.stgit@dell3.ogc.int> Message-ID: <1165851389.13419.3.camel@stevo-desktop> On Sun, 2006-12-10 at 20:02 -0800, Roland Dreier wrote: > I haven't seen any evidence of the corresponding ethernet NIC driver > being merged for 2.6.20 (which is a prerequisite, right). > > What's the status of that? > It is on its third or fourth round of review. The last driver posted on 12/7, was merged up to linus's latest tree probably as of 12/7. I know the comments set it was against 2.6.19, but it was really linus's latest. Divy, can you expand on this? Steve. From mst at mellanox.co.il Mon Dec 11 07:39:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 17:39:10 +0200 Subject: [openib-general] [PATCH untested] mthca: speed up memory registrations Message-ID: <20061211153910.GE15870@mellanox.co.il> Speed up memory registration by filling in MTTs directly. This reduces the number of FW commands needed to register an MR by at least a factor of 2. This applies to all memfree cards, and to tavor mode on 64 bit systems with the patch I posted earlier. Signed-off-by: Michael S. Tsirkin --- Roland, I'm posting this untested patch to get style comments out of the way early while I'm testing it. Note that this *not* FMR - this is strictly compliant IB memory registration since MPTs are still updated using FW command. Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h @@ -464,6 +464,8 @@ void mthca_uar_free(struct mthca_dev *de int mthca_pd_alloc(struct mthca_dev *dev, int privileged, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); +int mthca_write_mtt_chunk_size(struct mthca_dev *dev); + struct mthca_mtt *mthca_alloc_mtt(struct mthca_dev *dev, int size); void mthca_free_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt); int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c @@ -244,8 +244,8 @@ void mthca_free_mtt(struct mthca_dev *de kfree(mtt); } -int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, - int start_index, u64 *buffer_list, int list_len) +static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) { struct mthca_mailbox *mailbox; __be64 *mtt_entry; @@ -296,6 +296,84 @@ out: return err; } +void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + u64 __iomem *mtts; + u32 mtt_seg; + int i; + + mtt_seg = mtt->first_seg * MTHCA_MTT_SEG_SIZE; + mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg + start_index * sizeof (u64); + for (i = 0; i < list_len; ++i) { + __be64 mtt_entry = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + mthca_write64_raw(mtt_entry, mtts + i); + } +} + +void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + __be64 *mtts; + int i; + int s = start_index * sizeof (u64); + + /* For Arbel, all MTTs must fit in the same page. */ + BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64)) / PAGE_SIZE); + /* Require full segments */ + BUG_ON(s % MTHCA_MTT_SEG_SIZE); + + mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg + + s / MTHCA_MTT_SEG_SIZE); + + BUG_ON(!mtts); + + for (i = 0; i < list_len; ++i) + mtts[i] = cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT); +} + +int mthca_write_mtt_size(struct mthca_dev *dev) +{ + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + /* + * Be friendly to WRITE_MTT command + * and leave two empty slots for the + * index and reserved fields of the + * mailbox. + */ + return PAGE_SIZE / sizeof (u64) - 2; + + /* For Arbel, all MTTs must fit in the same page. */ + return mthca_is_memfree(dev) ? (PAGE_SIZE / sizeof (u64)) : 0x7ffffff; +} + +int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + int size = mthca_write_mtt_size(dev); + int chunk; + + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); + + while (list_len > 0) { + chunk = min(size, list_len); + if (mthca_is_memfree(dev)) + mthca_arbel_write_mtt_seg(dev, mtt, start_index, + buffer_list, list_len); + else + mthca_tavor_write_mtt_seg(dev, mtt, start_index, + buffer_list, list_len); + + list_len -= chunk; + start_index += chunk; + buffer_list += chunk; + } + + return 0; +} + static inline u32 tavor_hw_index_to_key(u32 ind) { return ind; Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1015,6 +1015,7 @@ static struct ib_mr *mthca_reg_user_mr(s int shift, n, len; int i, j, k; int err = 0; + int write_mtt_size; shift = ffs(region->page_size) - 1; @@ -1040,6 +1041,8 @@ static struct ib_mr *mthca_reg_user_mr(s i = n = 0; + write_mtt_size = min(mthca_write_mtt_size(dev), PAGE_SIZE / sizeof *pages); + list_for_each_entry(chunk, ®ion->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; @@ -1047,14 +1050,11 @@ static struct ib_mr *mthca_reg_user_mr(s pages[i++] = sg_dma_address(&chunk->page_list[j]) + region->page_size * k; /* - * Be friendly to WRITE_MTT command - * and leave two empty slots for the - * index and reserved fields of the - * mailbox. + * Be friendly to write_mtt and pass it chunks + * of appropriate size. */ - if (i == PAGE_SIZE / sizeof (u64) - 2) { - err = mthca_write_mtt(dev, mr->mtt, - n, pages, i); + if (i == write_mtt_size) { + err = mthca_write_mtt(dev, mr->mtt, n, pages, i); if (err) goto mtt_done; n += i; -- MST From sashak at voltaire.com Mon Dec 11 09:16:01 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 19:16:01 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061211152552.GD15870@mellanox.co.il> References: <20061211152801.GC465@sashak.voltaire.com> <20061211152552.GD15870@mellanox.co.il> Message-ID: <20061211171601.GG465@sashak.voltaire.com> On 17:25 Mon 11 Dec , Michael S. Tsirkin wrote: > > On 21:27 Sun 10 Dec , Roland Dreier wrote: > > > > Right, and everyone can enable this, if _he_ wants. > > > > > > I think the point is that in the OFA environment, there's no obvious > > > reason to disable the hook, since without the hook http:// transport > > > is broken. > > > > > > So it makes sense to help people who aren't necessarily git experts, > > > > IMO the maintaining the public git repository requires some minimal > > experience anyway and I don't think that hiding such details under > > default settings is so helpful - "default" git defaults are reasonable > > start point. > > They are, for repositories not exposed with http. > Since all repositories on staging are exposed with http, I don't know this about all repositories on staging (including yet not created ones, where default will affect). > the default is wrong in that case and need to be fixed. > Agree? No (and already tried to explain why). Sasha From sweitzen at cisco.com Mon Dec 11 09:19:45 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 11 Dec 2006 09:19:45 -0800 Subject: [openib-general] libsdp: RFC changing libsdp.conf location Message-ID: It's not clear to me. Are you changing the libsdp.conf location or not? Can you define "sanely"? Scott > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi > Sent: Monday, December 11, 2006 2:27 AM > To: Michael S. Tsirkin > Cc: Nimrod Gindi; OPENIB GENERAL > Subject: Re: [openib-general] libsdp: RFC changing > libsdp.conf location > > Hi Michael, > > Thanks. This proposal is simple and clear to me. > Let's wait a day and see if anybody else have other ideas. > > Thanks > > Eitan > > Michael S. Tsirkin wrote: > >> BTW: libsdp.conf used to be overwritten in previous install. > >> I have fixed the nakefile to avoid that and instead create a > >> new file with install date under the same directory. > >> > > > > Here's a simple proposal that will address this issue: > > - Make libsdp behave sanely when not libsdp.conf file is present. > > Do not install anything in default location in make install. > > > > - in make install, copy the example configuration file into > > libsdp.conf.example. Add a line to the top of it saying > > "rename this file to libsdp.conf to make lbisdp use it". > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Mon Dec 11 09:34:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 19:34:59 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061211171601.GG465@sashak.voltaire.com> References: <20061211152801.GC465@sashak.voltaire.com> <20061211152552.GD15870@mellanox.co.il> <20061211171601.GG465@sashak.voltaire.com> Message-ID: <20061211173459.GC20344@mellanox.co.il> > > the default is wrong in that case and need to be fixed. > > Agree? > > No (and already tried to explain why). Can't say I get it. -- MST From tziporet at dev.mellanox.co.il Mon Dec 11 09:40:30 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 11 Dec 2006 19:40:30 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <20061211173459.GC20344@mellanox.co.il> References: <20061211152801.GC465@sashak.voltaire.com> <20061211152552.GD15870@mellanox.co.il> <20061211171601.GG465@sashak.voltaire.com> <20061211173459.GC20344@mellanox.co.il> Message-ID: <457D980E.7030803@dev.mellanox.co.il> Michael S. Tsirkin wrote: >>> the default is wrong in that case and need to be fixed. >>> Agree? >>> >> No (and already tried to explain why). >> > > Can't say I get it. > > Sasha - look at Roland's reply Not clear why are you against something that will make the work of all users easier. Tziporet From robert.j.woodruff at intel.com Mon Dec 11 09:45:14 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 11 Dec 2006 09:45:14 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories Message-ID: Steve Wrote, >NOTE: This git tree is backed against Linus's tree and I merged up to >his latest on 12/8. So it's past 2.6.19 and now depends on changes that >are post 2.6.19 (the workqueue changes). >Steve. Do you plan on only supporting the Chelsio driver for 2.6.20+ kernels or do plan on providing backport patches for earlier kernels ? woody From rdreier at cisco.com Mon Dec 11 09:45:26 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 11 Dec 2006 09:45:26 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061211082410.GB29276@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 11 Dec 2006 10:24:10 +0200") References: <20061129140016.GO5061@mellanox.co.il> <20061205161944.GD30209@mellanox.co.il> <20061210134137.GL29174@mellanox.co.il> <20061211082410.GB29276@mellanox.co.il> Message-ID: > BTW, Roland, could you give me some indication on whether this > has a chance getting into 2.6.20? If yes I'll stop writing new code > and focus on polishing this. I think we could probably merge it but maybe it's better to put it in -mm for a cycle given that it's new and not too many people have looked at it yet. And I still haven't gotten comfortable with the way CM is enabled. - R. From ralph.campbel at qlogic.com Mon Dec 11 10:02:26 2006 From: ralph.campbel at qlogic.com (Ralph Campbell) Date: Mon, 11 Dec 2006 10:02:26 -0800 (PST) Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <457BD18D.7000403@voltaire.com> References: <1165517253.14800.283.camel@brick.pathscale.com> <457BD18D.7000403@voltaire.com> Message-ID: <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com> > Ralph Campbell wrote: >> This version of the patch fixes ipath_sg_dma_address() and >> updates the comments for ipath_dma.c as Or Gerlitz >> suggested. > >> This patch implements the interposing DMA mapping functions to allow >> support for IOMMUs and remove the dependence on phys_to_virt() and >> bus_to_virt(). > > Ralph, > > The patch seems ready modulo the resolution of whether you implement the > addresses returned by the ipath ib_dma_map_xxx code as keys into a SW > IOTLB (which means you return dma_address_t and not u64 but assign it > ipath semantics) or choose a different path to follow (ie assume the > problem exists only under the unsupported by ipath 32bit / high-mem > config, do nothing, etc) - what ever you set with Roland. > > Or. I would like to see this last set of patches integrated as is. I would like to get more experience with the current implementation before extending it to support other configurations. From mst at mellanox.co.il Mon Dec 11 10:07:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 20:07:46 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061211180746.GD20344@mellanox.co.il> > > BTW, Roland, could you give me some indication on whether this > > has a chance getting into 2.6.20? If yes I'll stop writing new code > > and focus on polishing this. > > I think we could probably merge it but maybe it's better to put it in > -mm for a cycle given that it's new and not too many people have > looked at it yet. Hmm. People here in openib community don't seem to look at, or run -mm kernels, so I don't think this will buy us much - it'll just create work for me. No? > And I still haven't gotten comfortable with the way CM is enabled. Are you still worried someone might turn it on by default? I'm actively looking at fixing multicast - it's just unlikely to be ready this week. Enabling logic is a small part of the code - maybe code can be merged, and enabling tweaked post -rc1? -- MST From mst at mellanox.co.il Mon Dec 11 10:14:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 20:14:29 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061211180746.GD20344@mellanox.co.il> References: <20061211180746.GD20344@mellanox.co.il> Message-ID: <20061211181429.GE20344@mellanox.co.il> > > > BTW, Roland, could you give me some indication on whether this > > > has a chance getting into 2.6.20? If yes I'll stop writing new code > > > and focus on polishing this. > > > > I think we could probably merge it but maybe it's better to put it in > > -mm for a cycle given that it's new and not too many people have > > looked at it yet. > > Hmm. People here in openib community don't seem to look at, or run -mm kernels, so > I don't think this will buy us much - it'll just create work for me. > > No? And it's not like it's such a lot of code, either, is it? So fixing it up even in major ways will be possible later in RC cycle. -- MST From mshefty at ichips.intel.com Mon Dec 11 09:54:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 11 Dec 2006 09:54:18 -0800 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <457BDF15.6090608@voltaire.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <457BDF15.6090608@voltaire.com> Message-ID: <457D9B4A.6010507@ichips.intel.com> > patch made over your path, can you please queue this somewhere so it > will not be forgotten? Can you just send a signed-off-by line? I'll add the patch to the librdmacm multicast branch. Thanks, - Sean From mshefty at ichips.intel.com Mon Dec 11 10:20:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 11 Dec 2006 10:20:29 -0800 Subject: [openib-general] [PATCH] - ucma updates for miscdev changes In-Reply-To: <1165788273.25243.8.camel@linux-q667.site> References: <1165788273.25243.8.camel@linux-q667.site> Message-ID: <457DA16D.3010604@ichips.intel.com> > As part of merging up to linus's tree as of 12/8/2006, I had to change > ucma.c to support changes in the miscdevice stuff. Below is a patch for > this. In addition to this change, I had to merge your ucma patches to > get them to apply. Nothing functional changed, I don't think, but some > of the changes in your tree are already in linus's tree, so those > patches were ignored. And one didn't apply cleanly and I had to fix it > manually. > > You can see these changes including the patch below as a single patch in > git://staging.openfabrics.org/~swise/cxgb3.git commit number: > d1ac2e74680d61a5e87165e1c6b4cec44533f2bd. Thanks - I'll take a look at this. My intention is follow the same process that we had been following and keep my tree in sync with the latest kernel release only, unless I need a more updated branch. - Sean From sashak at voltaire.com Mon Dec 11 10:40:19 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 20:40:19 +0200 Subject: [openib-general] cannot clone librdmacm In-Reply-To: <457D980E.7030803@dev.mellanox.co.il> References: <20061211152801.GC465@sashak.voltaire.com> <20061211152552.GD15870@mellanox.co.il> <20061211171601.GG465@sashak.voltaire.com> <20061211173459.GC20344@mellanox.co.il> <457D980E.7030803@dev.mellanox.co.il> Message-ID: <20061211184019.GK465@sashak.voltaire.com> On 19:40 Mon 11 Dec , Tziporet Koren wrote: > Michael S. Tsirkin wrote: > >>>the default is wrong in that case and need to be fixed. > >>>Agree? > >>> > >>No (and already tried to explain why). > >> > > > >Can't say I get it. > > > > > Sasha - look at Roland's reply > Not clear why are you against something that will make the work of all > users easier. I'm absolutely not against something that will make the work of all users easier. I'm just against this specific thing - to make executable default post-update hook template. Because I don't see this as significant improvement for users, but OTOH it seems for me as sort of user's repository ownership violation. As user I would prefer to not have such "surprises" as hooks executed by default. Sasha From swise at opengridcomputing.com Mon Dec 11 10:34:28 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 11 Dec 2006 12:34:28 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: References: Message-ID: <1165862068.4020.14.camel@stevo-desktop> On Mon, 2006-12-11 at 09:45 -0800, Woodruff, Robert J wrote: > Steve Wrote, > >NOTE: This git tree is backed against Linus's tree and I merged up to > >his latest on 12/8. So it's past 2.6.19 and now depends on changes > that > >are post 2.6.19 (the workqueue changes). > > >Steve. > > Do you plan on only supporting the Chelsio driver for 2.6.20+ kernels > or do plan on providing backport patches for earlier kernels ? > I was really hoping it would work for both kernels, but now with the workqueue changes, I'll have to think about a 2.6.19 patch. However, my top priority is getting this tested and into kernel.org... Steve. From robert.j.woodruff at intel.com Mon Dec 11 10:52:13 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 11 Dec 2006 10:52:13 -0800 Subject: [openib-general] [ANNOUNCE]OFA Component Early Integration Test Tree Message-ID: At SC'06 developer's summit in Tampa, we had some discussion of having an early integration-test tree (kind of like the MM tree) for early testing of new infiniband components. I have started to look at putting together such a tree that contains Sean's latest uCMA, local sa cache, multicast code, and the IPoIB_CM code, at least for my own testing. If others are interested in trying this stuff out before it gets into OFED or the kernel.org tree, they can clone my ofa_integration tree at, ~woody/scm/ofa-integration The merged code is in the integration-test branch. My merge script also creates a single patch from the integration-test branch using git-diff linux-2.6.19 integration-test > ./infiniband-ofa-mmddyy-for-linux-2.6.19.patch that people could take and just apply against a stock 2.6.19 kernel. These are in my top level directory, ~woody Not sure if these patches would be useful to anyone else, but if they are, we could figure out a way to publish them for use by the greater community, or people could always just generate the patch themselves using git-diff. So far I have tested the ~woody/infiniband-ofa-1207006-for-linux-2.6.19.patch which contains Sean's local_sa cache, uCMA, and the IPoIB_CM code. The ~woody/infiniband-ofa-1211006-for-linux-2.6.19.patch has the loca_sa cache, uCMA, IPoIB_CM, and Sean Multicast code. I just created this one, the merge went ok, but I have yet to test it. If other people would like to add some new kernel code to this tree to test with other components that are under development, I can add your code. What I need to do this is for you to publish a git tree based on 2.6.19 (the release kernel) that has your code (and only your code) in a branch, such that one could create a self contained patch that would apply to a stock 2.6.19 kernel using git-diff. e.g., git-diff linux-2.6.19 mybranch would create a self contained patch of your code. This should allow me to easily merge the code into my tree. If your git tree contains changes to other core infiniband code or is based on something other than the release 2.6.19 kernel.org kernel, I cannot use it, since I cannot easily merge it with the other experimental components that are based on 2.6.19. If people think that this would be of value to the wider community, I will add something to the wiki to explain how to get the tree and the userspace code that matches this kernel code. woody -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkatesh.babu at 3leafnetworks.com Mon Dec 11 11:03:28 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Mon, 11 Dec 2006 11:03:28 -0800 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1165672098.26559.43885.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> <1165625283.26559.10270.camel@hal.voltaire.com> <457A0B62.2060501@3leafnetworks.com> <1165628315.26559.12385.camel@hal.voltaire.com> <457A1E90.5040606@3leafnetworks.com> <1165666352.26559.39788.camel@hal.voltaire.com> <1165672098.26559.43885.camel@hal.voltaire.com> Message-ID: <457DAB80.7010501@3leafnetworks.com> Yes, the problem is noticed on port 1 also. It is random. Sometimes with port 1 and sometimes with port 2. I will try with only one "port 1" subnet. VBabu Hal Rosenstock wrote: >On Sat, 2006-12-09 at 07:12, Hal Rosenstock wrote: > > >>One more thing: >> >>When you upgraded to OFED 1.2, did you build and install the management >>libraries (libibcommon, libibumad are important here and libibmad for >>diags) ? >> >> > >Does the problem always occur on the "second" subnet (port 2's subnet) >or does it ever occur on port 1's subnet ? > >Can you totally not configure the "port 1" subnet on all machines (and >OpenSM on the port 1's where that runs) and see if it is reproducible ? > >Thanks. > >-- Hal > > > From venkatesh.babu at 3leafnetworks.com Mon Dec 11 11:14:00 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Mon, 11 Dec 2006 11:14:00 -0800 Subject: [openib-general] Unreliable OpemSM failover In-Reply-To: <1165666352.26559.39788.camel@hal.voltaire.com> References: <1164117837.4381.48210.camel@hal.voltaire.com> <456B7CC8.5060806@3leafnetworks.com> <1164674885.11808.760.camel@hal.voltaire.com> <4579E333.4000901@3leafnetworks.com> <1165617878.26559.4952.camel@hal.voltaire.com> <4579F8E6.3040604@3leafnetworks.com> <1165622233.26559.8108.camel@hal.voltaire.com> <457A0389.7030103@3leafnetworks.com> <1165625283.26559.10270.camel@hal.voltaire.com> <457A0B62.2060501@3leafnetworks.com> <1165628315.26559.12385.camel@hal.voltaire.com> <457A1E90.5040606@3leafnetworks.com> <1165666352.26559.39788.camel@hal.voltaire.com> Message-ID: <457DADF8.7010002@3leafnetworks.com> Hal Rosenstock wrote: >I was interested in the one on Node1 when it appeared to be trying to >exit (which it shouldn't be but is) and the other threads don't seem to >terminate. > > Let me see if I can reproduse it again. First thing I will capture the core file, so that it can be investigated later. > > >> How do I findout the thread_state value ? >> >> > >It's a variable in the SM structure (in the SM thread). > > I found this variable in osm_vl15intf.h:osm_vl15_t. I will get this thread_state value next time. >One more thing: > >When you upgraded to OFED 1.2, did you build and install the management >libraries (libibcommon, libibumad are important here and libibmad for >diags) ? > > I upgraded from OFED 1.0 to OFED 1.1 (not OFED 1.2). I built all these libraries and installed it. VBabu From robert.j.woodruff at intel.com Mon Dec 11 11:11:27 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 11 Dec 2006 11:11:27 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories Message-ID: Steve Wrote. >I was really hoping it would work for both kernels, but now with the >workqueue changes, I'll have to think about a 2.6.19 patch. However, my >top priority is getting this tested and into kernel.org... >Steve. Understood. I would like to try to include this driver in my OFA early integration-test tree, but to do so, I would need you to publish a branch based on 2.6.19, not linus's tree, since all of the rest of the components are based on 2.6.19, rather than linus's current tree. If you want this code in my early integration test tree so that others in the OFA community can give it a try before it goes to kernel.org, I would be willing to try to include it. If you don't have the time right now, I understand. From swise at opengridcomputing.com Mon Dec 11 11:13:06 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 11 Dec 2006 13:13:06 -0600 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: References: Message-ID: <1165864386.6867.2.camel@stevo-desktop> On Mon, 2006-12-11 at 09:45 -0800, Woodruff, Robert J wrote: > Steve Wrote, > >NOTE: This git tree is backed against Linus's tree and I merged up to > >his latest on 12/8. So it's past 2.6.19 and now depends on changes > that > >are post 2.6.19 (the workqueue changes). > > >Steve. > > Do you plan on only supporting the Chelsio driver for 2.6.20+ kernels > or do plan on providing backport patches for earlier kernels ? > > woody Hey Roland, is there a preferred way to handle this? IE whats the best was of keeping a 2.6.19 based patch set while also trying to merge your patches into the latest from linus's tree? I guess I can create a branch with a HEAD at 2.6.19 and back-port my latest patch set. Is that the best way? Maybe a for-ofed branch? Steve. From robert.j.woodruff at intel.com Mon Dec 11 11:19:00 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 11 Dec 2006 11:19:00 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support Message-ID: Michael wrote, >> BTW, Roland, could you give me some indication on whether this >> has a chance getting into 2.6.20? If yes I'll stop writing new code >> and focus on polishing this. >I think we could probably merge it but maybe it's better to put it in >-mm for a cycle given that it's new and not too many people have >looked at it yet. And I still haven't gotten comfortable with the way >CM is enabled. >- R. I think it might be good for others in the OFA community to try this out before we decide it is ready for the kernel. I tried it out over the weekend, running Intel MPI over IPoIB_CM, and with default MTU settings, it did not cause any problems on my small 2 node cluster. Might be good however for someone to load this up on a larger cluster and test it. I did notice that unless I made the MTU really big (16K), there was not much benefit (if any) with the default MTU size. I also noticed that when I set the MTU to 16K and ran some stressful MPI tests, that my system seemed to get un-responsive like IPoIB was taking up too much kernel memory. Thus, I think it best for others to play with this a bit before it is submitted upstream. my 2 cents, woody From divy at chelsio.com Mon Dec 11 11:25:00 2006 From: divy at chelsio.com (Divy Le Ray) Date: Mon, 11 Dec 2006 11:25:00 -0800 Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver In-Reply-To: <1165851389.13419.3.camel@stevo-desktop> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> <1165851389.13419.3.camel@stevo-desktop> Message-ID: <457DB08C.8070709@chelsio.com> Steve Wise wrote: > On Sun, 2006-12-10 at 20:02 -0800, Roland Dreier wrote: > >> I haven't seen any evidence of the corresponding ethernet NIC driver >> being merged for 2.6.20 (which is a prerequisite, right). >> >> What's the status of that? >> >> > > It is on its third or fourth round of review. The last driver posted on > 12/7, was merged up to linus's latest tree probably as of 12/7. I know > the comments set it was against 2.6.19, but it was really linus's > latest. > > Divy, can you expand on this? > Steve, the patch for the Chelsio T3 driver was postered against Linus'tree indeed. -bash-3.00$ cat .git/refs/heads/origin 0215ffb08ce99e2bb59eca114a99499a4d06e704 It incorporated Stephen's feedback. The comments I received since then concern minor coding style glitches. I will fix them, the driver functionality should remain unchanged however. Cheers, Divy > > Steve. > > From robert.j.woodruff at intel.com Mon Dec 11 11:40:35 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 11 Dec 2006 11:40:35 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support Message-ID: Woody wrote, >I also noticed that when I set the MTU to 16K and ran some stressful MPI >tests, >that my system seemed to get un-responsive like IPoIB was taking up too >much >kernel memory. Correction, I saw the strange behavior when I had the MPU set to 64K, not 16K MTU, and I cannot be sure that it was IPoIB_CM that was causing the problem, so I think it would be good for others to give this some airtime and report their experiences to the list. woody From mst at mellanox.co.il Mon Dec 11 11:41:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 21:41:11 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061211194111.GB27010@mellanox.co.il> > >> BTW, Roland, could you give me some indication on whether this > >> has a chance getting into 2.6.20? If yes I'll stop writing new code > >> and focus on polishing this. > > >I think we could probably merge it but maybe it's better to put it in > >-mm for a cycle given that it's new and not too many people have > >looked at it yet. And I still haven't gotten comfortable with the way > >CM is enabled. > > >- R. > > I think it might be good for others in the OFA community to try this out before > we decide it is ready for the kernel. I tried it out over the weekend, running > Intel MPI over IPoIB_CM, and with default MTU settings, it did not cause any > problems on my small 2 node cluster. Might be good however for someone to load > this up on a larger cluster and test it. IMO, we have after -rc1 to fix any bugs. The feature *is* marked experimental after all, and have 0 impact on code when disabled at compile time. So if you want rock-stable, just turn it off. > I did notice that unless I made the MTU > really big (16K), there was not much benefit (if any) with the default MTU size. Right. My observation too. The whole point of IPoIB CM is to enable high MTU values. 64K is what works really well. > I also noticed that when I set the MTU to 16K and ran some stressful MPI tests, > that my system seemed to get un-responsive like IPoIB was taking up too much > kernel memory. Could you enable debug and try again? Maybe you have send errors. My guess would be you are getting RQ underruns and QPs are getting closed and reopened (and if DREQs are lost for some reason, which shouldn't happen on back to back but seems to due to some issue in our MAD layer, we could be getting stale connections which aren't currently cleaned up - it's on my TODO). I have a couple of ideas on how to fix it better - e.g. detect RNR NACK and cycle the QP through RTS/INIT/RTR/RTS - but the simplest workaround for now would be just to have a high MTU or increase the RX ring size via IPoIB module option. Can you try this too, and let me know? > Thus, I think it best for others to play with this a bit before > it is submitted upstream. > > my 2 cents, > woody I don't know, really - it's an option after all. Given that it doesn't cause problems for people that don't enable it, keeping code out of kernel until it's totally robust seems wrong - instead of debugging/fixing issues I'll have to spend time keeping the code up to date with upstream. -- MST From mst at mellanox.co.il Mon Dec 11 11:44:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 21:44:43 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061211194443.GC27010@mellanox.co.il> > >I also noticed that when I set the MTU to 16K and ran some stressful MPI tests, > >that my system seemed to get un-responsive like IPoIB was taking up too > >much kernel memory. > > Correction, I saw the strange behavior when I had the MPU set to 64K, > not 16K MTU, > and I cannot be sure that it was IPoIB_CM that was causing the problem, > so I think it would be good for others to give this some airtime and > report > their experiences to the list. That's the setup I'm mostly testing at I haven't seen this yet. Are you running this together with Sean's multicast patches and the sa cache? Are you seeing something in the log? What about when you set debug_level to 1? Does increasing the RQ size help? -- MST From divy at chelsio.com Mon Dec 11 11:49:53 2006 From: divy at chelsio.com (Divy Le Ray) Date: Mon, 11 Dec 2006 11:49:53 -0800 Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver In-Reply-To: <457DB08C.8070709@chelsio.com> References: <20061210223244.27166.36192.stgit@dell3.ogc.int> <1165851389.13419.3.camel@stevo-desktop> <457DB08C.8070709@chelsio.com> Message-ID: <457DB661.6060102@chelsio.com> Divy Le Ray wrote: > Steve Wise wrote: >> On Sun, 2006-12-10 at 20:02 -0800, Roland Dreier wrote: >> >>> I haven't seen any evidence of the corresponding ethernet NIC driver >>> being merged for 2.6.20 (which is a prerequisite, right). >>> >>> What's the status of that? >>> >>> >> >> It is on its third or fourth round of review. The last driver posted on >> 12/7, was merged up to linus's latest tree probably as of 12/7. I know >> the comments set it was against 2.6.19, but it was really linus's >> latest. >> >> Divy, can you expand on this? >> > Steve, the patch for the Chelsio T3 driver was postered against > Linus'tree indeed. > > -bash-3.00$ cat .git/refs/heads/origin > 0215ffb08ce99e2bb59eca114a99499a4d06e704 I meant -bash-3.00$ cat .git/refs/heads/master 9eba2b0ba067ce9745e575e5ea2e97a5d7d61bef > > It incorporated Stephen's feedback. > The comments I received since then concern minor coding style glitches. > I will fix them, the driver functionality should remain unchanged > however. > > Cheers, > Divy > > >> >> Steve. >> >> > > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html From eitan at mellanox.co.il Mon Dec 11 12:31:17 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 11 Dec 2006 22:31:17 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: References: Message-ID: <457DC015.9050207@mellanox.co.il> Hi Scott, Scott Weitzenkamp (sweitzen) wrote: > It's not clear to me. > > Are you changing the libsdp.conf location or not? > Currently the only feedback I got say that I need not install libsdp.conf at all. I only need to install an example somewhere (I do not know where - maybe docs?) Instead I am going to change the default libsdp behavior to that of the default config. Do you have some insight into this issue? Any preferences? Thanks EZ > Can you define "sanely"? > > Scott > > >> -----Original Message----- >> From: openib-general-bounces at openib.org >> [mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi >> Sent: Monday, December 11, 2006 2:27 AM >> To: Michael S. Tsirkin >> Cc: Nimrod Gindi; OPENIB GENERAL >> Subject: Re: [openib-general] libsdp: RFC changing >> libsdp.conf location >> >> Hi Michael, >> >> Thanks. This proposal is simple and clear to me. >> Let's wait a day and see if anybody else have other ideas. >> >> Thanks >> >> Eitan >> >> Michael S. Tsirkin wrote: >> >>>> BTW: libsdp.conf used to be overwritten in previous install. >>>> I have fixed the nakefile to avoid that and instead create a >>>> new file with install date under the same directory. >>>> >>>> >>> Here's a simple proposal that will address this issue: >>> - Make libsdp behave sanely when not libsdp.conf file is present. >>> Do not install anything in default location in make install. >>> >>> - in make install, copy the example configuration file into >>> libsdp.conf.example. Add a line to the top of it saying >>> "rename this file to libsdp.conf to make lbisdp use it". >>> >>> >>> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From adit.262 at gmail.com Mon Dec 11 12:55:34 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Mon, 11 Dec 2006 15:55:34 -0500 Subject: [openib-general] Configuring Guest VMs to use Infiniband interfaces In-Reply-To: References: Message-ID: Hi, Has anyone worked with the xen-smartio repository? I was using it and had a few questions with regard to the configuration of the guest VM configuration.: 1) Is there a special config line to assign IB virtual interface to guests? if i say vif= [' '] in guest config, the interface shows up as eth0 and not ib0 in guest. Ive changed the network script in network-bridge to use the IB interface (ib1) as the bridge. 2) how does xen mux/demux over the IB interface? does it use same ethernet bridging? If so how does one get it to work? Thanks, Adit Ranadive From sashak at voltaire.com Mon Dec 11 13:07:08 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 11 Dec 2006 23:07:08 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061210215033.GC21155@sashak.voltaire.com> References: <20061210215033.GC21155@sashak.voltaire.com> Message-ID: <20061211210708.GA25052@sashak.voltaire.com> On 23:50 Sun 10 Dec , Sasha Khapyorsky wrote: > Hi, > > Recently I found this OFA 'Userspace Git Trees' downloading howto: > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories > > and thought that we could make it simpler for end-user to choose the > "right" git tree just by adding one more series of symbolic links under > /pub/scm. This links will point to the maintainer's "official" trees, and > we will have only one such link per project. > > So typical downloading howto for end-users will looks like: > > git clone git://staging.openfabrics.org/dapl > git clone git://staging.openfabrics.org/ibutils > git clone git://staging.openfabrics.org/imgen > ... > > instead of > > git clone git://staging.openfabrics.org/~ardavis/dapl > git clone git://staging.openfabrics.org/~eitan/ibutils > git clone git://staging.openfabrics.org/~mst/imgen > ... > > as it is now. > > > To illustrate this I've added already couple of such symbolic links > under /pub/scm and it is visible now via gitweb: > > http://staging.openfabrics.org/git > > Comments, objections? Don't see many supporters up to now so I'm going to remove this "demo" soon. If anybody cares - this is the last call! Sasha From halr at voltaire.com Mon Dec 11 12:59:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Dec 2006 15:59:23 -0500 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <457AC99E.8050402@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <1165617195.26559.4435.camel@hal.voltaire.com> <457AC99E.8050402@mellanox.co.il> Message-ID: <1165870759.21606.18477.camel@hal.voltaire.com> On Sat, 2006-12-09 at 09:35, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: > > > >> Hal Rosenstock wrote: > >> > >>> Hi Eitan, > >>> > >>> Just wanted to close the loop on the OpenSM issues of the last couple > >>> days. > >>> > >>> 1. When can you supply an OpenSM verbose log for the InformInfo > >>> subscribe problem you reported earlier today ? Failing that, I don't > >>> know how to reproduce this. > >>> > >>> > >> Attached > >> > I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure. Any idea on when you will have a chance to look into this ? > >>> 4. I encourage you to look at and comment on the OpenSM patches rather > >>> than waiting for them to be in the tree. > >>> > >>> > >> I am sure you did not mean to, but now I have to admit my limited skills > >> in catching bugs by reading patches :-( . > >> > > > > Not just read, but they are there to try out as well. > > > I will need an automatic flow for that sake. I can not keep up with the > amount of patches manually. > But I do not know how to automatically convert the mails into patches > into a tree. > > You could try out the patches and do the same thing before they are > > committed. > > > > > I have automation based on the committed tree that pull it (git trem) , > compile and run regression. > Actually this is how all other code is handled too. Are you referring to OFED ? In the case of OFED, where do those "special" trees/branches come from ? -- Hal From jlentini at netapp.com Mon Dec 11 13:09:47 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 11 Dec 2006 16:09:47 -0500 (EST) Subject: [openib-general] Configuring Guest VMs to use Infiniband interfaces In-Reply-To: References: Message-ID: On Mon, 11 Dec 2006, Adit Ranadive wrote: > Has anyone worked with the xen-smartio repository? Novell has made substantial improvements to the xen-smartio code. They made a presentation at the last workshop: http://openfabrics.org/conference/nov2006sc/xen-ib-presentation.pdf From robert.j.woodruff at intel.com Mon Dec 11 13:11:21 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Mon, 11 Dec 2006 13:11:21 -0800 Subject: [openib-general] userspace git trees Message-ID: Sasha wrote, >> Comments, objections? >Don't see many supporters up to now so I'm going to remove this "demo" >soon. If anybody cares - this is the last call! >Sasha I don't have any preference either way is fine. woody From sweitzen at cisco.com Mon Dec 11 13:42:36 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 11 Dec 2006 13:42:36 -0800 Subject: [openib-general] libsdp: RFC changing libsdp.conf location Message-ID: > > Are you changing the libsdp.conf location or not? > > > Currently the only feedback I got say that I need not install > libsdp.conf at all. > I only need to install an example somewhere (I do not know > where - maybe > docs?) > Instead I am going to change the default libsdp behavior to > that of the > default config. > > Do you have some insight into this issue? Any preferences? I strongly disagree with not installing libsdp.conf at all. On my RHEL4 system I count 57 /etc/*.conf files. Most of these I have never changed, yet they are useful references. I'm OK with leaving libsdp.conf in /usr/local/ofed/etc. How do other RPM packages with .conf file handle upgrading the .conf file? Scott From mst at mellanox.co.il Mon Dec 11 13:51:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 11 Dec 2006 23:51:41 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: References: Message-ID: <20061211215141.GB4235@mellanox.co.il> > I strongly disagree with not installing libsdp.conf at all. Just saying "I strongly disagree" does not make for a strong argument :) Why do you (strongly) want it installed if libsdp will work fine without, in a way identical to what it is doing with default libsdp.conf today? -- MST From rdreier at cisco.com Mon Dec 11 14:02:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 11 Dec 2006 14:02:34 -0800 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: (Scott Weitzenkamp's message of "Mon, 11 Dec 2006 13:42:36 -0800") References: Message-ID: > How do other RPM packages with .conf file handle upgrading the .conf > file? You mark the config file with %config or %config(noreplace) in the spec file. With %config, RPM will move the old config to .rpmsave (if the old config was edited) and with %config(noreplace), RPM will put the new config file in .rpmnew (if the old file was edited). I definitely think RPM packages should install sane defaults into their /etc/*.conf files. As a side note it doesn't make any sense to me for OFED RPMs to put stuff in /usr/local/ofed rather than the standard prefix. - R. From sweitzen at cisco.com Mon Dec 11 14:04:29 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 11 Dec 2006 14:04:29 -0800 Subject: [openib-general] libsdp: RFC changing libsdp.conf location Message-ID: > > I strongly disagree with not installing libsdp.conf at all. > > Just saying "I strongly disagree" does not make for a strong > argument :) > > Why do you (strongly) want it installed if libsdp will work > fine without, > in a way identical to what it is doing with default libsdp.conf today? On my RHEL4 system I count 57 /etc/*.conf files. Most of these I have never changed, yet they are useful references. This is more intiutive to me than having to guess or search for how to configure libsdp. We install libsdp.conf today, and I don't see any good reason to not keep doing so. Scott From mst at mellanox.co.il Mon Dec 11 14:03:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 00:03:12 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: References: Message-ID: <20061211220312.GB8725@mellanox.co.il> > On my RHEL4 > system I count 57 /etc/*.conf files. Most of these I have never > changed, yet they are useful references. We can have a file named libsdp.conf.example, with the first line: # this is an example libsdp configuration file. # to make it active, rename it libsdp.conf: mv libsdp.conf.example libsdp.conf -- MST From sweitzen at cisco.com Mon Dec 11 14:07:42 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 11 Dec 2006 14:07:42 -0800 Subject: [openib-general] libsdp: RFC changing libsdp.conf location Message-ID: > > On my RHEL4 > > system I count 57 /etc/*.conf files. Most of these I have never > > changed, yet they are useful references. > > We can have a file named libsdp.conf.example, with the first line: > > # this is an example libsdp configuration file. > # to make it active, rename it libsdp.conf: mv > libsdp.conf.example libsdp.conf > > -- > MST > I this this is less useful than just having the .conf file there, and I only see one example of this on RHEL4. Scott From halr at voltaire.com Mon Dec 11 14:07:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 11 Dec 2006 17:07:06 -0500 Subject: [openib-general] openib-commits and git Message-ID: <1165874816.21606.21357.camel@hal.voltaire.com> Hi, Some have requested the equivalent of what we had with svn with openib-commits. The first question is what capabilities in this are desired. We don't want to spend a lot of engineering time on this but it would be good to know. Is a notification of the commit/push with the log sufficient or does it need to look more what svn provided (and include the changes too) ? The other question is a policy one: Is it a reasonable default to enable this for all the developers ? Do any of the developers object to this policy ? -- Hal From sweitzen at cisco.com Mon Dec 11 14:20:55 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Mon, 11 Dec 2006 14:20:55 -0800 Subject: [openib-general] openib-commits and git Message-ID: I would like to see diffs, either inline in the commit email or via a URL I can click on. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > Sent: Monday, December 11, 2006 2:07 PM > To: openib-general at openib.org > Cc: OpenFabricsEWG > Subject: [openib-general] openib-commits and git > > Hi, > > Some have requested the equivalent of what we had with svn with > openib-commits. > > The first question is what capabilities in this are desired. We don't > want to spend a lot of engineering time on this but it would > be good to > know. Is a notification of the commit/push with the log sufficient or > does it need to look more what svn provided (and include the changes > too) ? > > The other question is a policy one: Is it a reasonable > default to enable > this for all the developers ? Do any of the developers object to this > policy ? > > -- Hal > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From adit.262 at gmail.com Mon Dec 11 14:24:21 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Mon, 11 Dec 2006 17:24:21 -0500 Subject: [openib-general] Configuring Guest VMs to use Infiniband interfaces In-Reply-To: References: Message-ID: Novell is planning those changes unfortunately the source tree at http://xenbits.xensource.com/ext/xen-smartio.hg is still abt 8 months old.. Also are the mellanox 25208 hcas compatible with the 23208 ones? I know that the guestVMs use the hca driver only for 23208.. Unfortunately I have the 25208 ones will they still work in domU? Thanks, Adit On 12/11/06, James Lentini wrote: > > > On Mon, 11 Dec 2006, Adit Ranadive wrote: > > > Has anyone worked with the xen-smartio repository? > > Novell has made substantial improvements to the xen-smartio code. They > made a presentation at the last workshop: > > http://openfabrics.org/conference/nov2006sc/xen-ib-presentation.pdf > From jlentini at netapp.com Mon Dec 11 14:51:00 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 11 Dec 2006 17:51:00 -0500 (EST) Subject: [openib-general] Configuring Guest VMs to use Infiniband interfaces In-Reply-To: References: Message-ID: On Mon, 11 Dec 2006, Adit Ranadive wrote: > Novell is planning those changes unfortunately the source tree at > http://xenbits.xensource.com/ext/xen-smartio.hg is still abt 8 months > old.. > > Also are the mellanox 25208 hcas compatible with the 23208 ones? They are compatible in "Tavor" compatibility mode. > I know that the guestVMs use the hca driver only for 23208.. > Unfortunately I have the 25208 ones will they still work in domU? I'm not sure if the xen-smartio tree supports this. > Thanks, > Adit > > On 12/11/06, James Lentini wrote: > > > > > > On Mon, 11 Dec 2006, Adit Ranadive wrote: > > > > > Has anyone worked with the xen-smartio repository? > > > > Novell has made substantial improvements to the xen-smartio code. They > > made a presentation at the last workshop: > > > > http://openfabrics.org/conference/nov2006sc/xen-ib-presentation.pdf From mlleinin at hpcn.ca.sandia.gov Mon Dec 11 15:20:56 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Mon, 11 Dec 2006 15:20:56 -0800 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com> References: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com> Message-ID: <1165879256.19459.379.camel@localhost> On Tue, 2006-12-05 at 12:22 +0000, Eric Barton wrote: > Hi, > > We'd dearly like some help to understand why we seem to be having > performance issues with OFED. When we run a lustre network bandwidth > benchmark, we find significant performance degradation on OFED versus > Voltaire... > > Premap (256 RDMA frags) Map on demand (1 RDMA frag) > Voltaire OFED Ratio Voltaire OFED Ratio > Writes MB/s 682 567 83 % 577 436 75 % > Reads MB/s 658 554 84 % 555 432 77 % Where these tests run on the same hardware setup? If so was it PCI-X or PCIe? HCA firmware version would also be useful. Roland may be able to comment on if their are performance difference for interrupt-drive CQ between the old VAPI stacks and OFED. At face value these results are troubling since we are starting to move all of our IB clusters, that use Lustre, over to OFED. Thanks, - Matt > > These tests measure the bandwidth of 1MByte transfers pipelined 8 deep. > All hardware/software was the same, apart from the IB stack and the lustre > network driver. > > The architecture of the lustre network drivers for OFED and Voltaire are > almost identical. Both use RC QPs with the same control message protocol > to set up bulk data transfers using RDMA WRITE. Control messages use a > credit flow protocol to ensure that they are only sent when buffers are > posted to receive them. Concurrent transfers over the same QP are > supported so that lustre can pipeline bulk I/O. > > The only difference between the lustre network drivers is that the Voltaire > driver has a single global CQ and the OFED driver has 1 CQ per QP. However > the measurement above are for a single pair of nodes - in this case both > implementations use a single CQ. > > By default, the drivers pre-map all of physical memory so each RDMA > consists of page fragments. However, we can also compile both drivers to > map on demand using FMR so that RDMA is not fragmented. The results above > compare both methods and although both drivers perform worse when mapping, > the OFED driver takes the bigger hit. > > We'd be delighted if anyone can shed any light or can suggest any steps we > should take to discover the reason. We're also very willing to provide > assistance if any of the OpenFabrics developers wants to duplicate the > setup. > From sashak at voltaire.com Mon Dec 11 16:09:11 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 Dec 2006 02:09:11 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061211054539.GL9205@mellanox.co.il> References: <20061210233657.GB32199@sashak.voltaire.com> <20061211054539.GL9205@mellanox.co.il> Message-ID: <20061212000911.GJ25052@sashak.voltaire.com> On 07:48 Mon 11 Dec , Michael S. Tsirkin wrote: > > > > > > Recently I found this OFA 'Userspace Git Trees' downloading howto: > > > > > > > > > > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories > > > > > > > > > > > > and thought that we could make it simpler for end-user to choose the > > > > > > "right" git tree just by adding one more series of symbolic links under > > > > > > /pub/scm. This links will point to the maintainer's "official" trees, and > > > > > > we will have only one such link per project. > > > > > > > > > > > > So typical downloading howto for end-users will looks like: > > > > > > > > > > > > git clone git://staging.openfabrics.org/dapl > > > > > > git clone git://staging.openfabrics.org/ibutils > > > > > > git clone git://staging.openfabrics.org/imgen > > > > > > ... > > > > > > > > > > > > instead of > > > > > > > > > > > > git clone git://staging.openfabrics.org/~ardavis/dapl > > > > > > git clone git://staging.openfabrics.org/~eitan/ibutils > > > > > > git clone git://staging.openfabrics.org/~mst/imgen > > > > > > ... > > > > > > > > > > > > as it is now. > > > > > > > > > > NACK, please remove this. These soft links are messy, and > > > > > the fact that one needs root just to add a tree shows just how the approach > > > > > is broken. > > > > > > > > No, it is not instead, but in addition to ~user/ links, so root is _not_ > > > > required to add tree. > > > > > > right but suddenly root is needed to make it "official". > > > Let's avoid the whole policy-setting-by-softlinks. > > > "I have root" should not equal, or be required for "I say what's official". > > > > What are you trying to avoid? That only sysadmin will decide which git > > tree will be "official" for OFED and which will not? > > Yes. Another point is that I should not need sysadmin priviledges to create > a new project and declare my tree the official source. Nothing prevents from you to do it. No? In "worst" case we could make /pub/scm writable for dedicated group (like 'git') and add all git users to this group. I think this should be safe - currently we have only symbolic links in this directory. > But not only that - staging is used to develop more than just OFED. Read > the rant part in the original mail if you like for more detail - development > trees should all be equal. Only releases should be official. And release has an > immutable name, so it does not *matter* which tree you get it from. I don't understand how it is related. Currently we have the list of "official" trees anyway in Wiki (as above): https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories , and the goal is just to make it easier for end-users to find this. > > > These should be branches, not separate trees. > > > > Why not? > > You seem to have a fear of branches :). Of course not :). I like branches and I like trees, both can be useful. > Many trees do not buy you anything, > I tried this with ofed 1.1 in the beginning. Your bad experience doesn't mean that multiple trees are bad idea - you may find many good experiences as well (look at kernel.org for example). > You can have many trees. But a single project maintained by a single person > belongs in a single public tree, scattering it around between multiple trees > just makes it messy for people to track, and messy to figure out the delta > between branches. In the "rant" above you talked about equal development trees, I guess "multiple"? What about multiple projects maintained by single person, and single project maintained by multiple persons, and experimental features of some existing project maintained but yet another person... > Finally, it wastes space. 'git-clone -s' helps to save space. Anyway I don't think that my proposition is so "Great Idea" (and requires such fundamental discussion as branch against tree :)), but just small helpful thing, mainly end-user oriented. And since there is no strong support for this, I'm removing this now. Sasha From rdreier at cisco.com Mon Dec 11 16:16:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 11 Dec 2006 16:16:21 -0800 Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com> (Ralph Campbell's message of "Mon, 11 Dec 2006 10:02:26 -0800 (PST)") References: <1165517253.14800.283.camel@brick.pathscale.com> <457BD18D.7000403@voltaire.com> <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com> Message-ID: > I would like to see this last set of patches integrated as is. > I would like to get more experience with the current implementation > before extending it to support other configurations. Yeah, let's go with that. Since ipath depends on 64BIT in Kconfig anyway I think this is OK for now. - R. From rdreier at cisco.com Mon Dec 11 16:17:50 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 11 Dec 2006 16:17:50 -0800 Subject: [openib-general] version #defines for the kernel In-Reply-To: <076a01c71cb0$244a7630$0281a8c0@ebpc> (Eric Barton's message of "Sun, 10 Dec 2006 23:08:45 -0000") References: <076a01c71cb0$244a7630$0281a8c0@ebpc> Message-ID: > > No other kernel subsystem has one, so I don't think it's realistic to > > expect one for IB. > Don't you think it would be useful? Even if only to make API changes > explicit? Sure, I admit it would be useful for out-of-tree code. But it would also be an unmaintainable mess to actually try and have a set of feature flags, so I don't think we can do it. - R. From poknam at gmail.com Mon Dec 11 17:24:53 2006 From: poknam at gmail.com (PN) Date: Tue, 12 Dec 2006 09:24:53 +0800 Subject: [openib-general] Automatically connect to SRP target In-Reply-To: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com> References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com> Message-ID: <92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com> No one can help me? :( PN 2006/12/7, Lai Dragonfly : > > Hi all, > > i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in clients and > IBGD-1.8.2-srpt in targets. > i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in > openib.conf, > i could not found the SRP target. > until i execute "srp_daemon -e -o", i can see all the targets appear in > /dev/sdX. > > since i want to export the targets to other nodes, > any idea so that i can connect to the targets automatically in each > reboot. > without typing "srp_daemon -e -o" each time? > > thanks in advance. > > PN > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vuhuong at mellanox.com Mon Dec 11 17:25:42 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 11 Dec 2006 17:25:42 -0800 Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: References: <4579C6C3.5090207@mellanox.com> Message-ID: <457E0516.2050009@mellanox.com> James Lentini wrote: > A couple of questions Vu: > > What NFS-RDMA release are you using? This looks like release 7. > Yes. I'm using release 7 > Is this reproducible? I ran into it twice - I think that it may co-relate to openSM restart incident. I'll double check it and confirm > > What kernel version are you using? 2.6.18.5 > > What hardware is this on? It looks like x86-64 to me, which is fine. I > just want to be sure I know what I'm looking at. As many specifics as > possible is good (number of CPUs, hyperthreading, etc.) > Dual woodcrest xeon based CPUs > Could you send the output of > > objdump -Slr /path/to/kernel/mm/swap.o > I attached the objdump output here > Actually, just the put_page disassembly is all I want to see. > > Is there any more text available? Usually there is an explanation > given for an oops message (e.g. "Unable to handle kernel paging > request.."). > I did not see any oops text message. System was still responsive with ipoib ping or login > I opened a bug at the NFS-RDMA SourceForge project to track this: > > http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 thanks for your help, -vu > > Thanks for reporting this. > james > > On Fri, 8 Dec 2006, Vu Pham wrote: > >> Hi James, >> I got these errors in server's /var/log/messages and then the server stop >> responding to login, I/O...; however, the server is still up, ipoib is still >> working >> >> >> Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[] >> [] put_page+0x17/0x40 >> Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS: 00010246 >> Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001 >> RCX: 000000000003ffff >> Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001 >> RDI: ffff8102274e92f8 >> Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034 >> R09: 0000000000000000 >> Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000 >> R12: ffff81020ef96800 >> Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000 >> R15: ffff8102053ee890 >> Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000) >> GS:ffff81022066eb40(0000) knlGS:0000000000000000 >> Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >> 000000008005003b >> Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000 >> CR4: 00000000000006e0 >> Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo >> ffff810219dde000, task ffff81020d87f0c0) >> Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 ffff81020ef96968 >> ffff81020ef96800 ffff81020ef96958 >> Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90 >> ffffffff80424e05 0000000000000000 >> Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90 >> ffffffff80239b90 ffff81020d87f0c0 >> Dec 8 06:38:21 ibd201 kernel: Call Trace: >> Dec 8 06:38:21 ibd201 kernel: [] >> :sunrpc:svc_rdma_put_context+0x37/0xd0 >> Dec 8 06:38:21 ibd201 kernel: [] >> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0 >> Dec 8 06:38:21 ibd201 kernel: [] >> schedule_timeout+0x95/0xb0 >> Dec 8 06:38:21 ibd201 kernel: [] process_timeout+0x0/0x10 >> Dec 8 06:38:21 ibd201 kernel: [] >> wait_for_completion_timeout+0xcd/0x150 >> Dec 8 06:38:21 ibd201 kernel: [] >> default_wake_function+0x0/0x10 >> Dec 8 06:38:21 ibd201 kernel: [] >> :ib_mthca:mthca_cmd_post+0x232/0x260 >> Dec 8 06:38:21 ibd201 kernel: [] >> default_wake_function+0x0/0x10 >> Dec 8 06:38:21 ibd201 kernel: [] __next_cpu+0x19/0x30 >> Dec 8 06:38:21 ibd201 kernel: [] >> find_busiest_group+0x24e/0x6d0 >> Dec 8 06:38:21 ibd201 kernel: [] thread_return+0x0/0xde >> Dec 8 06:38:21 ibd201 kernel: [] >> _spin_unlock_irqrestore+0x8/0x10 >> Dec 8 06:38:21 ibd201 kernel: [] >> try_to_del_timer_sync+0x51/0x60 >> Dec 8 06:38:21 ibd201 kernel: [] del_timer_sync+0xc/0x20 >> Dec 8 06:38:21 ibd201 kernel: [] >> schedule_timeout+0x95/0xb0 >> Dec 8 06:38:21 ibd201 kernel: [] >> :sunrpc:svc_recv+0x416/0x510 >> Dec 8 06:38:21 ibd201 kernel: [] >> default_wake_function+0x0/0x10 >> Dec 8 06:38:21 ibd201 kernel: [] >> default_wake_function+0x0/0x10 >> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 >> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x111/0x380 >> Dec 8 06:38:21 ibd201 kernel: [] child_rip+0xa/0x12 >> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 >> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 >> Dec 8 06:38:21 ibd201 kernel: [] child_rip+0x0/0x12 >> Dec 8 06:38:21 ibd201 kernel: >> Dec 8 06:38:21 ibd201 kernel: >> Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f 08 >> 0f 94 c0 84 c0 74 >> Dec 8 06:38:21 ibd201 kernel: RIP [] put_page+0x17/0x40 >> Dec 8 06:38:21 ibd201 kernel: RSP >> >> -vu >> From vuhuong at mellanox.com Mon Dec 11 17:31:08 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 11 Dec 2006 17:31:08 -0800 Subject: [openib-general] Automatically connect to SRP target In-Reply-To: <92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com> References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com> <92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com> Message-ID: <457E065C.6030104@mellanox.com> PN, Edit file /etc/infiniband/openib.conf and set SRPHA_ENABLE=yes this will start srp_daemon by default -vu > No one can help me? :( > > PN > > > 2006/12/7, Lai Dragonfly >: > > Hi all, > > i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in > clients and > IBGD-1.8.2-srpt in targets. > i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in > openib.conf, > i could not found the SRP target. > until i execute "srp_daemon -e -o", i can see all the targets appear > in /dev/sdX. > > since i want to export the targets to other nodes, > any idea so that i can connect to the targets automatically in each > reboot. > without typing "srp_daemon -e -o" each time? > > thanks in advance. > > PN > > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vuhuong at mellanox.com Mon Dec 11 17:32:10 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 11 Dec 2006 17:32:10 -0800 Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: <457E0516.2050009@mellanox.com> References: <4579C6C3.5090207@mellanox.com> <457E0516.2050009@mellanox.com> Message-ID: <457E069A.4020807@mellanox.com> Hit *send* too soon - here is the objdump of swap.o -vu > James Lentini wrote: >> A couple of questions Vu: >> >> What NFS-RDMA release are you using? This looks like release 7. >> > > Yes. I'm using release 7 > >> Is this reproducible? > > I ran into it twice - I think that it may co-relate to > openSM restart incident. I'll double check it and confirm > > >> What kernel version are you using? > > 2.6.18.5 > >> What hardware is this on? It looks like x86-64 to me, which is fine. I >> just want to be sure I know what I'm looking at. As many specifics as >> possible is good (number of CPUs, hyperthreading, etc.) >> > > Dual woodcrest xeon based CPUs > >> Could you send the output of >> >> objdump -Slr /path/to/kernel/mm/swap.o >> > > I attached the objdump output here > >> Actually, just the put_page disassembly is all I want to see. >> >> Is there any more text available? Usually there is an explanation >> given for an oops message (e.g. "Unable to handle kernel paging >> request.."). >> > > I did not see any oops text message. System was still > responsive with ipoib ping or login > > >> I opened a bug at the NFS-RDMA SourceForge project to track this: >> >> http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 > > thanks for your help, > > -vu > >> Thanks for reporting this. >> james >> >> On Fri, 8 Dec 2006, Vu Pham wrote: >> >>> Hi James, >>> I got these errors in server's /var/log/messages and then the server stop >>> responding to login, I/O...; however, the server is still up, ipoib is still >>> working >>> >>> >>> Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[] >>> [] put_page+0x17/0x40 >>> Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS: 00010246 >>> Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001 >>> RCX: 000000000003ffff >>> Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001 >>> RDI: ffff8102274e92f8 >>> Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034 >>> R09: 0000000000000000 >>> Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000 >>> R12: ffff81020ef96800 >>> Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000 >>> R15: ffff8102053ee890 >>> Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000) >>> GS:ffff81022066eb40(0000) knlGS:0000000000000000 >>> Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >>> 000000008005003b >>> Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000 >>> CR4: 00000000000006e0 >>> Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo >>> ffff810219dde000, task ffff81020d87f0c0) >>> Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 ffff81020ef96968 >>> ffff81020ef96800 ffff81020ef96958 >>> Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90 >>> ffffffff80424e05 0000000000000000 >>> Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90 >>> ffffffff80239b90 ffff81020d87f0c0 >>> Dec 8 06:38:21 ibd201 kernel: Call Trace: >>> Dec 8 06:38:21 ibd201 kernel: [] >>> :sunrpc:svc_rdma_put_context+0x37/0xd0 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> schedule_timeout+0x95/0xb0 >>> Dec 8 06:38:21 ibd201 kernel: [] process_timeout+0x0/0x10 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> wait_for_completion_timeout+0xcd/0x150 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> default_wake_function+0x0/0x10 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> :ib_mthca:mthca_cmd_post+0x232/0x260 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> default_wake_function+0x0/0x10 >>> Dec 8 06:38:21 ibd201 kernel: [] __next_cpu+0x19/0x30 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> find_busiest_group+0x24e/0x6d0 >>> Dec 8 06:38:21 ibd201 kernel: [] thread_return+0x0/0xde >>> Dec 8 06:38:21 ibd201 kernel: [] >>> _spin_unlock_irqrestore+0x8/0x10 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> try_to_del_timer_sync+0x51/0x60 >>> Dec 8 06:38:21 ibd201 kernel: [] del_timer_sync+0xc/0x20 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> schedule_timeout+0x95/0xb0 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> :sunrpc:svc_recv+0x416/0x510 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> default_wake_function+0x0/0x10 >>> Dec 8 06:38:21 ibd201 kernel: [] >>> default_wake_function+0x0/0x10 >>> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 >>> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x111/0x380 >>> Dec 8 06:38:21 ibd201 kernel: [] child_rip+0xa/0x12 >>> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 >>> Dec 8 06:38:21 ibd201 kernel: [] :nfsd:nfsd+0x0/0x380 >>> Dec 8 06:38:21 ibd201 kernel: [] child_rip+0x0/0x12 >>> Dec 8 06:38:21 ibd201 kernel: >>> Dec 8 06:38:21 ibd201 kernel: >>> Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f 08 >>> 0f 94 c0 84 c0 74 >>> Dec 8 06:38:21 ibd201 kernel: RIP [] put_page+0x17/0x40 >>> Dec 8 06:38:21 ibd201 kernel: RSP >>> >>> -vu >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: swap.objdump URL: From poknam at gmail.com Mon Dec 11 18:41:08 2006 From: poknam at gmail.com (PN) Date: Tue, 12 Dec 2006 10:41:08 +0800 Subject: [openib-general] Automatically connect to SRP target In-Reply-To: <457E065C.6030104@mellanox.com> References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com> <92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com> <457E065C.6030104@mellanox.com> Message-ID: <92daa7bf0612111841l70e4a653ked1d93ec1dc9f91@mail.gmail.com> Hi Vu, i have 2 more questions, now i have 3 srp targets and use LVM to form a GFS system. after setting SRPHA_ENABLE=yes, i found that sometimes (~30%) it will miss a target during reboot. i need to manually type "srp_daemon -e -o" to discover the missing target. is there any method such that the srp_daemon will repeat to try to ensure all targets were found? also, currently there is only 1 cable connect to each dual ports client. is it normal to have the following messages? Dec 12 10:18:10 storage02 run_srp_daemon[5471]: starting srp_daemon: [HCA=mthca0] [port=2] Dec 12 10:18:13 storage02 run_srp_daemon[5483]: failed srp_daemon: [HCA=mthca0] [port=2] [exit status=0] Dec 12 10:18:43 storage02 run_srp_daemon[5489]: starting srp_daemon: [HCA=mthca0] [port=2] Dec 12 10:18:46 storage02 run_srp_daemon[5501]: failed srp_daemon: [HCA=mthca0] [port=2] [exit status=0] .....[repeat infinitely] Thanks a lot, PN Below is the log: Dec 12 10:17:18 storage02 network: Setting network parameters: succeeded Dec 12 10:17:18 storage02 network: Bringing up loopback interface: succeeded Dec 12 10:17:23 storage02 network: Bringing up interface eth0: succeeded Dec 12 10:17:23 storage02 network: Bringing up interface ib0: succeeded Dec 12 10:17:26 storage02 kernel: REJ reason 0xa Dec 12 10:17:26 storage02 kernel: ib_srp: Connection failed Dec 12 10:17:26 storage02 kernel: scsi3 : SRP.T10:00D0680000000578 Dec 12 10:17:26 storage02 kernel: Vendor: Mellanox Model: IBSRP10-TGT Rev: 1.46 Dec 12 10:17:26 storage02 kernel: Type: Direct-Access ANSI SCSI revision: 03 Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB) Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte hdwr sectors (81964 MB) Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back Dec 12 10:17:26 storage02 rpcidmapd: rpc.idmapd startup succeeded Dec 12 10:17:26 storage02 kernel: sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 sdb7 > Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdb at scsi3, channel 0, id 0, lun 0 Dec 12 10:17:26 storage02 kernel: scsi4 : SRP.T10:00D06800000007B2 Dec 12 10:17:26 storage02 kernel: Vendor: Mellanox Model: IBSRP10-TGT hy-b Rev: 1.46 Dec 12 10:17:26 storage02 kernel: Type: Direct-Access ANSI SCSI revision: 03 Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte hdwr sectors (81964 MB) Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte hdwr sectors (81964 MB) Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back Dec 12 10:17:26 storage02 kernel: sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 sdc6 > Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdc at scsi4, channel 0, id 0, lun 0 Dec 12 10:17:26 storage02 scsi.agent[3668]: disk at /devices/pci0000:00/0000:00:02.0/0000:01:00.0/host3/target3:0:0/3:0:0:0 Dec 12 10:17:26 storage02 scsi.agent[3705]: disk at /devices/pci0000:00/0000:00:02.0/0000:01:00.0/host4/target4:0:0/4:0:0:0 Dec 12 10:17:26 storage02 ccsd[3769]: Starting ccsd 1.0.7: Dec 12 10:17:26 storage02 ccsd[3769]: Built: Aug 26 2006 15:01:49 Dec 12 10:17:26 storage02 ccsd[3769]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 10 Dec 12 10:17:26 storage02 kernel: Disabled Privacy Extensions on device ffffffff80405540(lo) Dec 12 10:17:26 storage02 kernel: IPv6 over IPv4 tunneling driver Dec 12 10:17:26 storage02 ccsd: succeeded Dec 12 10:17:26 storage02 kernel: CMAN 2.6.9-45.4.centos4 (built Aug 26 2006 14:55:55) installed Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 30 Dec 12 10:17:26 storage02 kernel: DLM 2.6.9-42.12.centos4 (built Aug 27 2006 05:25:40) installed Dec 12 10:17:27 storage02 ccsd[3769]: cluster.conf (cluster name = GFS_Cluster, version = 21) found. Dec 12 10:17:27 storage02 ccsd[3769]: Unable to perform sendto: Cannot assign requested address Dec 12 10:17:27 storage02 run_srp_daemon[3845]: failed srp_daemon: [HCA=mthca0] [port=2] [exit status=0] Dec 12 10:17:28 storage02 run_srp_daemon[3851]: starting srp_daemon: [HCA=mthca0] [port=2] Dec 12 10:17:29 storage02 ccsd[3769]: Remote copy of cluster.conf is from quorate node. Dec 12 10:17:29 storage02 ccsd[3769]: Local version # : 21 Dec 12 10:17:29 storage02 ccsd[3769]: Remote version #: 21 Dec 12 10:17:29 storage02 kernel: CMAN: Waiting to join or form a Linux-cluster Dec 12 10:17:29 storage02 kernel: CMAN: sending membership request Dec 12 10:17:29 storage02 ccsd[3769]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.1 Dec 12 10:17:29 storage02 ccsd[3769]: Initial status:: Inquorate Dec 12 10:17:30 storage02 kernel: CMAN: got node storage01 Dec 12 10:17:30 storage02 kernel: CMAN: got node storage03 Dec 12 10:17:30 storage02 kernel: CMAN: quorum regained, resuming activity Dec 12 10:17:30 storage02 ccsd[3769]: Cluster is quorate. Allowing connections. Dec 12 10:17:30 storage02 cman: startup succeeded Dec 12 10:17:30 storage02 lock_gulmd: no section detected in /etc/cluster/cluster.conf succeeded Dec 12 10:17:31 storage02 fenced: startup succeeded Dec 12 10:17:31 storage02 run_srp_daemon[4196]: failed srp_daemon: [HCA=mthca0] [port=2] [exit status=0] Dec 12 10:17:33 storage02 run_srp_daemon[4224]: starting srp_daemon: [HCA=mthca0] [port=2] Dec 12 10:17:36 storage02 run_srp_daemon[4236]: failed srp_daemon: [HCA=mthca0] [port=2] [exit status=0] Dec 12 10:17:40 storage02 run_srp_daemon[4242]: starting srp_daemon: [HCA=mthca0] [port=2] Dec 12 10:17:42 storage02 clvmd: Cluster LVM daemon started - connected to CMAN Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 on node storage01 Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 on node storage03 Dec 12 10:17:42 storage02 clvmd: clvmd startup succeeded Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes for volume group gfsvg. Dec 12 10:17:42 storage02 vgchange: Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes for volume group gfsvg. Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes for volume group gfsvg. Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes for volume group gfsvg. Dec 12 10:17:42 storage02 vgchange: Volume group "gfsvg" not found Dec 12 10:17:42 storage02 clvmd: Activating VGs: failed Dec 12 10:17:42 storage02 netfs: Mounting other filesystems: succeeded Dec 12 10:17:42 storage02 kernel: Lock_Harness 2.6.9-58.2.centos4 (built Aug 27 2006 05:27:43) installed Dec 12 10:17:42 storage02 kernel: GFS 2.6.9-58.2.centos4 (built Aug 27 2006 05:28:00) installed Dec 12 10:17:42 storage02 mount: mount: special device /dev/gfsvg/gfslv does not exist Dec 12 10:17:42 storage02 gfs: Mounting GFS filesystems: failed Dec 12 10:17:42 storage02 kernel: i2c /dev entries driver ..... 2006/12/12, Vu Pham : > > PN, > Edit file /etc/infiniband/openib.conf and set > > SRPHA_ENABLE=yes > > this will start srp_daemon by default > > -vu > > > No one can help me? :( > > > > PN > > > > > > 2006/12/7, Lai Dragonfly >: > > > > Hi all, > > > > i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in > > clients and > > IBGD-1.8.2-srpt in targets. > > i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in > > openib.conf, > > i could not found the SRP target. > > until i execute "srp_daemon -e -o", i can see all the targets appear > > in /dev/sdX. > > > > since i want to export the targets to other nodes, > > any idea so that i can connect to the targets automatically in each > > reboot. > > without typing "srp_daemon -e -o" each time? > > > > thanks in advance. > > > > PN > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vishal at endace.com Mon Dec 11 20:51:49 2006 From: vishal at endace.com (vishal) Date: Tue, 12 Dec 2006 17:51:49 +1300 Subject: [openib-general] srp initiator device discovery In-Reply-To: References: Message-ID: <1165899109.14308.9.camel@julia.et.endace.com> Hi, I have srp initiator installed with OFED-1.1, and another machine with SRP target (IBGOLD). I started the srp daemon to discover the target devices, and then ran fdisk -l to see the list. The list (below) shows duplicate devices :- Disk /dev/sdb: 2199.0 GB, 2199023255552 bytes 255 heads, 63 sectors/track, 267349 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdb doesn't contain a valid partition table Disk /dev/sdc: 2199.0 GB, 2199023255552 bytes 255 heads, 63 sectors/track, 267349 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System Disk /dev/sdd: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdd1 * 1 13 104391 83 Linux /dev/sdd2 14 60801 488279610 8e Linux LVM Disk /dev/sde: 2199.0 GB, 2199023255552 bytes 255 heads, 63 sectors/track, 267349 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sde doesn't contain a valid partition table Disk /dev/sdf: 2199.0 GB, 2199023255552 bytes 255 heads, 63 sectors/track, 267349 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System Disk /dev/sdg: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdg1 * 1 13 104391 83 Linux /dev/sdg2 14 60801 488279610 8e Linux LVM Doing some tests I found that sdb=sde, sdc=sdf, and sdd=sdg (obvious). I also tested the device discovery after creating an md device on the target side, and found that the initiator doesn't take into account the presence of an md device. Is this the expected behaviour ? Thanks for your time! Vishal From mst at mellanox.co.il Mon Dec 11 21:42:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 07:42:23 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061212000911.GJ25052@sashak.voltaire.com> References: <20061210233657.GB32199@sashak.voltaire.com> <20061211054539.GL9205@mellanox.co.il> <20061212000911.GJ25052@sashak.voltaire.com> Message-ID: <20061212054223.GB11064@mellanox.co.il> Sasha, one small request: could you please fix description for your trees? It should hopefully say something like "mirror of svn for ". Thanks very much, MST -- MST From mst at mellanox.co.il Mon Dec 11 21:46:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 07:46:36 +0200 Subject: [openib-general] ~bos/ipathverbs Message-ID: <20061212054636.GC11064@mellanox.co.il> Bryan, could you please change the description for your tree? gitweb summary page only shows first 3 words, so it now says "Userspace Infiniband verbs ..." and this does not make it clear its' not a generic verbs tree. Can you make it "Qlogic ipath userspace support", or something in that style, please? -- MST From mst at mellanox.co.il Mon Dec 11 21:58:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 07:58:41 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061212000911.GJ25052@sashak.voltaire.com> References: <20061210233657.GB32199@sashak.voltaire.com> <20061211054539.GL9205@mellanox.co.il> <20061212000911.GJ25052@sashak.voltaire.com> Message-ID: <20061212055841.GD11064@mellanox.co.il> > > Finally, it wastes space. > > 'git-clone -s' helps to save space. BTW, be careful with that: it seems clone -s might lose your data if the repository you clone from removes some heads and prunes history. So it's only safe to clone in this way from Linus who knows never to do this :) -- MST From mst at mellanox.co.il Mon Dec 11 22:03:34 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 08:03:34 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <1165879256.19459.379.camel@localhost> References: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com> <1165879256.19459.379.camel@localhost> Message-ID: <20061212060334.GE11064@mellanox.co.il> > > Hi, > > > > We'd dearly like some help to understand why we seem to be having > > performance issues with OFED. When we run a lustre network bandwidth > > benchmark, we find significant performance degradation on OFED versus > > Voltaire... > > > > Premap (256 RDMA frags) Map on demand (1 RDMA frag) > > Voltaire OFED Ratio Voltaire OFED Ratio > > Writes MB/s 682 567 83 % 577 436 75 % > > Reads MB/s 658 554 84 % 555 432 77 % > > Where these tests run on the same hardware setup? If so was it PCI-X > or PCIe? HCA firmware version would also be useful. Good point, Matt, thanks! This gives me an idea: try loading mthca with tune_pci=1. If this helps, this is a BIOS issue. -- MST From ramachandra.kuchimanchi at qlogic.com Mon Dec 11 22:28:05 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra Kuchimanchi) Date: Tue, 12 Dec 2006 00:28:05 -0600 Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover from secondary path back to primary path In-Reply-To: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com> References: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com> Message-ID: Roland, Did you get a chance to look at these patches ? Regards, Ram > -----Original Message----- > From: openib-general-bounces at openib.org [mailto:openib-general- > bounces at openib.org] On Behalf Of Ramachandra K > Sent: Thursday, December 07, 2006 4:33 PM > To: Roland Dreier > Cc: Openib-General > Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover from > secondary path back to primary path > > This fixes a bug due to which failover from secondary path back to primary path > was not working. > > Signed-off-by: Ramachandra K > --- > > drivers/infiniband/ulp/vnic/vnic_ib.c | 4 +++- > drivers/infiniband/ulp/vnic/vnic_main.c | 9 +++++---- > 2 files changed, 8 insertions(+), 5 deletions(-) > > diff --git a/drivers/infiniband/ulp/vnic/vnic_ib.c > b/drivers/infiniband/ulp/vnic/vnic_ib.c > index 6196e20..56ae9f7 100644 > --- a/drivers/infiniband/ulp/vnic/vnic_ib.c > +++ b/drivers/infiniband/ulp/vnic/vnic_ib.c > @@ -303,10 +303,12 @@ int vnic_ib_get_path(struct netpath *net > " path record query\n", > config->path_info.status); > > - netpath_timer(netpath, vnic->config->no_path_timeout); > ret = config->path_info.status; > } > out: > + if (ret) > + netpath_timer(netpath, vnic->config->no_path_timeout); > + > return ret; > } > > diff --git a/drivers/infiniband/ulp/vnic/vnic_main.c > b/drivers/infiniband/ulp/vnic/vnic_main.c > index fca2b90..e15d3f9 100644 > --- a/drivers/infiniband/ulp/vnic/vnic_main.c > +++ b/drivers/infiniband/ulp/vnic/vnic_main.c > @@ -710,17 +710,18 @@ static struct vnic * vnic_handle_npevent > case VNIC_PRINP_TIMEREXPIRED: > netpath = &vnic->primary_path; > netpath->timer_state = NETPATH_TS_EXPIRED; > - if (netpath->carrier) > + if (!netpath->carrier) > update_path_and_reconnect(netpath, vnic); > break; > case VNIC_SECNP_TIMEREXPIRED: > netpath = &vnic->secondary_path; > netpath->timer_state = NETPATH_TS_EXPIRED; > - if (netpath->carrier) { > + if (!netpath->carrier) > + update_path_and_reconnect(netpath, vnic); > + else { > if (vnic->state == VNIC_UNINITIALIZED) > vnic_npevent_register(vnic, netpath); > - } else > - update_path_and_reconnect(netpath, vnic); > + } > break; > case VNIC_PRINP_LINKUP: > vnic->primary_path.carrier = 1; > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Mon Dec 11 22:48:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 08:48:47 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061212064847.GB13509@mellanox.co.il> > I think we could probably merge it but maybe it's better to put it in > -mm for a cycle given that it's new and not too many people have > looked at it yet. And I still haven't gotten comfortable with the way > CM is enabled. Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP, or you don't want it in 2.6.20 anyway? -- MST From yhkim93 at keti.re.kr Mon Dec 11 23:02:05 2006 From: yhkim93 at keti.re.kr (=?ks_c_5601-1987?B?sei/tciv?=) Date: Tue, 12 Dec 2006 16:02:05 +0900 Subject: [openib-general] booting problem after cross compile to ppc in infiniband source of linux-2.6.19 Message-ID: <20061212070219.E733C3B0009@sentry-two.sandia.gov> I am developing the infiniband storage system. I use IBM PPC 440 SPe 667Mhz. so I have cross-compiled infiniband source to ppc. But the follow message happened on consol. What is problem? I think to happen at DMA allocation. Anybody are developing the infiniband driver on ppc? And is there any infiniband source that support ppc? Please help me. Always thanks for openib members’s help. ============================================================================ ========================== Waiting for PHY auto negotiation to complete... done ENET Speed is 1000 Mbps - FULL duplex connection Using ppc_4xx_eth0 device TFTP from server 192.168.1.1; our IP address is 192.168.1.10 Filename 'yucca/uImage'. Load address: 0x200000 Loading: T ################################################################# ################################################################# ################################################################# ################################################### done Bytes transferred = 1255776 (132960 hex) ## Booting image at 00200000 ... Image Name: Linux-2.6.19 Image Type: PowerPC Linux Kernel Image (gzip compressed) Data Size: 1255712 Bytes = 1.2 MB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK Uncompressing Kernel Image ... OK Linux version 2.6.19 (root at yhkim-devpc) (gcc version 4.0.0) #2 Fri Dec 8 11:18:08 KST 2006 PCIE:1 successfully set as rootpoint vendor-id 0xaaa1 device-id 0xbed1 Yucca port (Roland Dreier ) Zone PFN ranges: DMA 0 -> 196608 Normal 196608 -> 196608 early_node_map[1] active PFN ranges 0: 0 -> 196608 Built 1 zonelists. Total pages: 195072 Kernel command line: root=/dev/nfs rw nfsroot=192.168.1.1:/tftpboot/yucca/ppc_4xx ip=192.168.1.10:192.168.1.1::255.250PID hash table entries: 4096 (order: 12, 16384 bytes) Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 776704k available (1900k kernel code, 592k data, 148k init, 0k highmem) Mount-cache hash table entries: 512 NET: Registered protocol family 16 PCI: Probing PCI hardware NET: Registered protocol family 2 IP route cache hash table entries: 32768 (order: 5, 131072 bytes) TCP established hash table entries: 131072 (order: 7, 524288 bytes) TCP bind hash table entries: 65536 (order: 6, 262144 bytes) TCP: Hash tables configured (established 131072 bind 65536) TCP reno registered io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered Generic RTC Driver v1.07 Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled serial8250: ttyS0 at MMIO 0x0 (irq = 0) is a 16550A serial8250: ttyS1 at MMIO 0x0 (irq = 1) is a 16550A serial8250: ttyS2 at MMIO 0x0 (irq = 37) is a 16550A RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize PPC 4xx OCP EMAC driver, version 3.54 mal0: initialized, 1 TX channels, 1 RX channels eth0: emac0, MAC 00:04:ac:01:ca:fe eth0: found CIS8201 Gigabit Ethernet PHY (0x01) ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0001:01:01.0 kernel BUG in __dma_alloc_coherent at arch/ppc/kernel/dma-mapping.c:233! Oops: Exception in kernel mode, sig: 5 [#1] NIP: C0004904 LR: C00048D0 CTR: 00000000 REGS: c0981c90 TRAP: 0700 Not tainted (2.6.19) MSR: 00029000 CR: 88FF4F82 XER: 00000000 TASK = c096db70[1] 'swapper' THREAD: c0980000 GPR00: 00000001 C0981D40 C096DB70 C0885840 00000000 0000001F EF4BAFFC 00029000 GPR08: C021E410 00000000 C097B828 00000000 28FF4F88 00000000 3FFE6500 00000001 GPR16: 007FFF93 00000000 00800000 FFFFFFFF 007FFF00 C0280000 C0220000 00000000 GPR24: EF48F3E0 C021E410 FF2FF000 C0981D9C C0885860 C09A3000 C0885840 00001000 NIP [C0004904] __dma_alloc_coherent+0x20c/0x2d8 LR [C00048D0] __dma_alloc_coherent+0x1d8/0x2d8 Call Trace: [C0981D40] [C0004828] __dma_alloc_coherent+0x130/0x2d8 (unreliable) [C0981D80] [C0273404] mthca_create_eq+0x338/0x438 [C0981DE0] [C0273668] mthca_init_eq_table+0x164/0x6c0 [C0981E20] [C0146A44] __mthca_init_one+0x924/0xbf4 [C0981E70] [C0272F08] mthca_init_one+0x74/0xbc [C0981E90] [C00F6FE4] pci_device_probe+0x7c/0xa0 [C0981EB0] [C010FB58] really_probe+0x54/0x13c [C0981ED0] [C011004C] __driver_attach+0xcc/0xf8 [C0981EF0] [C010EE7C] bus_for_each_dev+0x54/0x90 [C0981F20] [C010F958] driver_attach+0x24/0x34 [C0981F30] [C010F4B0] bus_add_driver+0x84/0x168 [C0981F50] [C011034C] driver_register+0x68/0xb0 [C0981F60] [C00F6C64] __pci_register_driver+0x98/0xa8 [C0981F70] [C02720D0] mthca_init+0x60/0x8c [C0981F80] [C0001124] init+0x98/0x2a4 [C0981FF0] [C0003DA0] kernel_thread+0x44/0x60 Instruction dump: 3d20c028 8169d0e0 7c00f050 54003826 7c005a14 901b0000 815d0004 39200000 7d205379 38000000 41820008 38000001 <0f000000> 38000400 7d60f028 7d6b0378 Kernel panic - not syncing: Attempted to kill init! <0>Rebooting in 1 seconds.. U-Boot 1.1.6 (Dec 7 2006 - 16:36:13) CPU: AMCC PowerPC 440SPe Rev. B at 533.328 MHz (PLB=133, OPB=66, EBC=66 MHz) I2C boot EEPROM enabled Bootstrap Option D - Boot ROM Location I2C (Addr 0x50) Internal PCI arbiter enabled 32 kB I-Cache 32 kB D-Cache Board: Yucca - AMCC 440SPe Evaluation Board I2C: ready DRAM: 1024 MB FLASH: 5 MB PCI: Bus Dev VenId DevId Class Int PCIE:1 successfully set as rootpoint 01 01 15b3 6282 0c06 00 In: serial Out: serial Err: serial Net: ppc_4xx_eth0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Dec 11 23:10:15 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 12 Dec 2006 09:10:15 +0200 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <1165870759.21606.18477.camel@hal.voltaire.com> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <1165617195.26559.4435.camel@hal.voltaire.com> <457AC99E.8050402@mellanox.co.il> <1165870759.21606.18477.camel@hal.voltaire.com> Message-ID: <457E55D7.5070603@mellanox.co.il> Hal Rosenstock wrote: > On Sat, 2006-12-09 at 09:35, Eitan Zahavi wrote: > >> Hal Rosenstock wrote: >> >>> On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: >>> >>> >>>> Hal Rosenstock wrote: >>>> >>>> >>>>> Hi Eitan, >>>>> >>>>> Just wanted to close the loop on the OpenSM issues of the last couple >>>>> days. >>>>> >>>>> 1. When can you supply an OpenSM verbose log for the InformInfo >>>>> subscribe problem you reported earlier today ? Failing that, I don't >>>>> know how to reproduce this. >>>>> >>>>> >>>>> >>>> Attached >>>> >>>> >> I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure. >> > > Any idea on when you will have a chance to look into this ? > Maybe by the weekend. > >>>>> 4. I encourage you to look at and comment on the OpenSM patches rather >>>>> than waiting for them to be in the tree. >>>>> >>>>> >>>>> >>>> I am sure you did not mean to, but now I have to admit my limited skills >>>> in catching bugs by reading patches :-( . >>>> >>>> >>> Not just read, but they are there to try out as well. >>> >>> >> I will need an automatic flow for that sake. I can not keep up with the >> amount of patches manually. >> But I do not know how to automatically convert the mails into patches >> into a tree. >> >>> You could try out the patches and do the same thing before they are >>> committed. >>> >>> >>> >> I have automation based on the committed tree that pull it (git trem) , >> compile and run regression. >> Actually this is how all other code is handled too. >> > > Are you referring to OFED ? > No the current GIT tree under git://staging.openfabrics.org/~halr/management.git > In the case of OFED, where do those "special" trees/branches come from ? > No. I think we are having some miss-understanding: I am not proposing using a pre-commit branch. But if there is no such branch I can not do pre-commit testing. I think it is fine to have post-commit bug reports. No big deal. We branch when we go to an OFED release. Then I have two regressions run every night. One on the trunk and one on the OFED branch. This is how things were for OFED1.1 and OFED1.0. It is your call if we need to have a "stable" trunk and experimental branch such that I will be able to test pre-trunk patches. What I will not be able to do is to have an automatic system to select which patches to include in the regression, etc etc. Eitan > -- Hal > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Tue Dec 12 00:51:58 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 12 Dec 2006 10:51:58 +0200 Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: References: <1165517253.14800.283.camel@brick.pathscale.com> <457BD18D.7000403@voltaire.com> <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com> Message-ID: <457E6DAE.3040206@voltaire.com> Roland Dreier wrote: > > I would like to see this last set of patches integrated as is. > > I would like to get more experience with the current implementation > > before extending it to support other configurations. > > Yeah, let's go with that. Since ipath depends on 64BIT in Kconfig > anyway I think this is OK for now. This design of ib_dma_map_single, ib_sg_dma_address etc returning u64 instead of dma_addr_t causes the resulted patch to the IB ULPs to be quite big. Have you tested any dma_map single (eg IPoIB) and sg (eg SRP or iSER) consumer with this code? Or. From ogerlitz at voltaire.com Tue Dec 12 00:57:32 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 12 Dec 2006 10:57:32 +0200 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <457D9B4A.6010507@ichips.intel.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <457BDF15.6090608@voltaire.com> <457D9B4A.6010507@ichips.intel.com> Message-ID: <457E6EFC.6030601@voltaire.com> Sean Hefty wrote: > Can you just send a signed-off-by line? I'll add the patch to the > librdmacm multicast branch. > fix rdma_leave_multicast return code on the success path > > Signed-off-by: Or Gerlitz > > --- librdmacm/src/cma.c 2006-12-10 12:55:03.000000000 +0200 > +++ librdmacm-multicast/src/cma.c 2006-12-10 13:15:12.000000000 +0200 > @@ -1015,6 +1015,8 @@ int rdma_leave_multicast(struct rdma_cm_ > ret = write(id->channel->fd, msg, size); > if (ret != size) > ret = (ret > 0) ? -ENODATA : ret; > + else > + ret = 0; > > pthread_mutex_lock(&id_priv->mut); > while (mc->events_completed < resp->events_reported) From vuhuong at mellanox.com Tue Dec 12 00:58:01 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 00:58:01 -0800 Subject: [openib-general] Automatically connect to SRP target In-Reply-To: <92daa7bf0612111841l70e4a653ked1d93ec1dc9f91@mail.gmail.com> References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com> <92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com> <457E065C.6030104@mellanox.com> <92daa7bf0612111841l70e4a653ked1d93ec1dc9f91@mail.gmail.com> Message-ID: <457E6F19.90103@mellanox.com> PN wrote: > Hi Vu, > > i have 2 more questions, > now i have 3 srp targets and use LVM to form a GFS system. > > after setting SRPHA_ENABLE=yes, i found that sometimes (~30%) it will > miss a target during reboot. > i need to manually type "srp_daemon -e -o" to discover the missing target. > is there any method such that the srp_daemon will repeat to try to > ensure all targets were found? > Probably you didn't have a clean shutdown and the srp target still had the previous connection around (it does not have self clean up dead connection mechanism) then the next login the srp target reject the login request However srp_daemon will scan the fabric every 60 sec and should pick up the missing target from previous scan > also, currently there is only 1 cable connect to each dual ports client. > is it normal to have the following messages? > Dec 12 10:18:10 storage02 run_srp_daemon[5471]: starting srp_daemon: > [HCA=mthca0] [port=2] > Dec 12 10:18:13 storage02 run_srp_daemon[5483]: failed srp_daemon: > [HCA=mthca0] [port=2] [exit status=0] > Dec 12 10:18:43 storage02 run_srp_daemon[5489]: starting srp_daemon: > [HCA=mthca0] [port=2] > Dec 12 10:18:46 storage02 run_srp_daemon[5501]: failed srp_daemon: > [HCA=mthca0] [port=2] [exit status=0] > .....[repeat infinitely] This is fine. The srp_daemon for port 2 keep running and it will detect any target on the fabric if you plug the cable in; otherwise, there's no ill effect except these annoying error messages -vu > > > Thanks a lot, > PN > > > Below is the log: > > Dec 12 10:17:18 storage02 network: Setting network parameters: succeeded > Dec 12 10:17:18 storage02 network: Bringing up loopback interface: > succeeded > Dec 12 10:17:23 storage02 network: Bringing up interface eth0: succeeded > Dec 12 10:17:23 storage02 network: Bringing up interface ib0: succeeded > Dec 12 10:17:26 storage02 kernel: REJ reason 0xa > Dec 12 10:17:26 storage02 kernel: ib_srp: Connection failed > Dec 12 10:17:26 storage02 kernel: scsi3 : SRP.T10:00D0680000000578 > Dec 12 10:17:26 storage02 kernel: Vendor: Mellanox Model: > IBSRP10-TGT Rev: 1.46 > Dec 12 10:17:26 storage02 kernel: Type: > Direct-Access ANSI SCSI revision: 03 > Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte > hdwr sectors (81964 MB) > Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back > Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte > hdwr sectors (81964 MB) > Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back > Dec 12 10:17:26 storage02 rpcidmapd: rpc.idmapd startup succeeded > Dec 12 10:17:26 storage02 kernel: sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 > sdb7 > > Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdb at scsi3, > channel 0, id 0, lun 0 > Dec 12 10:17:26 storage02 kernel: scsi4 : SRP.T10:00D06800000007B2 > Dec 12 10:17:26 storage02 kernel: Vendor: Mellanox Model: IBSRP10-TGT > hy-b Rev: 1.46 > Dec 12 10:17:26 storage02 kernel: Type: > Direct-Access ANSI SCSI revision: 03 > Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte > hdwr sectors (81964 MB) > Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back > Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte > hdwr sectors (81964 MB) > Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back > Dec 12 10:17:26 storage02 kernel: sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 sdc6 > > Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdc at scsi4, > channel 0, id 0, lun 0 > Dec 12 10:17:26 storage02 scsi.agent[3668]: disk at > /devices/pci0000:00/0000:00:02.0/0000:01:00.0/host3/target3:0:0/3:0:0:0 > Dec 12 10:17:26 storage02 scsi.agent[3705]: disk at > /devices/pci0000:00/0000:00:02.0/0000:01:00.0/host4/target4:0:0/4:0:0:0 > Dec 12 10:17:26 storage02 ccsd[3769]: Starting ccsd 1.0.7: > Dec 12 10:17:26 storage02 ccsd[3769]: Built: Aug 26 2006 15:01:49 > Dec 12 10:17:26 storage02 ccsd[3769]: Copyright (C) Red Hat, Inc. > 2004 All rights reserved. > Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 10 > Dec 12 10:17:26 storage02 kernel: Disabled Privacy Extensions on device > ffffffff80405540(lo) > Dec 12 10:17:26 storage02 kernel: IPv6 over IPv4 tunneling driver > Dec 12 10:17:26 storage02 ccsd: succeeded > Dec 12 10:17:26 storage02 kernel: CMAN 2.6.9-45.4.centos4 (built Aug 26 > 2006 14:55:55) installed > Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 30 > Dec 12 10:17:26 storage02 kernel: DLM 2.6.9-42.12.centos4 (built Aug 27 > 2006 05:25:40) installed > Dec 12 10:17:27 storage02 ccsd[3769]: cluster.conf (cluster name = > GFS_Cluster, version = 21) found. > Dec 12 10:17:27 storage02 ccsd[3769]: Unable to perform sendto: Cannot > assign requested address > Dec 12 10:17:27 storage02 run_srp_daemon[3845]: failed srp_daemon: > [HCA=mthca0] [port=2] [exit status=0] > Dec 12 10:17:28 storage02 run_srp_daemon[3851]: starting srp_daemon: > [HCA=mthca0] [port=2] > Dec 12 10:17:29 storage02 ccsd[3769]: Remote copy of cluster.conf is > from quorate node. > Dec 12 10:17:29 storage02 ccsd[3769]: Local version # : 21 > Dec 12 10:17:29 storage02 ccsd[3769]: Remote version #: 21 > Dec 12 10:17:29 storage02 kernel: CMAN: Waiting to join or form a > Linux-cluster > Dec 12 10:17:29 storage02 kernel: CMAN: sending membership request > Dec 12 10:17:29 storage02 ccsd[3769]: Connected to cluster infrastruture > via: CMAN/SM Plugin v1.1.7.1 > Dec 12 10:17:29 storage02 ccsd[3769]: Initial status:: Inquorate > Dec 12 10:17:30 storage02 kernel: CMAN: got node storage01 > Dec 12 10:17:30 storage02 kernel: CMAN: got node storage03 > Dec 12 10:17:30 storage02 kernel: CMAN: quorum regained, resuming activity > Dec 12 10:17:30 storage02 ccsd[3769]: Cluster is quorate. Allowing > connections. > Dec 12 10:17:30 storage02 cman: startup succeeded > Dec 12 10:17:30 storage02 lock_gulmd: no section detected in > /etc/cluster/cluster.conf succeeded > Dec 12 10:17:31 storage02 fenced: startup succeeded > Dec 12 10:17:31 storage02 run_srp_daemon[4196]: failed srp_daemon: > [HCA=mthca0] [port=2] [exit status=0] > Dec 12 10:17:33 storage02 run_srp_daemon[4224]: starting srp_daemon: > [HCA=mthca0] [port=2] > Dec 12 10:17:36 storage02 run_srp_daemon[4236]: failed srp_daemon: > [HCA=mthca0] [port=2] [exit status=0] > Dec 12 10:17:40 storage02 run_srp_daemon[4242]: starting srp_daemon: > [HCA=mthca0] [port=2] > Dec 12 10:17:42 storage02 clvmd: Cluster LVM daemon started - connected > to CMAN > Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 > on node storage01 > Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 > on node storage03 > Dec 12 10:17:42 storage02 clvmd: clvmd startup succeeded > Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid > 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. > Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes > for volume group gfsvg. > Dec 12 10:17:42 storage02 vgchange: > Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid > 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. > Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes > for volume group gfsvg. > Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid > 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. > Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes > for volume group gfsvg. > Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid > 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'. > Dec 12 10:17:42 storage02 vgchange: Couldn't find all physical volumes > for volume group gfsvg. > Dec 12 10:17:42 storage02 vgchange: Volume group "gfsvg" not found > Dec 12 10:17:42 storage02 clvmd: Activating VGs: failed > Dec 12 10:17:42 storage02 netfs: Mounting other filesystems: succeeded > Dec 12 10:17:42 storage02 kernel: Lock_Harness 2.6.9-58.2.centos4 (built > Aug 27 2006 05:27:43) installed > Dec 12 10:17:42 storage02 kernel: GFS 2.6.9-58.2.centos4 (built Aug 27 > 2006 05:28:00) installed > Dec 12 10:17:42 storage02 mount: mount: special device /dev/gfsvg/gfslv > does not exist > Dec 12 10:17:42 storage02 gfs: Mounting GFS filesystems: failed > Dec 12 10:17:42 storage02 kernel: i2c /dev entries driver > ..... > > > > > > > 2006/12/12, Vu Pham >: > > PN, > Edit file /etc/infiniband/openib.conf and set > > SRPHA_ENABLE=yes > > this will start srp_daemon by default > > -vu > > > No one can help me? :( > > > > PN > > > > > > 2006/12/7, Lai Dragonfly >>: > > > > Hi all, > > > > i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in > > clients and > > IBGD-1.8.2-srpt in targets. > > i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in > > openib.conf, > > i could not found the SRP target. > > until i execute "srp_daemon -e -o", i can see all the targets > appear > > in /dev/sdX. > > > > since i want to export the targets to other nodes, > > any idea so that i can connect to the targets automatically > in each > > reboot. > > without typing "srp_daemon -e -o" each time? > > > > thanks in advance. > > > > PN > > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From vuhuong at mellanox.com Tue Dec 12 01:03:54 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 01:03:54 -0800 Subject: [openib-general] srp initiator device discovery In-Reply-To: <1165899109.14308.9.camel@julia.et.endace.com> References: <1165899109.14308.9.camel@julia.et.endace.com> Message-ID: <457E707A.4040802@mellanox.com> How many cable did you connect from your host to fabric? If you have two cables (2 ports of same hca or each port of 2 hcas) connected then you have two paths to same srp target. Each path will see the same number of luns of srp target. You can work with dm-multipath/multipath and access the luns/devices thru /dev/mapper - this will provide you capability of fail-over/fail-back functionality IBGD's srp target only works with scsi devices. It does not work with block devices (hdX, md, lvm volules ...) -vu > Hi, > > I have srp initiator installed with OFED-1.1, and another machine > with SRP target (IBGOLD). I started the srp daemon to discover the > target devices, and then ran fdisk -l to see the list. The list (below) > shows duplicate devices :- > > Disk /dev/sdb: 2199.0 GB, 2199023255552 bytes > 255 heads, 63 sectors/track, 267349 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Disk /dev/sdb doesn't contain a valid partition table > > Disk /dev/sdc: 2199.0 GB, 2199023255552 bytes > 255 heads, 63 sectors/track, 267349 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Device Boot Start End Blocks Id System > > Disk /dev/sdd: 500.1 GB, 500107862016 bytes > 255 heads, 63 sectors/track, 60801 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Device Boot Start End Blocks Id System > /dev/sdd1 * 1 13 104391 83 Linux > /dev/sdd2 14 60801 488279610 8e Linux LVM > > Disk /dev/sde: 2199.0 GB, 2199023255552 bytes > 255 heads, 63 sectors/track, 267349 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Disk /dev/sde doesn't contain a valid partition table > > Disk /dev/sdf: 2199.0 GB, 2199023255552 bytes > 255 heads, 63 sectors/track, 267349 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Device Boot Start End Blocks Id System > > Disk /dev/sdg: 500.1 GB, 500107862016 bytes > 255 heads, 63 sectors/track, 60801 cylinders > Units = cylinders of 16065 * 512 = 8225280 bytes > > Device Boot Start End Blocks Id System > /dev/sdg1 * 1 13 104391 83 Linux > /dev/sdg2 14 60801 488279610 8e Linux LVM > > > > Doing some tests I found that sdb=sde, sdc=sdf, and sdd=sdg (obvious). > > I also tested the device discovery after creating an md device on the > target side, and found that the initiator doesn't take into account the > presence of an md device. Is this the expected behaviour ? > > Thanks for your time! > > Vishal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vuhuong at mellanox.com Tue Dec 12 01:19:16 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 01:19:16 -0800 Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: <457E069A.4020807@mellanox.com> References: <4579C6C3.5090207@mellanox.com> <457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com> Message-ID: <457E7414.6040802@mellanox.com> James, I hit another variation of put_page problem. I just ran iozone with 9 GB file size (both client and server machines have 8 GB of memory, dual woodcrest xeon cpus, 2.6.18.5 kernel, nfsrdma release 7) After this happened other nfsrdma clients can still do I/O to the server -vu > Hit *send* too soon - here is the objdump of swap.o > > -vu > > >> James Lentini wrote: >>> A couple of questions Vu: >>> >>> What NFS-RDMA release are you using? This looks like release 7. >>> >> >> Yes. I'm using release 7 >> >>> Is this reproducible? >> >> I ran into it twice - I think that it may co-relate to openSM restart >> incident. I'll double check it and confirm >> >> >>> What kernel version are you using? >> >> 2.6.18.5 >> >>> What hardware is this on? It looks like x86-64 to me, which is fine. >>> I just want to be sure I know what I'm looking at. As many specifics >>> as possible is good (number of CPUs, hyperthreading, etc.) >>> >> >> Dual woodcrest xeon based CPUs >> >>> Could you send the output of >>> objdump -Slr /path/to/kernel/mm/swap.o >>> >> >> I attached the objdump output here >> >>> Actually, just the put_page disassembly is all I want to see. >>> >>> Is there any more text available? Usually there is an explanation >>> given for an oops message (e.g. "Unable to handle kernel paging >>> request.."). >>> >> >> I did not see any oops text message. System was still responsive with >> ipoib ping or login >> >> >>> I opened a bug at the NFS-RDMA SourceForge project to track this: >>> >>> http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 >>> >> >> thanks for your help, >> >> -vu >> >>> Thanks for reporting this. >>> james >>> >>> On Fri, 8 Dec 2006, Vu Pham wrote: >>> >>>> Hi James, >>>> I got these errors in server's /var/log/messages and then the >>>> server stop >>>> responding to login, I/O...; however, the server is still up, ipoib >>>> is still >>>> working >>>> >>>> >>>> Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[] >>>> [] put_page+0x17/0x40 >>>> Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS: >>>> 00010246 >>>> Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: >>>> 0000000000000001 >>>> RCX: 000000000003ffff >>>> Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: >>>> 0000000000000001 >>>> RDI: ffff8102274e92f8 >>>> Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: >>>> 0000000000000034 >>>> R09: 0000000000000000 >>>> Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: >>>> 0000000000000000 >>>> R12: ffff81020ef96800 >>>> Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: >>>> 0000000000000000 >>>> R15: ffff8102053ee890 >>>> Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000) >>>> GS:ffff81022066eb40(0000) knlGS:0000000000000000 >>>> Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >>>> 000000008005003b >>>> Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: >>>> 000000021c22b000 >>>> CR4: 00000000000006e0 >>>> Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo >>>> ffff810219dde000, task ffff81020d87f0c0) >>>> Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 >>>> ffff81020ef96968 >>>> ffff81020ef96800 ffff81020ef96958 >>>> Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90 >>>> ffffffff80424e05 0000000000000000 >>>> Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90 >>>> ffffffff80239b90 ffff81020d87f0c0 >>>> Dec 8 06:38:21 ibd201 kernel: Call Trace: >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :sunrpc:svc_rdma_put_context+0x37/0xd0 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> schedule_timeout+0x95/0xb0 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> process_timeout+0x0/0x10 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> wait_for_completion_timeout+0xcd/0x150 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> default_wake_function+0x0/0x10 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :ib_mthca:mthca_cmd_post+0x232/0x260 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> default_wake_function+0x0/0x10 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> __next_cpu+0x19/0x30 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> find_busiest_group+0x24e/0x6d0 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> thread_return+0x0/0xde >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> _spin_unlock_irqrestore+0x8/0x10 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> try_to_del_timer_sync+0x51/0x60 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> del_timer_sync+0xc/0x20 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> schedule_timeout+0x95/0xb0 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :sunrpc:svc_recv+0x416/0x510 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> default_wake_function+0x0/0x10 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> default_wake_function+0x0/0x10 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :nfsd:nfsd+0x0/0x380 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :nfsd:nfsd+0x111/0x380 >>>> Dec 8 06:38:21 ibd201 kernel: [] child_rip+0xa/0x12 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :nfsd:nfsd+0x0/0x380 >>>> Dec 8 06:38:21 ibd201 kernel: [] >>>> :nfsd:nfsd+0x0/0x380 >>>> Dec 8 06:38:21 ibd201 kernel: [] child_rip+0x0/0x12 >>>> Dec 8 06:38:21 ibd201 kernel: >>>> Dec 8 06:38:21 ibd201 kernel: >>>> Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 >>>> f0 ff 4f 08 >>>> 0f 94 c0 84 c0 74 >>>> Dec 8 06:38:21 ibd201 kernel: RIP [] >>>> put_page+0x17/0x40 >>>> Dec 8 06:38:21 ibd201 kernel: RSP >>>> >>>> -vu >>>> >> >> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: messages.202 URL: From eeb at bartonsoftware.com Tue Dec 12 01:37:47 2006 From: eeb at bartonsoftware.com (Eric Barton) Date: Tue, 12 Dec 2006 09:37:47 -0000 Subject: [openib-general] version #defines for the kernel In-Reply-To: Message-ID: <099601c71dd1$2ed415d0$0281a8c0@ebpc> > > > No other kernel subsystem has one, so I don't think it's > > > realistic to expect one for IB. > > > Don't you think it would be useful? Even if only to make > > API changes explicit? > > Sure, I admit it would be useful for out-of-tree code. But it would > also be an unmaintainable mess to actually try and have a set of > feature flags, so I don't think we can do it. At the risk of flogging a dead horse - I was only thinking of a very simple version number that incremented on change - something like LINUX_VERSION_CODE? Cheers, Eric From eitan at sw053.yok.mtl.com Tue Dec 12 01:45:01 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Tue, 12 Dec 2006 11:45:01 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-12:normal completion Message-ID: <200612120945.kBC9j1RK024188@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Mon_Dec_11_12:18:47_2006 a12f32 ibutils rev = Mon_Dec_11_12:42:28_2006 2ba86a Total=242 Pass=241 Fail=1 Pass: 33 Stability IS1-16.topo 33 Pkey IS1-16.topo 33 OsmStress IS1-16.topo 33 Multicast IS1-16.topo 33 LidMgr IS1-16.topo 11 Stability IS3-loop.topo 11 Stability IS3-128.topo 11 Pkey IS3-128.topo 11 Multicast IS3-loop.topo 11 Multicast IS3-128.topo 11 LidMgr IS3-128.topo 10 OsmStress IS3-128.topo Failures: 1 OsmStress IS3-128.topo From mst at mellanox.co.il Tue Dec 12 04:29:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 14:29:57 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061210225613.GF21155@sashak.voltaire.com> References: <20061210225613.GF21155@sashak.voltaire.com> Message-ID: <20061212122957.GC14622@mellanox.co.il> > For me it is unclear yet how long we may need this - 1.1 still be in > SVN yet, and 1.1 git branch is updated there. By the way, one can't actually build OFED 1.1 userspace from git because OFED also applies some patches after checking things out from svn. They are here: https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes -- MST From mst at mellanox.co.il Tue Dec 12 05:42:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 15:42:35 +0200 Subject: [openib-general] openib-commits and git In-Reply-To: References: Message-ID: <20061212134235.GB26613@mellanox.co.il> > > -----Original Message----- > > From: openib-general-bounces at openib.org > > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock > > Sent: Monday, December 11, 2006 2:07 PM > > To: openib-general at openib.org > > Cc: OpenFabricsEWG > > Subject: [openib-general] openib-commits and git > > > > Hi, > > > > Some have requested the equivalent of what we had with svn with > > openib-commits. > > > > The first question is what capabilities in this are desired. We don't > > want to spend a lot of engineering time on this but it would > > be good to > > know. Is a notification of the commit/push with the log sufficient or > > does it need to look more what svn provided (and include the changes > > too) ? > > > > The other question is a policy one: Is it a reasonable > > default to enable > > this for all the developers ? Do any of the developers object to this > > policy ? > > Quoting r. Scott Weitzenkamp (sweitzen) : > Subject: Re: openib-commits and git > > I would like to see diffs, either inline in the commit email or via a > URL I can click on. In that case, why bother with email at all? gitweb already has RSS support, which Sasha has activated. Look at any git tree in gitweb (e.g. http://staging.openfabrics.org/git/) and you'll see an RSS feed URL. This can be fed to any RSS aggregator, including the firefox live bookmarks one. -- MST From halr at voltaire.com Tue Dec 12 05:53:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Dec 2006 08:53:08 -0500 Subject: [openib-general] OpenSM Issues of the last couple days In-Reply-To: <457E55D7.5070603@mellanox.co.il> References: <1165531651.25587.204056.camel@hal.voltaire.com> <457995E5.40303@mellanox.co.il> <1165617195.26559.4435.camel@hal.voltaire.com> <457AC99E.8050402@mellanox.co.il> <1165870759.21606.18477.camel@hal.voltaire.com> <457E55D7.5070603@mellanox.co.il> Message-ID: <1165931584.28709.4614.camel@hal.voltaire.com> On Tue, 2006-12-12 at 02:10, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Sat, 2006-12-09 at 09:35, Eitan Zahavi wrote: > > > >> Hal Rosenstock wrote: > >> > >>> On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: > >>> > >>> > >>>> Hal Rosenstock wrote: > >>>> > >>>> > >>>>> Hi Eitan, > >>>>> > >>>>> Just wanted to close the loop on the OpenSM issues of the last couple > >>>>> days. > >>>>> > >>>>> 1. When can you supply an OpenSM verbose log for the InformInfo > >>>>> subscribe problem you reported earlier today ? Failing that, I don't > >>>>> know how to reproduce this. > >>>>> > >>>>> > >>>>> > >>>> Attached > >>>> > >>>> > >> I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure. > >> > > > > Any idea on when you will have a chance to look into this ? > > > Maybe by the weekend. > > > >>>>> 4. I encourage you to look at and comment on the OpenSM patches rather > >>>>> than waiting for them to be in the tree. > >>>>> > >>>>> > >>>>> > >>>> I am sure you did not mean to, but now I have to admit my limited skills > >>>> in catching bugs by reading patches :-( . > >>>> > >>>> > >>> Not just read, but they are there to try out as well. > >>> > >>> > >> I will need an automatic flow for that sake. I can not keep up with the > >> amount of patches manually. > >> But I do not know how to automatically convert the mails into patches > >> into a tree. > >> > >>> You could try out the patches and do the same thing before they are > >>> committed. > >>> > >>> > >>> > >> I have automation based on the committed tree that pull it (git trem) , > >> compile and run regression. > >> Actually this is how all other code is handled too. > >> > > > > Are you referring to OFED ? > > > No the current GIT tree under > git://staging.openfabrics.org/~halr/management.git OK but I was commenting on what you said about "all other code" being handled this way. > > In the case of OFED, where do those "special" trees/branches come from ? > > > No. I think we are having some miss-understanding: > I am not proposing using a pre-commit branch. > But if there is no such branch I can not do pre-commit testing. Understood. > I think it is fine to have post-commit bug reports. No big deal. Right; rather than "pre trunk commit" ones. If it breaks, we try to fix it as fast as possible or perhaps even back out the change if there is some critical reason to do so. > We branch when we go to an OFED release. Yes. > Then I have two regressions run every night. One on the trunk and one on > the OFED branch. > This is how things were for OFED1.1 and OFED1.0. That would be great. > It is your call if we need to have a "stable" trunk and experimental > branch such that I will be able to test pre-trunk patches. I'll consider this based on how stable or unstable the trunk is as we go forward but still prefer to not have to maintain another branch (for obvious reasons). > What I will not be able to do is to have an automatic system to select > which patches to include in the regression, etc etc. OK. -- Hal > Eitan > > -- Hal > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From sashak at voltaire.com Tue Dec 12 06:50:31 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 Dec 2006 16:50:31 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061212054223.GB11064@mellanox.co.il> References: <20061210233657.GB32199@sashak.voltaire.com> <20061211054539.GL9205@mellanox.co.il> <20061212000911.GJ25052@sashak.voltaire.com> <20061212054223.GB11064@mellanox.co.il> Message-ID: <20061212145031.GE10901@sashak.voltaire.com> On 07:42 Tue 12 Dec , Michael S. Tsirkin wrote: > Sasha, one small request: could you please fix description for your trees? > It should hopefully say something like "mirror of svn for ". Yes, sure. Sasha From sashak at voltaire.com Tue Dec 12 07:07:03 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 12 Dec 2006 17:07:03 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061212055841.GD11064@mellanox.co.il> References: <20061210233657.GB32199@sashak.voltaire.com> <20061211054539.GL9205@mellanox.co.il> <20061212000911.GJ25052@sashak.voltaire.com> <20061212055841.GD11064@mellanox.co.il> Message-ID: <20061212150703.GF10901@sashak.voltaire.com> On 07:58 Tue 12 Dec , Michael S. Tsirkin wrote: > > > Finally, it wastes space. > > > > 'git-clone -s' helps to save space. > > BTW, be careful with that: it seems clone -s might lose your data if the repository > you clone from removes some heads and prunes history. It is hard to lose data fatally this way. Only when origin repo was removed completely (then you can lose this old part of history). Use 'git-clone -l' if unsure. And this still be theoretical discussion - largest userspace tree on OFA takes 10MB disk space. Sasha From mst at mellanox.co.il Tue Dec 12 07:10:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 17:10:16 +0200 Subject: [openib-general] [PATCH] mthca: make all MRs accessible for FMR mapping on 64 bit kernels Message-ID: <20061212151016.GI26613@mellanox.co.il> For Tavor, we currently reserve separate MPT and MTT space for FMRs so avoid abusing the vmalloc space on 32 bit kernels. No such problem exists on 64 bit kernels so let's not do it there. This way we have a shared pool for MR and FMR resources, used on demand. This will also make it possible to write MTTs for regular regions directly from driver. Signed-off-by: Michael S. Tsirkin --- This patch passed verbs and SRP testing here. Please consider this for 2.6.20. Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c @@ -762,7 +762,7 @@ void mthca_arbel_fmr_unmap(struct mthca_ int __devinit mthca_init_mr_table(struct mthca_dev *dev) { unsigned long addr; - int err, i; + int mpts, mtts, err, i; err = mthca_alloc_init(&dev->mr_table.mpt_alloc, dev->limits.num_mpts, @@ -796,13 +796,21 @@ int __devinit mthca_init_mr_table(struct err = -EINVAL; goto err_fmr_mpt; } + mpts = mtts = 1 << i; + } else { + mpts = dev->limits.num_mtt_segs; + mtts = dev->limits.num_mpts; + } + + if (!mthca_is_memfree(dev) && + (dev->mthca_flags & MTHCA_FLAG_FMR)) { addr = pci_resource_start(dev->pdev, 4) + ((pci_resource_len(dev->pdev, 4) - 1) & dev->mr_table.mpt_base); dev->mr_table.tavor_fmr.mpt_base = - ioremap(addr, (1 << i) * sizeof(struct mthca_mpt_entry)); + ioremap(addr, mpts * sizeof(struct mthca_mpt_entry)); if (!dev->mr_table.tavor_fmr.mpt_base) { mthca_warn(dev, "MPT ioremap for FMR failed.\n"); @@ -815,19 +823,21 @@ int __devinit mthca_init_mr_table(struct dev->mr_table.mtt_base); dev->mr_table.tavor_fmr.mtt_base = - ioremap(addr, (1 << i) * MTHCA_MTT_SEG_SIZE); + ioremap(addr, mtts * MTHCA_MTT_SEG_SIZE); if (!dev->mr_table.tavor_fmr.mtt_base) { mthca_warn(dev, "MTT ioremap for FMR failed.\n"); err = -ENOMEM; goto err_fmr_mtt; } + } - err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, i); + if (dev->limits.fmr_reserved_mtts) { + err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, mtts); if (err) goto err_fmr_mtt_buddy; /* Prevent regular MRs from using FMR keys */ - err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, i); + err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, mtts); if (err) goto err_reserve_fmr; Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_profile.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_profile.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_profile.c @@ -277,7 +277,7 @@ u64 mthca_make_profile(struct mthca_dev * out of the MR pool. They don't use additional memory, but * we assign them as part of the HCA profile anyway. */ - if (mthca_is_memfree(dev)) + if (mthca_is_memfree(dev) || BITS_PER_LONG == 64) dev->limits.fmr_reserved_mtts = 0; else dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts; -- MST From mst at mellanox.co.il Tue Dec 12 07:10:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 17:10:39 +0200 Subject: [openib-general] [PATCH] mthca: speed up memory registration by filling MTTs directly Message-ID: <20061212151039.GJ26613@mellanox.co.il> Speed up memory registration by filling in MTTs directly. This reduces the number of FW commands needed to register an MR by at least a factor of 2. This applies to all memfree cards, and to tavor mode on 64 bit systems with the patch I posted earlier. Signed-off-by: Michael S. Tsirkin --- This passed verbs testing here, please consider for 2.6.20. Note that this *not* FMR - this is regular IB memory registration since MPTs are still updated using FW command. Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h @@ -464,6 +464,8 @@ void mthca_uar_free(struct mthca_dev *de int mthca_pd_alloc(struct mthca_dev *dev, int privileged, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); +int mthca_write_mtt_size(struct mthca_dev *dev); + struct mthca_mtt *mthca_alloc_mtt(struct mthca_dev *dev, int size); void mthca_free_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt); int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c @@ -244,8 +244,8 @@ void mthca_free_mtt(struct mthca_dev *de kfree(mtt); } -int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, - int start_index, u64 *buffer_list, int list_len) +static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) { struct mthca_mailbox *mailbox; __be64 *mtt_entry; @@ -296,6 +296,84 @@ out: return err; } +void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + u64 __iomem *mtts; + u32 mtt_seg; + int i; + + mtt_seg = mtt->first_seg * MTHCA_MTT_SEG_SIZE; + mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg + start_index * sizeof (u64); + for (i = 0; i < list_len; ++i) { + __be64 mtt_entry = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + mthca_write64_raw(mtt_entry, mtts + i); + } +} + +void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + __be64 *mtts; + int i; + int s = start_index * sizeof (u64); + + /* For Arbel, all MTTs must fit in the same page. */ + BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64)) / PAGE_SIZE); + /* Require full segments */ + BUG_ON(s % MTHCA_MTT_SEG_SIZE); + + mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg + + s / MTHCA_MTT_SEG_SIZE); + + BUG_ON(!mtts); + + for (i = 0; i < list_len; ++i) + mtts[i] = cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT); +} + +int mthca_write_mtt_size(struct mthca_dev *dev) +{ + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + /* + * Be friendly to WRITE_MTT command + * and leave two empty slots for the + * index and reserved fields of the + * mailbox. + */ + return PAGE_SIZE / sizeof (u64) - 2; + + /* For Arbel, all MTTs must fit in the same page. */ + return mthca_is_memfree(dev) ? (PAGE_SIZE / sizeof (u64)) : 0x7ffffff; +} + +int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + int size = mthca_write_mtt_size(dev); + int chunk; + + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); + + while (list_len > 0) { + chunk = min(size, list_len); + if (mthca_is_memfree(dev)) + mthca_arbel_write_mtt_seg(dev, mtt, start_index, + buffer_list, list_len); + else + mthca_tavor_write_mtt_seg(dev, mtt, start_index, + buffer_list, list_len); + + list_len -= chunk; + start_index += chunk; + buffer_list += chunk; + } + + return 0; +} + static inline u32 tavor_hw_index_to_key(u32 ind) { return ind; Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1015,6 +1015,7 @@ static struct ib_mr *mthca_reg_user_mr(s int shift, n, len; int i, j, k; int err = 0; + int write_mtt_size; shift = ffs(region->page_size) - 1; @@ -1040,6 +1041,8 @@ static struct ib_mr *mthca_reg_user_mr(s i = n = 0; + write_mtt_size = min(mthca_write_mtt_size(dev), PAGE_SIZE / sizeof *pages); + list_for_each_entry(chunk, ®ion->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; @@ -1047,14 +1050,11 @@ static struct ib_mr *mthca_reg_user_mr(s pages[i++] = sg_dma_address(&chunk->page_list[j]) + region->page_size * k; /* - * Be friendly to WRITE_MTT command - * and leave two empty slots for the - * index and reserved fields of the - * mailbox. + * Be friendly to write_mtt and pass it chunks + * of appropriate size. */ - if (i == PAGE_SIZE / sizeof (u64) - 2) { - err = mthca_write_mtt(dev, mr->mtt, - n, pages, i); + if (i == write_mtt_size) { + err = mthca_write_mtt(dev, mr->mtt, n, pages, i); if (err) goto mtt_done; n += i; -- MST From vlad at dev.mellanox.co.il Tue Dec 12 07:53:58 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 12 Dec 2006 17:53:58 +0200 Subject: [openib-general] Daily build of userspace and kernel packages for OFED-1.2 Message-ID: <457ED096.1020703@dev.mellanox.co.il> Hi, The userspace and kernel space packages for OFED-1.2 developers can be downloaded from: http://staging.openfabrics.org/builds. User: http://staging.openfabrics.org/builds/ofa_1_2_user/ Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/ last_stable.tgz link points to the latest package that passed compilation on the build machine (staging.openfabrics.org OS Ubuntu 6.06.1 with kernel 2.6.15-23-server) To install user/kernel: Download and open tgz file Run ./configure PARAMETERS (see configure --help) make make install User space packages from git: libibverbs_git="git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git" libmthca_git="git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git" libehca_git="git://staging.openfabrics.org/~hnguyen/libehca.git" libipathverbs_git="git://staging.openfabrics.org/~bos/libipathverbs.git" tvflash_git="git://staging.openfabrics.org/~rdreier/tvflash.git" libibcm_git="git://staging.openfabrics.org/~shefty/libibcm.git" libsdp_git="git://staging.openfabrics.org/~eitan/libsdp.git" mstflint_git="git://staging.openfabrics.org/~mst/mstflint.git" perftest_git="git://staging.openfabrics.org/~mst/perftest.git" srptools_git="git://staging.openfabrics.org/~ishai/srptools.git" ipoibtools_git="git://staging.openfabrics.org/~vlad/ipoibtools.git" librdmacm_git="git://staging.openfabrics.org/~shefty/librdmacm.git" dapl_git="git://staging.openfabrics.org/~ardavis/dapl.git" imgen_git="git://staging.openfabrics.org/~mst/imgen.git" management_git="git://staging.openfabrics.org/~halr/management.git" scripts_git="git://staging.openfabrics.org/~vlad/ofascripts.git" Kernel space: git://staging.openfabrics.org/~vlad/ofed_1_2 I'd be glad to get comments. Regards, Vladimir From rdreier at cisco.com Tue Dec 12 08:42:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 08:42:58 -0800 Subject: [openib-general] version #defines for the kernel References: <099601c71dd1$2ed415d0$0281a8c0@ebpc> Message-ID: > At the risk of flogging a dead horse - I was only thinking of a very simple > version number that incremented on change - something like > LINUX_VERSION_CODE? In that case what do you expect to see in a kernel with backported drivers, that has backported some changes but not others? - R. From rdreier at cisco.com Tue Dec 12 09:07:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 09:07:58 -0800 Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover from secondary path back to primary path References: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com> Message-ID: > Did you get a chance to look at these patches ? Not yet ... I will just apply them to the vex branch though. - R. From vuhuong at mellanox.com Tue Dec 12 09:46:47 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 09:46:47 -0800 Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: <457E7414.6040802@mellanox.com> References: <4579C6C3.5090207@mellanox.com> <457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com> <457E7414.6040802@mellanox.com> Message-ID: <457EEB07.8040904@mellanox.com> James, Another variation of put_page problem. I have stopped doing I/O or accessing the mounted directory since last night. This morning I just try to do *ls* the mounted directory and get this error -vu > James, > I hit another variation of put_page problem. I just ran iozone with 9 > GB file size (both client and server machines have 8 GB of memory, dual > woodcrest xeon cpus, 2.6.18.5 kernel, nfsrdma release 7) > > After this happened other nfsrdma clients can still do I/O to the server > > -vu > >> Hit *send* too soon - here is the objdump of swap.o >> >> -vu >> >> >>> James Lentini wrote: >>>> A couple of questions Vu: >>>> >>>> What NFS-RDMA release are you using? This looks like release 7. >>>> >>> >>> Yes. I'm using release 7 >>> >>>> Is this reproducible? >>> >>> I ran into it twice - I think that it may co-relate to openSM restart >>> incident. I'll double check it and confirm >>> >>> >>>> What kernel version are you using? >>> >>> 2.6.18.5 >>> >>>> What hardware is this on? It looks like x86-64 to me, which is fine. >>>> I just want to be sure I know what I'm looking at. As many specifics >>>> as possible is good (number of CPUs, hyperthreading, etc.) >>>> >>> >>> Dual woodcrest xeon based CPUs >>> >>>> Could you send the output of >>>> objdump -Slr /path/to/kernel/mm/swap.o >>>> >>> >>> I attached the objdump output here >>> >>>> Actually, just the put_page disassembly is all I want to see. >>>> >>>> Is there any more text available? Usually there is an explanation >>>> given for an oops message (e.g. "Unable to handle kernel paging >>>> request.."). >>>> >>> >>> I did not see any oops text message. System was still responsive with >>> ipoib ping or login >>> >>> >>>> I opened a bug at the NFS-RDMA SourceForge project to track this: >>>> >>>> http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 >>>> >>> >>> thanks for your help, >>> >>> -vu >>> >>>> Thanks for reporting this. >>>> james >>>> >>>> On Fri, 8 Dec 2006, Vu Pham wrote: >>>> >>>>> Hi James, >>>>> I got these errors in server's /var/log/messages and then the >>>>> server stop >>>>> responding to login, I/O...; however, the server is still up, ipoib >>>>> is still >>>>> working >>>>> >>>>> >>>>> Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[] >>>>> [] put_page+0x17/0x40 >>>>> Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS: >>>>> 00010246 >>>>> Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: >>>>> 0000000000000001 >>>>> RCX: 000000000003ffff >>>>> Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: >>>>> 0000000000000001 >>>>> RDI: ffff8102274e92f8 >>>>> Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: >>>>> 0000000000000034 >>>>> R09: 0000000000000000 >>>>> Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: >>>>> 0000000000000000 >>>>> R12: ffff81020ef96800 >>>>> Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: >>>>> 0000000000000000 >>>>> R15: ffff8102053ee890 >>>>> Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000) >>>>> GS:ffff81022066eb40(0000) knlGS:0000000000000000 >>>>> Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >>>>> 000000008005003b >>>>> Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: >>>>> 000000021c22b000 >>>>> CR4: 00000000000006e0 >>>>> Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo >>>>> ffff810219dde000, task ffff81020d87f0c0) >>>>> Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 >>>>> ffff81020ef96968 >>>>> ffff81020ef96800 ffff81020ef96958 >>>>> Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90 >>>>> ffffffff80424e05 0000000000000000 >>>>> Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90 >>>>> ffffffff80239b90 ffff81020d87f0c0 >>>>> Dec 8 06:38:21 ibd201 kernel: Call Trace: >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :sunrpc:svc_rdma_put_context+0x37/0xd0 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> schedule_timeout+0x95/0xb0 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> process_timeout+0x0/0x10 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> wait_for_completion_timeout+0xcd/0x150 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> default_wake_function+0x0/0x10 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :ib_mthca:mthca_cmd_post+0x232/0x260 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> default_wake_function+0x0/0x10 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> __next_cpu+0x19/0x30 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> find_busiest_group+0x24e/0x6d0 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> thread_return+0x0/0xde >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> _spin_unlock_irqrestore+0x8/0x10 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> try_to_del_timer_sync+0x51/0x60 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> del_timer_sync+0xc/0x20 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> schedule_timeout+0x95/0xb0 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :sunrpc:svc_recv+0x416/0x510 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> default_wake_function+0x0/0x10 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> default_wake_function+0x0/0x10 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :nfsd:nfsd+0x0/0x380 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :nfsd:nfsd+0x111/0x380 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> child_rip+0xa/0x12 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :nfsd:nfsd+0x0/0x380 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> :nfsd:nfsd+0x0/0x380 >>>>> Dec 8 06:38:21 ibd201 kernel: [] >>>>> child_rip+0x0/0x12 >>>>> Dec 8 06:38:21 ibd201 kernel: >>>>> Dec 8 06:38:21 ibd201 kernel: >>>>> Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 >>>>> f0 ff 4f 08 >>>>> 0f 94 c0 84 c0 74 >>>>> Dec 8 06:38:21 ibd201 kernel: RIP [] >>>>> put_page+0x17/0x40 >>>>> Dec 8 06:38:21 ibd201 kernel: RSP >>>>> >>>>> -vu >>>>> >>> >>> > > ------------------------------------------------------------------------ > > > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596b800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596bc00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7dec00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please bite here ] --------- > Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300 > Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP > Dec 12 01:09:30 ibd202 kernel: CPU 1 > Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 floppy ext3 jbd megaraid_sas sd_mod scsi_mod > Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1 > Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[] [] put_page+0x13/0x2e > Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08 EFLAGS: 00010246 > Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000006a53 > Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 RDI: ffff81024fc3dec0 > Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 R09: 0000000000000000 > Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 R12: ffff810240fb3800 > Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 R15: 00000000000dbba0 > Dec 12 01:09:30 ibd202 kernel: FS: 00002ad030296b00(0000) GS:ffff81024688eac0(0000) knlGS:0000000000000000 > Dec 12 01:09:30 ibd202 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 CR4: 00000000000006e0 > Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo ffff81023fd10000, task ffff810246562840) > Dec 12 01:09:30 ibd202 kernel: Stack: ffffffff8817b2fb ffff810240fb39b8 0000000000000000 ffff81024172c5b0 > Dec 12 01:09:30 ibd202 kernel: ffffffff8817ec67 ffff81023cda7000 ffffffff8817b2a8 0000000000000000 > Dec 12 01:09:30 ibd202 kernel: ffff81023fd11ca0 ffff81023fd11b80 0000000000000001 ffff81023cda7000 > Dec 12 01:09:30 ibd202 kernel: Call Trace: > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_rdma_put_context+0x37/0xb5 > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_rdma_recvfrom+0x58f/0x1150 > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_rdma_get_context+0x10c/0x128 > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:send_write+0x200/0x22c > Dec 12 01:09:30 ibd202 kernel: [] generic_file_readv+0x8e/0xa7 > Dec 12 01:09:30 ibd202 kernel: [] zone_statistics+0x40/0x70 > Dec 12 01:09:30 ibd202 kernel: [] find_busiest_group+0x21f/0x66f > Dec 12 01:09:30 ibd202 kernel: [] _spin_unlock_irq+0x6/0xa > Dec 12 01:09:30 ibd202 kernel: [] thread_return+0x64/0xec > Dec 12 01:09:30 ibd202 kernel: [] _spin_lock_irqsave+0x9/0xe > Dec 12 01:09:30 ibd202 kernel: [] lock_timer_base+0x1b/0x3c > Dec 12 01:09:30 ibd202 kernel: [] try_to_del_timer_sync+0x4a/0x51 > Dec 12 01:09:30 ibd202 kernel: [] del_timer_sync+0xc/0x16 > Dec 12 01:09:30 ibd202 kernel: [] schedule_timeout+0x92/0xad > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_recv+0x3c5/0x4be > Dec 12 01:09:30 ibd202 kernel: [] default_wake_function+0x0/0xe > Dec 12 01:09:30 ibd202 kernel: [] default_wake_function+0x0/0xe > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x10d/0x359 > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0xa/0x12 > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0x0/0x12 > Dec 12 01:09:30 ibd202 kernel: > Dec 12 01:09:30 ibd202 kernel: > Dec 12 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f 94 c0 84 c0 74 > Dec 12 01:09:30 ibd202 kernel: RIP [] put_page+0x13/0x2e > Dec 12 01:09:30 ibd202 kernel: RSP > Dec 12 01:09:30 ibd202 kernel: <4>nfsd: terminating on error 22 > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596b800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596bc00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7dec00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please bite here ] --------- > Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300 > Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP > Dec 12 01:09:30 ibd202 kernel: CPU 1 > Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 floppy ext3 jbd megaraid_sas sd_mod scsi_mod > Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1 > Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[] [] put_page+0x13/0x2e > Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08 EFLAGS: 00010246 > Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000006a53 > Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 RDI: ffff81024fc3dec0 > Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 R09: 0000000000000000 > Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 R12: ffff810240fb3800 > Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 R15: 00000000000dbba0 > Dec 12 01:09:30 ibd202 kernel: FS: 00002ad030296b00(0000) GS:ffff81024688eac0(0000) knlGS:0000000000000000 > Dec 12 01:09:30 ibd202 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 CR4: 00000000000006e0 > Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo ffff81023fd10000, task ffff810246562840) > Dec 12 01:09:30 ibd202 kernel: Stack: ffffffff8817b2fb ffff810240fb39b8 0000000000000000 ffff81024172c5b0 > Dec 12 01:09:30 ibd202 kernel: ffffffff8817ec67 ffff81023cda7000 ffffffff8817b2a8 0000000000000000 > Dec 12 01:09:30 ibd202 kernel: ffff81023fd11ca0 ffff81023fd11b80 0000000000000001 ffff81023cda7000 > Dec 12 01:09:30 ibd202 kernel: Call Trace: > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_rdma_put_context+0x37/0xb5 > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_rdma_recvfrom+0x58f/0x1150 > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_rdma_get_context+0x10c/0x128 > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:send_write+0x200/0x22c > Dec 12 01:09:30 ibd202 kernel: [] generic_file_readv+0x8e/0xa7 > Dec 12 01:09:30 ibd202 kernel: [] zone_statistics+0x40/0x70 > Dec 12 01:09:30 ibd202 kernel: [] find_busiest_group+0x21f/0x66f > Dec 12 01:09:30 ibd202 kernel: [] _spin_unlock_irq+0x6/0xa > Dec 12 01:09:30 ibd202 kernel: [] thread_return+0x64/0xec > Dec 12 01:09:30 ibd202 kernel: [] _spin_lock_irqsave+0x9/0xe > Dec 12 01:09:30 ibd202 kernel: [] lock_timer_base+0x1b/0x3c > Dec 12 01:09:30 ibd202 kernel: [] try_to_del_timer_sync+0x4a/0x51 > Dec 12 01:09:30 ibd202 kernel: [] del_timer_sync+0xc/0x16 > Dec 12 01:09:30 ibd202 kernel: [] schedule_timeout+0x92/0xad > Dec 12 01:09:30 ibd202 kernel: [] :sunrpc:svc_recv+0x3c5/0x4be > Dec 12 01:09:30 ibd202 kernel: [] default_wake_function+0x0/0xe > Dec 12 01:09:30 ibd202 kernel: [] default_wake_function+0x0/0xe > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x10d/0x359 > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0xa/0x12 > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0x0/0x12 > Dec 12 01:09:30 ibd202 kernel: > Dec 12 01:09:30 ibd202 kernel: > Dec 12 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f 94 c0 84 c0 74 > Dec 12 01:09:30 ibd202 kernel: RIP [] put_page+0x13/0x2e > Dec 12 01:09:30 ibd202 kernel: RSP > Dec 12 01:09:30 ibd202 kernel: <4>nfsd: terminating on error 22 > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: messages.202.1 URL: From ralph.campbell at qlogic.com Tue Dec 12 10:22:58 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 12 Dec 2006 10:22:58 -0800 Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <457E6DAE.3040206@voltaire.com> References: <1165517253.14800.283.camel@brick.pathscale.com> <457BD18D.7000403@voltaire.com> <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com> <457E6DAE.3040206@voltaire.com> Message-ID: <1165947778.14800.315.camel@brick.pathscale.com> On Tue, 2006-12-12 at 10:51 +0200, Or Gerlitz wrote: > Roland Dreier wrote: > > > I would like to see this last set of patches integrated as is. > > > I would like to get more experience with the current implementation > > > before extending it to support other configurations. > > > > Yeah, let's go with that. Since ipath depends on 64BIT in Kconfig > > anyway I think this is OK for now. > > This design of ib_dma_map_single, ib_sg_dma_address etc returning u64 > instead of dma_addr_t causes the resulted patch to the IB ULPs to be > quite big. I think it was you who pointed out that dma_addr_t is 32 bits on sparc64. Did you have a different solution in mind? > Have you tested any dma_map single (eg IPoIB) and sg (eg SRP or iSER) > consumer with this code? Yes. From michael.arndt at informatik.tu-chemnitz.de Tue Dec 12 10:21:00 2006 From: michael.arndt at informatik.tu-chemnitz.de (Michael Arndt) Date: Tue, 12 Dec 2006 19:21:00 +0100 Subject: [openib-general] mad_agents Message-ID: <000e01c71e1a$46939ad0$21606d86@one7> Hi, the following statements about functions and modules refer to the mad.c, agent.c and user_mad.c file. during the initialisation of the mad module a funktion ib_agent_port_open is called(ib_mad_init_device -> ib_mad_port_open). At this point an agent is registered (ib_register_mad_agent), without a MAD registration request applied. So my question is, what is this agent for? And is it right that the agent registered by the umad module (ib_umad_ioctl -> ib_umad_reg_agent -> ib_register_mad_agent) gets all the SMP packets from the device and passes them to the SM (read and FileDescriptior). What is about the SMA? Where are the SMPs filtered between SMA and SM? I also would like to say that it would be really nice if there would be some papers, diagrams, grafics or anything else which explain how the whole openib system works. The source code as only reference isn't really helping for new developer. Thanks Michael From rdreier at cisco.com Tue Dec 12 10:30:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 10:30:49 -0800 Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <457E6DAE.3040206@voltaire.com> (Or Gerlitz's message of "Tue, 12 Dec 2006 10:51:58 +0200") References: <1165517253.14800.283.camel@brick.pathscale.com> <457BD18D.7000403@voltaire.com> <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com> <457E6DAE.3040206@voltaire.com> Message-ID: > This design of ib_dma_map_single, ib_sg_dma_address etc returning u64 > instead of dma_addr_t causes the resulted patch to the IB ULPs to be > quite big. Yes, there are actually some bugs introduced (basically pci_unmap_addr et al can no longer be used). I'll fix it up and test before merging. - R. From adit.262 at gmail.com Tue Dec 12 10:45:31 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Tue, 12 Dec 2006 13:45:31 -0500 Subject: [openib-general] QoS configuration using opensm Message-ID: Hi, Im trying to establish some QoS parameters for allowing apps to communicate using different service levels. Curently my opensm.opts looks like this: # QoS default options qos_max_vls 15 qos_high_limit 0 qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:255,8:0,9:0,10:0,11:0,12:0,13:0,14:0 qos_vlarb_low 0:4,1:100,2:100,3:100,4:100,5:100,6:100,7:100,8:100,9:100,10:100,11:100,12:100,13:4,14:4 qos_sl2vl 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7 # QoS CA options qos_ca_max_vls 15 qos_ca_high_limit 0 qos_ca_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 qos_ca_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 Im not sure which options to modify QoS default or QoS CA? Should both tables have same values? My setup includes no switch and back two machines connected to each other using the IB cable. Since im mapping service level 7 to all VLs all apps using sl=7 should receive equal bandwidth? Also since in VLarb_high table weight of SL=7 is 255? Thanks, Regards, Adit From jlentini at netapp.com Tue Dec 12 10:51:23 2006 From: jlentini at netapp.com (James Lentini) Date: Tue, 12 Dec 2006 13:51:23 -0500 (EST) Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: <457EEB07.8040904@mellanox.com> References: <4579C6C3.5090207@mellanox.com> <457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com> <457E7414.6040802@mellanox.com> <457EEB07.8040904@mellanox.com> Message-ID: It appears that one or more of the receive work requests is completing in error. The crash occurs when the server attempts to cleanup the buffer associated with the work request. I'd like to know why receives are failing. What is the error? Do your logs contain the printk on net/sunrpc/svc_rdma_recvfrom.c:522 "svcrdma: bad WR completion..."? If they do not, you can turn on SVCRDMA_DEBUG (echo 4096 > /proc/sys/sunrpc/rpc_debug). james On Tue, 12 Dec 2006, Vu Pham wrote: > James, > Another variation of put_page problem. I have stopped doing I/O or > accessing the mounted directory since last night. This morning I > just try to do *ls* the mounted directory and get this error > > -vu > > > James, > > I hit another variation of put_page problem. I just ran iozone with 9 GB > > file size (both client and server machines have 8 GB of memory, dual > > woodcrest xeon cpus, 2.6.18.5 kernel, nfsrdma release 7) > > > > After this happened other nfsrdma clients can still do I/O to the server > > > > -vu > > > > > Hit *send* too soon - here is the objdump of swap.o > > > > > > -vu > > > > > > > > > > James Lentini wrote: > > > > > A couple of questions Vu: > > > > > > > > > > What NFS-RDMA release are you using? This looks like release 7. > > > > > > > > > > > > > Yes. I'm using release 7 > > > > > > > > > Is this reproducible? > > > > > > > > I ran into it twice - I think that it may co-relate to openSM restart > > > > incident. I'll double check it and confirm > > > > > > > > > > > > > What kernel version are you using? > > > > > > > > 2.6.18.5 > > > > > > > > > What hardware is this on? It looks like x86-64 to me, which is fine. I > > > > > just want to be sure I know what I'm looking at. As many specifics as > > > > > possible is good (number of CPUs, hyperthreading, etc.) > > > > > > > > > > > > > Dual woodcrest xeon based CPUs > > > > > > > > > Could you send the output of > > > > > objdump -Slr /path/to/kernel/mm/swap.o > > > > > > > > > > > > > I attached the objdump output here > > > > > > > > > Actually, just the put_page disassembly is all I want to see. > > > > > > > > > > Is there any more text available? Usually there is an explanation > > > > > given for an oops message (e.g. "Unable to handle kernel paging > > > > > request.."). > > > > > > > > > > > > > I did not see any oops text message. System was still responsive with > > > > ipoib ping or login > > > > > > > > > > > > > I opened a bug at the NFS-RDMA SourceForge project to track this: > > > > > > > > > > http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 > > > > > > > > thanks for your help, > > > > > > > > -vu > > > > > > > > > Thanks for reporting this. > > > > > james > > > > > > > > > > On Fri, 8 Dec 2006, Vu Pham wrote: > > > > > > > > > > > Hi James, > > > > > > I got these errors in server's /var/log/messages and then the > > > > > > server stop > > > > > > responding to login, I/O...; however, the server is still up, ipoib > > > > > > is still > > > > > > working > > > > > > > > > > > > > > > > > > Dec 8 06:38:21 ibd201 kernel: RIP: 0010:[] > > > > > > [] put_page+0x17/0x40 > > > > > > Dec 8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08 EFLAGS: > > > > > > 00010246 > > > > > > Dec 8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: > > > > > > 0000000000000001 > > > > > > RCX: 000000000003ffff > > > > > > Dec 8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: > > > > > > 0000000000000001 > > > > > > RDI: ffff8102274e92f8 > > > > > > Dec 8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: > > > > > > 0000000000000034 > > > > > > R09: 0000000000000000 > > > > > > Dec 8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: > > > > > > 0000000000000000 > > > > > > R12: ffff81020ef96800 > > > > > > Dec 8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: > > > > > > 0000000000000000 > > > > > > R15: ffff8102053ee890 > > > > > > Dec 8 06:38:21 ibd201 kernel: FS: 00002ad76b8acb00(0000) > > > > > > GS:ffff81022066eb40(0000) knlGS:0000000000000000 > > > > > > Dec 8 06:38:21 ibd201 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > > > > > > 000000008005003b > > > > > > Dec 8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: > > > > > > 000000021c22b000 > > > > > > CR4: 00000000000006e0 > > > > > > Dec 8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo > > > > > > ffff810219dde000, task ffff81020d87f0c0) > > > > > > Dec 8 06:38:21 ibd201 kernel: Stack: ffffffff8835e547 > > > > > > ffff81020ef96968 > > > > > > ffff81020ef96800 ffff81020ef96958 > > > > > > Dec 8 06:38:21 ibd201 kernel: ffffffff88360c72 000000010395dc90 > > > > > > ffffffff80424e05 0000000000000000 > > > > > > Dec 8 06:38:21 ibd201 kernel: 0000000000200200 000000010395dc90 > > > > > > ffffffff80239b90 ffff81020d87f0c0 > > > > > > Dec 8 06:38:21 ibd201 kernel: Call Trace: > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :sunrpc:svc_rdma_put_context+0x37/0xd0 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > schedule_timeout+0x95/0xb0 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > process_timeout+0x0/0x10 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > wait_for_completion_timeout+0xcd/0x150 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > default_wake_function+0x0/0x10 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :ib_mthca:mthca_cmd_post+0x232/0x260 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > default_wake_function+0x0/0x10 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > __next_cpu+0x19/0x30 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > find_busiest_group+0x24e/0x6d0 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > thread_return+0x0/0xde > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > _spin_unlock_irqrestore+0x8/0x10 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > try_to_del_timer_sync+0x51/0x60 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > del_timer_sync+0xc/0x20 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > schedule_timeout+0x95/0xb0 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :sunrpc:svc_recv+0x416/0x510 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > default_wake_function+0x0/0x10 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > default_wake_function+0x0/0x10 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :nfsd:nfsd+0x0/0x380 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :nfsd:nfsd+0x111/0x380 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > child_rip+0xa/0x12 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :nfsd:nfsd+0x0/0x380 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > :nfsd:nfsd+0x0/0x380 > > > > > > Dec 8 06:38:21 ibd201 kernel: [] > > > > > > child_rip+0x0/0x12 > > > > > > Dec 8 06:38:21 ibd201 kernel: > > > > > > Dec 8 06:38:21 ibd201 kernel: > > > > > > Dec 8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 > > > > > > f0 ff 4f 08 > > > > > > 0f 94 c0 84 c0 74 > > > > > > Dec 8 06:38:21 ibd201 kernel: RIP [] > > > > > > put_page+0x17/0x40 > > > > > > Dec 8 06:38:21 ibd201 kernel: RSP > > > > > > > > > > > > -vu > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596b800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596bc00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17c00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7dec00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39c00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please > > bite here ] --------- > > Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300 > > Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30 > > ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd > > exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod > > button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca > > shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 > > floppy ext3 jbd megaraid_sas sd_mod scsi_mod > > Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1 > > Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[] > > [] put_page+0x13/0x2e > > Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08 EFLAGS: 00010246 > > Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 > > RCX: 0000000000006a53 > > Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 > > RDI: ffff81024fc3dec0 > > Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 > > R09: 0000000000000000 > > Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 > > R12: ffff810240fb3800 > > Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 > > R15: 00000000000dbba0 > > Dec 12 01:09:30 ibd202 kernel: FS: 00002ad030296b00(0000) > > GS:ffff81024688eac0(0000) knlGS:0000000000000000 > > Dec 12 01:09:30 ibd202 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > > 000000008005003b > > Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 > > CR4: 00000000000006e0 > > Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo > > ffff81023fd10000, task ffff810246562840) > > Dec 12 01:09:30 ibd202 kernel: Stack: ffffffff8817b2fb ffff810240fb39b8 > > 0000000000000000 ffff81024172c5b0 > > Dec 12 01:09:30 ibd202 kernel: ffffffff8817ec67 ffff81023cda7000 > > ffffffff8817b2a8 0000000000000000 > > Dec 12 01:09:30 ibd202 kernel: ffff81023fd11ca0 ffff81023fd11b80 > > 0000000000000001 ffff81023cda7000 > > Dec 12 01:09:30 ibd202 kernel: Call Trace: > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_rdma_put_context+0x37/0xb5 > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_rdma_recvfrom+0x58f/0x1150 > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_rdma_get_context+0x10c/0x128 > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:send_write+0x200/0x22c > > Dec 12 01:09:30 ibd202 kernel: [] > > generic_file_readv+0x8e/0xa7 > > Dec 12 01:09:30 ibd202 kernel: [] > > zone_statistics+0x40/0x70 > > Dec 12 01:09:30 ibd202 kernel: [] > > find_busiest_group+0x21f/0x66f > > Dec 12 01:09:30 ibd202 kernel: [] > > _spin_unlock_irq+0x6/0xa > > Dec 12 01:09:30 ibd202 kernel: [] thread_return+0x64/0xec > > Dec 12 01:09:30 ibd202 kernel: [] > > _spin_lock_irqsave+0x9/0xe > > Dec 12 01:09:30 ibd202 kernel: [] > > lock_timer_base+0x1b/0x3c > > Dec 12 01:09:30 ibd202 kernel: [] > > try_to_del_timer_sync+0x4a/0x51 > > Dec 12 01:09:30 ibd202 kernel: [] del_timer_sync+0xc/0x16 > > Dec 12 01:09:30 ibd202 kernel: [] > > schedule_timeout+0x92/0xad > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_recv+0x3c5/0x4be > > Dec 12 01:09:30 ibd202 kernel: [] > > default_wake_function+0x0/0xe > > Dec 12 01:09:30 ibd202 kernel: [] > > default_wake_function+0x0/0xe > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x10d/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0xa/0x12 > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0x0/0x12 > > Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12 > > 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f > > 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP [] > > put_page+0x13/0x2e > > Dec 12 01:09:30 ibd202 kernel: RSP > > Dec 12 01:09:30 ibd202 kernel: <4>nfsd: terminating on error 22 > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596b800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596bc00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17c00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7dec00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39800, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39c00, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf000, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > > Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on > > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > > Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please > > bite here ] --------- > > Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300 > > Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30 > > ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd > > exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod > > button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca > > shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 > > floppy ext3 jbd megaraid_sas sd_mod scsi_mod > > Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1 > > Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[] > > [] put_page+0x13/0x2e > > Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08 EFLAGS: 00010246 > > Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 > > RCX: 0000000000006a53 > > Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 > > RDI: ffff81024fc3dec0 > > Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 > > R09: 0000000000000000 > > Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 > > R12: ffff810240fb3800 > > Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 > > R15: 00000000000dbba0 > > Dec 12 01:09:30 ibd202 kernel: FS: 00002ad030296b00(0000) > > GS:ffff81024688eac0(0000) knlGS:0000000000000000 > > Dec 12 01:09:30 ibd202 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > > 000000008005003b > > Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 > > CR4: 00000000000006e0 > > Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo > > ffff81023fd10000, task ffff810246562840) > > Dec 12 01:09:30 ibd202 kernel: Stack: ffffffff8817b2fb ffff810240fb39b8 > > 0000000000000000 ffff81024172c5b0 > > Dec 12 01:09:30 ibd202 kernel: ffffffff8817ec67 ffff81023cda7000 > > ffffffff8817b2a8 0000000000000000 > > Dec 12 01:09:30 ibd202 kernel: ffff81023fd11ca0 ffff81023fd11b80 > > 0000000000000001 ffff81023cda7000 > > Dec 12 01:09:30 ibd202 kernel: Call Trace: > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_rdma_put_context+0x37/0xb5 > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_rdma_recvfrom+0x58f/0x1150 > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_rdma_get_context+0x10c/0x128 > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:send_write+0x200/0x22c > > Dec 12 01:09:30 ibd202 kernel: [] > > generic_file_readv+0x8e/0xa7 > > Dec 12 01:09:30 ibd202 kernel: [] > > zone_statistics+0x40/0x70 > > Dec 12 01:09:30 ibd202 kernel: [] > > find_busiest_group+0x21f/0x66f > > Dec 12 01:09:30 ibd202 kernel: [] > > _spin_unlock_irq+0x6/0xa > > Dec 12 01:09:30 ibd202 kernel: [] thread_return+0x64/0xec > > Dec 12 01:09:30 ibd202 kernel: [] > > _spin_lock_irqsave+0x9/0xe > > Dec 12 01:09:30 ibd202 kernel: [] > > lock_timer_base+0x1b/0x3c > > Dec 12 01:09:30 ibd202 kernel: [] > > try_to_del_timer_sync+0x4a/0x51 > > Dec 12 01:09:30 ibd202 kernel: [] del_timer_sync+0xc/0x16 > > Dec 12 01:09:30 ibd202 kernel: [] > > schedule_timeout+0x92/0xad > > Dec 12 01:09:30 ibd202 kernel: [] > > :sunrpc:svc_recv+0x3c5/0x4be > > Dec 12 01:09:30 ibd202 kernel: [] > > default_wake_function+0x0/0xe > > Dec 12 01:09:30 ibd202 kernel: [] > > default_wake_function+0x0/0xe > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x10d/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0xa/0x12 > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 > > Dec 12 01:09:30 ibd202 kernel: [] child_rip+0x0/0x12 > > Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12 > > 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f > > 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP [] > > put_page+0x13/0x2e > > Dec 12 01:09:30 ibd202 kernel: RSP > > Dec 12 01:09:30 ibd202 kernel: <4>nfsd: terminating on error 22 > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > From eeb at bartonsoftware.com Tue Dec 12 10:53:09 2006 From: eeb at bartonsoftware.com (Eric Barton) Date: Tue, 12 Dec 2006 18:53:09 -0000 Subject: [openib-general] version #defines for the kernel In-Reply-To: Message-ID: <0a3901c71e1e$c431f910$0281a8c0@ebpc> > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: 12 December 2006 4:43 PM > To: Eric Barton > Cc: openib-general at openib.org > Subject: Re: [openib-general] version #defines for the kernel > > > At the risk of flogging a dead horse - I was only thinking > of a very simple > > version number that incremented on change - something like > > LINUX_VERSION_CODE? > > In that case what do you expect to see in a kernel with backported > drivers, that has backported some changes but not others? Blood one the floor somewhere I'd hope :) Or maybe just no #define for the version, since the person doing the backport clearly isn't worried about compatibility with out-of-tree code. Cheers, Eric From mst at mellanox.co.il Tue Dec 12 11:02:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 21:02:01 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061212064847.GB13509@mellanox.co.il> References: <20061212064847.GB13509@mellanox.co.il> Message-ID: <20061212190200.GE382@mellanox.co.il> > Quoting r. Michael S. Tsirkin : > Subject: Re: [PATCHv2] IPoIB CM Experimental support > > > I think we could probably merge it but maybe it's better to put it in > > -mm for a cycle given that it's new and not too many people have > > looked at it yet. And I still haven't gotten comfortable with the way > > CM is enabled. > > Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP, > or you don't want it in 2.6.20 anyway? Roland, could you clarify your opinion pls? -- MST From rdreier at cisco.com Tue Dec 12 11:12:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 11:12:00 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061212190200.GE382@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 12 Dec 2006 21:02:01 +0200") References: <20061212064847.GB13509@mellanox.co.il> <20061212190200.GE382@mellanox.co.il> Message-ID: > > Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP, > > or you don't want it in 2.6.20 anyway? > > Roland, could you clarify your opinion pls? Sorry, I thought about this a fair amount. I think I finally ended up feeling that the code is just too new. I don't think anyone other than you has had a chance to really look at it (I certainly haven't) so I think we're better off not merging it. I know that you said -mm has limited value but I actually think just the build coverage is worth it. And it is surprising how many people are auditing the new code that shows up in -mm so I think it will help a fair bit. - R. From halr at voltaire.com Tue Dec 12 11:12:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 12 Dec 2006 14:12:21 -0500 Subject: [openib-general] QoS configuration using opensm In-Reply-To: References: Message-ID: <1165950734.28709.17589.camel@hal.voltaire.com> Hi Adit, On Tue, 2006-12-12 at 13:45, Adit Ranadive wrote: > Hi, > > Im trying to establish some QoS parameters for allowing apps to > communicate using different service levels. > > Curently my opensm.opts looks like this: > > # QoS default options > qos_max_vls 15 > qos_high_limit 0 > qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:255,8:0,9:0,10:0,11:0,12:0,13:0,14:0 > qos_vlarb_low 0:4,1:100,2:100,3:100,4:100,5:100,6:100,7:100,8:100,9:100,10:100,11:100,12:100,13:4,14:4 > qos_sl2vl 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7 > > # QoS CA options > qos_ca_max_vls 15 > qos_ca_high_limit 0 > qos_ca_vlarb_high > 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0 > qos_ca_vlarb_low > 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4 > qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 > > Im not sure which options to modify QoS default or QoS CA? Depends what you want to do. The defaults are used by all unless overridden by the specific configuration by target port type (ca, rtr, sw0, ext). > Should both tables have same values? They can but if they do, you don't need one of them (likely the ca_ one). > My setup includes no switch and back two machines connected to each > other using the IB cable. > > Since im mapping service level 7 to all VLs No, it's the other way around: You are mapping all SLs to VL 7. > all apps using sl=7 should receive equal bandwidth? These tables only deal with arbitration amongst the VLs (and the mapping of the SLs to VLs). They do not deal with fairness amongst applications sharing the same SL. > Also since in VLarb_high table weight of SL=7 is 255? That setting means that the high priority limit can be unbounded and low priority will only be scheduled if there is no high priority work to do. -- Hal > Thanks, > Regards, > Adit > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Dec 12 11:21:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 11:21:18 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061212191805.GK382@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 12 Dec 2006 21:18:05 +0200") References: <20061212191805.GK382@mellanox.co.il> Message-ID: > OK, thanks for the suggestion. I'll send something to Andrew after -rc1's out. Actually I can get stuff into -mm with no work at all -- Andrew just grabs my for-mm branch. So I'll drop CM into there and we'll be all set. > The memory registration speedup patches can be 2.6.20 material though, can't they? Yes, they look OK. I am merging stuff up right now and they are on my list. - R. From mst at mellanox.co.il Tue Dec 12 11:18:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 21:18:05 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061212191805.GK382@mellanox.co.il> > > > Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP, > > > or you don't want it in 2.6.20 anyway? > > > > Roland, could you clarify your opinion pls? > > Sorry, I thought about this a fair amount. I think I finally ended up > feeling that the code is just too new. I don't think anyone other > than you has had a chance to really look at it (I certainly haven't) > so I think we're better off not merging it. > > I know that you said -mm has limited value but I actually think just > the build coverage is worth it. And it is surprising how many people > are auditing the new code that shows up in -mm so I think it will help > a fair bit. OK, thanks for the suggestion. I'll send something to Andrew after -rc1's out. The memory registration speedup patches can be 2.6.20 material though, can't they? -- MST From vuhuong at mellanox.com Tue Dec 12 11:25:01 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 11:25:01 -0800 Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: References: <4579C6C3.5090207@mellanox.com> <457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com> <457E7414.6040802@mellanox.com> <457EEB07.8040904@mellanox.com> Message-ID: <457F020D.7010500@mellanox.com> James Lentini wrote: > It appears that one or more of the receive work requests is completing > in error. The crash occurs when the server attempts to cleanup the > buffer associated with the work request. > > I'd like to know why receives are failing. What is the error? Do your > logs contain the printk on net/sunrpc/svc_rdma_recvfrom.c:522 > "svcrdma: bad WR completion..."? If they do not, you can turn on > SVCRDMA_DEBUG (echo 4096 > /proc/sys/sunrpc/rpc_debug). > Yes, this error message is original in my log messages.202 and messages.202.1 see below -vu >>> ------------------------------------------------------------------------ >>> >>> >>> >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596b800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596bc00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17c00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7dec00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39c00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please >>> bite here ] --------- >>> Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300 >>> Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30 >>> ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd >>> exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod >>> button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca >>> shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 >>> floppy ext3 jbd megaraid_sas sd_mod scsi_mod >>> Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1 >>> Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[] >>> [] put_page+0x13/0x2e >>> Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08 EFLAGS: 00010246 >>> Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 >>> RCX: 0000000000006a53 >>> Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 >>> RDI: ffff81024fc3dec0 >>> Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 >>> R09: 0000000000000000 >>> Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 >>> R12: ffff810240fb3800 >>> Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 >>> R15: 00000000000dbba0 >>> Dec 12 01:09:30 ibd202 kernel: FS: 00002ad030296b00(0000) >>> GS:ffff81024688eac0(0000) knlGS:0000000000000000 >>> Dec 12 01:09:30 ibd202 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >>> 000000008005003b >>> Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 >>> CR4: 00000000000006e0 >>> Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo >>> ffff81023fd10000, task ffff810246562840) >>> Dec 12 01:09:30 ibd202 kernel: Stack: ffffffff8817b2fb ffff810240fb39b8 >>> 0000000000000000 ffff81024172c5b0 >>> Dec 12 01:09:30 ibd202 kernel: ffffffff8817ec67 ffff81023cda7000 >>> ffffffff8817b2a8 0000000000000000 >>> Dec 12 01:09:30 ibd202 kernel: ffff81023fd11ca0 ffff81023fd11b80 >>> 0000000000000001 ffff81023cda7000 >>> Dec 12 01:09:30 ibd202 kernel: Call Trace: >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_rdma_put_context+0x37/0xb5 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_rdma_recvfrom+0x58f/0x1150 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_rdma_get_context+0x10c/0x128 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:send_write+0x200/0x22c >>> Dec 12 01:09:30 ibd202 kernel: [] >>> generic_file_readv+0x8e/0xa7 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> zone_statistics+0x40/0x70 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> find_busiest_group+0x21f/0x66f >>> Dec 12 01:09:30 ibd202 kernel: [] >>> _spin_unlock_irq+0x6/0xa >>> Dec 12 01:09:30 ibd202 kernel: [] thread_return+0x64/0xec >>> Dec 12 01:09:30 ibd202 kernel: [] >>> _spin_lock_irqsave+0x9/0xe >>> Dec 12 01:09:30 ibd202 kernel: [] >>> lock_timer_base+0x1b/0x3c >>> Dec 12 01:09:30 ibd202 kernel: [] >>> try_to_del_timer_sync+0x4a/0x51 >>> Dec 12 01:09:30 ibd202 kernel: [] del_timer_sync+0xc/0x16 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> schedule_timeout+0x92/0xad >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_recv+0x3c5/0x4be >>> Dec 12 01:09:30 ibd202 kernel: [] >>> default_wake_function+0x0/0xe >>> Dec 12 01:09:30 ibd202 kernel: [] >>> default_wake_function+0x0/0xe >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x10d/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] child_rip+0xa/0x12 >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] child_rip+0x0/0x12 >>> Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12 >>> 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f >>> 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP [] >>> put_page+0x13/0x2e >>> Dec 12 01:09:30 ibd202 kernel: RSP >>> Dec 12 01:09:30 ibd202 kernel: <4>nfsd: terminating on error 22 >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596b800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81012596bc00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff810144c17c00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7de800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e7dec00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39800, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023dd39c00, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf000, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 >>> Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please >>> bite here ] --------- >>> Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300 >>> Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30 >>> ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd >>> exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod >>> button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca >>> shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 >>> floppy ext3 jbd megaraid_sas sd_mod scsi_mod >>> Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1 >>> Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[] >>> [] put_page+0x13/0x2e >>> Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08 EFLAGS: 00010246 >>> Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 >>> RCX: 0000000000006a53 >>> Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 >>> RDI: ffff81024fc3dec0 >>> Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 >>> R09: 0000000000000000 >>> Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 >>> R12: ffff810240fb3800 >>> Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 >>> R15: 00000000000dbba0 >>> Dec 12 01:09:30 ibd202 kernel: FS: 00002ad030296b00(0000) >>> GS:ffff81024688eac0(0000) knlGS:0000000000000000 >>> Dec 12 01:09:30 ibd202 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: >>> 000000008005003b >>> Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 >>> CR4: 00000000000006e0 >>> Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo >>> ffff81023fd10000, task ffff810246562840) >>> Dec 12 01:09:30 ibd202 kernel: Stack: ffffffff8817b2fb ffff810240fb39b8 >>> 0000000000000000 ffff81024172c5b0 >>> Dec 12 01:09:30 ibd202 kernel: ffffffff8817ec67 ffff81023cda7000 >>> ffffffff8817b2a8 0000000000000000 >>> Dec 12 01:09:30 ibd202 kernel: ffff81023fd11ca0 ffff81023fd11b80 >>> 0000000000000001 ffff81023cda7000 >>> Dec 12 01:09:30 ibd202 kernel: Call Trace: >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_rdma_put_context+0x37/0xb5 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_rdma_recvfrom+0x58f/0x1150 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_rdma_get_context+0x10c/0x128 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:send_write+0x200/0x22c >>> Dec 12 01:09:30 ibd202 kernel: [] >>> generic_file_readv+0x8e/0xa7 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> zone_statistics+0x40/0x70 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> find_busiest_group+0x21f/0x66f >>> Dec 12 01:09:30 ibd202 kernel: [] >>> _spin_unlock_irq+0x6/0xa >>> Dec 12 01:09:30 ibd202 kernel: [] thread_return+0x64/0xec >>> Dec 12 01:09:30 ibd202 kernel: [] >>> _spin_lock_irqsave+0x9/0xe >>> Dec 12 01:09:30 ibd202 kernel: [] >>> lock_timer_base+0x1b/0x3c >>> Dec 12 01:09:30 ibd202 kernel: [] >>> try_to_del_timer_sync+0x4a/0x51 >>> Dec 12 01:09:30 ibd202 kernel: [] del_timer_sync+0xc/0x16 >>> Dec 12 01:09:30 ibd202 kernel: [] >>> schedule_timeout+0x92/0xad >>> Dec 12 01:09:30 ibd202 kernel: [] >>> :sunrpc:svc_recv+0x3c5/0x4be >>> Dec 12 01:09:30 ibd202 kernel: [] >>> default_wake_function+0x0/0xe >>> Dec 12 01:09:30 ibd202 kernel: [] >>> default_wake_function+0x0/0xe >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x10d/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] child_rip+0xa/0x12 >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] :nfsd:nfsd+0x0/0x359 >>> Dec 12 01:09:30 ibd202 kernel: [] child_rip+0x0/0x12 >>> Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12 >>> 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f >>> 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP [] >>> put_page+0x13/0x2e >>> Dec 12 01:09:30 ibd202 kernel: RSP >>> Dec 12 01:09:30 ibd202 kernel: <4>nfsd: terminating on error 22 >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >> From rdreier at cisco.com Tue Dec 12 11:30:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 11:30:38 -0800 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 30 Nov 2006 16:53:41 -0800") References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> Message-ID: OK, I merged 1..5 up for 2.6.20. I had to fix a few conflicts with cma.c cleanups already upstream, and also with the miscdevice conversion from class device to real device (basically what steve wise posted a few days ago). I just pushed the result out in my for-2.6.20 branch if anyone wants to check, and I'll ask Linus to pull soon. I did have one question, but we can clean it up later: > + if (signal_pending(current)) { > + ret = -ERESTARTSYS; > + break; > + } > + > + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); > + mutex_unlock(&file->mut); > + schedule(); > + mutex_lock(&file->mut); > + finish_wait(&file->poll_wait, &wait); is there any reason why this can't just be written with wait_event_interruptible() instead of this more-complex way? - R. From rdreier at cisco.com Tue Dec 12 11:50:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 11:50:30 -0800 Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories In-Reply-To: <1165864386.6867.2.camel@stevo-desktop> (Steve Wise's message of "Mon, 11 Dec 2006 13:13:06 -0600") References: <1165864386.6867.2.camel@stevo-desktop> Message-ID: > Hey Roland, is there a preferred way to handle this? IE whats the best > was of keeping a 2.6.19 based patch set while also trying to merge your > patches into the latest from linus's tree? > > I guess I can create a branch with a HEAD at 2.6.19 and back-port my > latest patch set. Is that the best way? Maybe a for-ofed branch? I don't know if there's a great way to handle it. Basically you need a branch based at the v2.6.19 tag. There's probably a smart way to keep the merge between your latest stuff and the backports, but I'm not sure what it would be exactly. - R. From vishal at endace.com Tue Dec 12 12:21:00 2006 From: vishal at endace.com (vishal) Date: Wed, 13 Dec 2006 09:21:00 +1300 Subject: [openib-general] srp initiator device discovery In-Reply-To: <457E707A.4040802@mellanox.com> References: <1165899109.14308.9.camel@julia.et.endace.com> <457E707A.4040802@mellanox.com> Message-ID: <1165954860.14308.12.camel@julia.et.endace.com> Hi, I have only a single cable connecting the initiator and the target machine... Thanks! Vishal On Tue, 2006-12-12 at 01:03 -0800, Vu Pham wrote: > How many cable did you connect from your host to fabric? > > If you have two cables (2 ports of same hca or each port of > 2 hcas) connected then you have two paths to same srp > target. Each path will see the same number of luns of srp > target. You can work with dm-multipath/multipath and access > the luns/devices thru /dev/mapper - this will provide you > capability of fail-over/fail-back functionality > > IBGD's srp target only works with scsi devices. It does not > work with block devices (hdX, md, lvm volules ...) > > -vu > > > Hi, > > > > I have srp initiator installed with OFED-1.1, and another machine > > with SRP target (IBGOLD). I started the srp daemon to discover the > > target devices, and then ran fdisk -l to see the list. The list (below) > > shows duplicate devices :- > > > > Disk /dev/sdb: 2199.0 GB, 2199023255552 bytes > > 255 heads, 63 sectors/track, 267349 cylinders > > Units = cylinders of 16065 * 512 = 8225280 bytes > > > > Disk /dev/sdb doesn't contain a valid partition table > > > > Disk /dev/sdc: 2199.0 GB, 2199023255552 bytes > > 255 heads, 63 sectors/track, 267349 cylinders > > Units = cylinders of 16065 * 512 = 8225280 bytes > > > > Device Boot Start End Blocks Id System > > > > Disk /dev/sdd: 500.1 GB, 500107862016 bytes > > 255 heads, 63 sectors/track, 60801 cylinders > > Units = cylinders of 16065 * 512 = 8225280 bytes > > > > Device Boot Start End Blocks Id System > > /dev/sdd1 * 1 13 104391 83 Linux > > /dev/sdd2 14 60801 488279610 8e Linux LVM > > > > Disk /dev/sde: 2199.0 GB, 2199023255552 bytes > > 255 heads, 63 sectors/track, 267349 cylinders > > Units = cylinders of 16065 * 512 = 8225280 bytes > > > > Disk /dev/sde doesn't contain a valid partition table > > > > Disk /dev/sdf: 2199.0 GB, 2199023255552 bytes > > 255 heads, 63 sectors/track, 267349 cylinders > > Units = cylinders of 16065 * 512 = 8225280 bytes > > > > Device Boot Start End Blocks Id System > > > > Disk /dev/sdg: 500.1 GB, 500107862016 bytes > > 255 heads, 63 sectors/track, 60801 cylinders > > Units = cylinders of 16065 * 512 = 8225280 bytes > > > > Device Boot Start End Blocks Id System > > /dev/sdg1 * 1 13 104391 83 Linux > > /dev/sdg2 14 60801 488279610 8e Linux LVM > > > > > > > > Doing some tests I found that sdb=sde, sdc=sdf, and sdd=sdg (obvious). > > > > I also tested the device discovery after creating an md device on the > > target side, and found that the initiator doesn't take into account the > > presence of an md device. Is this the expected behaviour ? > > > > Thanks for your time! > > > > Vishal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From tom at opengridcomputing.com Tue Dec 12 12:23:34 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 12 Dec 2006 14:23:34 -0600 Subject: [openib-general] nfsrdma server stop responding, In-Reply-To: <457F020D.7010500@mellanox.com> References: <4579C6C3.5090207@mellanox.com> <457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com> <457E7414.6040802@mellanox.com> <457EEB07.8040904@mellanox.com> <457F020D.7010500@mellanox.com> Message-ID: <1165955014.8722.82.camel@trinity.ogc.int> This is just the normal shutdown path. The WR completions are flushes (see status==5). On Tue, 2006-12-12 at 11:25 -0800, Vu Pham wrote: [...snip...] > >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion > >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on ...........................................................^ The bug is that the rmda ctxt is the same for these two WR and that will cause the same pages to be free twice. > >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 > >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion ............................................................v > >>> Dec 12 01:09:30 ibd202 kernel: ctxt=ffff81023e4cf400, count=1 on > >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5 From mshefty at ichips.intel.com Tue Dec 12 11:50:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 12 Dec 2006 11:50:36 -0800 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> Message-ID: <457F080C.2090202@ichips.intel.com> > > + if (signal_pending(current)) { > > + ret = -ERESTARTSYS; > > + break; > > + } > > + > > + prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE); > > + mutex_unlock(&file->mut); > > + schedule(); > > + mutex_lock(&file->mut); > > + finish_wait(&file->poll_wait, &wait); > > is there any reason why this can't just be written with > wait_event_interruptible() instead of this more-complex way? I don't think so. The code followed the ucm, which is likely whatever Libor had done. Did umad or uverbs follow this same format at some point? In any case, this and the ucm could probably both be cleaned up. - Sean From tom at opengridcomputing.com Tue Dec 12 12:36:05 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 12 Dec 2006 14:36:05 -0600 Subject: [openib-general] RNFS double page free fix Message-ID: <1165955765.8722.88.camel@trinity.ogc.int> Vu: Thanks for finding this bug. I think I have a fix. Can you please apply it to your server and see if it fixes the problem for you too? Thanks, Tom Double page free on session shutdown From: Tom Tucker --- net/sunrpc/svc_rdma_recvfrom.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/sunrpc/svc_rdma_recvfrom.c b/net/sunrpc/svc_rdma_recvfrom.c index ec62000..059f5ff 100644 --- a/net/sunrpc/svc_rdma_recvfrom.c +++ b/net/sunrpc/svc_rdma_recvfrom.c @@ -527,6 +527,7 @@ int svc_rdma_recvfrom(struct svc_rqst *r /* Close the transport */ set_bit(SK_CLOSE, &xprt->sk_flags); svc_rdma_put_context(ctxt, 1); + ctxt = NULL; goto poll_dto_q; } From rdreier at cisco.com Tue Dec 12 12:37:01 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 12:37:01 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061210134137.GL29174@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 10 Dec 2006 15:41:37 +0200") References: <20061129140016.GO5061@mellanox.co.il> <20061205161944.GD30209@mellanox.co.il> <20061210134137.GL29174@mellanox.co.il> Message-ID: OK, I merged this up into an ipoib-cm branch and merged it into for-mm as well. I had to fix some work-struct related stuff and a few other conflicts, so please look at what I did. Testing wouldn't hurt either (I didn't have a chance to do more than build it yet). - R. From rdreier at cisco.com Tue Dec 12 12:42:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 12:42:17 -0800 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <1165879256.19459.379.camel@localhost> (Matt Leininger's message of "Mon, 11 Dec 2006 15:20:56 -0800") References: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com> <1165879256.19459.379.camel@localhost> Message-ID: > Roland may be able to comment on if their are performance difference > for interrupt-drive CQ between the old VAPI stacks and OFED. I think OFED is probably faster than any other stack I know of... I think MST's idea of PCI tuning issues is probably right. Can you send the output of lspci -vxxx -d15b3: with the two stacks? - R. From mst at mellanox.co.il Tue Dec 12 12:49:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 12 Dec 2006 22:49:04 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061212204904.GM382@mellanox.co.il> > OK, I merged this up into an ipoib-cm branch and merged it into for-mm > as well. I had to fix some work-struct related stuff and a few other > conflicts, so please look at what I did. Testing wouldn't hurt either > (I didn't have a chance to do more than build it yet). OK, thanks. I'll look at it tomorrow. -- MST From vu at mellanox.com Tue Dec 12 14:46:03 2006 From: vu at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 14:46:03 -0800 Subject: [openib-general] RNFS double page free fix In-Reply-To: <1165955765.8722.88.camel@trinity.ogc.int> References: <1165955765.8722.88.camel@trinity.ogc.int> Message-ID: <457F312B.8060706@mellanox.com> Tom, Thanks a lot. This patch seem to fix the double page free problem -vu > Vu: > > Thanks for finding this bug. I think I have a fix. > Can you please apply it to your server and see if it > fixes the problem for you too? > > Thanks, > Tom > > Double page free on session shutdown > > From: Tom Tucker > > --- > > net/sunrpc/svc_rdma_recvfrom.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/net/sunrpc/svc_rdma_recvfrom.c b/net/sunrpc/svc_rdma_recvfrom.c > index ec62000..059f5ff 100644 > --- a/net/sunrpc/svc_rdma_recvfrom.c > +++ b/net/sunrpc/svc_rdma_recvfrom.c > @@ -527,6 +527,7 @@ int svc_rdma_recvfrom(struct svc_rqst *r > /* Close the transport */ > set_bit(SK_CLOSE, &xprt->sk_flags); > svc_rdma_put_context(ctxt, 1); > + ctxt = NULL; > goto poll_dto_q; > } > > > From vu at mellanox.com Tue Dec 12 15:01:07 2006 From: vu at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 15:01:07 -0800 Subject: [openib-general] nfsrdma release 7 issues, Message-ID: <457F34B3.9060402@mellanox.com> James, Beside the double page free issue that Tom already fixed, I see the following issues: 1. simultaneous nfsrdmamount from multiple host issue. I see the following error messages ... Dec 12 13:31:40 ibd202 kernel: svcrdma: QP event 4 received for QP=ffff810240f5fa00 Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for QP=ffff810240f5f000 Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for QP=ffff810242cfa400 2. While some clients run I/Os, one idle client try to access the mount point ie. *ls* and get I/O input error. I see these error messages on server log Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22 Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion Dec 12 13:58:29 ibd202 kernel: ctxt=ffff810242130800, count=1 on xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5 ... Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for unknown QP 2e0408 Then the mount point is inaccessible from all clients 3. performance issue - I got max 450 MB/s read from server cache (comparing to 800 MB/s with release 6, using the same hw configuration for both client/server) thanks, -vu From tom at opengridcomputing.com Tue Dec 12 15:36:14 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Tue, 12 Dec 2006 17:36:14 -0600 Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: <457F34B3.9060402@mellanox.com> References: <457F34B3.9060402@mellanox.com> Message-ID: <1165966574.8722.110.camel@trinity.ogc.int> Vu: See below... On Tue, 2006-12-12 at 15:01 -0800, Vu Pham wrote: > James, > Beside the double page free issue that Tom already fixed, I see the > following issues: > 1. simultaneous nfsrdmamount from multiple host issue. I see the > following error messages > ... > Dec 12 13:31:40 ibd202 kernel: svcrdma: QP event 4 received for > QP=ffff810240f5fa00 > Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for > QP=ffff810240f5f000 > Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for > QP=ffff810242cfa400 This is the known race in the ib cm that resulted in the addition of the rdma_establish interface. For RNFS it is a benign message, but I do need to add the call ...I'm not fond of the rdma_establish solution so I've dragged my feet...Thanks for reminding me ;-) > > 2. While some clients run I/Os, one idle client try to access the mount > point ie. *ls* and get I/O input error. I see these error messages on > server log > > Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22 > Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion > Dec 12 13:58:29 ibd202 kernel: ctxt=ffff810242130800, count=1 on > xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5 > ... > Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for > unknown QP 2e0408 > > Then the mount point is inaccessible from all clients Ooh. This looks bad. This isn't concurrent with issue 1. above is it? Was the "idle" client idle for more than 6 minutes? > > 3. performance issue - I got max 450 MB/s read from server cache > (comparing to 800 MB/s with release 6, using the same hw configuration > for both client/server) > Oof... 1. I get much better than this on my MTD1000 hardware with SDR. Can you send me your .config? 2. Can you please send me the iozone test parameters your using? Thanks, Tom > thanks, > -vu From vuhuong at mellanox.com Tue Dec 12 15:59:39 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 12 Dec 2006 15:59:39 -0800 Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: <1165966574.8722.110.camel@trinity.ogc.int> References: <457F34B3.9060402@mellanox.com> <1165966574.8722.110.camel@trinity.ogc.int> Message-ID: <457F426B.7020104@mellanox.com> Tom, > Vu: > > See below... > > On Tue, 2006-12-12 at 15:01 -0800, Vu Pham wrote: >> James, >> Beside the double page free issue that Tom already fixed, I see the >> following issues: >> 1. simultaneous nfsrdmamount from multiple host issue. I see the >> following error messages >> ... >> Dec 12 13:31:40 ibd202 kernel: svcrdma: QP event 4 received for >> QP=ffff810240f5fa00 >> Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for >> QP=ffff810240f5f000 >> Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for >> QP=ffff810242cfa400 > > This is the known race in the ib cm that resulted in the addition of the > rdma_establish interface. For RNFS it is a benign message, but I do need > to add the call ...I'm not fond of the rdma_establish solution so I've > dragged my feet...Thanks for reminding me ;-) > You will hear it from me from release to release ;-) >> 2. While some clients run I/Os, one idle client try to access the mount >> point ie. *ls* and get I/O input error. I see these error messages on >> server log >> >> Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22 >> Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion >> Dec 12 13:58:29 ibd202 kernel: ctxt=ffff810242130800, count=1 on >> xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5 >> ... >> Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for >> unknown QP 2e0408 >> >> Then the mount point is inaccessible from all clients > > Ooh. This looks bad. This isn't concurrent with issue 1. above is it? No, I don't think 1,2 are related > Was the "idle" client idle for more than 6 minutes? Yes > >> 3. performance issue - I got max 450 MB/s read from server cache >> (comparing to 800 MB/s with release 6, using the same hw configuration >> for both client/server) >> > > Oof... > > 1. I get much better than this on my MTD1000 hardware with SDR. Can you > send me your .config? Please find it in the attachment > > 2. Can you please send me the iozone test parameters your using? > server has 8GB of mem, client has 2GB of mem iozone -r 64KB -s 5g -i 0 -i 1 and iozone -r 64KB -s 2g -i 0 -i 1 -t 3 thanks, -vu > Thanks, > Tom >> thanks, >> -vu > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: .config URL: From rdreier at cisco.com Tue Dec 12 16:16:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 12 Dec 2006 16:16:29 -0800 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus Finishing up the major 2.6.20 merges, plus some fixes: Krishna Kumar (1): RDMA/amso1100: Fix memory leak in c2_qp_modify() Ralph Campbell (6): IB: Add DMA mapping functions to allow device drivers to interpose IB/ipath: Implement new verbs DMA mapping functions IB/core: Use the new verbs DMA mapping functions IPoIB: Use the new verbs DMA mapping functions IB/srp: Use new verbs IB DMA mapping functions IB/iser: Use the new verbs DMA mapping functions Roland Dreier (5): IB/fmr: ib_flush_fmr_pool() may wait too long IB/ipath: Remove unused "write-only" variables IB/iser: Remove unused "write-only" variables IB/ipath: Fix IRQ for PCI Express HCAs IPoIB: Make sure struct ipoib_neigh.queue is always initialized Sean Hefty (5): RDMA/cma: Remove unneeded qp_type parameter from rdma_cm RDMA/cma: Report connect info with connect events RDMA/cma: Allow early transition to RTS to handle lost CM messages RDMA/cma: Add support for RDMA_PS_UDP RDMA/cma: Export rdma cm interface to userspace drivers/infiniband/core/Makefile | 6 +- drivers/infiniband/core/cm.c | 4 + drivers/infiniband/core/cma.c | 416 +++++++++--- drivers/infiniband/core/fmr_pool.c | 12 +- drivers/infiniband/core/mad.c | 90 ++-- drivers/infiniband/core/mad_priv.h | 6 +- drivers/infiniband/core/ucma.c | 874 +++++++++++++++++++++++++ drivers/infiniband/core/uverbs_marshall.c | 5 +- drivers/infiniband/core/uverbs_mem.c | 12 +- drivers/infiniband/hw/amso1100/c2_qp.c | 13 +- drivers/infiniband/hw/ipath/Makefile | 1 + drivers/infiniband/hw/ipath/ipath_dma.c | 189 ++++++ drivers/infiniband/hw/ipath/ipath_driver.c | 4 +- drivers/infiniband/hw/ipath/ipath_file_ops.c | 5 +- drivers/infiniband/hw/ipath/ipath_iba6110.c | 3 +- drivers/infiniband/hw/ipath/ipath_iba6120.c | 8 +- drivers/infiniband/hw/ipath/ipath_init_chip.c | 3 +- drivers/infiniband/hw/ipath/ipath_intr.c | 3 +- drivers/infiniband/hw/ipath/ipath_keys.c | 8 +- drivers/infiniband/hw/ipath/ipath_mr.c | 7 +- drivers/infiniband/hw/ipath/ipath_sysfs.c | 3 - drivers/infiniband/hw/ipath/ipath_verbs.c | 1 + drivers/infiniband/hw/ipath/ipath_verbs.h | 2 + drivers/infiniband/ulp/ipoib/ipoib.h | 4 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 75 +-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 3 +- drivers/infiniband/ulp/iser/iscsi_iser.h | 2 +- drivers/infiniband/ulp/iser/iser_initiator.c | 4 - drivers/infiniband/ulp/iser/iser_memory.c | 125 ++-- drivers/infiniband/ulp/srp/ib_srp.c | 81 ++- drivers/infiniband/ulp/srp/ib_srp.h | 2 +- include/rdma/ib_marshall.h | 5 +- include/rdma/ib_verbs.h | 253 +++++++ include/rdma/rdma_cm.h | 62 ++- include/rdma/rdma_cm_ib.h | 3 + include/rdma/rdma_user_cm.h | 206 ++++++ 36 files changed, 2146 insertions(+), 354 deletions(-) create mode 100644 drivers/infiniband/core/ucma.c create mode 100644 drivers/infiniband/hw/ipath/ipath_dma.c create mode 100644 include/rdma/rdma_user_cm.h From eitan at sw053.yok.mtl.com Tue Dec 12 21:13:01 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Wed, 13 Dec 2006 07:13:01 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-13:normal completion Message-ID: <200612130513.kBD5D1se025785@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = ____ ibutils rev = ____ Total=572 Pass=571 Fail=1 Pass: 78 Stability IS1-16.topo 78 Pkey IS1-16.topo 78 Multicast IS1-16.topo 78 LidMgr IS1-16.topo 77 OsmStress IS1-16.topo 26 Stability IS3-loop.topo 26 Stability IS3-128.topo 26 Pkey IS3-128.topo 26 OsmStress IS3-128.topo 26 Multicast IS3-loop.topo 26 Multicast IS3-128.topo 26 LidMgr IS3-128.topo Failures: 1 OsmStress IS1-16.topo From k_mahesh85 at yahoo.co.in Tue Dec 12 22:55:13 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Wed, 13 Dec 2006 06:55:13 +0000 (GMT) Subject: [openib-general] [query]requirement of 'process_mad' in the HCA driver Message-ID: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com> Hello all, I want to know from u people that isi it necessary to implement the process_mad for a HCA. After looking into the implementations of process_mad in ipath and mthca drivers i have fount that they are used to reply the MADs with port_info,gid_info,sm_info etc.. But isn't it handled by SMA in the host...... i am little bit confused now . please just whether it is required to implement process_mad (suppose) for new HCA driver....if it is required why? Please CC your replies to me. regards, K.Mahesh. --------------------------------- Find out what India is talking about on - Yahoo! Answers India Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Tue Dec 12 23:54:28 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 13 Dec 2006 09:54:28 +0200 Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1165947778.14800.315.camel@brick.pathscale.com> References: <1165517253.14800.283.camel@brick.pathscale.com> <457BD18D.7000403@voltaire.com> <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com> <457E6DAE.3040206@voltaire.com> <1165947778.14800.315.camel@brick.pathscale.com> Message-ID: <457FB1B4.2010809@voltaire.com> Ralph Campbell wrote: > On Tue, 2006-12-12 at 10:51 +0200, Or Gerlitz wrote: >> Roland Dreier wrote: >>> > I would like to see this last set of patches integrated as is. >>> > I would like to get more experience with the current implementation >>> > before extending it to support other configurations. >>> >>> Yeah, let's go with that. Since ipath depends on 64BIT in Kconfig >>> anyway I think this is OK for now. >> This design of ib_dma_map_single, ib_sg_dma_address etc returning u64 >> instead of dma_addr_t causes the resulted patch to the IB ULPs to be >> quite big. > > I think it was you who pointed out that dma_addr_t is > 32 bits on sparc64. Did you have a different solution > in mind? To be precise, I have pointed on a problem and you have come with the solution of having ib_dma_map_xxx work with u64 instead of dma_addr_t. As Roland suggested, you could implement SW IOTLB that works with dma_addr_t and you have chosen not to. >> Have you tested any dma_map single (eg IPoIB) and sg (eg SRP or iSER) >> consumer with this code? > Yes. The new API (eg ib_dma_map_xxx, ib_sg_dma_address and ib_sa_dma_len) adds some branching on each call, I wonder if you have seen any performance difference before/after the change. Specifically with IPoIB running a test with many PPS (ie iperf udp) or SRP IOPS test ? Or. From ogerlitz at voltaire.com Wed Dec 13 00:22:03 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 13 Dec 2006 10:22:03 +0200 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <457FB82B.4090902@voltaire.com> Roland Dreier wrote: > Linus, please pull from > > master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > This tree is also available from kernel.org mirrors at: > > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus > > Finishing up the major 2.6.20 merges, plus some fixes: Roland, you have CC-ed lkml at cisco.com on this email, is there a chance you wanted to CC linux-kernel at vger.kernel.org instead ... May i ask what prevented the v3 of the mthca profile patch (see http://article.gmane.org/gmane.linux.drivers.openib/34005) to get in? Or. > > Krishna Kumar (1): > RDMA/amso1100: Fix memory leak in c2_qp_modify() > > Ralph Campbell (6): > IB: Add DMA mapping functions to allow device drivers to interpose > IB/ipath: Implement new verbs DMA mapping functions > IB/core: Use the new verbs DMA mapping functions > IPoIB: Use the new verbs DMA mapping functions > IB/srp: Use new verbs IB DMA mapping functions > IB/iser: Use the new verbs DMA mapping functions > > Roland Dreier (5): > IB/fmr: ib_flush_fmr_pool() may wait too long > IB/ipath: Remove unused "write-only" variables > IB/iser: Remove unused "write-only" variables > IB/ipath: Fix IRQ for PCI Express HCAs > IPoIB: Make sure struct ipoib_neigh.queue is always initialized > > Sean Hefty (5): > RDMA/cma: Remove unneeded qp_type parameter from rdma_cm > RDMA/cma: Report connect info with connect events > RDMA/cma: Allow early transition to RTS to handle lost CM messages > RDMA/cma: Add support for RDMA_PS_UDP > RDMA/cma: Export rdma cm interface to userspace > > drivers/infiniband/core/Makefile | 6 +- > drivers/infiniband/core/cm.c | 4 + > drivers/infiniband/core/cma.c | 416 +++++++++--- > drivers/infiniband/core/fmr_pool.c | 12 +- > drivers/infiniband/core/mad.c | 90 ++-- > drivers/infiniband/core/mad_priv.h | 6 +- > drivers/infiniband/core/ucma.c | 874 +++++++++++++++++++++++++ > drivers/infiniband/core/uverbs_marshall.c | 5 +- > drivers/infiniband/core/uverbs_mem.c | 12 +- > drivers/infiniband/hw/amso1100/c2_qp.c | 13 +- > drivers/infiniband/hw/ipath/Makefile | 1 + > drivers/infiniband/hw/ipath/ipath_dma.c | 189 ++++++ > drivers/infiniband/hw/ipath/ipath_driver.c | 4 +- > drivers/infiniband/hw/ipath/ipath_file_ops.c | 5 +- > drivers/infiniband/hw/ipath/ipath_iba6110.c | 3 +- > drivers/infiniband/hw/ipath/ipath_iba6120.c | 8 +- > drivers/infiniband/hw/ipath/ipath_init_chip.c | 3 +- > drivers/infiniband/hw/ipath/ipath_intr.c | 3 +- > drivers/infiniband/hw/ipath/ipath_keys.c | 8 +- > drivers/infiniband/hw/ipath/ipath_mr.c | 7 +- > drivers/infiniband/hw/ipath/ipath_sysfs.c | 3 - > drivers/infiniband/hw/ipath/ipath_verbs.c | 1 + > drivers/infiniband/hw/ipath/ipath_verbs.h | 2 + > drivers/infiniband/ulp/ipoib/ipoib.h | 4 +- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 75 +-- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 3 +- > drivers/infiniband/ulp/iser/iscsi_iser.h | 2 +- > drivers/infiniband/ulp/iser/iser_initiator.c | 4 - > drivers/infiniband/ulp/iser/iser_memory.c | 125 ++-- > drivers/infiniband/ulp/srp/ib_srp.c | 81 ++- > drivers/infiniband/ulp/srp/ib_srp.h | 2 +- > include/rdma/ib_marshall.h | 5 +- > include/rdma/ib_verbs.h | 253 +++++++ > include/rdma/rdma_cm.h | 62 ++- > include/rdma/rdma_cm_ib.h | 3 + > include/rdma/rdma_user_cm.h | 206 ++++++ > 36 files changed, 2146 insertions(+), 354 deletions(-) > create mode 100644 drivers/infiniband/core/ucma.c > create mode 100644 drivers/infiniband/hw/ipath/ipath_dma.c > create mode 100644 include/rdma/rdma_user_cm.h From halr at voltaire.com Wed Dec 13 03:43:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Dec 2006 06:43:39 -0500 Subject: [openib-general] [query]requirement of 'process_mad' in the HCA driver In-Reply-To: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com> References: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com> Message-ID: <1166010208.28709.59772.camel@hal.voltaire.com> On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote: > Hello all, > > I want to know from u people that isi it necessary to implement the > process_mad for a HCA. > > After looking into the implementations of process_mad in ipath and > mthca drivers i have fount that they are used to reply the MADs with > port_info,gid_info,sm_info etc.. > > But isn't it handled by SMA in the host...... The SMA can either be in the host on in firmware (as is typical with the Mellanox silicon). > i am little bit confused now . > please just whether it is required to implement process_mad (suppose) > for new HCA driver.... It is. For an example of a host (software SMA), see drivers/infiniband/hw/ipath/ipath_mad.c > if it is required why? The driver is needed to obtain the information for the IB node to fill in the MADs for response to the SMA query. It may also issue some traps. Similarly for PMA as well. -- Hal > Please CC your replies to me. > > regards, > K.Mahesh. > > > > > > > > ______________________________________________________________________ > Find out what India is talking about on - Yahoo! Answers India > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. > Get it NOW > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mst at mellanox.co.il Wed Dec 13 03:49:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 13 Dec 2006 13:49:16 +0200 Subject: [openib-general] [PATCH] mthca: move code from post send to post receive Message-ID: <20061213114916.GA23726@mellanox.co.il> Place SQ wrid's first in wrid buffer, to eliminate an add operation in the send datapath. This keeps binary size constant, moving code from post send to post receive: post send is a latency-sensitive operation, while post receive is done beforehand, so it's not. Additionally, a generic ULP mixing send and RDMA does more post sends than post receives (RDMA does not have a matching post receive). Signed-off-by: Jack Morgenstein Signed-off-by: Michael S. Tsirkin --- While unlikely to give a large gain, this makes sense to me. Please consider for 2.6.20. diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 149b369..433f9a8 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -537,8 +537,7 @@ static inline int mthca_poll_one(struct wq = &(*cur_qp)->sq; wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset) >> wq->wqe_shift); - entry->wr_id = (*cur_qp)->wrid[wqe_index + - (*cur_qp)->rq.max]; + entry->wr_id = (*cur_qp)->wrid[wqe_index]; } else if ((*cur_qp)->ibqp.srq) { struct mthca_srq *srq = to_msrq((*cur_qp)->ibqp.srq); u32 wqe = be32_to_cpu(cqe->wqe); @@ -558,7 +557,7 @@ static inline int mthca_poll_one(struct */ if (unlikely(wqe_index < 0)) wqe_index = wq->max - 1; - entry->wr_id = (*cur_qp)->wrid[wqe_index]; + entry->wr_id = (*cur_qp)->wrid[wqe_index + (*cur_qp)->sq.max]; } if (wq) { diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c index 6a7822e..9e6f715 100644 --- a/drivers/infiniband/hw/mthca/mthca_qp.c +++ b/drivers/infiniband/hw/mthca/mthca_qp.c @@ -1690,7 +1690,7 @@ int mthca_tavor_post_send(struct ib_qp * size += sizeof (struct mthca_data_seg) / 16; } - qp->wrid[ind + qp->rq.max] = wr->wr_id; + qp->wrid[ind] = wr->wr_id; if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) { mthca_err(dev, "opcode invalid\n"); @@ -1810,7 +1810,7 @@ int mthca_tavor_post_receive(struct ib_q size += sizeof (struct mthca_data_seg) / 16; } - qp->wrid[ind] = wr->wr_id; + qp->wrid[ind + qp->sq.max] = wr->wr_id; ((struct mthca_next_seg *) prev_wqe)->nda_op = cpu_to_be32((ind << qp->rq.wqe_shift) | 1); @@ -2068,7 +2068,7 @@ int mthca_arbel_post_send(struct ib_qp * size += sizeof (struct mthca_data_seg) / 16; } - qp->wrid[ind + qp->rq.max] = wr->wr_id; + qp->wrid[ind] = wr->wr_id; if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) { mthca_err(dev, "opcode invalid\n"); @@ -2192,7 +2192,7 @@ int mthca_arbel_post_receive(struct ib_q ((struct mthca_data_seg *) wqe)->addr = 0; } - qp->wrid[ind] = wr->wr_id; + qp->wrid[ind + qp->sq.max] = wr->wr_id; ++ind; if (unlikely(ind >= qp->rq.max)) -- MST From halr at voltaire.com Wed Dec 13 03:52:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 13 Dec 2006 06:52:43 -0500 Subject: [openib-general] mad_agents In-Reply-To: <000e01c71e1a$46939ad0$21606d86@one7> References: <000e01c71e1a$46939ad0$21606d86@one7> Message-ID: <1166010756.28709.60158.camel@hal.voltaire.com> Hi Michael, On Tue, 2006-12-12 at 13:21, Michael Arndt wrote: > Hi, > > the following statements about functions and modules refer to the mad.c, > agent.c and user_mad.c file. > > during the initialisation of the mad module a funktion ib_agent_port_open is > called(ib_mad_init_device -> ib_mad_port_open). At this point an agent is > registered (ib_register_mad_agent), without a MAD registration request > applied. So my question is, what is this agent for? When there is no registration, that means those agents are "send only" agents. "Send only" means the agent will only receive solicited responses and will not receive any unsolicited MADs. Those agents that are started are for SMI (QP0) and GSI (QP1). The SMA sits on QP0 (shared with SM). Many GS agents (including the PMA, also SA) sit on top of QP1. > And is it right that the agent registered by the umad module > (ib_umad_ioctl -> ib_umad_reg_agent -> ib_register_mad_agent) gets all the > SMP packets from the device and passes them to the SM (read and > FileDescriptior). user_mad registrations occur via the ioctl. It only gets those packets it registers for. These can include SMPs as well as GMPs depending on user agents registered. The diagnostics use these (DR SMPs, LR SMPs, and GMPs). The agent only gets those MADs that the SMA does not handle. This is done via the status passed back to process_mad (IB_MAD_RESULT_XXXXX). The SM registers for request/response matching on both SM and SA classes with different method masks (as different methods apply). There are also some unsolicited receives (e.g. traps) to be handled. When request/response matching is used, the agent is determined by the high 32 bits of the transaction ID which is overwritten in the (send of the) request. Those 32 bits are the agent ID and used for demux to the proper agent when the response (or timeout) occurs. > What is about the SMA? Where are the SMPs filtered between SMA and SM? process_mad in the MAD layer passes them to the driver (mthca_mad.c for one example) for filtering. This filtering is based on the status returned (IB_MAD_RESULT_XXXX in ib_mad.h). > I also would like to say that it would be really nice if there would be some > papers, diagrams, grafics or anything else which explain how the whole > openib system works. The source code as only reference isn't really helping > for new developer. Yes, that would be nice. Perhaps you can help here. -- Hal > Thanks Michael > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From yangdong at ncic.ac.cn Wed Dec 13 04:29:29 2006 From: yangdong at ncic.ac.cn (yangdong) Date: Wed, 13 Dec 2006 20:29:29 +0800 Subject: [openib-general] non-transparent integration with SDP in ofed Message-ID: <457FF229.9020306@ncic.ac.cn> Hello all: Some problems disturbed me. For non-transparent integration with SDP, no special environment variable is necessary. It is only required to recompile the application replacing AF_INET with AF_INET_SDP. The constant AF_INET_SDP is defined in sdp_inet.h. When i want to make use of non-transparent integration with SDP in kernel, i can replace PF_INET with PF_INET_SDP and AF_INET with AF_INET_SDP, then recompile the kernel module. e.g. sock_create_kern (PF_INET_SDP, SOCK_STREAM, IPPROTO_TCP, &new_sock); sin.sin_family = AF_INET; That did well when i use IBGD, but it can't work with OFED. I want to know what should I do. Some info : ./ibstat CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 3.3.2 Hardware version: a1 Node GUID: 0x0002c90200004c68 System image GUID: 0x0002c90200004c6b Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 7 LMC: 0 SM lid: 33 Capability mask: 0x00510a68 Port GUID: 0x0002c90200004c69 Port 2: State: Down Physical state: Polling Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00510a68 Port GUID: 0x0002c90200004c6a From tziporet at mellanox.co.il Wed Dec 13 08:06:45 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 13 Dec 2006 18:06:45 +0200 Subject: [openib-general] OFED 1.2 howto update Message-ID: <45802515.20605@mellanox.co.il> Hi All, I added a documentation on the Wiki how to add/change component for the OFED 1.2 development package: https://openib.org/tiki/tiki-index.php?page=OFED+1.2+HowTo These instructions should be used to add the new components that we agreed in previous meeting (VNIC, iWARP) Please look and send me any questions you have - and I will be able to improve this page. Tziporet From bos at pathscale.com Wed Dec 13 09:56:25 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 13 Dec 2006 09:56:25 -0800 Subject: [openib-general] version #defines for the kernel In-Reply-To: <0a3901c71e1e$c431f910$0281a8c0@ebpc> References: <0a3901c71e1e$c431f910$0281a8c0@ebpc> Message-ID: <45803EC9.7020004@pathscale.com> Eric Barton wrote: > Blood one the floor somewhere I'd hope :) > > Or maybe just no #define for the version, since the person doing the > backport clearly isn't worried about compatibility with out-of-tree > code. You're better off planning for the backport mess than hoping for API version definitions that will not be reliably present. Getting driver code to compile will be the least of your worries. Hi, Andrew - Here's a suitably renamed uncached-read memcpy. I hope the name is now self-explanatory. Message-ID: In cases where a large incoming RDMA is being received, we have to copy data inside the interrupt handler before we can ACK each packet. The source is DMAed to by the hardware, which means that the CPU won't have it cached. We only read the source this one time; using normal load instructions pollutes the dcache with useless data, reducing performance to the point where we can lose a significant number of packets. We use memcpy_uncached_read to try to not fill the dcache with useless data. Avoiding the cache refill penalty lets us keep up better with the sender, resulting in many fewer dropped packets. Signed-off-by: Bryan O'Sullivan diff -r e7c3b265254b -r f25d77f76998 drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 13 09:51:09 2006 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Wed Dec 13 09:51:09 2006 -0800 @@ -167,7 +167,7 @@ void ipath_copy_sge(struct ipath_sge_sta BUG_ON(len == 0); if (len > length) len = length; - memcpy(sge->vaddr, data, len); + memcpy_uncached_read(sge->vaddr, data, len); sge->vaddr += len; sge->length -= len; sge->sge_length -= len; From bos at pathscale.com Wed Dec 13 08:57:20 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 13 Dec 2006 09:57:20 -0700 Subject: [openib-general] [PATCH 1 of 2] Add memcpy_uncached_read, a memcpy that tries to reduce cache pressure In-Reply-To: Message-ID: This copy routine is memcpy-compatible, but on some architectures will use cache-bypassing loads to avoid bringing the source data into the cache. One case where this is useful is when a device issues a DMA to a memory region, and the CPU must copy the DMAed data elsewhere before doing any work with it. Since the source data is read-once, write-never from the CPU's perspective, caching the data at those addresses can only evict potentially useful data. We provide an x86_64 implementation that uses SSE non-temporal loads, and a generic version that falls back to plain memcpy. Implementors for other arches should not use cache-bypassing stores to the destination, as in most cases, the destination is accessed almost immediately after a copy finishes. Signed-off-by: Bryan O'Sullivan diff -r 4a0c3ede5076 -r e7c3b265254b arch/x86_64/lib/Makefile --- a/arch/x86_64/lib/Makefile Tue Dec 12 10:43:21 2006 -0800 +++ b/arch/x86_64/lib/Makefile Wed Dec 13 09:51:09 2006 -0800 @@ -9,4 +9,5 @@ lib-y := csum-partial.o csum-copy.o csum lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \ usercopy.o getuser.o putuser.o \ thunk.o clear_page.o copy_page.o bitstr.o bitops.o -lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o +lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o \ + memcpy_uncached_read.o diff -r 4a0c3ede5076 -r e7c3b265254b arch/x86_64/lib/memcpy_uncached_read.S --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/arch/x86_64/lib/memcpy_uncached_read.S Wed Dec 13 09:51:09 2006 -0800 @@ -0,0 +1,142 @@ +/* + * Copyright (c) 2006 QLogic Corporation. All Rights Reserved. + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +/* + * memcpy_uncached_read - memcpy-compatible copy routine, using streaming loads + * @dest: destination address + * @src: source address (will not be cached) + * @count: number of bytes to copy + * + * Use streaming loads and normal stores for a special-case copy where + * we know we won't be reading the source again, but will be reading the + * destination again soon. + */ + .text + .p2align 4,,15 + /* rdi destination, rsi source, rdx count */ + .globl memcpy_uncached_read + .type memcpy_uncached_read, @function +memcpy_uncached_read: + movq %rdi, %rax +.L5: + cmpq $15, %rdx + ja .L34 +.L3: + cmpl $8, %edx /* rdx is 0..15 */ + jbe .L9 +.L6: + testb $8, %dxl /* rdx is 3,5,6,7,9..15 */ + je .L13 + movq (%rsi), %rcx + addq $8, %rsi + movq %rcx, (%rdi) + addq $8, %rdi +.L13: + testb $4, %dxl + je .L15 + movl (%rsi), %ecx + addq $4, %rsi + movl %ecx, (%rdi) + addq $4, %rdi +.L15: + testb $2, %dxl + je .L17 + movzwl (%rsi), %ecx + addq $2, %rsi + movw %cx, (%rdi) + addq $2, %rdi +.L17: + testb $1, %dxl + je .L33 +.L1: + movzbl (%rsi), %ecx + movb %cl, (%rdi) +.L33: + ret +.L34: + cmpq $63, %rdx /* rdx is > 15 */ + ja .L64 + movl $16, %ecx /* rdx is 16..63 */ +.L25: + movq 8(%rsi), %r8 + movq (%rsi), %r9 + addq %rcx, %rsi + movq %r8, 8(%rdi) + movq %r9, (%rdi) + addq %rcx, %rdi + subq %rcx, %rdx + cmpl %edx, %ecx /* is rdx >= 16? */ + jbe .L25 + jmp .L3 /* rdx is 0..15 */ + .p2align 4,,7 +.L64: + movl $64, %ecx +.L42: + prefetchnta 128(%rsi) + movq (%rsi), %r8 + movq 8(%rsi), %r9 + movq 16(%rsi), %r10 + movq 24(%rsi), %r11 + subq %rcx, %rdx + movq %r8, (%rdi) + movq 32(%rsi), %r8 + movq %r9, 8(%rdi) + movq 40(%rsi), %r9 + movq %r10, 16(%rdi) + movq 48(%rsi), %r10 + movq %r11, 24(%rdi) + movq 56(%rsi), %r11 + addq %rcx, %rsi + movq %r8, 32(%rdi) + movq %r9, 40(%rdi) + movq %r10, 48(%rdi) + movq %r11, 56(%rdi) + addq %rcx, %rdi + cmpq %rdx, %rcx /* is rdx >= 64? */ + jbe .L42 + sfence + orl %edx, %edx + je .L33 + jmp .L5 +.L9: + jmp *.L12(,%rdx,8) /* rdx is 0..8 */ + .section .rodata + .align 8 + .align 4 +.L12: + .quad .L33 + .quad .L1 + .quad .L2 + .quad .L6 + .quad .L4 + .quad .L6 + .quad .L6 + .quad .L6 + .quad .L8 + .text +.L2: + movzwl (%rsi), %ecx + movw %cx, (%rdi) + ret +.L4: + movl (%rsi), %ecx + movl %ecx, (%rdi) + ret +.L8: + movq (%rsi), %rcx + movq %rcx, (%rdi) + ret diff -r 4a0c3ede5076 -r e7c3b265254b include/asm-x86_64/string.h --- a/include/asm-x86_64/string.h Tue Dec 12 10:43:21 2006 -0800 +++ b/include/asm-x86_64/string.h Wed Dec 13 09:51:09 2006 -0800 @@ -39,6 +39,8 @@ extern void *__memcpy(void *to, const vo __ret = __builtin_memcpy((dst),(src),__len); \ __ret; }) +#define __HAVE_ARCH_MEMCPY_UNCACHED_READ +extern void *memcpy_uncached_read(void *to, const void *from, size_t len); #define __HAVE_ARCH_MEMSET void *memset(void *s, int c, size_t n); diff -r 4a0c3ede5076 -r e7c3b265254b include/linux/string.h --- a/include/linux/string.h Tue Dec 12 10:43:21 2006 -0800 +++ b/include/linux/string.h Wed Dec 13 09:51:09 2006 -0800 @@ -85,6 +85,9 @@ extern void * memset(void *,int,__kernel #ifndef __HAVE_ARCH_MEMCPY extern void * memcpy(void *,const void *,__kernel_size_t); #endif +#ifndef __HAVE_ARCH_MEMCPY_UNCACHED_READ +#define memcpy_uncached_read(dest, src, count) memcpy((dest), (src), (count)) +#endif #ifndef __HAVE_ARCH_MEMMOVE extern void * memmove(void *,const void *,__kernel_size_t); #endif From philippe.bernadat at hp.com Wed Dec 13 10:02:09 2006 From: philippe.bernadat at hp.com (Philippe Bernadat) Date: Wed, 13 Dec 2006 19:02:09 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <0b8901c71ed3$e9b9f740$0281a8c0@ebpc> References: <0b8901c71ed3$e9b9f740$0281a8c0@ebpc> Message-ID: <45804021.9050209@hp.com> Roland, Attached are the two lspci outputs. The only differences I see are: [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed 1d0 < pcilib: Resource 5 in /sys/bus/pci/devices/0000:00:1f.1/resource has a 64-bit address, ignoring 40c39 < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 --- > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 [philippe at hamish o2ib]$ > > Roland may be able to comment on if their are performance difference > > for interrupt-drive CQ between the old VAPI stacks and OFED. > > I think OFED is probably faster than any other stack I know of... > > I think MST's idea of PCI tuning issues is probably right. Can you > send the output of > > lspci -vxxx -d15b3: > > with the two stacks? > > - R. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lspci.ofed URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lspci.vib URL: From mst at mellanox.co.il Wed Dec 13 10:09:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 13 Dec 2006 20:09:16 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: <20061212151039.GJ26613@mellanox.co.il> References: <20061212151039.GJ26613@mellanox.co.il> Message-ID: <20061213180916.GA1689@mellanox.co.il> Speed up memory registration by filling in MTTs directly. This reduces the number of FW commands needed to register an MR by at least a factor of 2. This applies to all memfree cards, and to tavor mode on 64 bit systems with the patch I posted earlier. Signed-off-by: Michael S. Tsirkin --- Roland, the previous version of this patch had a bug on memfree. I noticed you didn't push these patches out to Linus yet so I did a re-spin. Let me know if you prefer an incremental patch. This applies on top of "make all MRs accessible for FMR mapping". Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h @@ -464,6 +464,8 @@ void mthca_uar_free(struct mthca_dev *de int mthca_pd_alloc(struct mthca_dev *dev, int privileged, struct mthca_pd *pd); void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd); +int mthca_write_mtt_size(struct mthca_dev *dev); + struct mthca_mtt *mthca_alloc_mtt(struct mthca_dev *dev, int size); void mthca_free_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt); int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c @@ -244,8 +244,8 @@ void mthca_free_mtt(struct mthca_dev *de kfree(mtt); } -int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, - int start_index, u64 *buffer_list, int list_len) +static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) { struct mthca_mailbox *mailbox; __be64 *mtt_entry; @@ -296,6 +296,84 @@ out: return err; } +void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + u64 __iomem *mtts; + u32 mtt_seg; + int i; + + mtt_seg = mtt->first_seg * MTHCA_MTT_SEG_SIZE; + mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg + start_index * sizeof (u64); + for (i = 0; i < list_len; ++i) { + __be64 mtt_entry = cpu_to_be64(buffer_list[i] | + MTHCA_MTT_FLAG_PRESENT); + mthca_write64_raw(mtt_entry, mtts + i); + } +} + +void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + __be64 *mtts; + int i; + int s = start_index * sizeof (u64); + + /* For Arbel, all MTTs must fit in the same page. */ + BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64) - 1) / PAGE_SIZE); + /* Require full segments */ + BUG_ON(s % MTHCA_MTT_SEG_SIZE); + + mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg + + s / MTHCA_MTT_SEG_SIZE); + + BUG_ON(!mtts); + + for (i = 0; i < list_len; ++i) + mtts[i] = cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT); +} + +int mthca_write_mtt_size(struct mthca_dev *dev) +{ + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + /* + * Be friendly to WRITE_MTT command + * and leave two empty slots for the + * index and reserved fields of the + * mailbox. + */ + return PAGE_SIZE / sizeof (u64) - 2; + + /* For Arbel, all MTTs must fit in the same page. */ + return mthca_is_memfree(dev) ? (PAGE_SIZE / sizeof (u64)) : 0x7ffffff; +} + +int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt, + int start_index, u64 *buffer_list, int list_len) +{ + int size = mthca_write_mtt_size(dev); + int chunk; + + if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy) + return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len); + + while (list_len > 0) { + chunk = min(size, list_len); + if (mthca_is_memfree(dev)) + mthca_arbel_write_mtt_seg(dev, mtt, start_index, + buffer_list, chunk); + else + mthca_tavor_write_mtt_seg(dev, mtt, start_index, + buffer_list, chunk); + + list_len -= chunk; + start_index += chunk; + buffer_list += chunk; + } + + return 0; +} + static inline u32 tavor_hw_index_to_key(u32 ind) { return ind; Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1015,6 +1015,7 @@ static struct ib_mr *mthca_reg_user_mr(s int shift, n, len; int i, j, k; int err = 0; + int write_mtt_size; shift = ffs(region->page_size) - 1; @@ -1040,6 +1041,8 @@ static struct ib_mr *mthca_reg_user_mr(s i = n = 0; + write_mtt_size = min(mthca_write_mtt_size(dev), (int)(PAGE_SIZE / sizeof *pages)); + list_for_each_entry(chunk, ®ion->chunk_list, list) for (j = 0; j < chunk->nmap; ++j) { len = sg_dma_len(&chunk->page_list[j]) >> shift; @@ -1047,14 +1050,11 @@ static struct ib_mr *mthca_reg_user_mr(s pages[i++] = sg_dma_address(&chunk->page_list[j]) + region->page_size * k; /* - * Be friendly to WRITE_MTT command - * and leave two empty slots for the - * index and reserved fields of the - * mailbox. + * Be friendly to write_mtt and pass it chunks + * of appropriate size. */ - if (i == PAGE_SIZE / sizeof (u64) - 2) { - err = mthca_write_mtt(dev, mr->mtt, - n, pages, i); + if (i == write_mtt_size) { + err = mthca_write_mtt(dev, mr->mtt, n, pages, i); if (err) goto mtt_done; n += i; -- MST From eitan at mellanox.co.il Wed Dec 13 07:52:50 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 13 Dec 2006 17:52:50 +0200 Subject: [openib-general] non-transparent integration with SDP in ofed In-Reply-To: <457FF229.9020306@ncic.ac.cn> References: <457FF229.9020306@ncic.ac.cn> Message-ID: <458021D2.1070205@mellanox.co.il> Hi A change in SDP was done between IBGD and OFED: You only need to replace the address family to AF_INET_SDP when opening the socket. You should not use AF_INET_SDP when providing a struct sockaddr_in. These should stay AF_INET. Hope this helps. EZ yangdong wrote: > Hello all: > > Some problems disturbed me. For non-transparent integration with SDP, no > special environment variable is necessary. It is only required to > recompile the application replacing AF_INET with AF_INET_SDP. The > constant AF_INET_SDP is defined in sdp_inet.h. > > When i want to make use of non-transparent integration with SDP in > kernel, i can replace PF_INET with PF_INET_SDP and AF_INET with > AF_INET_SDP, then recompile the kernel module. > > e.g. sock_create_kern (PF_INET_SDP, SOCK_STREAM, IPPROTO_TCP, &new_sock); > sin.sin_family = AF_INET; > > That did well when i use IBGD, but it can't work with OFED. I want to > know what should I do. > > > > Some info : > ./ibstat > CA 'mthca0' > CA type: MT23108 > Number of ports: 2 > Firmware version: 3.3.2 > Hardware version: a1 > Node GUID: 0x0002c90200004c68 > System image GUID: 0x0002c90200004c6b > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 7 > LMC: 0 > SM lid: 33 > Capability mask: 0x00510a68 > Port GUID: 0x0002c90200004c69 > Port 2: > State: Down > Physical state: Polling > Rate: 2 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x00510a68 > Port GUID: 0x0002c90200004c6a > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Brian.Cain at ge.com Wed Dec 13 14:09:27 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Wed, 13 Dec 2006 17:09:27 -0500 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEF19@CINMLVEM11.e2k.ad.ge.com> There's gotta be a good way to let people know they're going down the wrong path on this one. Signed-off-by: Brian Cain --- ofed/openib/scripts/install.sh 2006-12-13 14:48:51.747995000 -0700 +++ ofed_fix/openib/scripts/install.sh 2006-12-13 14:59:00.586574000 -0700 @@ -1070,6 +1070,14 @@ echo "# Load SDP module" >> ${IB_CONF_DIR}/openib.conf echo "# SDP_LOAD=no" >> ${IB_CONF_DIR}/openib.conf fi + + + if [[ "$srp" == "y" || "$srp_target" == "y" ]] && + [[ $(egrep 'flags.*lm' /proc/cpuinfo | wc -l) > 0 ]] && + [[ $(uname -p | egrep 'i[3-9]86' | wc -l) > 0 ]]; then + echo '!!WARNING!! SRP is not supported for 32-bit OS running on 64-bit capable hardware' + fi + if [ "$srp" == "y" ]; then echo >> ${IB_CONF_DIR}/openib.conf -- -Brian From jlentini at netapp.com Wed Dec 13 14:09:45 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 13 Dec 2006 17:09:45 -0500 (EST) Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: <457F426B.7020104@mellanox.com> References: <457F34B3.9060402@mellanox.com> <1165966574.8722.110.camel@trinity.ogc.int> <457F426B.7020104@mellanox.com> Message-ID: On Tue, 12 Dec 2006, Vu Pham wrote: > > > 2. While some clients run I/Os, one idle client try to access the mount > > > point ie. *ls* and get I/O input error. I see these error messages on > > > server log Was there anything in the log before this point? I'd expect to see a message started with "svcrdma: failed to post SQ..." > > > Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22 > > > Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion > > > Dec 12 13:58:29 ibd202 kernel: ctxt=ffff810242130800, count=1 on > > > xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5 > > > ... > > > Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for unknown > > > QP 2e0408 From rdreier at cisco.com Wed Dec 13 14:21:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 14:21:29 -0800 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEF19@CINMLVEM11.e2k.ad.ge.com> (Brian Cain's message of "Wed, 13 Dec 2006 17:09:27 -0500") References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEF19@CINMLVEM11.e2k.ad.ge.com> Message-ID: > + echo '!!WARNING!! SRP is not supported for 32-bit OS running on 64-bit capable hardware' Did I miss something? Why doesn't SRP work with 32-bit userspace on a 64-bit capable hardware? In fact why doesn't it work with 32-bit userspace on a 64-bit kernel? - R. From rdreier at cisco.com Wed Dec 13 14:27:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 14:27:52 -0800 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: <20061213180916.GA1689@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 13 Dec 2006 20:09:16 +0200") References: <20061212151039.GJ26613@mellanox.co.il> <20061213180916.GA1689@mellanox.co.il> Message-ID: I was going to apply this, but then I realized that mthca is screwed up on non-cache-coherent CPUs with memfree HCAs, and this patch makes things much worse. The problem is that we allocate the MTT table with alloc_pages() and then do pci_map_sg(). But there's no pci_dma_sync_sg calls when the CPU tries to write directly to the MTT table, and in fact not even that would work: since a non-cache-coherent CPU can only work on cacheline-sized chunks there's no safe way to touch the MTT table. What all that means is that FMRs are currently broken for memfree on non-coherent CPUs. And this patch would break all memory registration. I think the fix has to be to use dma_alloc_coherent() to allocate the pages for the MTT table (and any other table allocated in lowmem -- but I don't think there are any others). Unfortunately my PowerPC 440 system is being reworked right now so I can't test this for a few days. I think this still can go into 2.6.20 after -rc1 if we can get this fixed up. - R. From rdreier at cisco.com Wed Dec 13 14:29:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 14:29:45 -0800 Subject: [openib-general] [PATCH] mthca: move code from post send to post receive In-Reply-To: <20061213114916.GA23726@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 13 Dec 2006 13:49:16 +0200") References: <20061213114916.GA23726@mellanox.co.il> Message-ID: > While unlikely to give a large gain, this makes sense to me. Out of curiousity -- can you measure any difference at all with this? I would have guessed that the addition can be scheduled so that it costs nothing at all on any common CPU. I guess it doesn't hurt though. Want to make a similar patch for libmthca? - R. From rdreier at cisco.com Wed Dec 13 14:30:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 14:30:54 -0800 Subject: [openib-general] [GIT PULL] please pull infiniband.git In-Reply-To: <457FB82B.4090902@voltaire.com> (Or Gerlitz's message of "Wed, 13 Dec 2006 10:22:03 +0200") References: <457FB82B.4090902@voltaire.com> Message-ID: > you have CC-ed lkml at cisco.com on this email, is there a chance you > wanted to CC linux-kernel at vger.kernel.org instead ... Yep, a typo caused by my auto-expand not triggering. No big deal though... > May i ask what prevented the v3 of the mthca profile patch (see > http://article.gmane.org/gmane.linux.drivers.openib/34005) to get in? The patch as posted is both ugly and wrong. I still plan to fix it up and merge it for 2.6.20, but I didn't get a chance yet. - R. From rdreier at cisco.com Wed Dec 13 14:32:28 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 14:32:28 -0800 Subject: [openib-general] [query]requirement of 'process_mad' in the HCA driver In-Reply-To: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com> ( keshetti mahesh's message of "Wed, 13 Dec 2006 06:55:13 +0000 (GMT)") References: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com> Message-ID: > But isn't it handled by SMA in the host...... i am little bit confused now . > please just whether it is required to implement process_mad (suppose) for new HCA driver....if it is required why? You can think of the process_mad() method as the interface from the SMA to the hardware. For example when a set of PortInfo occurs, then the hardware has to know what the local LID is, etc. From rdreier at cisco.com Wed Dec 13 14:37:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 14:37:06 -0800 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <457F080C.2090202@ichips.intel.com> (Sean Hefty's message of "Tue, 12 Dec 2006 11:50:36 -0800") References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <457F080C.2090202@ichips.intel.com> Message-ID: > I don't think so. The code followed the ucm, which is likely whatever > Libor had done. Did umad or uverbs follow this same format at some > point? In any case, this and the ucm could probably both be cleaned > up. I don't think umad/uverbs ever looked like that. You picked the wrong code to copy ;) Anyway I'll cook up a patch to clean it up at some point... From tom at opengridcomputing.com Wed Dec 13 14:38:18 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 13 Dec 2006 16:38:18 -0600 Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: References: <457F34B3.9060402@mellanox.com> <1165966574.8722.110.camel@trinity.ogc.int> <457F426B.7020104@mellanox.com> Message-ID: <1166049498.10873.7.camel@trinity.ogc.int> 22 is EINVAL. I believe the only way to get this on an RDMA connection is when there is an error in the RPCRDMA header. The completing WR is just a flush that resulted from shutting the connection down. On Wed, 2006-12-13 at 17:09 -0500, James Lentini wrote: > > On Tue, 12 Dec 2006, Vu Pham wrote: > > > > > 2. While some clients run I/Os, one idle client try to access the mount > > > > point ie. *ls* and get I/O input error. I see these error messages on > > > > server log > > Was there anything in the log before this point? I'd expect to see a > message started with "svcrdma: failed to post SQ..." > > > > > Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22 > > > > Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion > > > > Dec 12 13:58:29 ibd202 kernel: ctxt=ffff810242130800, count=1 on > > > > xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5 > > > > ... > > > > Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for unknown > > > > QP 2e0408 From tom at opengridcomputing.com Wed Dec 13 14:40:50 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 13 Dec 2006 16:40:50 -0600 Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: <457F426B.7020104@mellanox.com> References: <457F34B3.9060402@mellanox.com> <1165966574.8722.110.camel@trinity.ogc.int> <457F426B.7020104@mellanox.com> Message-ID: <1166049650.10873.9.camel@trinity.ogc.int> Vu: [...snip...] > > > > 2. Can you please send me the iozone test parameters your using? > > > > server has 8GB of mem, client has 2GB of mem > > iozone -r 64KB -s 5g -i 0 -i 1 > and > iozone -r 64KB -s 2g -i 0 -i 1 -t 3 > Can you please send me the iozone output you get from these commands? Thanks, > thanks, > -vu > > > Thanks, > > Tom > >> thanks, > >> -vu > > > > plain text document attachment (.config) > # > # Automatically generated make config: don't edit > # Linux kernel version: 2.6.18.5 > # Tue Dec 12 10:08:55 2006 > # > CONFIG_X86_64=y > CONFIG_64BIT=y > CONFIG_X86=y > CONFIG_LOCKDEP_SUPPORT=y > CONFIG_STACKTRACE_SUPPORT=y > CONFIG_SEMAPHORE_SLEEPERS=y > CONFIG_MMU=y > CONFIG_RWSEM_GENERIC_SPINLOCK=y > CONFIG_GENERIC_HWEIGHT=y > CONFIG_GENERIC_CALIBRATE_DELAY=y > CONFIG_X86_CMPXCHG=y > CONFIG_EARLY_PRINTK=y > CONFIG_GENERIC_ISA_DMA=y > CONFIG_GENERIC_IOMAP=y > CONFIG_ARCH_MAY_HAVE_PC_FDC=y > CONFIG_DMI=y > CONFIG_AUDIT_ARCH=y > CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" > > # > # Code maturity level options > # > CONFIG_EXPERIMENTAL=y > CONFIG_LOCK_KERNEL=y > CONFIG_INIT_ENV_ARG_LIMIT=32 > > # > # General setup > # > CONFIG_LOCALVERSION="" > CONFIG_LOCALVERSION_AUTO=y > CONFIG_SWAP=y > CONFIG_SYSVIPC=y > CONFIG_POSIX_MQUEUE=y > CONFIG_BSD_PROCESS_ACCT=y > # CONFIG_BSD_PROCESS_ACCT_V3 is not set > # CONFIG_TASKSTATS is not set > CONFIG_AUDIT=y > CONFIG_AUDITSYSCALL=y > # CONFIG_IKCONFIG is not set > # CONFIG_CPUSETS is not set > # CONFIG_RELAY is not set > CONFIG_INITRAMFS_SOURCE="" > CONFIG_CC_OPTIMIZE_FOR_SIZE=y > # CONFIG_EMBEDDED is not set > CONFIG_UID16=y > CONFIG_SYSCTL=y > CONFIG_KALLSYMS=y > # CONFIG_KALLSYMS_ALL is not set > CONFIG_KALLSYMS_EXTRA_PASS=y > CONFIG_HOTPLUG=y > CONFIG_PRINTK=y > CONFIG_BUG=y > CONFIG_ELF_CORE=y > CONFIG_BASE_FULL=y > CONFIG_FUTEX=y > CONFIG_EPOLL=y > CONFIG_SHMEM=y > CONFIG_SLAB=y > CONFIG_VM_EVENT_COUNTERS=y > CONFIG_RT_MUTEXES=y > # CONFIG_TINY_SHMEM is not set > CONFIG_BASE_SMALL=0 > # CONFIG_SLOB is not set > > # > # Loadable module support > # > CONFIG_MODULES=y > CONFIG_MODULE_UNLOAD=y > # CONFIG_MODULE_FORCE_UNLOAD is not set > CONFIG_MODVERSIONS=y > # CONFIG_MODULE_SRCVERSION_ALL is not set > CONFIG_KMOD=y > CONFIG_STOP_MACHINE=y > > # > # Block layer > # > CONFIG_LBD=y > # CONFIG_BLK_DEV_IO_TRACE is not set > CONFIG_LSF=y > > # > # IO Schedulers > # > CONFIG_IOSCHED_NOOP=y > CONFIG_IOSCHED_AS=y > CONFIG_IOSCHED_DEADLINE=y > CONFIG_IOSCHED_CFQ=y > # CONFIG_DEFAULT_AS is not set > CONFIG_DEFAULT_DEADLINE=y > # CONFIG_DEFAULT_CFQ is not set > # CONFIG_DEFAULT_NOOP is not set > CONFIG_DEFAULT_IOSCHED="deadline" > > # > # Processor type and features > # > CONFIG_X86_PC=y > # CONFIG_X86_VSMP is not set > # CONFIG_MK8 is not set > # CONFIG_MPSC is not set > CONFIG_GENERIC_CPU=y > CONFIG_X86_L1_CACHE_BYTES=128 > CONFIG_X86_L1_CACHE_SHIFT=7 > CONFIG_X86_INTERNODE_CACHE_BYTES=128 > CONFIG_X86_TSC=y > CONFIG_X86_GOOD_APIC=y > CONFIG_MICROCODE=m > CONFIG_X86_MSR=y > CONFIG_X86_CPUID=y > CONFIG_X86_HT=y > CONFIG_X86_IO_APIC=y > CONFIG_X86_LOCAL_APIC=y > CONFIG_MTRR=y > CONFIG_SMP=y > CONFIG_SCHED_SMT=y > CONFIG_SCHED_MC=y > CONFIG_PREEMPT_NONE=y > # CONFIG_PREEMPT_VOLUNTARY is not set > # CONFIG_PREEMPT is not set > CONFIG_PREEMPT_BKL=y > CONFIG_NUMA=y > CONFIG_K8_NUMA=y > CONFIG_NODES_SHIFT=6 > CONFIG_X86_64_ACPI_NUMA=y > # CONFIG_NUMA_EMU is not set > CONFIG_ARCH_DISCONTIGMEM_ENABLE=y > CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y > CONFIG_ARCH_SPARSEMEM_ENABLE=y > CONFIG_SELECT_MEMORY_MODEL=y > # CONFIG_FLATMEM_MANUAL is not set > CONFIG_DISCONTIGMEM_MANUAL=y > # CONFIG_SPARSEMEM_MANUAL is not set > CONFIG_DISCONTIGMEM=y > CONFIG_FLAT_NODE_MEM_MAP=y > CONFIG_NEED_MULTIPLE_NODES=y > # CONFIG_SPARSEMEM_STATIC is not set > CONFIG_SPLIT_PTLOCK_CPUS=4 > CONFIG_MIGRATION=y > CONFIG_RESOURCES_64BIT=y > CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y > CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y > CONFIG_NR_CPUS=8 > # CONFIG_HOTPLUG_CPU is not set > CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y > CONFIG_HPET_TIMER=y > CONFIG_HPET_EMULATE_RTC=y > CONFIG_IOMMU=y > CONFIG_CALGARY_IOMMU=y > CONFIG_SWIOTLB=y > CONFIG_X86_MCE=y > CONFIG_X86_MCE_INTEL=y > CONFIG_X86_MCE_AMD=y > # CONFIG_KEXEC is not set > # CONFIG_CRASH_DUMP is not set > CONFIG_PHYSICAL_START=0x200000 > CONFIG_SECCOMP=y > # CONFIG_HZ_100 is not set > CONFIG_HZ_250=y > # CONFIG_HZ_1000 is not set > CONFIG_HZ=250 > # CONFIG_REORDER is not set > CONFIG_K8_NB=y > CONFIG_GENERIC_HARDIRQS=y > CONFIG_GENERIC_IRQ_PROBE=y > CONFIG_ISA_DMA_API=y > CONFIG_GENERIC_PENDING_IRQ=y > > # > # Power management options > # > CONFIG_PM=y > CONFIG_PM_LEGACY=y > # CONFIG_PM_DEBUG is not set > > # > # ACPI (Advanced Configuration and Power Interface) Support > # > CONFIG_ACPI=y > CONFIG_ACPI_AC=m > CONFIG_ACPI_BATTERY=m > CONFIG_ACPI_BUTTON=m > CONFIG_ACPI_VIDEO=y > # CONFIG_ACPI_HOTKEY is not set > CONFIG_ACPI_FAN=y > # CONFIG_ACPI_DOCK is not set > CONFIG_ACPI_PROCESSOR=y > CONFIG_ACPI_THERMAL=y > CONFIG_ACPI_NUMA=y > CONFIG_ACPI_ASUS=m > # CONFIG_ACPI_IBM is not set > CONFIG_ACPI_TOSHIBA=m > CONFIG_ACPI_BLACKLIST_YEAR=0 > # CONFIG_ACPI_DEBUG is not set > CONFIG_ACPI_EC=y > CONFIG_ACPI_POWER=y > CONFIG_ACPI_SYSTEM=y > CONFIG_X86_PM_TIMER=y > # CONFIG_ACPI_CONTAINER is not set > # CONFIG_ACPI_SBS is not set > > # > # CPU Frequency scaling > # > CONFIG_CPU_FREQ=y > CONFIG_CPU_FREQ_TABLE=y > # CONFIG_CPU_FREQ_DEBUG is not set > CONFIG_CPU_FREQ_STAT=y > # CONFIG_CPU_FREQ_STAT_DETAILS is not set > # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set > CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y > CONFIG_CPU_FREQ_GOV_PERFORMANCE=y > CONFIG_CPU_FREQ_GOV_POWERSAVE=m > CONFIG_CPU_FREQ_GOV_USERSPACE=y > CONFIG_CPU_FREQ_GOV_ONDEMAND=m > # CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set > > # > # CPUFreq processor drivers > # > CONFIG_X86_POWERNOW_K8=y > CONFIG_X86_POWERNOW_K8_ACPI=y > CONFIG_X86_SPEEDSTEP_CENTRINO=y > CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y > CONFIG_X86_ACPI_CPUFREQ=y > > # > # shared options > # > # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set > # CONFIG_X86_SPEEDSTEP_LIB is not set > > # > # Bus options (PCI etc.) > # > CONFIG_PCI=y > CONFIG_PCI_DIRECT=y > CONFIG_PCI_MMCONFIG=y > # CONFIG_PCIEPORTBUS is not set > CONFIG_PCI_MSI=y > # CONFIG_PCI_DEBUG is not set > > # > # PCCARD (PCMCIA/CardBus) support > # > # CONFIG_PCCARD is not set > > # > # PCI Hotplug Support > # > CONFIG_HOTPLUG_PCI=y > # CONFIG_HOTPLUG_PCI_FAKE is not set > CONFIG_HOTPLUG_PCI_ACPI=m > CONFIG_HOTPLUG_PCI_ACPI_IBM=m > # CONFIG_HOTPLUG_PCI_CPCI is not set > CONFIG_HOTPLUG_PCI_SHPC=m > # CONFIG_HOTPLUG_PCI_SHPC_POLL_EVENT_MODE is not set > > # > # Executable file formats / Emulations > # > CONFIG_BINFMT_ELF=y > CONFIG_BINFMT_MISC=y > CONFIG_IA32_EMULATION=y > # CONFIG_IA32_AOUT is not set > CONFIG_COMPAT=y > CONFIG_SYSVIPC_COMPAT=y > > # > # Networking > # > CONFIG_NET=y > > # > # Networking options > # > # CONFIG_NETDEBUG is not set > CONFIG_PACKET=y > CONFIG_PACKET_MMAP=y > CONFIG_UNIX=y > CONFIG_XFRM=y > CONFIG_XFRM_USER=y > CONFIG_NET_KEY=m > CONFIG_INET=y > CONFIG_IP_MULTICAST=y > CONFIG_IP_ADVANCED_ROUTER=y > CONFIG_ASK_IP_FIB_HASH=y > # CONFIG_IP_FIB_TRIE is not set > CONFIG_IP_FIB_HASH=y > CONFIG_IP_MULTIPLE_TABLES=y > CONFIG_IP_ROUTE_FWMARK=y > CONFIG_IP_ROUTE_MULTIPATH=y > # CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set > CONFIG_IP_ROUTE_VERBOSE=y > # CONFIG_IP_PNP is not set > CONFIG_NET_IPIP=m > CONFIG_NET_IPGRE=m > CONFIG_NET_IPGRE_BROADCAST=y > CONFIG_IP_MROUTE=y > CONFIG_IP_PIMSM_V1=y > CONFIG_IP_PIMSM_V2=y > # CONFIG_ARPD is not set > CONFIG_SYN_COOKIES=y > CONFIG_INET_AH=m > CONFIG_INET_ESP=m > CONFIG_INET_IPCOMP=m > CONFIG_INET_XFRM_TUNNEL=m > CONFIG_INET_TUNNEL=m > CONFIG_INET_XFRM_MODE_TRANSPORT=y > CONFIG_INET_XFRM_MODE_TUNNEL=y > CONFIG_INET_DIAG=y > CONFIG_INET_TCP_DIAG=y > # CONFIG_TCP_CONG_ADVANCED is not set > CONFIG_TCP_CONG_BIC=y > > # > # IP: Virtual Server Configuration > # > CONFIG_IP_VS=m > # CONFIG_IP_VS_DEBUG is not set > CONFIG_IP_VS_TAB_BITS=12 > > # > # IPVS transport protocol load balancing support > # > CONFIG_IP_VS_PROTO_TCP=y > CONFIG_IP_VS_PROTO_UDP=y > CONFIG_IP_VS_PROTO_ESP=y > CONFIG_IP_VS_PROTO_AH=y > > # > # IPVS scheduler > # > CONFIG_IP_VS_RR=m > CONFIG_IP_VS_WRR=m > CONFIG_IP_VS_LC=m > CONFIG_IP_VS_WLC=m > CONFIG_IP_VS_LBLC=m > CONFIG_IP_VS_LBLCR=m > CONFIG_IP_VS_DH=m > CONFIG_IP_VS_SH=m > CONFIG_IP_VS_SED=m > CONFIG_IP_VS_NQ=m > > # > # IPVS application helper > # > CONFIG_IP_VS_FTP=m > CONFIG_IPV6=m > CONFIG_IPV6_PRIVACY=y > # CONFIG_IPV6_ROUTER_PREF is not set > CONFIG_INET6_AH=m > CONFIG_INET6_ESP=m > CONFIG_INET6_IPCOMP=m > CONFIG_INET6_XFRM_TUNNEL=m > CONFIG_INET6_TUNNEL=m > CONFIG_INET6_XFRM_MODE_TRANSPORT=m > CONFIG_INET6_XFRM_MODE_TUNNEL=m > CONFIG_IPV6_TUNNEL=m > CONFIG_NETWORK_SECMARK=y > CONFIG_NETFILTER=y > # CONFIG_NETFILTER_DEBUG is not set > CONFIG_BRIDGE_NETFILTER=y > > # > # Core Netfilter Configuration > # > # CONFIG_NETFILTER_NETLINK is not set > # CONFIG_NETFILTER_XTABLES is not set > > # > # IP: Netfilter Configuration > # > CONFIG_IP_NF_CONNTRACK=m > CONFIG_IP_NF_CT_ACCT=y > # CONFIG_IP_NF_CONNTRACK_MARK is not set > # CONFIG_IP_NF_CONNTRACK_SECMARK is not set > # CONFIG_IP_NF_CONNTRACK_EVENTS is not set > CONFIG_IP_NF_CT_PROTO_SCTP=m > CONFIG_IP_NF_FTP=m > CONFIG_IP_NF_IRC=m > # CONFIG_IP_NF_NETBIOS_NS is not set > CONFIG_IP_NF_TFTP=m > CONFIG_IP_NF_AMANDA=m > # CONFIG_IP_NF_PPTP is not set > # CONFIG_IP_NF_H323 is not set > # CONFIG_IP_NF_SIP is not set > CONFIG_IP_NF_QUEUE=m > > # > # IPv6: Netfilter Configuration (EXPERIMENTAL) > # > # CONFIG_IP6_NF_QUEUE is not set > > # > # Bridge: Netfilter Configuration > # > CONFIG_BRIDGE_NF_EBTABLES=m > CONFIG_BRIDGE_EBT_BROUTE=m > CONFIG_BRIDGE_EBT_T_FILTER=m > CONFIG_BRIDGE_EBT_T_NAT=m > CONFIG_BRIDGE_EBT_802_3=m > CONFIG_BRIDGE_EBT_AMONG=m > CONFIG_BRIDGE_EBT_ARP=m > CONFIG_BRIDGE_EBT_IP=m > CONFIG_BRIDGE_EBT_LIMIT=m > CONFIG_BRIDGE_EBT_MARK=m > CONFIG_BRIDGE_EBT_PKTTYPE=m > CONFIG_BRIDGE_EBT_STP=m > CONFIG_BRIDGE_EBT_VLAN=m > CONFIG_BRIDGE_EBT_ARPREPLY=m > CONFIG_BRIDGE_EBT_DNAT=m > CONFIG_BRIDGE_EBT_MARK_T=m > CONFIG_BRIDGE_EBT_REDIRECT=m > CONFIG_BRIDGE_EBT_SNAT=m > CONFIG_BRIDGE_EBT_LOG=m > # CONFIG_BRIDGE_EBT_ULOG is not set > > # > # DCCP Configuration (EXPERIMENTAL) > # > # CONFIG_IP_DCCP is not set > > # > # SCTP Configuration (EXPERIMENTAL) > # > CONFIG_IP_SCTP=m > # CONFIG_SCTP_DBG_MSG is not set > # CONFIG_SCTP_DBG_OBJCNT is not set > # CONFIG_SCTP_HMAC_NONE is not set > # CONFIG_SCTP_HMAC_SHA1 is not set > CONFIG_SCTP_HMAC_MD5=y > > # > # TIPC Configuration (EXPERIMENTAL) > # > # CONFIG_TIPC is not set > CONFIG_ATM=m > CONFIG_ATM_CLIP=m > # CONFIG_ATM_CLIP_NO_ICMP is not set > CONFIG_ATM_LANE=m > # CONFIG_ATM_MPOA is not set > CONFIG_ATM_BR2684=m > # CONFIG_ATM_BR2684_IPFILTER is not set > CONFIG_BRIDGE=m > CONFIG_VLAN_8021Q=m > # CONFIG_DECNET is not set > CONFIG_LLC=y > # CONFIG_LLC2 is not set > # CONFIG_IPX is not set > # CONFIG_ATALK is not set > # CONFIG_X25 is not set > # CONFIG_LAPB is not set > # CONFIG_ECONET is not set > # CONFIG_WAN_ROUTER is not set > > # > # QoS and/or fair queueing > # > CONFIG_NET_SCHED=y > CONFIG_NET_SCH_CLK_JIFFIES=y > # CONFIG_NET_SCH_CLK_GETTIMEOFDAY is not set > # CONFIG_NET_SCH_CLK_CPU is not set > > # > # Queueing/Scheduling > # > CONFIG_NET_SCH_CBQ=m > CONFIG_NET_SCH_HTB=m > CONFIG_NET_SCH_HFSC=m > CONFIG_NET_SCH_ATM=m > CONFIG_NET_SCH_PRIO=m > CONFIG_NET_SCH_RED=m > CONFIG_NET_SCH_SFQ=m > CONFIG_NET_SCH_TEQL=m > CONFIG_NET_SCH_TBF=m > CONFIG_NET_SCH_GRED=m > CONFIG_NET_SCH_DSMARK=m > CONFIG_NET_SCH_NETEM=m > CONFIG_NET_SCH_INGRESS=m > > # > # Classification > # > CONFIG_NET_CLS=y > # CONFIG_NET_CLS_BASIC is not set > CONFIG_NET_CLS_TCINDEX=m > CONFIG_NET_CLS_ROUTE4=m > CONFIG_NET_CLS_ROUTE=y > CONFIG_NET_CLS_FW=m > CONFIG_NET_CLS_U32=m > CONFIG_CLS_U32_PERF=y > # CONFIG_CLS_U32_MARK is not set > CONFIG_NET_CLS_RSVP=m > CONFIG_NET_CLS_RSVP6=m > # CONFIG_NET_EMATCH is not set > # CONFIG_NET_CLS_ACT is not set > CONFIG_NET_CLS_POLICE=y > CONFIG_NET_CLS_IND=y > CONFIG_NET_ESTIMATOR=y > > # > # Network testing > # > # CONFIG_NET_PKTGEN is not set > # CONFIG_NET_TCPPROBE is not set > # CONFIG_HAMRADIO is not set > # CONFIG_IRDA is not set > CONFIG_BT=m > CONFIG_BT_L2CAP=m > CONFIG_BT_SCO=m > CONFIG_BT_RFCOMM=m > CONFIG_BT_RFCOMM_TTY=y > CONFIG_BT_BNEP=m > CONFIG_BT_BNEP_MC_FILTER=y > CONFIG_BT_BNEP_PROTO_FILTER=y > CONFIG_BT_CMTP=m > CONFIG_BT_HIDP=m > > # > # Bluetooth device drivers > # > CONFIG_BT_HCIUSB=m > CONFIG_BT_HCIUSB_SCO=y > CONFIG_BT_HCIUART=m > CONFIG_BT_HCIUART_H4=y > CONFIG_BT_HCIUART_BCSP=y > CONFIG_BT_HCIBCM203X=m > # CONFIG_BT_HCIBPA10X is not set > CONFIG_BT_HCIBFUSB=m > CONFIG_BT_HCIVHCI=m > CONFIG_IEEE80211=m > # CONFIG_IEEE80211_DEBUG is not set > # CONFIG_IEEE80211_CRYPT_WEP is not set > # CONFIG_IEEE80211_CRYPT_CCMP is not set > CONFIG_IEEE80211_CRYPT_TKIP=m > # CONFIG_IEEE80211_SOFTMAC is not set > CONFIG_WIRELESS_EXT=y > > # > # Device Drivers > # > > # > # Generic Driver Options > # > CONFIG_STANDALONE=y > CONFIG_PREVENT_FIRMWARE_BUILD=y > CONFIG_FW_LOADER=y > # CONFIG_DEBUG_DRIVER is not set > # CONFIG_SYS_HYPERVISOR is not set > > # > # Connector - unified userspace <-> kernelspace linker > # > # CONFIG_CONNECTOR is not set > > # > # Memory Technology Devices (MTD) > # > CONFIG_MTD=m > # CONFIG_MTD_DEBUG is not set > CONFIG_MTD_CONCAT=m > CONFIG_MTD_PARTITIONS=y > CONFIG_MTD_REDBOOT_PARTS=m > CONFIG_MTD_REDBOOT_DIRECTORY_BLOCK=-1 > # CONFIG_MTD_REDBOOT_PARTS_UNALLOCATED is not set > # CONFIG_MTD_REDBOOT_PARTS_READONLY is not set > CONFIG_MTD_CMDLINE_PARTS=y > > # > # User Modules And Translation Layers > # > CONFIG_MTD_CHAR=m > CONFIG_MTD_BLOCK=m > CONFIG_MTD_BLOCK_RO=m > CONFIG_FTL=m > CONFIG_NFTL=m > CONFIG_NFTL_RW=y > # CONFIG_INFTL is not set > # CONFIG_RFD_FTL is not set > > # > # RAM/ROM/Flash chip drivers > # > CONFIG_MTD_CFI=m > CONFIG_MTD_JEDECPROBE=m > CONFIG_MTD_GEN_PROBE=m > # CONFIG_MTD_CFI_ADV_OPTIONS is not set > CONFIG_MTD_MAP_BANK_WIDTH_1=y > CONFIG_MTD_MAP_BANK_WIDTH_2=y > CONFIG_MTD_MAP_BANK_WIDTH_4=y > # CONFIG_MTD_MAP_BANK_WIDTH_8 is not set > # CONFIG_MTD_MAP_BANK_WIDTH_16 is not set > # CONFIG_MTD_MAP_BANK_WIDTH_32 is not set > CONFIG_MTD_CFI_I1=y > CONFIG_MTD_CFI_I2=y > # CONFIG_MTD_CFI_I4 is not set > # CONFIG_MTD_CFI_I8 is not set > CONFIG_MTD_CFI_INTELEXT=m > CONFIG_MTD_CFI_AMDSTD=m > CONFIG_MTD_CFI_STAA=m > CONFIG_MTD_CFI_UTIL=m > CONFIG_MTD_RAM=m > CONFIG_MTD_ROM=m > CONFIG_MTD_ABSENT=m > # CONFIG_MTD_OBSOLETE_CHIPS is not set > > # > # Mapping drivers for chip access > # > CONFIG_MTD_COMPLEX_MAPPINGS=y > # CONFIG_MTD_PHYSMAP is not set > # CONFIG_MTD_PNC2000 is not set > CONFIG_MTD_SC520CDP=m > CONFIG_MTD_NETSC520=m > # CONFIG_MTD_TS5500 is not set > # CONFIG_MTD_SBC_GXX is not set > # CONFIG_MTD_AMD76XROM is not set > CONFIG_MTD_ICHXROM=m > # CONFIG_MTD_SCB2_FLASH is not set > # CONFIG_MTD_NETtel is not set > # CONFIG_MTD_DILNETPC is not set > # CONFIG_MTD_L440GX is not set > # CONFIG_MTD_PCI is not set > # CONFIG_MTD_PLATRAM is not set > > # > # Self-contained MTD device drivers > # > # CONFIG_MTD_PMC551 is not set > # CONFIG_MTD_SLRAM is not set > # CONFIG_MTD_PHRAM is not set > CONFIG_MTD_MTDRAM=m > CONFIG_MTDRAM_TOTAL_SIZE=4096 > CONFIG_MTDRAM_ERASE_SIZE=128 > # CONFIG_MTD_BLOCK2MTD is not set > > # > # Disk-On-Chip Device Drivers > # > # CONFIG_MTD_DOC2000 is not set > # CONFIG_MTD_DOC2001 is not set > # CONFIG_MTD_DOC2001PLUS is not set > > # > # NAND Flash Device Drivers > # > CONFIG_MTD_NAND=m > # CONFIG_MTD_NAND_VERIFY_WRITE is not set > # CONFIG_MTD_NAND_ECC_SMC is not set > CONFIG_MTD_NAND_IDS=m > # CONFIG_MTD_NAND_DISKONCHIP is not set > # CONFIG_MTD_NAND_NANDSIM is not set > > # > # OneNAND Flash Device Drivers > # > # CONFIG_MTD_ONENAND is not set > > # > # Parallel port support > # > CONFIG_PARPORT=m > CONFIG_PARPORT_PC=m > CONFIG_PARPORT_SERIAL=m > # CONFIG_PARPORT_PC_FIFO is not set > # CONFIG_PARPORT_PC_SUPERIO is not set > CONFIG_PARPORT_NOT_PC=y > # CONFIG_PARPORT_GSC is not set > # CONFIG_PARPORT_AX88796 is not set > CONFIG_PARPORT_1284=y > > # > # Plug and Play support > # > # CONFIG_PNP is not set > > # > # Block devices > # > CONFIG_BLK_DEV_FD=m > # CONFIG_PARIDE is not set > CONFIG_BLK_CPQ_DA=m > CONFIG_BLK_CPQ_CISS_DA=m > CONFIG_CISS_SCSI_TAPE=y > CONFIG_BLK_DEV_DAC960=m > # CONFIG_BLK_DEV_UMEM is not set > # CONFIG_BLK_DEV_COW_COMMON is not set > CONFIG_BLK_DEV_LOOP=m > CONFIG_BLK_DEV_CRYPTOLOOP=m > CONFIG_BLK_DEV_NBD=m > CONFIG_BLK_DEV_SX8=m > # CONFIG_BLK_DEV_UB is not set > CONFIG_BLK_DEV_RAM=y > CONFIG_BLK_DEV_RAM_COUNT=16 > CONFIG_BLK_DEV_RAM_SIZE=16384 > CONFIG_BLK_DEV_RAM_BLOCKSIZE=1024 > CONFIG_BLK_DEV_INITRD=y > # CONFIG_CDROM_PKTCDVD is not set > # CONFIG_ATA_OVER_ETH is not set > > # > # ATA/ATAPI/MFM/RLL support > # > CONFIG_IDE=y > CONFIG_BLK_DEV_IDE=y > > # > # Please see Documentation/ide.txt for help/info on IDE drives > # > # CONFIG_BLK_DEV_IDE_SATA is not set > # CONFIG_BLK_DEV_HD_IDE is not set > CONFIG_BLK_DEV_IDEDISK=y > CONFIG_IDEDISK_MULTI_MODE=y > CONFIG_BLK_DEV_IDECD=y > # CONFIG_BLK_DEV_IDETAPE is not set > CONFIG_BLK_DEV_IDEFLOPPY=y > CONFIG_BLK_DEV_IDESCSI=m > # CONFIG_IDE_TASK_IOCTL is not set > > # > # IDE chipset support/bugfixes > # > CONFIG_IDE_GENERIC=y > # CONFIG_BLK_DEV_CMD640 is not set > CONFIG_BLK_DEV_IDEPCI=y > CONFIG_IDEPCI_SHARE_IRQ=y > # CONFIG_BLK_DEV_OFFBOARD is not set > CONFIG_BLK_DEV_GENERIC=y > # CONFIG_BLK_DEV_OPTI621 is not set > CONFIG_BLK_DEV_RZ1000=y > CONFIG_BLK_DEV_IDEDMA_PCI=y > # CONFIG_BLK_DEV_IDEDMA_FORCED is not set > CONFIG_IDEDMA_PCI_AUTO=y > # CONFIG_IDEDMA_ONLYDISK is not set > CONFIG_BLK_DEV_AEC62XX=y > CONFIG_BLK_DEV_ALI15X3=y > # CONFIG_WDC_ALI15X3 is not set > CONFIG_BLK_DEV_AMD74XX=y > CONFIG_BLK_DEV_ATIIXP=y > CONFIG_BLK_DEV_CMD64X=y > CONFIG_BLK_DEV_TRIFLEX=y > CONFIG_BLK_DEV_CY82C693=y > CONFIG_BLK_DEV_CS5520=y > CONFIG_BLK_DEV_CS5530=y > CONFIG_BLK_DEV_HPT34X=y > # CONFIG_HPT34X_AUTODMA is not set > CONFIG_BLK_DEV_HPT366=y > # CONFIG_BLK_DEV_SC1200 is not set > CONFIG_BLK_DEV_PIIX=y > # CONFIG_BLK_DEV_IT821X is not set > # CONFIG_BLK_DEV_NS87415 is not set > CONFIG_BLK_DEV_PDC202XX_OLD=y > # CONFIG_PDC202XX_BURST is not set > CONFIG_BLK_DEV_PDC202XX_NEW=y > CONFIG_BLK_DEV_SVWKS=y > CONFIG_BLK_DEV_SIIMAGE=y > CONFIG_BLK_DEV_SIS5513=y > CONFIG_BLK_DEV_SLC90E66=y > # CONFIG_BLK_DEV_TRM290 is not set > CONFIG_BLK_DEV_VIA82CXXX=y > # CONFIG_IDE_ARM is not set > CONFIG_BLK_DEV_IDEDMA=y > # CONFIG_IDEDMA_IVB is not set > CONFIG_IDEDMA_AUTO=y > # CONFIG_BLK_DEV_HD is not set > > # > # SCSI device support > # > # CONFIG_RAID_ATTRS is not set > CONFIG_SCSI=m > CONFIG_SCSI_PROC_FS=y > > # > # SCSI support type (disk, tape, CD-ROM) > # > CONFIG_BLK_DEV_SD=m > CONFIG_CHR_DEV_ST=m > CONFIG_CHR_DEV_OSST=m > CONFIG_BLK_DEV_SR=m > CONFIG_BLK_DEV_SR_VENDOR=y > CONFIG_CHR_DEV_SG=m > # CONFIG_CHR_DEV_SCH is not set > > # > # Some SCSI devices (e.g. CD jukebox) support multiple LUNs > # > # CONFIG_SCSI_MULTI_LUN is not set > CONFIG_SCSI_CONSTANTS=y > CONFIG_SCSI_LOGGING=y > > # > # SCSI Transport Attributes > # > CONFIG_SCSI_SPI_ATTRS=m > CONFIG_SCSI_FC_ATTRS=m > CONFIG_SCSI_ISCSI_ATTRS=m > CONFIG_SCSI_SAS_ATTRS=m > > # > # SCSI low-level drivers > # > # CONFIG_ISCSI_TCP is not set > CONFIG_BLK_DEV_3W_XXXX_RAID=m > CONFIG_SCSI_3W_9XXX=m > CONFIG_SCSI_ACARD=m > CONFIG_SCSI_AACRAID=m > CONFIG_SCSI_AIC7XXX=m > CONFIG_AIC7XXX_CMDS_PER_DEVICE=4 > CONFIG_AIC7XXX_RESET_DELAY_MS=15000 > # CONFIG_AIC7XXX_DEBUG_ENABLE is not set > CONFIG_AIC7XXX_DEBUG_MASK=0 > # CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set > CONFIG_SCSI_AIC7XXX_OLD=m > CONFIG_SCSI_AIC79XX=m > CONFIG_AIC79XX_CMDS_PER_DEVICE=4 > CONFIG_AIC79XX_RESET_DELAY_MS=15000 > # CONFIG_AIC79XX_ENABLE_RD_STRM is not set > # CONFIG_AIC79XX_DEBUG_ENABLE is not set > CONFIG_AIC79XX_DEBUG_MASK=0 > # CONFIG_AIC79XX_REG_PRETTY_PRINT is not set > CONFIG_MEGARAID_NEWGEN=y > CONFIG_MEGARAID_MM=m > CONFIG_MEGARAID_MAILBOX=m > # CONFIG_MEGARAID_LEGACY is not set > CONFIG_MEGARAID_SAS=m > CONFIG_SCSI_SATA=m > CONFIG_SCSI_SATA_AHCI=m > CONFIG_SCSI_SATA_SVW=m > CONFIG_SCSI_ATA_PIIX=m > # CONFIG_SCSI_SATA_MV is not set > CONFIG_SCSI_SATA_NV=m > # CONFIG_SCSI_PDC_ADMA is not set > # CONFIG_SCSI_HPTIOP is not set > # CONFIG_SCSI_SATA_QSTOR is not set > CONFIG_SCSI_SATA_PROMISE=m > CONFIG_SCSI_SATA_SX4=m > CONFIG_SCSI_SATA_SIL=m > # CONFIG_SCSI_SATA_SIL24 is not set > CONFIG_SCSI_SATA_SIS=m > # CONFIG_SCSI_SATA_ULI is not set > CONFIG_SCSI_SATA_VIA=m > CONFIG_SCSI_SATA_VITESSE=m > CONFIG_SCSI_SATA_INTEL_COMBINED=y > # CONFIG_SCSI_BUSLOGIC is not set > # CONFIG_SCSI_DMX3191D is not set > # CONFIG_SCSI_EATA is not set > # CONFIG_SCSI_FUTURE_DOMAIN is not set > CONFIG_SCSI_GDTH=m > CONFIG_SCSI_IPS=m > CONFIG_SCSI_INITIO=m > # CONFIG_SCSI_INIA100 is not set > CONFIG_SCSI_PPA=m > CONFIG_SCSI_IMM=m > # CONFIG_SCSI_IZIP_EPP16 is not set > # CONFIG_SCSI_IZIP_SLOW_CTR is not set > CONFIG_SCSI_SYM53C8XX_2=m > CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1 > CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16 > CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64 > CONFIG_SCSI_SYM53C8XX_MMIO=y > # CONFIG_SCSI_IPR is not set > CONFIG_SCSI_QLOGIC_1280=m > # CONFIG_SCSI_QLA_FC is not set > CONFIG_SCSI_LPFC=m > # CONFIG_SCSI_DC395x is not set > # CONFIG_SCSI_DC390T is not set > # CONFIG_SCSI_DEBUG is not set > > # > # Multi-device support (RAID and LVM) > # > CONFIG_MD=y > CONFIG_BLK_DEV_MD=y > CONFIG_MD_LINEAR=m > CONFIG_MD_RAID0=m > CONFIG_MD_RAID1=m > CONFIG_MD_RAID10=m > # CONFIG_MD_RAID456 is not set > CONFIG_MD_MULTIPATH=m > # CONFIG_MD_FAULTY is not set > CONFIG_BLK_DEV_DM=m > CONFIG_DM_CRYPT=m > CONFIG_DM_SNAPSHOT=m > CONFIG_DM_MIRROR=m > CONFIG_DM_ZERO=m > CONFIG_DM_MULTIPATH=m > CONFIG_DM_MULTIPATH_EMC=m > > # > # Fusion MPT device support > # > CONFIG_FUSION=y > CONFIG_FUSION_SPI=m > CONFIG_FUSION_FC=m > CONFIG_FUSION_SAS=m > CONFIG_FUSION_MAX_SGE=40 > CONFIG_FUSION_CTL=m > CONFIG_FUSION_LAN=m > > # > # IEEE 1394 (FireWire) support > # > # CONFIG_IEEE1394 is not set > > # > # I2O device support > # > CONFIG_I2O=m > CONFIG_I2O_LCT_NOTIFY_ON_CHANGES=y > CONFIG_I2O_EXT_ADAPTEC=y > CONFIG_I2O_EXT_ADAPTEC_DMA64=y > CONFIG_I2O_CONFIG=m > CONFIG_I2O_CONFIG_OLD_IOCTL=y > # CONFIG_I2O_BUS is not set > CONFIG_I2O_BLOCK=m > CONFIG_I2O_SCSI=m > CONFIG_I2O_PROC=m > > # > # Network device support > # > CONFIG_NETDEVICES=y > CONFIG_DUMMY=m > CONFIG_BONDING=m > # CONFIG_EQUALIZER is not set > CONFIG_TUN=m > > # > # ARCnet devices > # > # CONFIG_ARCNET is not set > > # > # PHY device support > # > # CONFIG_PHYLIB is not set > > # > # Ethernet (10 or 100Mbit) > # > CONFIG_NET_ETHERNET=y > CONFIG_MII=m > CONFIG_HAPPYMEAL=m > CONFIG_SUNGEM=m > # CONFIG_CASSINI is not set > CONFIG_NET_VENDOR_3COM=y > CONFIG_VORTEX=m > CONFIG_TYPHOON=m > > # > # Tulip family network device support > # > CONFIG_NET_TULIP=y > CONFIG_DE2104X=m > CONFIG_TULIP=m > # CONFIG_TULIP_MWI is not set > CONFIG_TULIP_MMIO=y > # CONFIG_TULIP_NAPI is not set > CONFIG_DE4X5=m > CONFIG_WINBOND_840=m > CONFIG_DM9102=m > # CONFIG_ULI526X is not set > # CONFIG_HP100 is not set > CONFIG_NET_PCI=y > CONFIG_PCNET32=m > CONFIG_AMD8111_ETH=m > CONFIG_AMD8111E_NAPI=y > CONFIG_ADAPTEC_STARFIRE=m > CONFIG_ADAPTEC_STARFIRE_NAPI=y > CONFIG_B44=m > CONFIG_FORCEDETH=m > # CONFIG_DGRS is not set > CONFIG_EEPRO100=m > CONFIG_E100=m > CONFIG_FEALNX=m > CONFIG_NATSEMI=m > CONFIG_NE2K_PCI=m > CONFIG_8139CP=m > CONFIG_8139TOO=m > CONFIG_8139TOO_PIO=y > # CONFIG_8139TOO_TUNE_TWISTER is not set > CONFIG_8139TOO_8129=y > # CONFIG_8139_OLD_RX_RESET is not set > CONFIG_SIS900=m > CONFIG_EPIC100=m > # CONFIG_SUNDANCE is not set > CONFIG_VIA_RHINE=m > CONFIG_VIA_RHINE_MMIO=y > # CONFIG_VIA_RHINE_NAPI is not set > # CONFIG_NET_POCKET is not set > > # > # Ethernet (1000 Mbit) > # > CONFIG_ACENIC=m > # CONFIG_ACENIC_OMIT_TIGON_I is not set > CONFIG_DL2K=m > CONFIG_E1000=m > CONFIG_E1000_NAPI=y > # CONFIG_E1000_DISABLE_PACKET_SPLIT is not set > CONFIG_NS83820=m > # CONFIG_HAMACHI is not set > # CONFIG_YELLOWFIN is not set > CONFIG_R8169=m > CONFIG_R8169_NAPI=y > # CONFIG_R8169_VLAN is not set > # CONFIG_SIS190 is not set > # CONFIG_SKGE is not set > CONFIG_SKY2=m > CONFIG_SK98LIN=m > CONFIG_VIA_VELOCITY=m > CONFIG_TIGON3=m > CONFIG_BNX2=m > > # > # Ethernet (10000 Mbit) > # > # CONFIG_CHELSIO_T1 is not set > CONFIG_IXGB=m > CONFIG_IXGB_NAPI=y > CONFIG_S2IO=m > CONFIG_S2IO_NAPI=y > # CONFIG_MYRI10GE is not set > > # > # Token Ring devices > # > CONFIG_TR=y > CONFIG_IBMOL=m > CONFIG_3C359=m > CONFIG_TMS380TR=m > CONFIG_TMSPCI=m > CONFIG_ABYSS=m > > # > # Wireless LAN (non-hamradio) > # > CONFIG_NET_RADIO=y > # CONFIG_NET_WIRELESS_RTNETLINK is not set > > # > # Obsolete Wireless cards support (pre-802.11) > # > # CONFIG_STRIP is not set > > # > # Wireless 802.11b ISA/PCI cards support > # > CONFIG_IPW2100=m > # CONFIG_IPW2100_MONITOR is not set > # CONFIG_IPW2100_DEBUG is not set > CONFIG_IPW2200=m > # CONFIG_IPW2200_MONITOR is not set > # CONFIG_IPW2200_QOS is not set > # CONFIG_IPW2200_DEBUG is not set > # CONFIG_AIRO is not set > CONFIG_HERMES=m > CONFIG_PLX_HERMES=m > CONFIG_TMD_HERMES=m > # CONFIG_NORTEL_HERMES is not set > CONFIG_PCI_HERMES=m > CONFIG_ATMEL=m > CONFIG_PCI_ATMEL=m > > # > # Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support > # > CONFIG_PRISM54=m > # CONFIG_USB_ZD1201 is not set > # CONFIG_HOSTAP is not set > CONFIG_NET_WIRELESS=y > > # > # Wan interfaces > # > # CONFIG_WAN is not set > > # > # ATM drivers > # > # CONFIG_ATM_DUMMY is not set > CONFIG_ATM_TCP=m > CONFIG_ATM_LANAI=m > CONFIG_ATM_ENI=m > # CONFIG_ATM_ENI_DEBUG is not set > # CONFIG_ATM_ENI_TUNE_BURST is not set > CONFIG_ATM_FIRESTREAM=m > # CONFIG_ATM_ZATM is not set > CONFIG_ATM_IDT77252=m > # CONFIG_ATM_IDT77252_DEBUG is not set > # CONFIG_ATM_IDT77252_RCV_ALL is not set > CONFIG_ATM_IDT77252_USE_SUNI=y > CONFIG_ATM_AMBASSADOR=m > # CONFIG_ATM_AMBASSADOR_DEBUG is not set > CONFIG_ATM_HORIZON=m > # CONFIG_ATM_HORIZON_DEBUG is not set > CONFIG_ATM_FORE200E_MAYBE=m > # CONFIG_ATM_FORE200E_PCA is not set > CONFIG_ATM_HE=m > # CONFIG_ATM_HE_USE_SUNI is not set > CONFIG_FDDI=y > # CONFIG_DEFXX is not set > # CONFIG_SKFP is not set > # CONFIG_HIPPI is not set > # CONFIG_PLIP is not set > CONFIG_PPP=m > CONFIG_PPP_MULTILINK=y > CONFIG_PPP_FILTER=y > CONFIG_PPP_ASYNC=m > CONFIG_PPP_SYNC_TTY=m > CONFIG_PPP_DEFLATE=m > # CONFIG_PPP_BSDCOMP is not set > # CONFIG_PPP_MPPE is not set > CONFIG_PPPOE=m > CONFIG_PPPOATM=m > # CONFIG_SLIP is not set > CONFIG_NET_FC=y > # CONFIG_SHAPER is not set > CONFIG_NETCONSOLE=m > CONFIG_NETPOLL=y > # CONFIG_NETPOLL_RX is not set > CONFIG_NETPOLL_TRAP=y > CONFIG_NET_POLL_CONTROLLER=y > > # > # ISDN subsystem > # > CONFIG_ISDN=m > > # > # Old ISDN4Linux > # > CONFIG_ISDN_I4L=m > CONFIG_ISDN_PPP=y > CONFIG_ISDN_PPP_VJ=y > CONFIG_ISDN_MPP=y > CONFIG_IPPP_FILTER=y > # CONFIG_ISDN_PPP_BSDCOMP is not set > CONFIG_ISDN_AUDIO=y > CONFIG_ISDN_TTY_FAX=y > > # > # ISDN feature submodules > # > # CONFIG_ISDN_DIVERSION is not set > > # > # ISDN4Linux hardware drivers > # > > # > # Passive cards > # > CONFIG_ISDN_DRV_HISAX=m > > # > # D-channel protocol features > # > CONFIG_HISAX_EURO=y > CONFIG_DE_AOC=y > CONFIG_HISAX_NO_SENDCOMPLETE=y > CONFIG_HISAX_NO_LLC=y > CONFIG_HISAX_NO_KEYPAD=y > CONFIG_HISAX_1TR6=y > CONFIG_HISAX_NI1=y > CONFIG_HISAX_MAX_CARDS=8 > > # > # HiSax supported cards > # > CONFIG_HISAX_16_3=y > CONFIG_HISAX_TELESPCI=y > CONFIG_HISAX_S0BOX=y > CONFIG_HISAX_FRITZPCI=y > CONFIG_HISAX_AVM_A1_PCMCIA=y > CONFIG_HISAX_ELSA=y > CONFIG_HISAX_DIEHLDIVA=y > CONFIG_HISAX_SEDLBAUER=y > CONFIG_HISAX_NETJET=y > CONFIG_HISAX_NETJET_U=y > CONFIG_HISAX_NICCY=y > CONFIG_HISAX_BKM_A4T=y > CONFIG_HISAX_SCT_QUADRO=y > CONFIG_HISAX_GAZEL=y > CONFIG_HISAX_HFC_PCI=y > CONFIG_HISAX_W6692=y > CONFIG_HISAX_HFC_SX=y > CONFIG_HISAX_ENTERNOW_PCI=y > # CONFIG_HISAX_DEBUG is not set > > # > # HiSax PCMCIA card service modules > # > > # > # HiSax sub driver modules > # > CONFIG_HISAX_ST5481=m > CONFIG_HISAX_HFCUSB=m > # CONFIG_HISAX_HFC4S8S is not set > CONFIG_HISAX_FRITZ_PCIPNP=m > CONFIG_HISAX_HDLC=y > > # > # Active cards > # > > # > # Siemens Gigaset > # > # CONFIG_ISDN_DRV_GIGASET is not set > > # > # CAPI subsystem > # > CONFIG_ISDN_CAPI=m > CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON=y > CONFIG_ISDN_CAPI_MIDDLEWARE=y > CONFIG_ISDN_CAPI_CAPI20=m > CONFIG_ISDN_CAPI_CAPIFS_BOOL=y > CONFIG_ISDN_CAPI_CAPIFS=m > CONFIG_ISDN_CAPI_CAPIDRV=m > > # > # CAPI hardware drivers > # > > # > # Active AVM cards > # > CONFIG_CAPI_AVM=y > CONFIG_ISDN_DRV_AVMB1_B1PCI=m > CONFIG_ISDN_DRV_AVMB1_B1PCIV4=y > CONFIG_ISDN_DRV_AVMB1_B1PCMCIA=m > CONFIG_ISDN_DRV_AVMB1_T1PCI=m > CONFIG_ISDN_DRV_AVMB1_C4=m > > # > # Active Eicon DIVA Server cards > # > # CONFIG_CAPI_EICON is not set > > # > # Telephony Support > # > # CONFIG_PHONE is not set > > # > # Input device support > # > CONFIG_INPUT=y > > # > # Userland interfaces > # > CONFIG_INPUT_MOUSEDEV=y > # CONFIG_INPUT_MOUSEDEV_PSAUX is not set > CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 > CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 > CONFIG_INPUT_JOYDEV=m > # CONFIG_INPUT_TSDEV is not set > CONFIG_INPUT_EVDEV=y > # CONFIG_INPUT_EVBUG is not set > > # > # Input Device Drivers > # > CONFIG_INPUT_KEYBOARD=y > CONFIG_KEYBOARD_ATKBD=y > # CONFIG_KEYBOARD_SUNKBD is not set > # CONFIG_KEYBOARD_LKKBD is not set > # CONFIG_KEYBOARD_XTKBD is not set > # CONFIG_KEYBOARD_NEWTON is not set > CONFIG_INPUT_MOUSE=y > CONFIG_MOUSE_PS2=y > CONFIG_MOUSE_SERIAL=m > CONFIG_MOUSE_VSXXXAA=m > CONFIG_INPUT_JOYSTICK=y > # CONFIG_JOYSTICK_ANALOG is not set > # CONFIG_JOYSTICK_A3D is not set > # CONFIG_JOYSTICK_ADI is not set > # CONFIG_JOYSTICK_COBRA is not set > # CONFIG_JOYSTICK_GF2K is not set > # CONFIG_JOYSTICK_GRIP is not set > # CONFIG_JOYSTICK_GRIP_MP is not set > # CONFIG_JOYSTICK_GUILLEMOT is not set > # CONFIG_JOYSTICK_INTERACT is not set > # CONFIG_JOYSTICK_SIDEWINDER is not set > # CONFIG_JOYSTICK_TMDC is not set > # CONFIG_JOYSTICK_IFORCE is not set > # CONFIG_JOYSTICK_WARRIOR is not set > # CONFIG_JOYSTICK_MAGELLAN is not set > # CONFIG_JOYSTICK_SPACEORB is not set > # CONFIG_JOYSTICK_SPACEBALL is not set > # CONFIG_JOYSTICK_STINGER is not set > # CONFIG_JOYSTICK_TWIDJOY is not set > # CONFIG_JOYSTICK_DB9 is not set > # CONFIG_JOYSTICK_GAMECON is not set > # CONFIG_JOYSTICK_TURBOGRAFX is not set > # CONFIG_JOYSTICK_JOYDUMP is not set > CONFIG_INPUT_TOUCHSCREEN=y > CONFIG_TOUCHSCREEN_GUNZE=m > # CONFIG_TOUCHSCREEN_ELO is not set > # CONFIG_TOUCHSCREEN_MTOUCH is not set > # CONFIG_TOUCHSCREEN_MK712 is not set > CONFIG_INPUT_MISC=y > CONFIG_INPUT_PCSPKR=m > CONFIG_INPUT_UINPUT=m > > # > # Hardware I/O ports > # > CONFIG_SERIO=y > CONFIG_SERIO_I8042=y > CONFIG_SERIO_SERPORT=y > # CONFIG_SERIO_CT82C710 is not set > # CONFIG_SERIO_PARKBD is not set > # CONFIG_SERIO_PCIPS2 is not set > CONFIG_SERIO_LIBPS2=y > # CONFIG_SERIO_RAW is not set > # CONFIG_GAMEPORT is not set > > # > # Character devices > # > CONFIG_VT=y > CONFIG_VT_CONSOLE=y > CONFIG_HW_CONSOLE=y > # CONFIG_VT_HW_CONSOLE_BINDING is not set > CONFIG_SERIAL_NONSTANDARD=y > # CONFIG_COMPUTONE is not set > # CONFIG_ROCKETPORT is not set > # CONFIG_CYCLADES is not set > # CONFIG_DIGIEPCA is not set > # CONFIG_MOXA_INTELLIO is not set > # CONFIG_MOXA_SMARTIO is not set > # CONFIG_ISI is not set > # CONFIG_SYNCLINK is not set > # CONFIG_SYNCLINKMP is not set > # CONFIG_SYNCLINK_GT is not set > CONFIG_N_HDLC=m > # CONFIG_SPECIALIX is not set > # CONFIG_SX is not set > # CONFIG_RIO is not set > CONFIG_STALDRV=y > > # > # Serial drivers > # > CONFIG_SERIAL_8250=y > CONFIG_SERIAL_8250_CONSOLE=y > CONFIG_SERIAL_8250_PCI=y > CONFIG_SERIAL_8250_NR_UARTS=4 > CONFIG_SERIAL_8250_RUNTIME_UARTS=4 > CONFIG_SERIAL_8250_EXTENDED=y > # CONFIG_SERIAL_8250_MANY_PORTS is not set > CONFIG_SERIAL_8250_SHARE_IRQ=y > CONFIG_SERIAL_8250_DETECT_IRQ=y > CONFIG_SERIAL_8250_RSA=y > > # > # Non-8250 serial port support > # > CONFIG_SERIAL_CORE=y > CONFIG_SERIAL_CORE_CONSOLE=y > # CONFIG_SERIAL_JSM is not set > CONFIG_UNIX98_PTYS=y > # CONFIG_LEGACY_PTYS is not set > CONFIG_PRINTER=m > CONFIG_LP_CONSOLE=y > CONFIG_PPDEV=m > # CONFIG_TIPAR is not set > > # > # IPMI > # > CONFIG_IPMI_HANDLER=m > # CONFIG_IPMI_PANIC_EVENT is not set > CONFIG_IPMI_DEVICE_INTERFACE=m > CONFIG_IPMI_SI=m > CONFIG_IPMI_WATCHDOG=m > CONFIG_IPMI_POWEROFF=m > > # > # Watchdog Cards > # > CONFIG_WATCHDOG=y > # CONFIG_WATCHDOG_NOWAYOUT is not set > > # > # Watchdog Device Drivers > # > CONFIG_SOFT_WATCHDOG=m > CONFIG_ACQUIRE_WDT=m > CONFIG_ADVANTECH_WDT=m > CONFIG_ALIM1535_WDT=m > CONFIG_ALIM7101_WDT=m > CONFIG_SC520_WDT=m > CONFIG_EUROTECH_WDT=m > CONFIG_IB700_WDT=m > # CONFIG_IBMASR is not set > CONFIG_WAFER_WDT=m > # CONFIG_I6300ESB_WDT is not set > CONFIG_I8XX_TCO=m > CONFIG_SC1200_WDT=m > # CONFIG_60XX_WDT is not set > # CONFIG_SBC8360_WDT is not set > CONFIG_CPU5_WDT=m > CONFIG_W83627HF_WDT=m > CONFIG_W83877F_WDT=m > # CONFIG_W83977F_WDT is not set > CONFIG_MACHZ_WDT=m > # CONFIG_SBC_EPX_C3_WATCHDOG is not set > > # > # PCI-based Watchdog Cards > # > CONFIG_PCIPCWATCHDOG=m > CONFIG_WDTPCI=m > CONFIG_WDT_501_PCI=y > > # > # USB-based Watchdog Cards > # > CONFIG_USBPCWATCHDOG=m > CONFIG_HW_RANDOM=y > CONFIG_HW_RANDOM_INTEL=y > CONFIG_HW_RANDOM_AMD=y > CONFIG_HW_RANDOM_GEODE=y > # CONFIG_NVRAM is not set > CONFIG_RTC=y > CONFIG_DTLK=m > # CONFIG_R3964 is not set > # CONFIG_APPLICOM is not set > > # > # Ftape, the floppy tape device driver > # > CONFIG_AGP=y > CONFIG_AGP_AMD64=y > # CONFIG_AGP_INTEL is not set > # CONFIG_AGP_SIS is not set > # CONFIG_AGP_VIA is not set > CONFIG_DRM=y > # CONFIG_DRM_TDFX is not set > CONFIG_DRM_R128=m > CONFIG_DRM_RADEON=m > CONFIG_DRM_MGA=m > # CONFIG_DRM_SIS is not set > # CONFIG_DRM_VIA is not set > # CONFIG_DRM_SAVAGE is not set > # CONFIG_MWAVE is not set > # CONFIG_PC8736x_GPIO is not set > CONFIG_RAW_DRIVER=y > CONFIG_MAX_RAW_DEVS=8192 > # CONFIG_HPET is not set > CONFIG_HANGCHECK_TIMER=m > > # > # TPM devices > # > # CONFIG_TCG_TPM is not set > # CONFIG_TELCLOCK is not set > > # > # I2C support > # > CONFIG_I2C=m > CONFIG_I2C_CHARDEV=m > > # > # I2C Algorithms > # > CONFIG_I2C_ALGOBIT=m > CONFIG_I2C_ALGOPCF=m > CONFIG_I2C_ALGOPCA=m > > # > # I2C Hardware Bus support > # > CONFIG_I2C_ALI1535=m > CONFIG_I2C_ALI1563=m > CONFIG_I2C_ALI15X3=m > CONFIG_I2C_AMD756=m > # CONFIG_I2C_AMD756_S4882 is not set > CONFIG_I2C_AMD8111=m > CONFIG_I2C_I801=m > CONFIG_I2C_I810=m > # CONFIG_I2C_PIIX4 is not set > CONFIG_I2C_ISA=m > CONFIG_I2C_NFORCE2=m > # CONFIG_I2C_OCORES is not set > # CONFIG_I2C_PARPORT is not set > # CONFIG_I2C_PARPORT_LIGHT is not set > CONFIG_I2C_PROSAVAGE=m > CONFIG_I2C_SAVAGE4=m > CONFIG_I2C_SIS5595=m > CONFIG_I2C_SIS630=m > CONFIG_I2C_SIS96X=m > # CONFIG_I2C_STUB is not set > CONFIG_I2C_VIA=m > CONFIG_I2C_VIAPRO=m > CONFIG_I2C_VOODOO3=m > # CONFIG_I2C_PCA_ISA is not set > > # > # Miscellaneous I2C Chip support > # > # CONFIG_SENSORS_DS1337 is not set > # CONFIG_SENSORS_DS1374 is not set > CONFIG_SENSORS_EEPROM=m > CONFIG_SENSORS_PCF8574=m > # CONFIG_SENSORS_PCA9539 is not set > CONFIG_SENSORS_PCF8591=m > # CONFIG_SENSORS_MAX6875 is not set > # CONFIG_I2C_DEBUG_CORE is not set > # CONFIG_I2C_DEBUG_ALGO is not set > # CONFIG_I2C_DEBUG_BUS is not set > # CONFIG_I2C_DEBUG_CHIP is not set > > # > # SPI support > # > # CONFIG_SPI is not set > # CONFIG_SPI_MASTER is not set > > # > # Dallas's 1-wire bus > # > > # > # Hardware Monitoring support > # > CONFIG_HWMON=y > CONFIG_HWMON_VID=m > # CONFIG_SENSORS_ABITUGURU is not set > CONFIG_SENSORS_ADM1021=m > CONFIG_SENSORS_ADM1025=m > # CONFIG_SENSORS_ADM1026 is not set > CONFIG_SENSORS_ADM1031=m > # CONFIG_SENSORS_ADM9240 is not set > CONFIG_SENSORS_ASB100=m > # CONFIG_SENSORS_ATXP1 is not set > CONFIG_SENSORS_DS1621=m > # CONFIG_SENSORS_F71805F is not set > CONFIG_SENSORS_FSCHER=m > # CONFIG_SENSORS_FSCPOS is not set > CONFIG_SENSORS_GL518SM=m > # CONFIG_SENSORS_GL520SM is not set > CONFIG_SENSORS_IT87=m > # CONFIG_SENSORS_LM63 is not set > CONFIG_SENSORS_LM75=m > CONFIG_SENSORS_LM77=m > CONFIG_SENSORS_LM78=m > CONFIG_SENSORS_LM80=m > CONFIG_SENSORS_LM83=m > CONFIG_SENSORS_LM85=m > # CONFIG_SENSORS_LM87 is not set > CONFIG_SENSORS_LM90=m > # CONFIG_SENSORS_LM92 is not set > CONFIG_SENSORS_MAX1619=m > # CONFIG_SENSORS_PC87360 is not set > # CONFIG_SENSORS_SIS5595 is not set > CONFIG_SENSORS_SMSC47M1=m > # CONFIG_SENSORS_SMSC47M192 is not set > # CONFIG_SENSORS_SMSC47B397 is not set > CONFIG_SENSORS_VIA686A=m > # CONFIG_SENSORS_VT8231 is not set > CONFIG_SENSORS_W83781D=m > # CONFIG_SENSORS_W83791D is not set > # CONFIG_SENSORS_W83792D is not set > CONFIG_SENSORS_W83L785TS=m > CONFIG_SENSORS_W83627HF=m > # CONFIG_SENSORS_W83627EHF is not set > # CONFIG_SENSORS_HDAPS is not set > # CONFIG_HWMON_DEBUG_CHIP is not set > > # > # Misc devices > # > # CONFIG_IBM_ASM is not set > > # > # Multimedia devices > # > CONFIG_VIDEO_DEV=m > CONFIG_VIDEO_V4L1=y > CONFIG_VIDEO_V4L1_COMPAT=y > CONFIG_VIDEO_V4L2=y > > # > # Video Capture Adapters > # > > # > # Video Capture Adapters > # > # CONFIG_VIDEO_ADV_DEBUG is not set > # CONFIG_VIDEO_VIVI is not set > # CONFIG_VIDEO_BT848 is not set > # CONFIG_VIDEO_BWQCAM is not set > # CONFIG_VIDEO_CQCAM is not set > # CONFIG_VIDEO_W9966 is not set > # CONFIG_VIDEO_CPIA is not set > # CONFIG_VIDEO_CPIA2 is not set > # CONFIG_VIDEO_SAA5246A is not set > # CONFIG_VIDEO_SAA5249 is not set > # CONFIG_TUNER_3036 is not set > # CONFIG_VIDEO_STRADIS is not set > # CONFIG_VIDEO_ZORAN is not set > # CONFIG_VIDEO_SAA7134 is not set > # CONFIG_VIDEO_MXB is not set > # CONFIG_VIDEO_DPC is not set > # CONFIG_VIDEO_HEXIUM_ORION is not set > # CONFIG_VIDEO_HEXIUM_GEMINI is not set > # CONFIG_VIDEO_CX88 is not set > > # > # Encoders and Decoders > # > # CONFIG_VIDEO_MSP3400 is not set > # CONFIG_VIDEO_CS53L32A is not set > # CONFIG_VIDEO_TLV320AIC23B is not set > # CONFIG_VIDEO_WM8775 is not set > # CONFIG_VIDEO_WM8739 is not set > # CONFIG_VIDEO_CX2341X is not set > # CONFIG_VIDEO_CX25840 is not set > # CONFIG_VIDEO_SAA711X is not set > # CONFIG_VIDEO_SAA7127 is not set > # CONFIG_VIDEO_UPD64031A is not set > # CONFIG_VIDEO_UPD64083 is not set > > # > # V4L USB devices > # > # CONFIG_VIDEO_PVRUSB2 is not set > # CONFIG_VIDEO_EM28XX is not set > CONFIG_VIDEO_USBVIDEO=m > CONFIG_USB_VICAM=m > CONFIG_USB_IBMCAM=m > CONFIG_USB_KONICAWC=m > # CONFIG_USB_QUICKCAM_MESSENGER is not set > # CONFIG_USB_ET61X251 is not set > CONFIG_VIDEO_OVCAMCHIP=m > CONFIG_USB_W9968CF=m > CONFIG_USB_OV511=m > CONFIG_USB_SE401=m > CONFIG_USB_SN9C102=m > CONFIG_USB_STV680=m > # CONFIG_USB_ZC0301 is not set > CONFIG_USB_PWC=m > # CONFIG_USB_PWC_DEBUG is not set > > # > # Radio Adapters > # > # CONFIG_RADIO_GEMTEK_PCI is not set > # CONFIG_RADIO_MAXIRADIO is not set > # CONFIG_RADIO_MAESTRO is not set > CONFIG_USB_DSBR=m > > # > # Digital Video Broadcasting Devices > # > # CONFIG_DVB is not set > CONFIG_USB_DABUSB=m > > # > # Graphics support > # > CONFIG_FIRMWARE_EDID=y > CONFIG_FB=y > CONFIG_FB_CFB_FILLRECT=y > CONFIG_FB_CFB_COPYAREA=y > CONFIG_FB_CFB_IMAGEBLIT=y > # CONFIG_FB_MACMODES is not set > # CONFIG_FB_BACKLIGHT is not set > CONFIG_FB_MODE_HELPERS=y > # CONFIG_FB_TILEBLITTING is not set > CONFIG_FB_CIRRUS=m > # CONFIG_FB_PM2 is not set > # CONFIG_FB_CYBER2000 is not set > # CONFIG_FB_ARC is not set > # CONFIG_FB_ASILIANT is not set > # CONFIG_FB_IMSTT is not set > CONFIG_FB_VGA16=m > CONFIG_FB_VESA=y > # CONFIG_FB_HGA is not set > # CONFIG_FB_S1D13XXX is not set > # CONFIG_FB_NVIDIA is not set > CONFIG_FB_RIVA=m > # CONFIG_FB_RIVA_I2C is not set > # CONFIG_FB_RIVA_DEBUG is not set > # CONFIG_FB_INTEL is not set > # CONFIG_FB_MATROX is not set > # CONFIG_FB_RADEON is not set > # CONFIG_FB_ATY128 is not set > # CONFIG_FB_ATY is not set > # CONFIG_FB_SAVAGE is not set > # CONFIG_FB_SIS is not set > # CONFIG_FB_NEOMAGIC is not set > CONFIG_FB_KYRO=m > # CONFIG_FB_3DFX is not set > # CONFIG_FB_VOODOO1 is not set > # CONFIG_FB_TRIDENT is not set > # CONFIG_FB_GEODE is not set > # CONFIG_FB_VIRTUAL is not set > > # > # Console display driver support > # > CONFIG_VGA_CONSOLE=y > # CONFIG_VGACON_SOFT_SCROLLBACK is not set > CONFIG_VIDEO_SELECT=y > CONFIG_DUMMY_CONSOLE=y > CONFIG_FRAMEBUFFER_CONSOLE=y > # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set > # CONFIG_FONTS is not set > CONFIG_FONT_8x8=y > CONFIG_FONT_8x16=y > > # > # Logo configuration > # > CONFIG_LOGO=y > # CONFIG_LOGO_LINUX_MONO is not set > # CONFIG_LOGO_LINUX_VGA16 is not set > CONFIG_LOGO_LINUX_CLUT224=y > # CONFIG_BACKLIGHT_LCD_SUPPORT is not set > > # > # Sound > # > CONFIG_SOUND=m > > # > # Advanced Linux Sound Architecture > # > CONFIG_SND=m > CONFIG_SND_TIMER=m > CONFIG_SND_PCM=m > CONFIG_SND_HWDEP=m > CONFIG_SND_RAWMIDI=m > CONFIG_SND_SEQUENCER=m > CONFIG_SND_SEQ_DUMMY=m > CONFIG_SND_OSSEMUL=y > CONFIG_SND_MIXER_OSS=m > CONFIG_SND_PCM_OSS=m > CONFIG_SND_PCM_OSS_PLUGINS=y > CONFIG_SND_SEQUENCER_OSS=y > CONFIG_SND_RTCTIMER=m > CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y > # CONFIG_SND_DYNAMIC_MINORS is not set > CONFIG_SND_SUPPORT_OLD_API=y > CONFIG_SND_VERBOSE_PROCFS=y > # CONFIG_SND_VERBOSE_PRINTK is not set > # CONFIG_SND_DEBUG is not set > > # > # Generic devices > # > CONFIG_SND_MPU401_UART=m > CONFIG_SND_OPL3_LIB=m > CONFIG_SND_VX_LIB=m > CONFIG_SND_AC97_CODEC=m > CONFIG_SND_AC97_BUS=m > CONFIG_SND_DUMMY=m > CONFIG_SND_VIRMIDI=m > CONFIG_SND_MTPAV=m > # CONFIG_SND_SERIAL_U16550 is not set > CONFIG_SND_MPU401=m > > # > # PCI devices > # > # CONFIG_SND_AD1889 is not set > # CONFIG_SND_ALS300 is not set > CONFIG_SND_ALS4000=m > CONFIG_SND_ALI5451=m > CONFIG_SND_ATIIXP=m > CONFIG_SND_ATIIXP_MODEM=m > CONFIG_SND_AU8810=m > CONFIG_SND_AU8820=m > CONFIG_SND_AU8830=m > CONFIG_SND_AZT3328=m > CONFIG_SND_BT87X=m > # CONFIG_SND_BT87X_OVERCLOCK is not set > # CONFIG_SND_CA0106 is not set > CONFIG_SND_CMIPCI=m > CONFIG_SND_CS4281=m > CONFIG_SND_CS46XX=m > CONFIG_SND_CS46XX_NEW_DSP=y > # CONFIG_SND_DARLA20 is not set > # CONFIG_SND_GINA20 is not set > # CONFIG_SND_LAYLA20 is not set > # CONFIG_SND_DARLA24 is not set > # CONFIG_SND_GINA24 is not set > # CONFIG_SND_LAYLA24 is not set > # CONFIG_SND_MONA is not set > # CONFIG_SND_MIA is not set > # CONFIG_SND_ECHO3G is not set > # CONFIG_SND_INDIGO is not set > # CONFIG_SND_INDIGOIO is not set > # CONFIG_SND_INDIGODJ is not set > CONFIG_SND_EMU10K1=m > # CONFIG_SND_EMU10K1X is not set > CONFIG_SND_ENS1370=m > CONFIG_SND_ENS1371=m > CONFIG_SND_ES1938=m > CONFIG_SND_ES1968=m > CONFIG_SND_FM801=m > # CONFIG_SND_FM801_TEA575X_BOOL is not set > # CONFIG_SND_HDA_INTEL is not set > CONFIG_SND_HDSP=m > # CONFIG_SND_HDSPM is not set > CONFIG_SND_ICE1712=m > CONFIG_SND_ICE1724=m > CONFIG_SND_INTEL8X0=m > CONFIG_SND_INTEL8X0M=m > CONFIG_SND_KORG1212=m > CONFIG_SND_MAESTRO3=m > CONFIG_SND_MIXART=m > CONFIG_SND_NM256=m > # CONFIG_SND_PCXHR is not set > # CONFIG_SND_RIPTIDE is not set > CONFIG_SND_RME32=m > CONFIG_SND_RME96=m > CONFIG_SND_RME9652=m > CONFIG_SND_SONICVIBES=m > CONFIG_SND_TRIDENT=m > CONFIG_SND_VIA82XX=m > # CONFIG_SND_VIA82XX_MODEM is not set > CONFIG_SND_VX222=m > CONFIG_SND_YMFPCI=m > > # > # USB devices > # > CONFIG_SND_USB_AUDIO=m > CONFIG_SND_USB_USX2Y=m > > # > # Open Sound System > # > # CONFIG_SOUND_PRIME is not set > > # > # USB support > # > CONFIG_USB_ARCH_HAS_HCD=y > CONFIG_USB_ARCH_HAS_OHCI=y > CONFIG_USB_ARCH_HAS_EHCI=y > CONFIG_USB=y > # CONFIG_USB_DEBUG is not set > > # > # Miscellaneous USB options > # > CONFIG_USB_DEVICEFS=y > # CONFIG_USB_BANDWIDTH is not set > # CONFIG_USB_DYNAMIC_MINORS is not set > CONFIG_USB_SUSPEND=y > # CONFIG_USB_OTG is not set > > # > # USB Host Controller Drivers > # > CONFIG_USB_EHCI_HCD=m > CONFIG_USB_EHCI_SPLIT_ISO=y > CONFIG_USB_EHCI_ROOT_HUB_TT=y > # CONFIG_USB_EHCI_TT_NEWSCHED is not set > # CONFIG_USB_ISP116X_HCD is not set > CONFIG_USB_OHCI_HCD=m > # CONFIG_USB_OHCI_BIG_ENDIAN is not set > CONFIG_USB_OHCI_LITTLE_ENDIAN=y > CONFIG_USB_UHCI_HCD=m > # CONFIG_USB_SL811_HCD is not set > > # > # USB Device Class drivers > # > CONFIG_USB_ACM=m > CONFIG_USB_PRINTER=m > > # > # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' > # > > # > # may also be needed; see USB_STORAGE Help for more information > # > CONFIG_USB_STORAGE=m > # CONFIG_USB_STORAGE_DEBUG is not set > CONFIG_USB_STORAGE_DATAFAB=y > CONFIG_USB_STORAGE_FREECOM=y > CONFIG_USB_STORAGE_ISD200=y > CONFIG_USB_STORAGE_DPCM=y > # CONFIG_USB_STORAGE_USBAT is not set > CONFIG_USB_STORAGE_SDDR09=y > CONFIG_USB_STORAGE_SDDR55=y > CONFIG_USB_STORAGE_JUMPSHOT=y > # CONFIG_USB_STORAGE_ALAUDA is not set > # CONFIG_USB_LIBUSUAL is not set > > # > # USB Input Devices > # > CONFIG_USB_HID=y > CONFIG_USB_HIDINPUT=y > # CONFIG_USB_HIDINPUT_POWERBOOK is not set > CONFIG_HID_FF=y > CONFIG_HID_PID=y > CONFIG_LOGITECH_FF=y > CONFIG_THRUSTMASTER_FF=y > CONFIG_USB_HIDDEV=y > CONFIG_USB_AIPTEK=m > CONFIG_USB_WACOM=m > # CONFIG_USB_ACECAD is not set > CONFIG_USB_KBTAB=m > CONFIG_USB_POWERMATE=m > # CONFIG_USB_TOUCHSCREEN is not set > # CONFIG_USB_YEALINK is not set > CONFIG_USB_XPAD=m > CONFIG_USB_ATI_REMOTE=m > # CONFIG_USB_ATI_REMOTE2 is not set > # CONFIG_USB_KEYSPAN_REMOTE is not set > # CONFIG_USB_APPLETOUCH is not set > > # > # USB Imaging devices > # > CONFIG_USB_MDC800=m > CONFIG_USB_MICROTEK=m > > # > # USB Network Adapters > # > CONFIG_USB_CATC=m > CONFIG_USB_KAWETH=m > CONFIG_USB_PEGASUS=m > CONFIG_USB_RTL8150=m > CONFIG_USB_USBNET=m > CONFIG_USB_NET_AX8817X=m > CONFIG_USB_NET_CDCETHER=m > # CONFIG_USB_NET_GL620A is not set > CONFIG_USB_NET_NET1080=m > # CONFIG_USB_NET_PLUSB is not set > # CONFIG_USB_NET_RNDIS_HOST is not set > # CONFIG_USB_NET_CDC_SUBSET is not set > CONFIG_USB_NET_ZAURUS=m > CONFIG_USB_MON=y > > # > # USB port drivers > # > CONFIG_USB_USS720=m > > # > # USB Serial Converter support > # > CONFIG_USB_SERIAL=m > CONFIG_USB_SERIAL_GENERIC=y > # CONFIG_USB_SERIAL_AIRPRIME is not set > # CONFIG_USB_SERIAL_ARK3116 is not set > CONFIG_USB_SERIAL_BELKIN=m > # CONFIG_USB_SERIAL_WHITEHEAT is not set > CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m > # CONFIG_USB_SERIAL_CP2101 is not set > # CONFIG_USB_SERIAL_CYPRESS_M8 is not set > CONFIG_USB_SERIAL_EMPEG=m > CONFIG_USB_SERIAL_FTDI_SIO=m > # CONFIG_USB_SERIAL_FUNSOFT is not set > CONFIG_USB_SERIAL_VISOR=m > CONFIG_USB_SERIAL_IPAQ=m > CONFIG_USB_SERIAL_IR=m > CONFIG_USB_SERIAL_EDGEPORT=m > CONFIG_USB_SERIAL_EDGEPORT_TI=m > # CONFIG_USB_SERIAL_GARMIN is not set > # CONFIG_USB_SERIAL_IPW is not set > CONFIG_USB_SERIAL_KEYSPAN_PDA=m > CONFIG_USB_SERIAL_KEYSPAN=m > CONFIG_USB_SERIAL_KEYSPAN_MPR=y > CONFIG_USB_SERIAL_KEYSPAN_USA28=y > CONFIG_USB_SERIAL_KEYSPAN_USA28X=y > CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y > CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y > CONFIG_USB_SERIAL_KEYSPAN_USA19=y > CONFIG_USB_SERIAL_KEYSPAN_USA18X=y > CONFIG_USB_SERIAL_KEYSPAN_USA19W=y > CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y > CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y > CONFIG_USB_SERIAL_KEYSPAN_USA49W=y > CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y > CONFIG_USB_SERIAL_KLSI=m > CONFIG_USB_SERIAL_KOBIL_SCT=m > CONFIG_USB_SERIAL_MCT_U232=m > # CONFIG_USB_SERIAL_NAVMAN is not set > CONFIG_USB_SERIAL_PL2303=m > # CONFIG_USB_SERIAL_HP4X is not set > CONFIG_USB_SERIAL_SAFE=m > CONFIG_USB_SERIAL_SAFE_PADDED=y > # CONFIG_USB_SERIAL_SIERRAWIRELESS is not set > # CONFIG_USB_SERIAL_TI is not set > CONFIG_USB_SERIAL_CYBERJACK=m > CONFIG_USB_SERIAL_XIRCOM=m > # CONFIG_USB_SERIAL_OPTION is not set > CONFIG_USB_SERIAL_OMNINET=m > CONFIG_USB_EZUSB=y > > # > # USB Miscellaneous drivers > # > CONFIG_USB_EMI62=m > # CONFIG_USB_EMI26 is not set > CONFIG_USB_AUERSWALD=m > CONFIG_USB_RIO500=m > CONFIG_USB_LEGOTOWER=m > CONFIG_USB_LCD=m > CONFIG_USB_LED=m > # CONFIG_USB_CYPRESS_CY7C63 is not set > # CONFIG_USB_CYTHERM is not set > # CONFIG_USB_PHIDGETKIT is not set > CONFIG_USB_PHIDGETSERVO=m > # CONFIG_USB_IDMOUSE is not set > # CONFIG_USB_APPLEDISPLAY is not set > # CONFIG_USB_SISUSBVGA is not set > # CONFIG_USB_LD is not set > CONFIG_USB_TEST=m > > # > # USB DSL modem support > # > CONFIG_USB_ATM=m > CONFIG_USB_SPEEDTOUCH=m > # CONFIG_USB_CXACRU is not set > # CONFIG_USB_UEAGLEATM is not set > # CONFIG_USB_XUSBATM is not set > > # > # USB Gadget Support > # > # CONFIG_USB_GADGET is not set > > # > # MMC/SD Card support > # > # CONFIG_MMC is not set > > # > # LED devices > # > # CONFIG_NEW_LEDS is not set > > # > # LED drivers > # > > # > # LED Triggers > # > > # > # InfiniBand support > # > CONFIG_INFINIBAND=m > CONFIG_INFINIBAND_USER_MAD=m > CONFIG_INFINIBAND_USER_ACCESS=m > CONFIG_INFINIBAND_ADDR_TRANS=y > CONFIG_INFINIBAND_MTHCA=m > CONFIG_INFINIBAND_MTHCA_DEBUG=y > # CONFIG_IPATH_CORE is not set > CONFIG_INFINIBAND_IPOIB=m > CONFIG_INFINIBAND_IPOIB_DEBUG=y > # CONFIG_INFINIBAND_IPOIB_DEBUG_DATA is not set > CONFIG_INFINIBAND_SRP=m > # CONFIG_INFINIBAND_ISER is not set > > # > # EDAC - error detection and reporting (RAS) (EXPERIMENTAL) > # > CONFIG_EDAC=m > > # > # Reporting subsystems > # > # CONFIG_EDAC_DEBUG is not set > CONFIG_EDAC_MM_EDAC=m > CONFIG_EDAC_E752X=m > CONFIG_EDAC_POLL=y > > # > # Real Time Clock > # > # CONFIG_RTC_CLASS is not set > > # > # DMA Engine support > # > # CONFIG_DMA_ENGINE is not set > > # > # DMA Clients > # > > # > # DMA Devices > # > > # > # Firmware Drivers > # > CONFIG_EDD=m > CONFIG_DELL_RBU=m > # CONFIG_DCDBAS is not set > > # > # File systems > # > CONFIG_EXT2_FS=y > CONFIG_EXT2_FS_XATTR=y > CONFIG_EXT2_FS_POSIX_ACL=y > CONFIG_EXT2_FS_SECURITY=y > # CONFIG_EXT2_FS_XIP is not set > CONFIG_EXT3_FS=m > CONFIG_EXT3_FS_XATTR=y > CONFIG_EXT3_FS_POSIX_ACL=y > CONFIG_EXT3_FS_SECURITY=y > CONFIG_JBD=m > # CONFIG_JBD_DEBUG is not set > CONFIG_FS_MBCACHE=y > CONFIG_REISERFS_FS=m > # CONFIG_REISERFS_CHECK is not set > # CONFIG_REISERFS_PROC_INFO is not set > # CONFIG_REISERFS_FS_XATTR is not set > CONFIG_JFS_FS=m > CONFIG_JFS_POSIX_ACL=y > CONFIG_JFS_SECURITY=y > # CONFIG_JFS_DEBUG is not set > # CONFIG_JFS_STATISTICS is not set > CONFIG_FS_POSIX_ACL=y > CONFIG_XFS_FS=m > CONFIG_XFS_QUOTA=y > CONFIG_XFS_SECURITY=y > CONFIG_XFS_POSIX_ACL=y > CONFIG_XFS_RT=y > # CONFIG_OCFS2_FS is not set > # CONFIG_MINIX_FS is not set > # CONFIG_ROMFS_FS is not set > CONFIG_INOTIFY=y > CONFIG_INOTIFY_USER=y > CONFIG_QUOTA=y > # CONFIG_QFMT_V1 is not set > CONFIG_QFMT_V2=y > CONFIG_QUOTACTL=y > CONFIG_DNOTIFY=y > # CONFIG_AUTOFS_FS is not set > CONFIG_AUTOFS4_FS=m > # CONFIG_FUSE_FS is not set > > # > # CD-ROM/DVD Filesystems > # > CONFIG_ISO9660_FS=y > CONFIG_JOLIET=y > CONFIG_ZISOFS=y > CONFIG_ZISOFS_FS=y > CONFIG_UDF_FS=m > CONFIG_UDF_NLS=y > > # > # DOS/FAT/NT Filesystems > # > CONFIG_FAT_FS=m > CONFIG_MSDOS_FS=m > CONFIG_VFAT_FS=m > CONFIG_FAT_DEFAULT_CODEPAGE=437 > CONFIG_FAT_DEFAULT_IOCHARSET="ascii" > # CONFIG_NTFS_FS is not set > > # > # Pseudo filesystems > # > CONFIG_PROC_FS=y > CONFIG_PROC_KCORE=y > CONFIG_SYSFS=y > CONFIG_TMPFS=y > CONFIG_HUGETLBFS=y > CONFIG_HUGETLB_PAGE=y > CONFIG_RAMFS=y > # CONFIG_CONFIGFS_FS is not set > > # > # Miscellaneous filesystems > # > # CONFIG_ADFS_FS is not set > # CONFIG_AFFS_FS is not set > CONFIG_HFS_FS=m > CONFIG_HFSPLUS_FS=m > # CONFIG_BEFS_FS is not set > # CONFIG_BFS_FS is not set > # CONFIG_EFS_FS is not set > # CONFIG_JFFS_FS is not set > CONFIG_JFFS2_FS=m > CONFIG_JFFS2_FS_DEBUG=0 > CONFIG_JFFS2_FS_WRITEBUFFER=y > # CONFIG_JFFS2_SUMMARY is not set > # CONFIG_JFFS2_FS_XATTR is not set > # CONFIG_JFFS2_COMPRESSION_OPTIONS is not set > CONFIG_JFFS2_ZLIB=y > CONFIG_JFFS2_RTIME=y > # CONFIG_JFFS2_RUBIN is not set > CONFIG_CRAMFS=m > CONFIG_VXFS_FS=m > # CONFIG_HPFS_FS is not set > # CONFIG_QNX4FS_FS is not set > # CONFIG_SYSV_FS is not set > # CONFIG_UFS_FS is not set > > # > # Network File Systems > # > CONFIG_NFS_FS=m > CONFIG_NFS_V3=y > CONFIG_NFS_V3_ACL=y > # CONFIG_NFS_V4 is not set > # CONFIG_NFS_DIRECTIO is not set > CONFIG_SUNRPC_XPRT_RDMA=m > CONFIG_NFSD=m > CONFIG_NFSD_V2_ACL=y > CONFIG_NFSD_V3=y > CONFIG_NFSD_V3_ACL=y > # CONFIG_NFSD_V4 is not set > CONFIG_NFSD_TCP=y > CONFIG_NFSD_RDMA=y > CONFIG_LOCKD=m > CONFIG_LOCKD_V4=y > CONFIG_EXPORTFS=m > CONFIG_NFS_ACL_SUPPORT=m > CONFIG_NFS_COMMON=y > CONFIG_SUNRPC=m > # CONFIG_RPCBIND_VERSION3 is not set > # CONFIG_RPCSEC_GSS_KRB5 is not set > # CONFIG_RPCSEC_GSS_SPKM3 is not set > CONFIG_SMB_FS=m > # CONFIG_SMB_NLS_DEFAULT is not set > CONFIG_CIFS=m > # CONFIG_CIFS_STATS is not set > # CONFIG_CIFS_WEAK_PW_HASH is not set > CONFIG_CIFS_XATTR=y > CONFIG_CIFS_POSIX=y > # CONFIG_CIFS_DEBUG2 is not set > # CONFIG_CIFS_EXPERIMENTAL is not set > # CONFIG_NCP_FS is not set > # CONFIG_CODA_FS is not set > # CONFIG_AFS_FS is not set > # CONFIG_9P_FS is not set > > # > # Partition Types > # > CONFIG_PARTITION_ADVANCED=y > # CONFIG_ACORN_PARTITION is not set > CONFIG_OSF_PARTITION=y > # CONFIG_AMIGA_PARTITION is not set > # CONFIG_ATARI_PARTITION is not set > CONFIG_MAC_PARTITION=y > CONFIG_MSDOS_PARTITION=y > CONFIG_BSD_DISKLABEL=y > CONFIG_MINIX_SUBPARTITION=y > CONFIG_SOLARIS_X86_PARTITION=y > CONFIG_UNIXWARE_DISKLABEL=y > # CONFIG_LDM_PARTITION is not set > CONFIG_SGI_PARTITION=y > # CONFIG_ULTRIX_PARTITION is not set > CONFIG_SUN_PARTITION=y > # CONFIG_KARMA_PARTITION is not set > CONFIG_EFI_PARTITION=y > > # > # Native Language Support > # > CONFIG_NLS=y > CONFIG_NLS_DEFAULT="utf8" > CONFIG_NLS_CODEPAGE_437=y > CONFIG_NLS_CODEPAGE_737=m > CONFIG_NLS_CODEPAGE_775=m > CONFIG_NLS_CODEPAGE_850=m > CONFIG_NLS_CODEPAGE_852=m > CONFIG_NLS_CODEPAGE_855=m > CONFIG_NLS_CODEPAGE_857=m > CONFIG_NLS_CODEPAGE_860=m > CONFIG_NLS_CODEPAGE_861=m > CONFIG_NLS_CODEPAGE_862=m > CONFIG_NLS_CODEPAGE_863=m > CONFIG_NLS_CODEPAGE_864=m > CONFIG_NLS_CODEPAGE_865=m > CONFIG_NLS_CODEPAGE_866=m > CONFIG_NLS_CODEPAGE_869=m > CONFIG_NLS_CODEPAGE_936=m > CONFIG_NLS_CODEPAGE_950=m > CONFIG_NLS_CODEPAGE_932=m > CONFIG_NLS_CODEPAGE_949=m > CONFIG_NLS_CODEPAGE_874=m > CONFIG_NLS_ISO8859_8=m > CONFIG_NLS_CODEPAGE_1250=m > CONFIG_NLS_CODEPAGE_1251=m > CONFIG_NLS_ASCII=y > CONFIG_NLS_ISO8859_1=m > CONFIG_NLS_ISO8859_2=m > CONFIG_NLS_ISO8859_3=m > CONFIG_NLS_ISO8859_4=m > CONFIG_NLS_ISO8859_5=m > CONFIG_NLS_ISO8859_6=m > CONFIG_NLS_ISO8859_7=m > CONFIG_NLS_ISO8859_9=m > CONFIG_NLS_ISO8859_13=m > CONFIG_NLS_ISO8859_14=m > CONFIG_NLS_ISO8859_15=m > CONFIG_NLS_KOI8_R=m > CONFIG_NLS_KOI8_U=m > CONFIG_NLS_UTF8=m > > # > # Instrumentation Support > # > CONFIG_PROFILING=y > CONFIG_OPROFILE=m > CONFIG_KPROBES=y > > # > # Kernel hacking > # > CONFIG_TRACE_IRQFLAGS_SUPPORT=y > # CONFIG_PRINTK_TIME is not set > CONFIG_MAGIC_SYSRQ=y > CONFIG_UNUSED_SYMBOLS=y > CONFIG_DEBUG_KERNEL=y > CONFIG_LOG_BUF_SHIFT=17 > CONFIG_DETECT_SOFTLOCKUP=y > # CONFIG_SCHEDSTATS is not set > # CONFIG_DEBUG_SLAB is not set > # CONFIG_DEBUG_RT_MUTEXES is not set > # CONFIG_RT_MUTEX_TESTER is not set > CONFIG_DEBUG_SPINLOCK=y > # CONFIG_DEBUG_MUTEXES is not set > # CONFIG_DEBUG_RWSEMS is not set > # CONFIG_DEBUG_LOCK_ALLOC is not set > # CONFIG_PROVE_LOCKING is not set > CONFIG_DEBUG_SPINLOCK_SLEEP=y > # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set > # CONFIG_DEBUG_KOBJECT is not set > CONFIG_DEBUG_INFO=y > # CONFIG_DEBUG_FS is not set > # CONFIG_DEBUG_VM is not set > # CONFIG_FRAME_POINTER is not set > # CONFIG_UNWIND_INFO is not set > CONFIG_FORCED_INLINING=y > # CONFIG_RCU_TORTURE_TEST is not set > # CONFIG_DEBUG_RODATA is not set > # CONFIG_IOMMU_DEBUG is not set > # CONFIG_DEBUG_STACKOVERFLOW is not set > # CONFIG_DEBUG_STACK_USAGE is not set > > # > # Security options > # > CONFIG_KEYS=y > CONFIG_KEYS_DEBUG_PROC_KEYS=y > CONFIG_SECURITY=y > CONFIG_SECURITY_NETWORK=y > # CONFIG_SECURITY_NETWORK_XFRM is not set > CONFIG_SECURITY_CAPABILITIES=y > # CONFIG_SECURITY_ROOTPLUG is not set > # CONFIG_SECURITY_SECLVL is not set > CONFIG_SECURITY_SELINUX=y > CONFIG_SECURITY_SELINUX_BOOTPARAM=y > CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1 > CONFIG_SECURITY_SELINUX_DISABLE=y > CONFIG_SECURITY_SELINUX_DEVELOP=y > CONFIG_SECURITY_SELINUX_AVC_STATS=y > CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1 > # CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT is not set > > # > # Cryptographic options > # > CONFIG_CRYPTO=y > CONFIG_CRYPTO_HMAC=y > CONFIG_CRYPTO_NULL=m > CONFIG_CRYPTO_MD4=m > CONFIG_CRYPTO_MD5=y > CONFIG_CRYPTO_SHA1=y > CONFIG_CRYPTO_SHA256=m > CONFIG_CRYPTO_SHA512=m > CONFIG_CRYPTO_WP512=m > # CONFIG_CRYPTO_TGR192 is not set > CONFIG_CRYPTO_DES=m > CONFIG_CRYPTO_BLOWFISH=m > CONFIG_CRYPTO_TWOFISH=m > CONFIG_CRYPTO_SERPENT=m > CONFIG_CRYPTO_AES=m > # CONFIG_CRYPTO_AES_X86_64 is not set > CONFIG_CRYPTO_CAST5=m > CONFIG_CRYPTO_CAST6=m > CONFIG_CRYPTO_TEA=m > CONFIG_CRYPTO_ARC4=m > CONFIG_CRYPTO_KHAZAD=m > # CONFIG_CRYPTO_ANUBIS is not set > CONFIG_CRYPTO_DEFLATE=m > CONFIG_CRYPTO_MICHAEL_MIC=m > CONFIG_CRYPTO_CRC32C=m > # CONFIG_CRYPTO_TEST is not set > > # > # Hardware crypto devices > # > > # > # Library routines > # > CONFIG_CRC_CCITT=m > # CONFIG_CRC16 is not set > CONFIG_CRC32=y > CONFIG_LIBCRC32C=m > CONFIG_ZLIB_INFLATE=y > CONFIG_ZLIB_DEFLATE=m > CONFIG_TEXTSEARCH=y > CONFIG_TEXTSEARCH_KMP=m > CONFIG_PLIST=y From vuhuong at mellanox.com Wed Dec 13 14:57:03 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 13 Dec 2006 14:57:03 -0800 Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: <1166049650.10873.9.camel@trinity.ogc.int> References: <457F34B3.9060402@mellanox.com> <1165966574.8722.110.camel@trinity.ogc.int> <457F426B.7020104@mellanox.com> <1166049650.10873.9.camel@trinity.ogc.int> Message-ID: <4580853F.9070907@mellanox.com> >>> 2. Can you please send me the iozone test parameters your using? >>> >> server has 8GB of mem, client has 2GB of mem >> >> iozone -r 64KB -s 5g -i 0 -i 1 >> and >> iozone -r 64KB -s 2g -i 0 -i 1 -t 3 >> > > Can you please send me the iozone output you get from these commands? Here it is -vu -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: iozone.output URL: From vuhuong at mellanox.com Wed Dec 13 15:02:01 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 13 Dec 2006 15:02:01 -0800 Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: References: <457F34B3.9060402@mellanox.com> <1165966574.8722.110.camel@trinity.ogc.int> <457F426B.7020104@mellanox.com> Message-ID: <45808669.2040602@mellanox.com> James Lentini wrote: > > On Tue, 12 Dec 2006, Vu Pham wrote: > >>>> 2. While some clients run I/Os, one idle client try to access the mount >>>> point ie. *ls* and get I/O input error. I see these error messages on >>>> server log > > Was there anything in the log before this point? I'd expect to see a > message started with "svcrdma: failed to post SQ..." There is no such message in the log before this point From Brian.Cain at ge.com Wed Dec 13 15:15:51 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Wed, 13 Dec 2006 18:15:51 -0500 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEFC0@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Wednesday, December 13, 2006 4:21 PM > To: Cain, Brian (GE Healthcare) > Cc: openib-general at openib.org > Subject: Re: [openib-general] [PATCH] install.sh: Cause less > pain to SRP users who didn't RTFM > > > > + echo '!!WARNING!! SRP is not supported > for 32-bit OS running on 64-bit capable hardware' > > Did I miss something? Why doesn't SRP work with 32-bit userspace on a > 64-bit capable hardware? In fact why doesn't it work with 32-bit > userspace on a 64-bit kernel? AFAICT, it's not a userspace/kernel issue, it's a hardware capability/OS target issue. >From srp_release_notes.txt: ~~~~~ ======================================================================== ====== 11. Known Issues ======================================================================== ====== - SRP is not supported on a 32-bit operating system running on a 64-bit platform. ~~~~~ Maybe the tests in the patch aren't appropriate for detecting this case, but it looked right to me. -BRian From sashak at voltaire.com Wed Dec 13 15:26:38 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 14 Dec 2006 01:26:38 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061212122957.GC14622@mellanox.co.il> References: <20061210225613.GF21155@sashak.voltaire.com> <20061212122957.GC14622@mellanox.co.il> Message-ID: <20061213232638.GC14186@sashak.voltaire.com> On 14:29 Tue 12 Dec , Michael S. Tsirkin wrote: > > For me it is unclear yet how long we may need this - 1.1 still be in > > SVN yet, and 1.1 git branch is updated there. > > By the way, one can't actually build OFED 1.1 userspace from git > because OFED also applies some patches after checking things out > from svn. They are here: > https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes I guess those patches should be committed in 1.1 svn branch (and imported to git's 1.1). Any reason why it is not committed? Sasha From rdreier at cisco.com Wed Dec 13 15:47:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 15:47:31 -0800 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEFC0@CINMLVEM11.e2k.ad.ge.com> (Brian Cain's message of "Wed, 13 Dec 2006 18:15:51 -0500") References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEFC0@CINMLVEM11.e2k.ad.ge.com> Message-ID: > - SRP is not supported on a 32-bit operating system running on a 64-bit > platform. Hmm, who wrote the release notes, and why was that put in? I don't know of any reason why the SRP initiator wouldn't work in a mixed 32-bit userspace / 64-bit kernel environment. - R. From vuhuong at mellanox.com Wed Dec 13 15:50:10 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 13 Dec 2006 15:50:10 -0800 Subject: [openib-general] nfsrdma release 7 issues, In-Reply-To: <4580853F.9070907@mellanox.com> References: <457F34B3.9060402@mellanox.com> <1165966574.8722.110.camel@trinity.ogc.int> <457F426B.7020104@mellanox.com> <1166049650.10873.9.camel@trinity.ogc.int> <4580853F.9070907@mellanox.com> Message-ID: <458091B2.1030905@mellanox.com> Tom, Here is the iozone output with same hw configuration; however, now server is running nfsrdma release 6, client is still running nfsrdma release 7 -vu > >>>> 2. Can you please send me the iozone test parameters your using? >>>> >>> server has 8GB of mem, client has 2GB of mem >>> >>> iozone -r 64KB -s 5g -i 0 -i 1 >>> and >>> iozone -r 64KB -s 2g -i 0 -i 1 -t 3 >>> >> >> Can you please send me the iozone output you get from these commands? > > Here it is > > -vu > > > > > ------------------------------------------------------------------------ > > [root at ibd001 ~]# cat /proc/meminfo > MemTotal: 2056688 kB > MemFree: 1851248 kB > Buffers: 12644 kB > Cached: 91764 kB > SwapCached: 0 kB > Active: 69400 kB > Inactive: 76536 kB > HighTotal: 0 kB > HighFree: 0 kB > LowTotal: 2056688 kB > LowFree: 1851248 kB > SwapTotal: 4192924 kB > SwapFree: 4192924 kB > Dirty: 1048 kB > Writeback: 4 kB > AnonPages: 41584 kB > Mapped: 6968 kB > Slab: 26760 kB > PageTables: 2072 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > CommitLimit: 5221268 kB > Committed_AS: 71812 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 2500 kB > VmallocChunk: 34359735671 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > Hugepagesize: 2048 kB > [root at ibd001 ~]# . /etc/nfsrdma-v7 > Doing nfs/rdma mount to 193.168.13.202, mount protocol to 193.168.13.202 > [root at ibd001 ~]# > [root at ibd001 ~]# > [root at ibd001 ~]# cd /vol-202 > [root at ibd001 vol-202]# iozone -r 64KB -s 5g -i 0 -i 1 > Iozone: Performance Test of File I/O > Version $Revision: 3.263 $ > Compiled for 32 bit mode. > Build: linux > > Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, > Randy Dunlap, Mark Montague, Dan Million, > Jean-Marc Zucconi, Jeff Blomberg, > Erik Habbinga, Kris Strecker, Walter Wong. > > Run began: Wed Dec 13 14:36:18 2006 > > Record Size 64 KB > File size set to 5242880 KB > Command line used: iozone -r 64KB -s 5g -i 0 -i 1 > Output is in Kbytes/sec > Time Resolution = 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random random bkwd record stride > KB reclen write rewrite read reread read write read rewrite read fwrite frewrite fread freread > 5242880 64 179970 257954 441693 485204 > > iozone test complete. > [root at ibd001 vol-202]# > [root at ibd001 vol-202]# iozone -r 64KB -s 2g -i 0 -i 1 -t 3 > Iozone: Performance Test of File I/O > Version $Revision: 3.263 $ > Compiled for 32 bit mode. > Build: linux > > Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, > Randy Dunlap, Mark Montague, Dan Million, > Jean-Marc Zucconi, Jeff Blomberg, > Erik Habbinga, Kris Strecker, Walter Wong. > > Run began: Wed Dec 13 14:39:41 2006 > > Record Size 64 KB > File size set to 2097152 KB > Command line used: iozone -r 64KB -s 2g -i 0 -i 1 -t 3 > Output is in Kbytes/sec > Time Resolution = 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > Throughput test with 3 processes > Each process writes a 2097152 Kbyte file in 64 Kbyte records > > Children see throughput for 3 initial writers = 220949.31 KB/sec > Parent sees throughput for 3 initial writers = 204066.05 KB/sec > Min throughput per process = 68142.53 KB/sec > Max throughput per process = 82785.59 KB/sec > Avg throughput per process = 73649.77 KB/sec > Min xfer = 1971712.00 KB > > Children see throughput for 3 rewriters = 307993.49 KB/sec > Parent sees throughput for 3 rewriters = 293288.28 KB/sec > Min throughput per process = 92883.50 KB/sec > Max throughput per process = 119024.17 KB/sec > Avg throughput per process = 102664.50 KB/sec > Min xfer = 1799616.00 KB > > Children see throughput for 3 readers = 423371.39 KB/sec > Parent sees throughput for 3 readers = 423168.28 KB/sec > Min throughput per process = 139781.50 KB/sec > Max throughput per process = 142646.52 KB/sec > Avg throughput per process = 141123.80 KB/sec > Min xfer = 2055232.00 KB > > Children see throughput for 3 re-readers = 447745.98 KB/sec > Parent sees throughput for 3 re-readers = 447678.57 KB/sec > Min throughput per process = 148235.48 KB/sec > Max throughput per process = 149965.86 KB/sec > Avg throughput per process = 149248.66 KB/sec > Min xfer = 2072512.00 KB > > > > iozone test complete. > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: iozone.v6.output URL: From sweitzen at cisco.com Wed Dec 13 16:14:45 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 13 Dec 2006 16:14:45 -0800 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM Message-ID: What problem are you seeing? We have tested SRP on 32-bit SLES10 running on 64-bit Opteron hardware. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Cain, > Brian (GE Healthcare) > Sent: Wednesday, December 13, 2006 2:09 PM > To: openib-general at openib.org > Subject: [openib-general] [PATCH] install.sh: Cause less pain > to SRP users who didn't RTFM > > There's gotta be a good way to let people know they're going down the > wrong path on this one. > > Signed-off-by: Brian Cain > > --- ofed/openib/scripts/install.sh 2006-12-13 14:48:51.747995000 > -0700 > +++ ofed_fix/openib/scripts/install.sh 2006-12-13 14:59:00.586574000 > -0700 > @@ -1070,6 +1070,14 @@ > echo "# Load SDP module" >> > ${IB_CONF_DIR}/openib.conf > echo "# SDP_LOAD=no" >> > ${IB_CONF_DIR}/openib.conf > fi > + > + > + if [[ "$srp" == "y" || "$srp_target" == "y" ]] && > + [[ $(egrep 'flags.*lm' /proc/cpuinfo | wc > -l) > 0 ]] > && > + [[ $(uname -p | egrep 'i[3-9]86' | wc -l) > 0 ]]; > then > + echo '!!WARNING!! SRP is not supported > for 32-bit OS > running on 64-bit capable hardware' > + fi > + > > if [ "$srp" == "y" ]; then > echo >> ${IB_CONF_DIR}/openib.conf > > -- > -Brian > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From akepner at sgi.com Wed Dec 13 16:29:52 2006 From: akepner at sgi.com (akepner at sgi.com) Date: Wed, 13 Dec 2006 16:29:52 -0800 (PST) Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race Message-ID: It appears that there are races between DMA and CQ updates which can result in incorrect behavior when CQs are allocated in user-space (via libibverbs). This problem affects Altix in particular, though it may exist on other platforms as well. (We haven't really seen this particular bug yet but, based on previous experience, it's something that we expect to be manifested on large NUMA systems.) Description of the race ----------------------- On a system such as Altix, that supports "posted DMA", DMA may complete out of order. (This is due to possible reordering within the NUMA-interconnect. So it's not a PCI reordering that's being described here.) For example, if an HCA does a DMA write to host memory and then updates a corresponding CQE, it's possible for the CQE update to be visible before the DMA has actually completed. There are a couple of mechanisms to ensure synchronization. Either: 1) an interrupt, or 2) a write to a "consistently" (coherently) mapped DMA address will flush in-flight DMA. When the CQ is allocated by the device driver, mechanism 2) will prevent the race since "dma_alloc_consistent()" is used there. But when the CQ allocation is done in user space (via libibverbs) there's no protection. So what to do? ------------- Obviously mechanism 1), generating an interrupt, is not the right solution for performance reasons. One proposal is to add a kernel API that enables "coherent memory" allocation (via the in-kernel DMA interface) from user-space. Then, CQs, e.g., could be allocated via this interface, and the race could be avoided. Any other ideas? -- Arthur From yhkim93 at keti.re.kr Wed Dec 13 17:16:18 2006 From: yhkim93 at keti.re.kr (=?ks_c_5601-1987?B?sei/tciv?=) Date: Thu, 14 Dec 2006 10:16:18 +0900 Subject: [openib-general] booting problem after cross compile to ppc in infiniband source of linux-2.6.19 In-Reply-To: Message-ID: <20061214011631.656303B0006@sentry-two.sandia.gov> I am making the infiniband storage system based on ppc. And I use AMCC 440 SPe yucca board. I have cross-compiled infiniband source to ppc. And I applied to patch because of short of coherent dma memory. But after compiling patched kernel source, happened the following error text. What is problem? =========================================================================== Waiting for PHY auto negotiation to complete... done ENET Speed is 1000 Mbps - FULL duplex connection Using ppc_4xx_eth0 device TFTP from server 192.168.1.1; our IP address is 192.168.1.10 Filename 'yucca/uImage'. Load address: 0x200000 Loading: T ################################################################# ################################################################# ################################################################# ######################################################### done Bytes transferred = 1289218 (13ac02 hex) ## Booting image at 00200000 ... Image Name: Linux-2.6.19 Image Type: PowerPC Linux Kernel Image (gzip compressed) Data Size: 1289154 Bytes = 1.2 MB Load Address: 00000000 Entry Point: 00000000 Verifying Checksum ... OK Uncompressing Kernel Image ... OK Linux version 2.6.19 (root at yhkim-devpc) (gcc version 4.0.0) #14 Thu Dec 14 09:43:16 KST 2006 PCIE:1 successfully set as rootpoint vendor-id 0xaaa1 device-id 0xbed1 Yucca port (Roland Dreier ) Zone PFN ranges: DMA 0 -> 196608 Normal 196608 -> 196608 early_node_map[1] active PFN ranges 0: 0 -> 196608 Built 1 zonelists. Total pages: 195072 Kernel command line: root=/dev/nfs rw nfsroot=192.168.1.1:/tftpboot/yucca/ppc_4xx ip=192.168.1.10:192.168.1.1::255.250PID hash table entries: 4096 (order: 12, 16384 bytes) Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 776704k available (1976k kernel code, 612k data, 124k init, 0k highmem) Mount-cache hash table entries: 512 NET: Registered protocol family 16 PCI: Probing PCI hardware NET: Registered protocol family 2 IP route cache hash table entries: 32768 (order: 5, 131072 bytes) TCP established hash table entries: 131072 (order: 7, 524288 bytes) TCP bind hash table entries: 65536 (order: 6, 262144 bytes) TCP: Hash tables configured (established 131072 bind 65536) TCP reno registered io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered Generic RTC Driver v1.07 Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled serial8250: ttyS0 at MMIO 0x0 (irq = 0) is a 16550A serial8250: ttyS1 at MMIO 0x0 (irq = 1) is a 16550A serial8250: ttyS2 at MMIO 0x0 (irq = 37) is a 16550A RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize PPC 4xx OCP EMAC driver, version 3.54 mal0: initialized, 1 TX channels, 1 RX channels eth0: emac0, MAC 00:04:ac:01:ca:fe eth0: found CIS8201 Gigabit Ethernet PHY (0x01) IBM IIC driver v2.1 ibm-iic0: using standard (100 kHz) mode ibm-iic1: using standard (100 kHz) mode ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0001:01:01.0 ib_mthca 0001:01:01.0: NOP command failed to generate interrupt (IRQ 100), aborting. ib_mthca 0001:01:01.0: BIOS or ACPI interrupt routing problem? ib_mthca: probe of 0001:01:01.0 failed with error -16 TCP cubic registered NET: Registered protocol family 1 NET: Registered protocol family 17 eth0: link is up, 1000 FDX IP-Config: Complete: device=eth0, addr=192.168.1.10, mask=255.255.255.0, gw=255.255.255.255, host=yucca, domain=, nis-domain=(none), bootserver=192.168.1.1, rootserver=192.168.1.1, rootpath= Looking up port of RPC 100003/2 on 192.168.1.1 Looking up port of RPC 100005/1 on 192.168.1.1 VFS: Mounted root (nfs filesystem). Freeing unused kernel memory: 124k init -------------- next part -------------- An HTML attachment was scrubbed... URL: From Brian.Cain at ge.com Wed Dec 13 17:45:31 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Wed, 13 Dec 2006 20:45:31 -0500 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- > From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] > Sent: Wednesday, December 13, 2006 6:15 PM > To: Cain, Brian (GE Healthcare); openib-general at openib.org > Subject: RE: [openib-general] [PATCH] install.sh: Cause less > pain to SRP users who didn't RTFM > > What problem are you seeing? We have tested SRP on 32-bit SLES10 > running on 64-bit Opteron hardware. We seem to get panics during multithreaded IO on the initiator. The panics don't seem to always point to any SRP code, but might be a symptom of memory corruption. It only seems to show up on 32-bit kernels. Our distro is a derivative of Fedora. There were a few more things we wanted to consider, but we stopped debugging when we saw an indication in the release notes that it's not a supported configuration. -Brian From rdreier at cisco.com Wed Dec 13 18:23:39 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 18:23:39 -0800 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com> (Brian Cain's message of "Wed, 13 Dec 2006 20:45:31 -0500") References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com> Message-ID: > We seem to get panics during multithreaded IO on the initiator. The > panics don't seem to always point to any SRP code, but might be a > symptom of memory corruption. It only seems to show up on 32-bit > kernels. Our distro is a derivative of Fedora. There were a few more > things we wanted to consider, but we stopped debugging when we saw an > indication in the release notes that it's not a supported configuration. I'm not sure who declared it "unsupported" and I would really like to know what issue(s) led to that declaration. Your report is the first I've heard of anything like this, and I have to say that it seems pretty implausible that running a 32-bit kernel on 64-bit-capable hardware would be the source of problems -- if there is an issue then I would expect it to be something to do with the 32-bit kernel. In any case I definitely consider 32-bit kernels as something I support, so if you could post a real bug report (what specific kernel version, if you are running out-of-tree drivers (like OFED), host server details, SRP target details, how to reproduce, etc) for your problems with 32-bit kernels then I will try to debug things. - R. From rdreier at cisco.com Wed Dec 13 18:40:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 18:40:33 -0800 Subject: [openib-general] booting problem after cross compile to ppc in infiniband source of linux-2.6.19 In-Reply-To: <5bg6vi$345u6o@sj-inbound-f.cisco.com> ( =?iso-8859-1?Q?=B1=E8?= =?iso-8859-1?Q?=BF=B5=C8=AF?= Message-ID: What kernel are you running? I can't find the messages: > PCIE:1 successfully set as rootpoint > vendor-id 0xaaa1 > device-id 0xbed1 anywhere in my kernel sources. Also, it seems you have your HCA in PCIE slot 1. And you don't seem to have any other PCI Express cards installed. Is that correct? If that is correct, then IRQ 100 should be the right IRQ, so I can't explain why you would see > ib_mthca 0001:01:01.0: NOP command failed to generate interrupt (IRQ 100), aborting. Do you have any other PCI Express cards you can try? Do they work? What revision of Yucca/440SPe do you have? I only have used a rev A CPU, although I am getting my Yucca board reworked with a rev B CPU this week. - R. From k_mahesh85 at yahoo.co.in Wed Dec 13 19:49:56 2006 From: k_mahesh85 at yahoo.co.in (keshetti mahesh) Date: Thu, 14 Dec 2006 03:49:56 +0000 (GMT) Subject: [openib-general] [query]requirement of 'process_mad' in the HCA driver In-Reply-To: <1166010208.28709.59772.camel@hal.voltaire.com> Message-ID: <2875.47466.qm@web8317.mail.in.yahoo.com> thanks for your reply, >The driver is needed to obtain the information for the IB node to fill >in the MADs for response to the SMA query. It may also issue some traps. >Similarly for PMA as well. Do u mean to say that HCA driver is needed to pass the HCA related information (like GID,GUID, port_info etc..) to the SMA so that it can reply to query(or GET ) MADs. Isn't SMA capable of doing the same by using "query_(gid,pkey,port)" verbs. And final questions if it is really required to implement 'process_mad' in HCA driver then why it is not specified in the IB specifications. Whose duty is this (replying to query MADs) according to the IB psec.s(its duty of SMA right?) I have observed that process_mad is not implemented in the IBM's eHCA driver. what is the case with it? PS: I am considering only SMA in the host s/w here. regards, K.Mahesh. Hal Rosenstock wrote: On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote: > Hello all, > > I want to know from u people that isi it necessary to implement the > process_mad for a HCA. > > After looking into the implementations of process_mad in ipath and > mthca drivers i have fount that they are used to reply the MADs with > port_info,gid_info,sm_info etc.. > > But isn't it handled by SMA in the host...... The SMA can either be in the host on in firmware (as is typical with the Mellanox silicon). > i am little bit confused now . > please just whether it is required to implement process_mad (suppose) > for new HCA driver.... It is. For an example of a host (software SMA), see drivers/infiniband/hw/ipath/ipath_mad.c > if it is required why? The driver is needed to obtain the information for the IB node to fill in the MADs for response to the SMA query. It may also issue some traps. Similarly for PMA as well. -- Hal > Please CC your replies to me. > > regards, > K.Mahesh. > > > > > > > > ______________________________________________________________________ > Find out what India is talking about on - Yahoo! Answers India > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. > Get it NOW > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Find out what India is talking about on - Yahoo! Answers India Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Dec 13 22:04:57 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 08:04:57 +0200 Subject: [openib-general] [PATCH] mthca: move code from post send to post receive In-Reply-To: References: Message-ID: <20061214060457.GG1689@mellanox.co.il> > > While unlikely to give a large gain, this makes sense to me. > > Out of curiousity -- can you measure any difference at all with this? > I would have guessed that the addition can be scheduled so that it > costs nothing at all on any common CPU. I didn't actually try to measure it. But maybe it will all add up with time as small tuning adjustments are done. > I guess it doesn't hurt though. Want to make a similar patch for libmthca? Sure. -- MST From mst at mellanox.co.il Wed Dec 13 22:19:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 08:19:51 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061213232638.GC14186@sashak.voltaire.com> References: <20061213232638.GC14186@sashak.voltaire.com> Message-ID: <20061214061951.GH1689@mellanox.co.il> > > > For me it is unclear yet how long we may need this - 1.1 still be in > > > SVN yet, and 1.1 git branch is updated there. > > > > By the way, one can't actually build OFED 1.1 userspace from git > > because OFED also applies some patches after checking things out > > from svn. They are here: > > https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes > > I guess those patches should be committed in 1.1 svn branch (and imported > to git's 1.1). This could be done, but why invest the time? And once we do touch the branch, who will test that the thing you pull from there even works? I would say that if you really want to mirror the OFED branch, and make it buildable to some extent, the way to do this would be to have a single git tree with all of OFED - patches, scripts and all. Oh, by the way, some tools in OFED tried to read an svn version in their code, this wouldn't work on git. And I don't see git trees for a lot of OFED bits - look at https://openib.org/svn/gen2/branches/1.1/ofed/ What I am trying to say is, let's just keep SVN around and do OFED 1.1 maintainance there. You can't fix the history. > Any reason why it is not committed? This was dicussed before OFED 1.1 and seems to have worked well so far. We tried to keep our modifications to upstream as separate as possible - this made transition to upstream in OFED 1.2 very easy as it was trivial to check what was applied and what wasn't. -- MST From yhkim93 at keti.re.kr Wed Dec 13 22:49:28 2006 From: yhkim93 at keti.re.kr (=?utf-8?B?6rmA7JiB7ZmY?=) Date: Thu, 14 Dec 2006 15:49:28 +0900 Subject: [openib-general] booting problem after cross compile to ppc in infiniband source of linux-2.6.19 In-Reply-To: Message-ID: <20061214065005.99D203B0028@sentry-two.sandia.gov> I used linux-2.6.19 supported by AMCC. I only installed HCA in PCIE slot1. And I don't have any other PCIE adapter. But I will test the other PCIE adapter. And I don't know exactly my cpu type. Only it is written 440SPe rev 2.0, p/n PPC440SPe-RGB533C in the chip. If so, what I have to compile official release linux-2.6.19 source? Thank you for your helps. -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Thursday, December 14, 2006 11:41 AM To: ±è¿µÈ¯ Cc: openib-general at openib.org Subject: Re: booting problem after cross compile to ppc in infiniband source of linux-2.6.19 What kernel are you running? I can't find the messages: > PCIE:1 successfully set as rootpoint > vendor-id 0xaaa1 > device-id 0xbed1 anywhere in my kernel sources. Also, it seems you have your HCA in PCIE slot 1. And you don't seem to have any other PCI Express cards installed. Is that correct? If that is correct, then IRQ 100 should be the right IRQ, so I can't explain why you would see > ib_mthca 0001:01:01.0: NOP command failed to generate interrupt (IRQ 100), aborting. Do you have any other PCI Express cards you can try? Do they work? What revision of Yucca/440SPe do you have? I only have used a rev A CPU, although I am getting my Yucca board reworked with a rev B CPU this week. - R. From eitan at sw053.yok.mtl.com Wed Dec 13 23:11:11 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Thu, 14 Dec 2006 09:11:11 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion Message-ID: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = ____ ibutils rev = ____ Total=264 Pass=261 Fail=3 Pass: 36 Stability IS1-16.topo 36 Pkey IS1-16.topo 36 Multicast IS1-16.topo 36 LidMgr IS1-16.topo 35 OsmStress IS1-16.topo 12 Stability IS3-loop.topo 12 Stability IS3-128.topo 12 Pkey IS3-128.topo 12 OsmStress IS3-128.topo 12 Multicast IS3-loop.topo 11 Multicast IS3-128.topo 11 LidMgr IS3-128.topo Failures: 1 OsmStress IS1-16.topo 1 Multicast IS3-128.topo 1 LidMgr IS3-128.topo From rdreier at cisco.com Wed Dec 13 23:32:49 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 23:32:49 -0800 Subject: [openib-general] booting problem after cross compile to ppc in infiniband source of linux-2.6.19 In-Reply-To: <5g725m$m2afa@sj-inbound-a.cisco.com> ( =?iso-8859-1?Q?=EA=B9?= =?iso-8859-1?Q?=80=EC=98=81=ED=99=98?= Message-ID: > I used linux-2.6.19 supported by AMCC. I only installed HCA in PCIE slot1. > And I don't have any other PCIE adapter. But I will test the other PCIE adapter. And I don't know exactly my cpu type. Only it is written 440SPe rev 2.0, p/n PPC440SPe-RGB533C in the chip. If so, what I have to compile official release linux-2.6.19 source? If AMCC is supporting your kernel you might want to ask them for help. The part number "RGB" means you have a rev B part (the "B"). You can also look in /proc/cpuinfo to see the PVR to find out what chip you have. To use the standard 2.6.19 kernel on a rev B chip you need the patches I posted to linuxppc-embedded a few weeks ago. Otherwise PCIe won't work. The patches aren't merged upstream yet because they still need some cleanup. - R. From rdreier at cisco.com Wed Dec 13 23:41:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 13 Dec 2006 23:41:41 -0800 Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race In-Reply-To: ( akepner@sgi.com's message of "Wed, 13 Dec 2006 16:29:52 -0800 (PST)") References: Message-ID: Are there other possible ordering problems involving user memory (not in a CQ or QP)? Something like a CPU on node A writing to memory on node B and then posting a work request that makes the HCA DMA from that memory on node B, and having the work request doorbell reach the HCA before the write to node B actually happens, so the HCA DMAs the old contents of node B's memory? I guess the only feasible solution to the problem you're pointing out is to have libmthca use some special mmap()-based allocator for queues so that the kernel can give it memory that has the special dma_map_consistent treatment. Ugh. - R. From ogerlitz at voltaire.com Thu Dec 14 00:19:13 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 14 Dec 2006 10:19:13 +0200 Subject: [openib-general] [was: GIT PULL] In-Reply-To: References: <457FB82B.4090902@voltaire.com> Message-ID: <45810901.3090209@voltaire.com> Roland Dreier wrote: >> you have CC-ed lkml at cisco.com on this email, is there a chance you >> wanted to CC linux-kernel at vger.kernel.org instead ... > Yep, a typo caused by my auto-expand not triggering. No big deal though... Indeed, I see now that Linus has pulled it >> May i ask what prevented the v3 of the mthca profile patch (see >> http://article.gmane.org/gmane.linux.drivers.openib/34005) to get in? > The patch as posted is both ugly and wrong. I still plan to fix it up > and merge it for 2.6.20, but I didn't get a chance yet. mmm, I understand all the comments raised during the review were fixed in the V3 post below, and now you say its both wrong and ugly... for example what's wrong here? Or. > Adds module parameters that enable settting some of the HCA > profile values > Signed-off-by: Leonid Arsh > Signed-off-by: Moni Shoua > --- > mthca_main.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 104 insertions(+), 11 deletions(-) > diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c > index 47ea021..deb0289 100644 > --- a/drivers/infiniband/hw/mthca/mthca_main.c > +++ b/drivers/infiniband/hw/mthca/mthca_main.c > @@ -82,21 +82,110 @@ MODULE_PARM_DESC(tune_pci, "increase PCI > > struct mutex mthca_device_mutex; > > +#define MTHCA_DEFAULT_NUM_QP (1 << 16) > +#define MTHCA_DEFAULT_RDB_PER_QP (1 << 2) > +#define MTHCA_DEFAULT_NUM_CQ (1 << 16) > +#define MTHCA_DEFAULT_NUM_MCG (1 << 13) > +#define MTHCA_DEFAULT_NUM_MPT (1 << 17) > +#define MTHCA_DEFAULT_NUM_MTT (1 << 20) > +#define MTHCA_DEFAULT_NUM_UDAV (1 << 15) > +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18) > +#define MTHCA_DEFAULT_NUM_UARC_SIZE (1 << 18) > + > +static struct mthca_profile default_profile = { > + .num_qp = MTHCA_DEFAULT_NUM_QP, > + .rdb_per_qp = MTHCA_DEFAULT_RDB_PER_QP, > + .num_cq = MTHCA_DEFAULT_NUM_CQ, > + .num_mcg = MTHCA_DEFAULT_NUM_MCG, > + .num_mpt = MTHCA_DEFAULT_NUM_MPT, > + .num_mtt = MTHCA_DEFAULT_NUM_MTT, > + .num_udav = MTHCA_DEFAULT_NUM_UDAV, /* Tavor only */ > + .fmr_reserved_mtts = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */ > + .uarc_size = MTHCA_DEFAULT_NUM_UARC_SIZE, /* Arbel only */ > +}; > + > +module_param_named(num_qp, default_profile.num_qp, int, 0444); > +MODULE_PARM_DESC(num_qp, "maximum number of available QPs per HCA"); > + > +module_param_named(rdb_per_qp, default_profile.rdb_per_qp, int, 0444); > +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP"); > + > +module_param_named(num_cq, default_profile.num_cq, int, 0444); > +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); > + > +module_param_named(num_mcg, default_profile.num_mcg, int, 0444); > +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); > + > +module_param_named(num_mpt, default_profile.num_mpt, int, 0444); > +MODULE_PARM_DESC(num_mpt, > + "maximum number of memory protection pable entries per HCA"); > + > +module_param_named(num_mtt, default_profile.num_mtt, int, 0444); > +MODULE_PARM_DESC(num_mtt, > + "maximum number of memory translation table segments per HCA"); > +/* Tavor only */ > +module_param_named(num_udav, default_profile.num_udav, int, 0444); > +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA"); > + > +/* Tavor only */ > +module_param_named(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, int, 0444); > +MODULE_PARM_DESC(fmr_reserved_mtts, > + "number of memory translation table segments reserved for FMR"); > + > static const char mthca_version[] __devinitdata = > DRV_NAME ": Mellanox InfiniBand HCA driver v" > DRV_VERSION " (" DRV_RELDATE ")\n"; > > -static struct mthca_profile default_profile = { > - .num_qp = 1 << 16, > - .rdb_per_qp = 4, > - .num_cq = 1 << 16, > - .num_mcg = 1 << 13, > - .num_mpt = 1 << 17, > - .num_mtt = 1 << 20, > - .num_udav = 1 << 15, /* Tavor only */ > - .fmr_reserved_mtts = 1 << 18, /* Tavor only */ > - .uarc_size = 1 << 18, /* Arbel only */ > -}; > + > +static int __devinit mthca_check_profile_value(int* pval, int pval_default){ > + /* value must be positive and power of 2 */ > + int old_pval = *pval; > + > + if (old_pval <= 0) > + *pval = pval_default; > + else > + *pval = roundup_pow_of_two(old_pval); > + > + return old_pval-*pval; > +} > + > +#define mthca_check_profile_and_warn(name, var, defval) \ > + if (mthca_check_profile_value(&var, defval)) \ > + mthca_warn(mdev, "invalid %s passed. changed to %d.\n", #name, var); > + > +static int __devinit mthca_validate_profile(struct mthca_dev *mdev, > + struct mthca_profile *profile) > +{ > + > + mthca_check_profile_and_warn(num_qp, default_profile.num_qp, > + MTHCA_DEFAULT_NUM_QP); > + mthca_check_profile_and_warn(rdb_per_qp, default_profile.rdb_per_qp, > + MTHCA_DEFAULT_RDB_PER_QP); > + mthca_check_profile_and_warn(num_cq, default_profile.num_cq, > + MTHCA_DEFAULT_NUM_CQ); > + mthca_check_profile_and_warn(num_mcg, default_profile.num_mcg, > + MTHCA_DEFAULT_NUM_MCG); > + mthca_check_profile_and_warn(num_mpt, default_profile.num_mpt, > + MTHCA_DEFAULT_NUM_MPT); > + mthca_check_profile_and_warn(num_mtt, default_profile.num_mtt, > + MTHCA_DEFAULT_NUM_MTT); > + > + if (!mthca_is_memfree(mdev)) { > + mthca_check_profile_and_warn(num_udav, default_profile.num_udav, > + MTHCA_DEFAULT_NUM_UDAV); > + mthca_check_profile_and_warn(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, > + MTHCA_DEFAULT_NUM_RESERVED_MTTS); > + > + if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) { > + mthca_err(mdev, "Invalid fmr_reserved_mtts parameter" > + "value (%d). Must be lower then num_mtt (%d)\n", > + default_profile.fmr_reserved_mtts, > + default_profile.num_mtt ); > + return -EINVAL; > + } > + } > + return 0; > +} > > static int __devinit mthca_tune_pci(struct mthca_dev *mdev) > { > @@ -1084,6 +1173,10 @@ static int __mthca_init_one(struct pci_d > if (err) > goto err_cmd; > > + err = mthca_validate_profile(mdev, &default_profile); > + if (err) > + goto err_cmd; > + > err = mthca_init_hca(mdev); > if (err) > goto err_cmd; From yosefe at voltaire.com Thu Dec 14 02:19:16 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Thu, 14 Dec 2006 12:19:16 +0200 Subject: [openib-general] ofed backports update In-Reply-To: <20061211144813.GA15870@mellanox.co.il> References: <20061211144813.GA15870@mellanox.co.il> Message-ID: <1166091556.926.17.camel@muscida> On Mon, 2006-12-11 at 16:48 +0200, Michael S. Tsirkin wrote: > Here's a small update on OFED 1.2 backports. This describes a change > I did a couple of weeks ago but never got to documenting. > NOTE: This info is relevant only for people developing OFED kernel code, > everything is transparent for others. > > NOTE: This is by *no means* a comprehensive writeup of OFED build process - > just a small update for people familiar with development in OFED 1.1. > > Background: > OFED 1.1 did all backports by applying patches under > kernel_patches/backports// directory. > To back-port a package, you just stuck a patch there > and one OFED detected an appropriate kernel, it was applied before build. > In many cases - where the kernel we are back-porting to was simply > missing some macro - what patch actually did was just add a file > under the include directory, and OFED build scripts knew to pick > these up before standard linux includes. > Managing these became somewhat of a pain as it is often hard to > see the history of a patch: try git diff on a patch that sits in git tree > and see what I mean. > > Update: > So for OFED 1.2 I've created a new directory kernel_addons, and converted > all patches that created new files to plain files under the relevant > kernel directory. OFED scripts now look there for files before standard > Linux headers. > For an example, look at how backport to 2.6.18 looks: > http://staging.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=tree;f=kernel_addons/backport/2.6.18/include/linux;h=5eabed1f98596f92ce149dae65c4ab1ceb1d6a67;hb=HEAD > Unfortunately, not all patches are of this form - some really tweak source > inside the infiniband subtree - but we can strive to reduce the number of this > and in this way make maintaining backports more of a seamless process. > > Bottom line > There are now 2 mechanisms for back-porting in OFED: > - if you want to add a kernel-specific file, stick it under > kernel_addons/backport//. > - if you must change an existing file depending on kernel version, stick > a patch in kernel_patches/backports//. > I was running the ‘configure’ script under ofed root. In ofed 1.1, it is possible to run configure without flags to patch the sources, and then run it again –without-patches and with the desired flags. In ofed 1.2 (Vlad’s tree) this scenario causes compilation error while running ‘make’ afterwards (2.6.9-34ELsmp and on 2.6.16.21-0.8, but NOT 2.6.19) causes compilation errors later on. However, when I just ran configure on a fresh source, with all the desired flags, it worked just fine. It seems to happen because the configure only patches Makefiles with the selected components with the kernel-addons include path. Maybe it should patch all Makefiles, or copy the files to ./include? _______________________________________________________________ Yosef Etigin, ib-host-stack Voltaire – The Grid Backbone www.voltaire.com From tziporet at dev.mellanox.co.il Thu Dec 14 03:30:24 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 14 Dec 2006 13:30:24 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <45804021.9050209@hp.com> References: <0b8901c71ed3$e9b9f740$0281a8c0@ebpc> <45804021.9050209@hp.com> Message-ID: <458135D0.6090100@dev.mellanox.co.il> Philippe Bernadat wrote: > Roland, > > Attached are the two lspci outputs. > > The only differences I see are: > > [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed > 1d0 > < pcilib: Resource 5 in /sys/bus/pci/devices/0000:00:1f.1/resource has > a 64-bit address, ignoring > 40c39 > < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 > --- > > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 > [philippe at hamish o2ib]$ > Have you tried running with options ib_mthca tune_pci =1 Tziporet From halr at voltaire.com Thu Dec 14 04:12:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2006 07:12:04 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion In-Reply-To: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> Message-ID: <1166098306.28709.122104.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = ____ > ibutils rev = ____ > Total=264 Pass=261 Fail=3 > > Pass: > 36 Stability IS1-16.topo > 36 Pkey IS1-16.topo > 36 Multicast IS1-16.topo > 36 LidMgr IS1-16.topo > 35 OsmStress IS1-16.topo > 12 Stability IS3-loop.topo > 12 Stability IS3-128.topo > 12 Pkey IS3-128.topo > 12 OsmStress IS3-128.topo > 12 Multicast IS3-loop.topo > 11 Multicast IS3-128.topo > 11 LidMgr IS3-128.topo > > Failures: > 1 OsmStress IS1-16.topo > 1 Multicast IS3-128.topo > 1 LidMgr IS3-128.topo There are now 2 more failures. You had previously explained OsmStress failure as needing more investigation. Now there is a Multicast and LidMgr failure yet nothing really changed since the previous run the night before. Are these new tests ? What were the failures ? The repetitions have also been reduced from previous reports. Are these the same or different tests ? -- Hal From philippe_bernadat at hp.com Thu Dec 14 04:24:04 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Thu, 14 Dec 2006 13:24:04 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <458135D0.6090100@dev.mellanox.co.il> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05537DAF@idaexc03.emea.cpqcorp.net> > Have you tried running with > > options ib_mthca tune_pci =1 > My understanding is that this is not required anymore with OFED-1.1 - It used to make a siginifciant differences with OFED-1.0, but I didn't observe it with OFED-1.1 And again, the user mode performance if comparable between VIB and OFED. Philippe > -----Original Message----- > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] > Sent: Thursday, December 14, 2006 12:30 PM > To: Bernadat, Philippe > Cc: Eric Barton; Roland Dreier; Matt Leininger; > openib-general at openib.org; Bernadat, Philippe > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire > > Philippe Bernadat wrote: > > Roland, > > > > Attached are the two lspci outputs. > > > > The only differences I see are: > > > > [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed > > 1d0 > > < pcilib: Resource 5 in > /sys/bus/pci/devices/0000:00:1f.1/resource has > > a 64-bit address, ignoring > > 40c39 > > < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 > > --- > > > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 > > [philippe at hamish o2ib]$ > > > Have you tried running with > > options ib_mthca tune_pci =1 > > Tziporet > > From ishai at dev.mellanox.co.il Thu Dec 14 04:24:15 2006 From: ishai at dev.mellanox.co.il (ishai) Date: Thu, 14 Dec 2006 14:24:15 +0200 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com> Message-ID: <4581426F.2060106@dev.mellanox.co.il> Hi Roland, SRP was tested on a 32-bit operating system running on a 32-bit platform and on 64-bit OS and there are no known problems. In the interoperability tests done in UNH-IOL on September we found out that SRP on a 32-bit operating system running on a 64-bit platform causes crashes. (It was tested on RHEL4-U3). Since we did not have enough time to solve this problem until the release and since we think that this combination (32-bit OS on 64-bit platform) is less common, we treat this issue as low priority. The remark in the release notes indicates that SRP does not work on this combination. Ishai Roland Dreier wrote: > > We seem to get panics during multithreaded IO on the initiator. The > > panics don't seem to always point to any SRP code, but might be a > > symptom of memory corruption. It only seems to show up on 32-bit > > kernels. Our distro is a derivative of Fedora. There were a few more > > things we wanted to consider, but we stopped debugging when we saw an > > indication in the release notes that it's not a supported configuration. > >I'm not sure who declared it "unsupported" and I would really like to >know what issue(s) led to that declaration. Your report is the first >I've heard of anything like this, and I have to say that it seems >pretty implausible that running a 32-bit kernel on 64-bit-capable >hardware would be the source of problems -- if there is an issue then >I would expect it to be something to do with the 32-bit kernel. > >In any case I definitely consider 32-bit kernels as something I >support, so if you could post a real bug report (what specific kernel >version, if you are running out-of-tree drivers (like OFED), host >server details, SRP target details, how to reproduce, etc) for your >problems with 32-bit kernels then I will try to debug things. > > - R. > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From monil at voltaire.com Thu Dec 14 04:35:24 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 14 Dec 2006 14:35:24 +0200 Subject: [openib-general] [openfabrics-ewg] OFED release testing Task Force In-Reply-To: <1E3DCD1C63492545881FACB6063A57C19AA561@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C19AA561@mtiexch01.mti.com> Message-ID: <6a122cc00612140435k55b4e177se9c58279d7444603@mail.gmail.com> Nimrod, On 11/22/06, Nimrod Gindi wrote: > > > > Hi, > > As a follow-up on the presentation prepared and presented by Amit Krig and > my-self in the OFA Meeting during SC06 I'm sending out this e-mail as a call > for participation. > > The targets of the Ad-hoc task force will be (as agreed upon in the session > we had): unify the test results formats, define release quality criteria, > define/assign ULP verification owners and enhance interoperability > finger-print in the release process. > > > > We would like to have a participant from each contributing company and would > appreciate any response sent to me with a name of a person from the company > to attend and take action on behalf of this task force. I'm sorry for the late reply. Yosi (yosefe at voltaire.com) and me will be happy to join. -- Moni > > BTW: I've also attached the presentation that was given in the OFA meeting. > > <> > > Happy Holidays to every one, > > > > Nimrod Gindi > > Mellanox Technologies Ltd. > > mail : nimrodg at mellanox.com > > Cell : +1-408-750-4801 > > Office: +1-347-342-0011 > > Fax : +1-212-987-0275 > > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > From mst at mellanox.co.il Thu Dec 14 04:46:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 14:46:29 +0200 Subject: [openib-general] [PATCH] mthca: save low memory used for reserved objects Message-ID: <20061214124629.GB24840@mellanox.co.il> We never need to allocate memory for reserved objects in low memory. Signed-off-by: Michael S. Tsirkin --- I noticed this obvious optimization when going over the icm allocation code. Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_memfree.c =================================================================== --- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_memfree.c +++ linux-2.6/drivers/infiniband/hw/mthca/mthca_memfree.c @@ -313,8 +313,7 @@ struct mthca_icm_table *mthca_alloc_icm_ chunk_size = nobj * obj_size - i * MTHCA_TABLE_CHUNK_SIZE; table->icm[i] = mthca_alloc_icm(dev, chunk_size >> PAGE_SHIFT, - (use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) | - __GFP_NOWARN); + GFP_HIGHUSER | __GFP_NOWARN); if (!table->icm[i]) goto err; if (mthca_MAP_ICM(dev, table->icm[i], virt + i * MTHCA_TABLE_CHUNK_SIZE, -- MST From eitan at mellanox.co.il Thu Dec 14 05:32:12 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 14 Dec 2006 15:32:12 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion In-Reply-To: <1166098306.28709.122104.camel@hal.voltaire.com> References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> <1166098306.28709.122104.camel@hal.voltaire.com> Message-ID: <4581525C.9060104@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote: > >> OSM Simulation Regression Summary >> OpenSM rev = ____ >> ibutils rev = ____ >> Total=264 Pass=261 Fail=3 >> >> Pass: >> 36 Stability IS1-16.topo >> 36 Pkey IS1-16.topo >> 36 Multicast IS1-16.topo >> 36 LidMgr IS1-16.topo >> 35 OsmStress IS1-16.topo >> 12 Stability IS3-loop.topo >> 12 Stability IS3-128.topo >> 12 Pkey IS3-128.topo >> 12 OsmStress IS3-128.topo >> 12 Multicast IS3-loop.topo >> 11 Multicast IS3-128.topo >> 11 LidMgr IS3-128.topo >> >> Failures: >> 1 OsmStress IS1-16.topo >> 1 Multicast IS3-128.topo >> 1 LidMgr IS3-128.topo >> > > There are now 2 more failures. You had previously explained OsmStress > failure as needing more investigation. Now there is a Multicast and > LidMgr failure yet nothing really changed since the previous run the > night before. Are these new tests ? What were the failures ? > The tests use random seeds and thus can catch other bugs in each run. I am investigating these failures. Some might be due to bugs in the checker code too. Please pay attention the failure rate is low (LidMgr pass 36+11 runs failed 1 test). This to imply the bug is a hard to find one. > The repetitions have also been reduced from previous reports. Are these > the same or different tests ? > Number of repetitions depends on runtime. The regression started later thus run less iterations. I run the "same" tests ("same" means same code not same random sequence). > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Thu Dec 14 05:52:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:52:33 -0600 Subject: [openib-general] [PATCH v4 00/13] 2.6.20 Chelsio T3 RDMA Driver Message-ID: <20061214135233.21159.78613.stgit@dell3.ogc.int> Roland, I think this is ready to go once the ethernet driver is pulled in. Version 4 changes: - Cleaned up spacing in the Kconfig file - Remove locking.txt file - its not needed - Remove -O1 from the debug config option - BugFix: support new LLD interface for dual-port adapters Version 3 changes: - BugFix: Don't use mutex inside of the mmap function. - BugFix: Move QP to TERMINATE when TERMINATE AE is processed - Support the new work queue design - Merged up to linus's tree as of 12/8/2006 - Misc nits Version 2 changes: - Make code sparse endian clean - Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays - Clean up confusing bitfields - Use random32() instead of local random function - Use krefs to track endpoint reference counts - Misc nits ----- The following series implements the Chelsio T3 iWARP/RDMA Driver to be considered for inclusion in 2.6.20. It depends on the Chelsio T3 Ethernet driver which is also under review now for 2.6.20. The latest Chelsio T3 Ethernet driver patch can be pulled from: http://service.chelsio.com/kernel.org/cxgb3.patch.bz2 A complete GIT kernel tree with all the T3 drivers can be pulled from: git://staging.openfabrics.org/~swise/cxgb3.git Thanks, Steve. From swise at opengridcomputing.com Thu Dec 14 05:53:05 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:53:05 -0600 Subject: [openib-general] [PATCH v4 01/13] Linux RDMA Core Changes In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135303.21159.61880.stgit@dell3.ogc.int> Support provider-specific data in ib_uverbs_cmd_req_notify_cq(). The Chelsio iwarp provider library needs to pass information to the kernel verb for re-arming the CQ. Signed-off-by: Steve Wise --- drivers/infiniband/core/uverbs_cmd.c | 9 +++++++-- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 3 ++- drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 ++- drivers/infiniband/hw/ehca/ehca_reqs.c | 3 ++- drivers/infiniband/hw/ipath/ipath_cq.c | 4 +++- drivers/infiniband/hw/ipath/ipath_verbs.h | 3 ++- drivers/infiniband/hw/mthca/mthca_cq.c | 6 ++++-- drivers/infiniband/hw/mthca/mthca_dev.h | 4 ++-- include/rdma/ib_verbs.h | 5 +++-- 10 files changed, 28 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 743247e..5dd1de9 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i int out_len) { struct ib_uverbs_req_notify_cq cmd; + struct ib_udata udata; struct ib_cq *cq; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i if (!cq) return -EINVAL; - ib_req_notify_cq(cq, cmd.solicited_only ? - IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + INIT_UDATA(&udata, buf + sizeof cmd, 0, + in_len - sizeof cmd, 0); + + cq->device->req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP, + &udata); put_cq_read(cq); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h index 04a9db5..9a76869 100644 --- a/drivers/infiniband/hw/amso1100/c2.h +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata); /* CM */ extern int c2_llp_connect(struct iw_cm_id *cm_id, diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 05c9154..7ce8bca 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n return npolled; } -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct c2_mq_shared __iomem *shared; struct c2_cq *cq; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 3720e30..566b30c 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata); struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index b46bda1..3ed6992 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -634,7 +634,8 @@ poll_cq_exit0: return ret; } -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 87462e0..27ba4db 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq) * ipath_req_notify_cq - change the notification type for a completion queue * @ibcq: the completion queue * @notify: the type of notification to request + * @udata: user data * * Returns 0 for success. * * This may be called from interrupt context. Also called by * ib_req_notify_cq() in the generic verbs code. */ -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); unsigned long flags; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 8039f6e..0d39960 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_ int ipath_destroy_cq(struct ib_cq *ibcq); -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata); int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 283d50b..15cbd49 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -722,7 +722,8 @@ repoll: return err == 0 || err == -EAGAIN ? npolled : err; } -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, + struct ib_udata *udata) { __be32 doorbell[2]; @@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, return 0; } -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct mthca_cq *cq = to_mcq(ibcq); __be32 doorbell[2]; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index fe5cecf..6b9ccf6 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 8eacc35..e3e1a2c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -941,7 +941,8 @@ struct ib_device { struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); int (*req_notify_cq)(struct ib_cq *cq, - enum ib_cq_notify cq_notify); + enum ib_cq_notify cq_notify, + struct ib_udata *udata); int (*req_ncomp_notif)(struct ib_cq *cq, int wc_cnt); struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_ static inline int ib_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) { - return cq->device->req_notify_cq(cq, cq_notify); + return cq->device->req_notify_cq(cq, cq_notify, NULL); } /** From swise at opengridcomputing.com Thu Dec 14 05:53:35 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:53:35 -0600 Subject: [openib-general] [PATCH v4 02/13] Device Discovery and ULLD Linkage In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135335.21159.79371.stgit@dell3.ogc.int> Code to discover all the T3 devices and register them with the T3 RDMA Core and the Linux RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 189 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 175 +++++++++++++++++++++++++++++++++ 2 files changed, 364 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c new file mode 100644 index 0000000..acbe449 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -0,0 +1,189 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" +#include "iwch_user.h" +#include "iwch.h" +#include "iwch_cm.h" + +#define DRV_VERSION "1.1" + +MODULE_AUTHOR("Boyd Faulkner, Steve Wise"); +MODULE_DESCRIPTION("Chelsio T3 RDMA Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; + +static void open_rnic_dev(struct t3cdev *); +static void close_rnic_dev(struct t3cdev *); + +struct cxgb3_client t3c_client = { + .name = "iw_cxgb3", + .add = open_rnic_dev, + .remove = close_rnic_dev, + .handlers = t3c_handlers, + .redirect = iwch_ep_redirect +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(dev_mutex); + +static void rnic_init(struct iwch_dev *rnicp) +{ + PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); + idr_init(&rnicp->cqidr); + idr_init(&rnicp->qpidr); + idr_init(&rnicp->mmidr); + spin_lock_init(&rnicp->lock); + + rnicp->attr.vendor_id = 0x168; + rnicp->attr.vendor_part_id = 7; + rnicp->attr.max_qps = T3_MAX_NUM_QP - 32; + rnicp->attr.max_wrs = (1UL << 24) - 1; + rnicp->attr.max_sge_per_wr = T3_MAX_SGE; + rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE; + rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1; + rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1; + rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev); + rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE; + rnicp->attr.max_pds = T3_MAX_NUM_PD - 1; + rnicp->attr.mem_pgsizes_bitmask = 0x7FFF; /* 4KB-128MB */ + rnicp->attr.can_resize_wq = 0; + rnicp->attr.max_rdma_reads_per_qp = 8; + rnicp->attr.max_rdma_read_resources = + rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps; + rnicp->attr.max_rdma_read_qp_depth = 8; /* IRD */ + rnicp->attr.max_rdma_read_depth = + rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps; + rnicp->attr.rq_overflow_handled = 0; + rnicp->attr.can_modify_ird = 0; + rnicp->attr.can_modify_ord = 0; + rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1; + rnicp->attr.stag0_value = 1; + rnicp->attr.zbva_support = 1; + rnicp->attr.local_invalidate_fence = 1; + rnicp->attr.cq_overflow_detection = 1; + return; +} + +static void open_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *rnicp; + static int vers_printed; + + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + if (!vers_printed++) + printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", + DRV_VERSION); + rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); + if (!rnicp) { + printk(KERN_ERR MOD "Cannot allocate ib device\n"); + return; + } + rnicp->rdev.ulp = rnicp; + rnicp->rdev.t3cdev_p = tdev; + + if (cxio_rdev_open(&rnicp->rdev)) { + printk(KERN_ERR MOD "Unable to open CXIO rdev\n"); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + rnic_init(rnicp); + + mutex_lock(&dev_mutex); + list_add_tail(&rnicp->entry, &dev_list); + mutex_unlock(&dev_mutex); + + if (iwch_register_device(rnicp)) { + printk(KERN_ERR MOD "Unable to register device\n"); + close_rnic_dev(tdev); + } + printk(KERN_INFO MOD "Initialized device %s\n", + pci_name(rnicp->rdev.rnic_info.pdev)); + return; +} + +static void close_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *dev, *tmp; + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + mutex_lock(&dev_mutex); + list_for_each_entry_safe(dev, tmp, &dev_list, entry) { + if (dev->rdev.t3cdev_p == tdev) { + list_del(&dev->entry); + iwch_unregister_device(dev); + cxio_rdev_close(&dev->rdev); + idr_destroy(&dev->cqidr); + idr_destroy(&dev->qpidr); + idr_destroy(&dev->mmidr); + ib_dealloc_device(&dev->ibdev); + break; + } + } + mutex_unlock(&dev_mutex); +} + +extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb); + +static int __init iwch_init_module(void) +{ + int err; + + err = cxio_hal_init(); + if (err) + return err; + err = iwch_cm_init(); + if (err) + return err; + cxio_register_ev_cb(iwch_ev_dispatch); + cxgb3_register_client(&t3c_client); + return 0; +} + +static void __exit iwch_exit_module(void) +{ + cxgb3_unregister_client(&t3c_client); + cxio_unregister_ev_cb(iwch_ev_dispatch); + iwch_cm_term(); + cxio_hal_exit(); +} + +module_init(iwch_init_module); +module_exit(iwch_exit_module); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h new file mode 100644 index 0000000..752b6ad --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_H__ +#define __IWCH_H__ + +#include +#include +#include +#include + +#include + +#include "cxio_hal.h" +#include "cxgb3_offload.h" + +struct iwch_pd; +struct iwch_cq; +struct iwch_qp; +struct iwch_mr; + +struct iwch_rnic_attributes { + u32 vendor_id; + u32 vendor_part_id; + u32 max_qps; + u32 max_wrs; /* Max for any SQ/RQ */ + u32 max_sge_per_wr; + u32 max_sge_per_rdma_write_wr; /* for RDMA Write WR */ + u32 max_cqs; + u32 max_cqes_per_cq; + u32 max_mem_regs; + u32 max_phys_buf_entries; /* for phys buf list */ + u32 max_pds; + + /* + * The memory page sizes supported by this RNIC. + * Bit position i in bitmap indicates page of + * size (4k)^i. Phys block list mode unsupported. + */ + u32 mem_pgsizes_bitmask; + u8 can_resize_wq; + + /* + * The maximum number of RDMA Reads that can be outstanding + * per QP with this RNIC as the target. + */ + u32 max_rdma_reads_per_qp; + + /* + * The maximum number of resources used for RDMA Reads + * by this RNIC with this RNIC as the target. + */ + u32 max_rdma_read_resources; + + /* + * The max depth per QP for initiation of RDMA Read + * by this RNIC. + */ + u32 max_rdma_read_qp_depth; + + /* + * The maximum depth for initiation of RDMA Read + * operations by this RNIC on all QPs + */ + u32 max_rdma_read_depth; + u8 rq_overflow_handled; + u32 can_modify_ird; + u32 can_modify_ord; + u32 max_mem_windows; + u32 stag0_value; + u8 zbva_support; + u8 local_invalidate_fence; + u32 cq_overflow_detection; +}; + +struct iwch_dev { + struct ib_device ibdev; + struct cxio_rdev rdev; + u32 device_cap_flags; + struct iwch_rnic_attributes attr; + struct idr cqidr; + struct idr qpidr; + struct idr mmidr; + spinlock_t lock; + struct list_head entry; +}; + +static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct iwch_dev, ibdev); +} + +static inline int t3b_device(const struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3B); +} + +static inline int t3a_device(const struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3A); +} + +static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid) +{ + return idr_find(&rhp->cqidr, cqid); +} + +static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid) +{ + return idr_find(&rhp->qpidr, qpid); +} + +static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) +{ + return idr_find(&rhp->mmidr, mmid); +} + +static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, + void *handle, u32 id) +{ + int ret; + u32 newid; + + do { + if (!idr_pre_get(idr, GFP_KERNEL)) { + return -ENOMEM; + } + spin_lock_irq(&rhp->lock); + ret = idr_get_new_above(idr, handle, id, &newid); + BUG_ON(newid != id); + spin_unlock_irq(&rhp->lock); + } while (ret == -EAGAIN); + + return ret; +} + +static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id) +{ + spin_lock_irq(&rhp->lock); + idr_remove(idr, id); + spin_unlock_irq(&rhp->lock); +} + +extern struct cxgb3_client t3c_client; +extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; +#endif From swise at opengridcomputing.com Thu Dec 14 05:54:05 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:54:05 -0600 Subject: [openib-general] [PATCH v4 03/13] Provider Methods and Data Structures In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135405.21159.5811.stgit@dell3.ogc.int> Provider methods to support the Linux RDMA verbs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_provider.c | 1171 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_provider.h | 363 ++++++++ drivers/infiniband/hw/cxgb3/iwch_user.h | 68 ++ 3 files changed, 1602 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c new file mode 100644 index 0000000..e9721b1 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -0,0 +1,1171 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" +#include "iwch_user.h" + +static int iwch_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return -ENOSYS; +} + +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + return ERR_PTR(-ENOSYS); +} + +static int iwch_ah_destroy(struct ib_ah *ah) +{ + return -ENOSYS; +} + +static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + return -ENOSYS; +} + +static int iwch_dealloc_ucontext(struct ib_ucontext *context) +{ + struct iwch_dev *rhp = to_iwch_dev(context->device); + struct iwch_ucontext *ucontext = to_iwch_ucontext(context); + PDBG("%s context %p\n", __FUNCTION__, context); + cxio_release_ucontext(&rhp->rdev, &ucontext->uctx); + kfree(ucontext); + return 0; +} + +static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct iwch_ucontext *context; + struct iwch_dev *rhp = to_iwch_dev(ibdev); + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + cxio_init_ucontext(&rhp->rdev, &context->uctx); + INIT_LIST_HEAD(&context->mmaps); + spin_lock_init(&context->mmap_lock); + return &context->ibucontext; +} + +static int iwch_destroy_cq(struct ib_cq *ib_cq) +{ + struct iwch_cq *chp; + + PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq); + chp = to_iwch_cq(ib_cq); + + remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid); + atomic_dec(&chp->refcnt); + wait_event(chp->wait, !atomic_read(&chp->refcnt)); + + cxio_destroy_cq(&chp->rhp->rdev, &chp->cq); + kfree(chp); + return 0; +} + +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + struct iwch_create_cq_resp uresp; + + PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries); + rhp = to_iwch_dev(ibdev); + chp = kzalloc(sizeof(*chp), GFP_KERNEL); + if (!chp) + return ERR_PTR(-ENOMEM); + + if (t3a_device(rhp)) { + + /* + * T3A: Add some fluff to handle extra CQEs inserted + * for various errors. + * Additional CQE possibilities: + * TERMINATE, + * incoming RDMA WRITE Failures + * incoming RDMA READ REQUEST FAILUREs + * NOTE: We cannot ensure the CQ won't overflow. + */ + entries += 16; + } + entries = roundup_pow_of_two(entries); + chp->cq.size_log2 = ilog2(entries); + + if (cxio_create_cq(&rhp->rdev, &chp->cq)) { + kfree(chp); + return ERR_PTR(-ENOMEM); + } + chp->rhp = rhp; + chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1; + spin_lock_init(&chp->lock); + atomic_set(&chp->refcnt, 1); + init_waitqueue_head(&chp->wait); + insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid); + + if (context) { + struct iwch_mm_entry *mm; + + mm = kmalloc(sizeof *mm, GFP_KERNEL); + if (!mm) { + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-ENOMEM); + } + uresp.cqid = chp->cq.cqid; + uresp.size_log2 = chp->cq.size_log2; + uresp.physaddr = virt_to_phys(chp->cq.queue); + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm); + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-EFAULT); + } + mm->addr = uresp.physaddr; + mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * + sizeof (struct t3_cqe)); + insert_mmap(to_iwch_ucontext(context), mm); + } + PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n", + chp->cq.cqid, chp, (1 << chp->cq.size_log2), + (u64)chp->cq.dma_addr); + return &chp->ibcq; +} + +static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + struct iwch_cq *chp = to_iwch_cq(cq); + struct t3_cq oldcq, newcq; + int ret; + + PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe); + + /* We don't downsize... */ + if (cqe <= cq->cqe) + return 0; + + /* create new t3_cq with new size */ + cqe = roundup_pow_of_two(cqe+1); + newcq.size_log2 = ilog2(cqe); + + /* Dont allow resize to less than the current wce count */ + if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) { + return -ENOMEM; + } + + /* Quiesce all QPs using this CQ */ + ret = iwch_quiesce_qps(chp); + if (ret) { + return ret; + } + + ret = cxio_create_cq(&chp->rhp->rdev, &newcq); + if (ret) { + kfree(chp); + return ret; + } + + /* copy CQEs */ + memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * + sizeof(struct t3_cqe)); + + /* old iwch_qp gets new t3_cq but keeps old cqid */ + oldcq = chp->cq; + chp->cq = newcq; + chp->cq.cqid = oldcq.cqid; + + /* resize new t3_cq to update the HW context */ + ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq); + if (ret) { + chp->cq = oldcq; + return ret; + } + chp->ibcq.cqe = (1<cq.size_log2) - 1; + + /* destroy old t3_cq */ + oldcq.cqid = newcq.cqid; + ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq); + if (ret) { + printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", + __FUNCTION__, ret); + } + + /* add user hooks here */ + + /* resume qps */ + ret = iwch_resume_qps(chp); + return ret; +} + +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + enum t3_cq_opcode cq_op; + int err; + unsigned long flag; + struct iwch_req_notify_cq ucmd; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + if (notify == IB_CQ_SOLICITED) + cq_op = CQ_ARM_SE; + else + cq_op = CQ_ARM_AN; + if (udata && t3b_device(rhp)) { + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return -EFAULT; + spin_lock_irqsave(&chp->lock, flag); + chp->cq.rptr = ucmd.rptr; + } else + spin_lock_irqsave(&chp->lock, flag); + PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); + err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); + spin_unlock_irqrestore(&chp->lock, flag); + if (err) + printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, + chp->cq.cqid); + return err; +} + +static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + int len = vma->vm_end - vma->vm_start; + u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT; + struct cxio_rdev *rdev_p; + int ret = 0; + struct iwch_mm_entry *mm; + struct iwch_ucontext *ucontext; + + PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, + pgaddr, len); + + if (vma->vm_start & (PAGE_SIZE-1)) { + return -EINVAL; + } + + rdev_p = &(to_iwch_dev(context->device)->rdev); + ucontext = to_iwch_ucontext(context); + + mm = remove_mmap(ucontext, pgaddr, len); + if (!mm) + return -EINVAL; + kfree(mm); + + if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && + (pgaddr < (rdev_p->rnic_info.udbell_physbase + + rdev_p->rnic_info.udbell_len))) { + + /* + * Map T3 DB register. + */ + if (vma->vm_flags & VM_READ) { + return -EPERM; + } + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; + vma->vm_flags &= ~VM_MAYREAD; + ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } else { + + /* + * Map WQ or CQ contig dma memory... + */ + ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } + + return ret; +} + +static int iwch_deallocate_pd(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + + php = to_iwch_pd(pd); + rhp = php->rhp; + PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid); + cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid); + kfree(php); + return 0; +} + +static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_pd *php; + u32 pdid; + struct iwch_dev *rhp; + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + rhp = (struct iwch_dev *) ibdev; + pdid = cxio_hal_get_pdid(rhp->rdev.rscp); + if (!pdid) + return ERR_PTR(-EINVAL); + php = kzalloc(sizeof(*php), GFP_KERNEL); + if (!php) { + cxio_hal_put_pdid(rhp->rdev.rscp, pdid); + return ERR_PTR(-ENOMEM); + } + php->pdid = pdid; + php->rhp = rhp; + if (context) { + if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) { + iwch_deallocate_pd(&php->ibpd); + return ERR_PTR(-EFAULT); + } + } + PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php); + return &php->ibpd; +} + +static int iwch_dereg_mr(struct ib_mr *ib_mr) +{ + struct iwch_dev *rhp; + struct iwch_mr *mhp; + u32 mmid; + + PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr); + /* There can be no memory windows */ + if (atomic_read(&ib_mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(ib_mr); + rhp = mhp->rhp; + mmid = mhp->attr.stag >> 8; + cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, + mhp->attr.pbl_addr); + remove_handle(rhp, &rhp->mmidr, mmid); + if (mhp->kva) + kfree((void *) (unsigned long) mhp->kva); + PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); + kfree(mhp); + return 0; +} + +static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + __be64 *page_list; + int shift; + u64 total_size; + int npages; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + int ret; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + php = to_iwch_pd(pd); + rhp = php->rhp; + + acc = iwch_convert_access(acc); + + + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, + &total_size, &npages, &shift, &page_list); + if (ret) + goto err; + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + + /* NOTE: TPT perms are backwards from BIND WR perms! */ + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + ret = iwch_register_mem(rhp, php, mhp, shift, page_list); + kfree(page_list); + if (ret) { + goto err; + } + return &mhp->ibmr; +err: + kfree(mhp); + return ERR_PTR(ret); + +} + +static int iwch_reregister_phys_mem(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 * iova_start) +{ + + struct iwch_mr mh, *mhp; + struct iwch_pd *php; + struct iwch_dev *rhp; + int new_acc; + __be64 *page_list = NULL; + int shift = 0; + u64 total_size; + int npages; + int ret; + + PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd); + + /* There can be no memory windows */ + if (atomic_read(&mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(mr); + rhp = mhp->rhp; + php = to_iwch_pd(mr->pd); + + /* make sure we are on the same adapter */ + if (rhp != php->rhp) + return -EINVAL; + + new_acc = mhp->attr.perms; + + memcpy(&mh, mhp, sizeof *mhp); + + if (mr_rereg_mask & IB_MR_REREG_PD) + php = to_iwch_pd(pd); + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mh.attr.perms = iwch_convert_access(acc); + if (mr_rereg_mask & IB_MR_REREG_TRANS) + ret = build_phys_page_list(buffer_list, num_phys_buf, + iova_start, + &total_size, &npages, + &shift, &page_list); + + ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages); + kfree(page_list); + if (ret) { + return ret; + } + if (mr_rereg_mask & IB_MR_REREG_PD) + mhp->attr.pdid = php->pdid; + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mhp->attr.perms = acc; + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + mhp->attr.zbva = 0; + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + } + + return 0; +} + + +struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + __be64 *pages; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + struct iwch_reg_user_mr_resp uresp; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + shift = ffs(region->page_size) - 1; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + acc = iwch_convert_access(acc); + + i = n = 0; + + list_for_each_entry(chunk, ®ion->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = cpu_to_be64(sg_dma_address( + &chunk->page_list[j]) + + region->page_size * k); + } + } + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + mhp->attr.va_fbo = region->virt_base; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) region->length; + mhp->attr.pbl_size = i; + err = iwch_register_mem(rhp, php, mhp, shift, pages); + kfree(pages); + if (err) + goto err; + + if (udata && t3b_device(rhp)) { + uresp.pbl_addr = (mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3; + PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, + uresp.pbl_addr); + + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + iwch_dereg_mr(&mhp->ibmr); + err = -EFAULT; + goto err; + } + } + + return &mhp->ibmr; + +err: + kfree(mhp); + return ERR_PTR(err); +} + +struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva; + struct ib_mr *ibmr; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + + /* + * T3 only supports 32 bits of size. + */ + bl.size = 0xffffffff; + bl.addr = 0; + kva = 0; + ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva); + return ibmr; +} + +struct ib_mw *iwch_alloc_mw(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mw *mhp; + u32 mmid; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_MW; + mhp->attr.stag = stag; + mmid = (stag) >> 8; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag); + return &(mhp->ibmw); +} + +int iwch_dealloc_mw(struct ib_mw *mw) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + u32 mmid; + + mhp = to_iwch_mw(mw); + rhp = mhp->rhp; + mmid = (mw->rkey) >> 8; + cxio_deallocate_window(&rhp->rdev, mhp->attr.stag); + remove_handle(rhp, &rhp->mmidr, mmid); + kfree(mhp); + PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp); + return 0; +} + +static int iwch_destroy_qp(struct ib_qp *ib_qp) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_qp_attributes attrs; + struct iwch_ucontext *ucontext; + + qhp = to_iwch_qp(ib_qp); + rhp = qhp->rhp; + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0); + } + wait_event(qhp->wait, !qhp->ep); + + remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid); + + atomic_dec(&qhp->refcnt); + wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); + + ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) + : NULL; + cxio_destroy_qp(&rhp->rdev, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx); + + PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, + ib_qp, qhp->wq.qpid, qhp); + kfree(qhp); + return 0; +} + +static struct ib_qp *iwch_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *attrs, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_pd *php; + struct iwch_cq *schp; + struct iwch_cq *rchp; + struct iwch_create_qp_resp uresp; + int wqsize, sqsize, rqsize; + struct iwch_ucontext *ucontext; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + if (attrs->qp_type != IB_QPT_RC) + return ERR_PTR(-EINVAL); + php = to_iwch_pd(pd); + rhp = php->rhp; + schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid); + rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid); + if (!schp || !rchp) + return ERR_PTR(-EINVAL); + + /* The RQT size must be # of entries + 1 rounded up to a power of two */ + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr); + if (rqsize == attrs->cap.max_recv_wr) + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1); + + /* T3 doesn't support RQT depth < 16 */ + if (rqsize < 16) + rqsize = 16; + + if (rqsize > T3_MAX_RQ_SIZE) + return ERR_PTR(-EINVAL); + + /* + * NOTE: The SQ and total WQ sizes don't need to be + * a power of two. However, all the code assumes + * they are. EG: Q_FREECNT() and friends. + */ + sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); + wqsize = roundup_pow_of_two(rqsize + sqsize); + PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, + wqsize, sqsize, rqsize); + qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); + if (!qhp) + return ERR_PTR(-ENOMEM); + qhp->wq.size_log2 = ilog2(wqsize); + qhp->wq.rq_size_log2 = ilog2(rqsize); + qhp->wq.sq_size_log2 = ilog2(sqsize); + ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; + if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) { + kfree(qhp); + return ERR_PTR(-ENOMEM); + } + attrs->cap.max_recv_wr = rqsize - 1; + attrs->cap.max_send_wr = sqsize; + qhp->rhp = rhp; + qhp->attr.pd = php->pdid; + qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid; + qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid; + qhp->attr.sq_num_entries = attrs->cap.max_send_wr; + qhp->attr.rq_num_entries = attrs->cap.max_recv_wr; + qhp->attr.sq_max_sges = attrs->cap.max_send_sge; + qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge; + qhp->attr.rq_max_sges = attrs->cap.max_recv_sge; + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.next_state = IWCH_QP_STATE_IDLE; + + /* + * XXX - These don't get passed in from the openib user + * at create time. The CM sets them via a QP modify. + * Need to fix... I think the CM should + */ + qhp->attr.enable_rdma_read = 1; + qhp->attr.enable_rdma_write = 1; + qhp->attr.enable_bind = 1; + qhp->attr.max_ord = 1; + qhp->attr.max_ird = 1; + + spin_lock_init(&qhp->lock); + init_waitqueue_head(&qhp->wait); + atomic_set(&qhp->refcnt, 1); + insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid); + + if (udata) { + + struct iwch_mm_entry *mm1, *mm2; + + mm1 = kmalloc(sizeof *mm1, GFP_KERNEL); + if (!mm1) { + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + mm2 = kmalloc(sizeof *mm2, GFP_KERNEL); + if (!mm2) { + kfree(mm1); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + uresp.qpid = qhp->wq.qpid; + uresp.size_log2 = qhp->wq.size_log2; + uresp.sq_size_log2 = qhp->wq.sq_size_log2; + uresp.rq_size_log2 = qhp->wq.rq_size_log2; + uresp.physaddr = virt_to_phys(qhp->wq.queue); + uresp.doorbell = qhp->wq.udb; + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm1); + kfree(mm2); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-EFAULT); + } + mm1->addr = uresp.physaddr; + mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr)); + insert_mmap(ucontext, mm1); + mm2->addr = uresp.doorbell & PAGE_MASK; + mm2->len = PAGE_SIZE; + insert_mmap(ucontext, mm2); + } + qhp->ibqp.qp_num = qhp->wq.qpid; + init_timer(&(qhp->timer)); + PDBG("%s sq_num_entries %d, rq_num_entries %d " + "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n", + __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries, + qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2); + return (&qhp->ibqp); +} + +static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + enum iwch_qp_attr_mask mask = 0; + struct iwch_qp_attributes attrs; + + PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp); + + /* iwarp does not support the RTR state */ + if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR)) + attr_mask &= ~IB_QP_STATE; + + /* Make sure we still have something left to do */ + if (!attr_mask) + return 0; + + memset(&attrs, 0, sizeof attrs); + qhp = to_iwch_qp(ibqp); + rhp = qhp->rhp; + + attrs.next_state = iwch_convert_state(attr->qp_state); + attrs.enable_rdma_read = (attr->qp_access_flags & + IB_ACCESS_REMOTE_READ) ? 1 : 0; + attrs.enable_rdma_write = (attr->qp_access_flags & + IB_ACCESS_REMOTE_WRITE) ? 1 : 0; + attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0; + + + mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0; + mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? + (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0; + + return iwch_modify_qp(rhp, qhp, mask, &attrs, 0); +} + +void iwch_qp_add_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + atomic_inc(&(to_iwch_qp(qp)->refcnt)); +} + +void iwch_qp_rem_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt))) + wake_up(&(to_iwch_qp(qp)->wait)); +} + +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn) +{ + PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn); + return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn); +} + + +static int iwch_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + *pkey = 0; + return 0; +} + +static int iwch_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct iwch_dev *dev; + + PDBG("%s ibdev %p, port %d, index %d, gid %p\n", + __FUNCTION__, ibdev, port, index, gid); + dev = to_iwch_dev(ibdev); + BUG_ON(port == 0 || port > 2); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6); + return 0; +} + +static int iwch_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + + struct iwch_dev *dev; + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + + dev = to_iwch_dev(ibdev); + memset(props, 0, sizeof *props); + memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + props->device_cap_flags = dev->device_cap_flags; + props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor; + props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device; + props->max_mr_size = ~0ull; + props->max_qp = dev->attr.max_qps; + props->max_qp_wr = dev->attr.max_wrs; + props->max_sge = dev->attr.max_sge_per_wr; + props->max_sge_rd = 1; + props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp; + props->max_cq = dev->attr.max_cqs; + props->max_cqe = dev->attr.max_cqes_per_cq; + props->max_mr = dev->attr.max_mem_regs; + props->max_pd = dev->attr.max_pds; + props->local_ca_ack_delay = 0; + + return 0; +} + +static int iwch_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 2; + props->active_speed = 2; + props->max_msg_sz = -1; + + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.fw_version); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.driver); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, dev); + return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor, + dev->rdev.rnic_info.pdev->device); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *iwch_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +int iwch_register_device(struct iwch_dev *dev) +{ + int ret; + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX); + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + dev->ibdev.owner = THIS_MODULE; + dev->device_cap_flags = + (IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + dev->ibdev.node_type = RDMA_NODE_RNIC; + memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); + dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports; + dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.query_device = iwch_query_device; + dev->ibdev.query_port = iwch_query_port; + dev->ibdev.modify_port = iwch_modify_port; + dev->ibdev.query_pkey = iwch_query_pkey; + dev->ibdev.query_gid = iwch_query_gid; + dev->ibdev.alloc_ucontext = iwch_alloc_ucontext; + dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext; + dev->ibdev.mmap = iwch_mmap; + dev->ibdev.alloc_pd = iwch_allocate_pd; + dev->ibdev.dealloc_pd = iwch_deallocate_pd; + dev->ibdev.create_ah = iwch_ah_create; + dev->ibdev.destroy_ah = iwch_ah_destroy; + dev->ibdev.create_qp = iwch_create_qp; + dev->ibdev.modify_qp = iwch_ib_modify_qp; + dev->ibdev.destroy_qp = iwch_destroy_qp; + dev->ibdev.create_cq = iwch_create_cq; + dev->ibdev.destroy_cq = iwch_destroy_cq; + dev->ibdev.resize_cq = iwch_resize_cq; + dev->ibdev.poll_cq = iwch_poll_cq; + dev->ibdev.get_dma_mr = iwch_get_dma_mr; + dev->ibdev.reg_phys_mr = iwch_register_phys_mem; + dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem; + dev->ibdev.reg_user_mr = iwch_reg_user_mr; + dev->ibdev.dereg_mr = iwch_dereg_mr; + dev->ibdev.alloc_mw = iwch_alloc_mw; + dev->ibdev.bind_mw = iwch_bind_mw; + dev->ibdev.dealloc_mw = iwch_dealloc_mw; + + dev->ibdev.attach_mcast = iwch_multicast_attach; + dev->ibdev.detach_mcast = iwch_multicast_detach; + dev->ibdev.process_mad = iwch_process_mad; + + dev->ibdev.req_notify_cq = iwch_arm_cq; + dev->ibdev.post_send = iwch_post_send; + dev->ibdev.post_recv = iwch_post_receive; + + + dev->ibdev.iwcm = + (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs), + GFP_KERNEL); + dev->ibdev.iwcm->connect = iwch_connect; + dev->ibdev.iwcm->accept = iwch_accept_cr; + dev->ibdev.iwcm->reject = iwch_reject_cr; + dev->ibdev.iwcm->create_listen = iwch_create_listen; + dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen; + dev->ibdev.iwcm->add_ref = iwch_qp_add_ref; + dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref; + dev->ibdev.iwcm->get_qp = iwch_get_qp; + + ret = ib_register_device(&dev->ibdev); + if (ret) + goto bail1; + + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + if (ret) { + goto bail2; + } + } + return 0; +bail2: + ib_unregister_device(&dev->ibdev); +bail1: + return ret; +} + +void iwch_unregister_device(struct iwch_dev *dev) +{ + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) + class_device_remove_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + ib_unregister_device(&dev->ibdev); + return; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h new file mode 100644 index 0000000..4d98886 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -0,0 +1,363 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_PROVIDER_H__ +#define __IWCH_PROVIDER_H__ + +#include +#include +#include +#include +#include "t3cdev.h" +#include "iwch.h" +#include "cxio_wr.h" +#include "cxio_hal.h" + +struct iwch_pd { + struct ib_pd ibpd; + u32 pdid; + struct iwch_dev *rhp; +}; + +static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct iwch_pd, ibpd); +} + +struct tpt_attributes { + u32 stag; + u32 state:1; + u32 type:2; + u32 rsvd:1; + enum tpt_mem_perm perms; + u32 remote_invaliate_disable:1; + u32 zbva:1; + u32 mw_bind_enable:1; + u32 page_size:5; + + u32 pdid; + u32 qpid; + u32 pbl_addr; + u32 len; + u64 va_fbo; + u32 pbl_size; +}; + +struct iwch_mr { + struct ib_mr ibmr; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +typedef struct iwch_mw iwch_mw_handle; + +static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct iwch_mr, ibmr); +} + +struct iwch_mw { + struct ib_mw ibmw; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw) +{ + return container_of(ibmw, struct iwch_mw, ibmw); +} + +struct iwch_cq { + struct ib_cq ibcq; + struct iwch_dev *rhp; + struct t3_cq cq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; +}; + +static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct iwch_cq, ibcq); +} + +enum IWCH_QP_FLAGS { + QP_QUIESCED = 0x01 +}; + +struct iwch_mpa_attributes { + u8 recv_marker_enabled; + u8 xmit_marker_enabled; /* iWARP: enable inbound Read Resp. */ + u8 crc_enabled; + u8 version; /* 0 or 1 */ +}; + +struct iwch_qp_attributes { + u32 scq; + u32 rcq; + u32 sq_num_entries; + u32 rq_num_entries; + u32 sq_max_sges; + u32 sq_max_sges_rdma_write; + u32 rq_max_sges; + u32 state; + u8 enable_rdma_read; + u8 enable_rdma_write; /* enable inbound Read Resp. */ + u8 enable_bind; + u8 enable_mmid0_fastreg; /* Enable STAG0 + Fast-register */ + /* + * Next QP state. If specify the current state, only the + * QP attributes will be modified. + */ + u32 max_ord; + u32 max_ird; + u32 pd; /* IN */ + u32 next_state; + char terminate_buffer[52]; + u32 terminate_msg_len; + u8 is_terminate_local; + struct iwch_mpa_attributes mpa_attr; /* IN-OUT */ + struct iwch_ep *llp_stream_handle; + char *stream_msg_buf; /* Last stream msg. before Idle -> RTS */ + u32 stream_msg_buf_len; /* Only on Idle -> RTS */ +}; + +struct iwch_qp { + struct ib_qp ibqp; + struct iwch_dev *rhp; + struct iwch_ep *ep; + struct iwch_qp_attributes attr; + struct t3_wq wq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; + enum IWCH_QP_FLAGS flags; + struct timer_list timer; +}; + +static inline int qp_quiesced(struct iwch_qp *qhp) +{ + return (qhp->flags & QP_QUIESCED); +} + +static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct iwch_qp, ibqp); +} + +void iwch_qp_add_ref(struct ib_qp *qp); +void iwch_qp_rem_ref(struct ib_qp *qp); +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn); + +struct iwch_ucontext { + struct ib_ucontext ibucontext; + struct cxio_ucontext uctx; + spinlock_t mmap_lock; + struct list_head mmaps; +}; + +static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c) +{ + return container_of(c, struct iwch_ucontext, ibucontext); +} + +struct iwch_mm_entry { + struct list_head entry; + u64 addr; + unsigned len; +}; + +static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext, + u64 addr, unsigned len) +{ + struct list_head *pos, *nxt; + struct iwch_mm_entry *mm; + + spin_lock_irq(&ucontext->mmap_lock); + list_for_each_safe(pos, nxt, &ucontext->mmaps) { + + mm = list_entry(pos, struct iwch_mm_entry, entry); + if (mm->addr == addr && mm->len == len) { + list_del_init(&mm->entry); + spin_unlock_irq(&ucontext->mmap_lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, + mm->len); + return mm; + } + } + spin_unlock_irq(&ucontext->mmap_lock); + return NULL; +} + +static inline void insert_mmap(struct iwch_ucontext *ucontext, + struct iwch_mm_entry *mm) +{ + spin_lock_irq(&ucontext->mmap_lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len); + list_add_tail(&mm->entry, &ucontext->mmaps); + spin_unlock_irq(&ucontext->mmap_lock); +} + +enum iwch_qp_attr_mask { + IWCH_QP_ATTR_NEXT_STATE = 1 << 0, + IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7, + IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8, + IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9, + IWCH_QP_ATTR_MAX_ORD = 1 << 11, + IWCH_QP_ATTR_MAX_IRD = 1 << 12, + IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22, + IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23, + IWCH_QP_ATTR_MPA_ATTR = 1 << 24, + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25, + IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_MAX_ORD | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_STREAM_MSG_BUFFER | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE) +}; + +int iwch_modify_qp(struct iwch_dev *rhp, + struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal); + +enum iwch_qp_state { + IWCH_QP_STATE_IDLE, + IWCH_QP_STATE_RTS, + IWCH_QP_STATE_ERROR, + IWCH_QP_STATE_TERMINATE, + IWCH_QP_STATE_CLOSING, + IWCH_QP_STATE_TOT +}; + +static inline int iwch_convert_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + case IB_QPS_INIT: + return IWCH_QP_STATE_IDLE; + case IB_QPS_RTS: + return IWCH_QP_STATE_RTS; + case IB_QPS_SQD: + return IWCH_QP_STATE_CLOSING; + case IB_QPS_SQE: + return IWCH_QP_STATE_TERMINATE; + case IB_QPS_ERR: + return IWCH_QP_STATE_ERROR; + default: + return -1; + } +} + +enum iwch_mem_perms { + IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0, + IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1, + IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2, + IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3, + IWCH_MEM_ACCESS_ATOMICS = 1 << 4, + IWCH_MEM_ACCESS_BINDING = 1 << 5, + IWCH_MEM_ACCESS_LOCAL = + (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE), + IWCH_MEM_ACCESS_REMOTE = + (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ) + /* cannot go beyond 1 << 31 */ +} __attribute__ ((packed)); + +static inline u32 iwch_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0) + | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) | + (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) | + IWCH_MEM_ACCESS_LOCAL_READ; +} + +enum iwch_mmid_state { + IWCH_STAG_STATE_VALID, + IWCH_STAG_STATE_INVALID +}; + +enum iwch_qp_query_flags { + IWCH_QP_QUERY_CONTEXT_NONE = 0x0, /* No ctx; Only attrs */ + IWCH_QP_QUERY_CONTEXT_GET = 0x1, /* Get ctx + attrs */ + IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2, /* Not Supported */ + + /* + * Quiesce QP context; Consumer + * will NOT replay outstanding WR + */ + IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4, + IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8, + IWCH_QP_QUERY_TEST_USERWRITE = 0x32 /* Test special */ +}; + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc); +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg); +int iwch_register_device(struct iwch_dev *dev); +void iwch_unregister_device(struct iwch_dev *dev); +int iwch_quiesce_qps(struct iwch_cq *chp); +int iwch_resume_qps(struct iwch_cq *chp); +void stop_read_rep_timer(struct iwch_qp *qhp); +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list); +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages); +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list); + + +#define IWCH_NODE_DESC "cxgb3 Chelsio Communications" + +#endif diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h new file mode 100644 index 0000000..4e4b9c9 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_user.h @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_USER_H__ +#define __IWCH_USER_H__ + +#define IWCH_UVERBS_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct iwch_create_cq_resp { + __u64 physaddr; + __u32 cqid; + __u32 size_log2; +}; + +struct iwch_create_qp_resp { + __u64 physaddr; + __u64 doorbell; + __u32 qpid; + __u32 size_log2; + __u32 sq_size_log2; + __u32 rq_size_log2; +}; + +struct iwch_reg_user_mr_resp { + __u32 pbl_addr; +}; + +struct iwch_req_notify_cq { + __u32 rptr; +}; +#endif From swise at opengridcomputing.com Thu Dec 14 05:54:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:54:36 -0600 Subject: [openib-general] [PATCH v4 04/13] Connection Manager In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135435.21159.92185.stgit@dell3.ogc.int> This code implements the iWARP CM provider methods for the Chelsio driver. The Chelsio ULLD is used to setup and teardown TCP connections, and the T3 RDMA Core is used to move the connections in and out of RDMA mode. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 2058 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 223 ++++ drivers/infiniband/hw/cxgb3/tcb.h | 603 ++++++++++ 3 files changed, 2884 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c new file mode 100644 index 0000000..962618f --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -0,0 +1,2058 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "tcb.h" +#include "cxgb3_offload.h" +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" + +char *states[] = { + "idle", + "listen", + "connecting", + "mpa_wait_req", + "mpa_req_sent", + "mpa_req_rcvd", + "mpa_rep_sent", + "fpdu_mode", + "aborting", + "closing", + "moribund", + "dead", + NULL, +}; + +static int ep_timeout_secs = 10; +module_param(ep_timeout_secs, int, 0444); +MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " + "in seconds (default=10)"); + +static int mpa_rev = 1; +module_param(mpa_rev, int, 0444); +MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " + "1 is spec compliant. (default=1)"); + +static int markers_enabled = 0; +module_param(markers_enabled, int, 0444); +MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)"); + +static int crc_enabled = 1; +module_param(crc_enabled, int, 0444); +MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)"); + +static int rcv_win = 512 * 1024; +module_param(rcv_win, int, 0444); +MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)"); + +static int snd_win = 512 * 1024; +module_param(snd_win, int, 0444); +MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)"); + +static unsigned int nocong = 1; +module_param(nocong, uint, 0444); +MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)"); + +static void process_work(struct work_struct *work); +static struct workqueue_struct *workq; +DECLARE_WORK(skb_work, process_work); + +static struct sk_buff_head rxq; +static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS]; + +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp); +static void ep_timeout(unsigned long arg); +static void connect_reply_upcall(struct iwch_ep *ep, int status); + +static void start_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + if (timer_pending(&ep->timer)) { + PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + } else + get_ep(&ep->com); + ep->timer.expires = jiffies + ep_timeout_secs * HZ; + ep->timer.data = (unsigned long)ep; + ep->timer.function = ep_timeout; + add_timer(&ep->timer); +} + +static void stop_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + put_ep(&ep->com); +} + +static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) +{ + struct cpl_tid_release *req; + + skb = get_skb(skb, sizeof *req, GFP_KERNEL); + if (!skb) + return; + req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); + skb->priority = CPL_PRIORITY_SETUP; + tdev->send(tdev, skb); + return; +} + +int iwch_quiesce_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +int iwch_resume_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = 0; + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static void set_emss(struct iwch_ep *ep, u16 opt) +{ + PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt); + ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40; + if (G_TCPOPT_TSTAMP(opt)) + ep->emss -= 12; + if (ep->emss < 128) + ep->emss = 128; + PDBG("emss=%d\n", ep->emss); +} + +static int state_comp_exch(struct iwch_ep_common *epc, + enum iwch_ep_state comp, + enum iwch_ep_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&epc->lock, flags); + ret = (epc->state == comp); + if (ret) + epc->state = exch; + spin_unlock_irqrestore(&epc->lock, flags); + return ret; +} + +static enum iwch_ep_state state_read(struct iwch_ep_common *epc) +{ + unsigned long flags; + enum iwch_ep_state state; + + spin_lock_irqsave(&epc->lock, flags); + state = epc->state; + spin_unlock_irqrestore(&epc->lock, flags); + return state; +} + +static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new) +{ + unsigned long flags; + + spin_lock_irqsave(&epc->lock, flags); + PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], + states[new]); + epc->state = new; + spin_unlock_irqrestore(&epc->lock, flags); + return; +} + +static void *alloc_ep(int size, gfp_t gfp) +{ + struct iwch_ep_common *epc; + + epc = kmalloc(size, gfp); + if (epc) { + memset(epc, 0, size); + kref_init(&epc->kref); + spin_lock_init(&epc->lock); + init_waitqueue_head(&epc->waitq); + } + PDBG("%s alloc ep %p\n", __FUNCTION__, epc); + return (void *) epc; +} + +void __free_ep(struct kref *kref) +{ + struct iwch_ep_common *epc; + epc = container_of(kref, struct iwch_ep_common, kref); + PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]); + kfree(epc); +} + +static void release_ep_resources(struct iwch_ep *ep) +{ + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + state_set(&ep->com, DEAD); + cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, ep->hwtid, NULL); + put_ep(&ep->com); +} + +static void process_work(struct work_struct *work) +{ + struct sk_buff *skb = NULL; + void *ep; + struct t3cdev *tdev; + int ret; + + while ((skb = skb_dequeue(&rxq))) { + ep = *((void **) (skb->cb)); + tdev = *((struct t3cdev **) (skb->cb + sizeof(void *))); + ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep); + if (ret & CPL_RET_BUF_DONE) + kfree_skb(skb); + + /* + * ep was referenced in sched(), and is freed here. + */ + put_ep((struct iwch_ep_common *)ep); + } +} + +static int status2errno(int status) +{ + switch (status) { + case CPL_ERR_NONE: + return 0; + case CPL_ERR_CONN_RESET: + return -ECONNRESET; + case CPL_ERR_ARP_MISS: + return -EHOSTUNREACH; + case CPL_ERR_CONN_TIMEDOUT: + return -ETIMEDOUT; + case CPL_ERR_TCAM_FULL: + return -ENOMEM; + case CPL_ERR_CONN_EXIST: + return -EADDRINUSE; + default: + return -EIO; + } +} + +/* + * Try and reuse skbs already allocated... + */ +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp) +{ + if (skb) { + BUG_ON(skb_cloned(skb)); + skb_trim(skb, 0); + skb_get(skb); + } else { + skb = alloc_skb(len, gfp); + } + return skb; +} + +static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip, + __be32 peer_ip, __be16 local_port, + __be16 peer_port, u8 tos) +{ + struct rtable *rt; + struct flowi fl = { + .oif = 0, + .nl_u = { + .ip4_u = { + .daddr = peer_ip, + .saddr = local_ip, + .tos = tos} + }, + .proto = IPPROTO_TCP, + .uli_u = { + .ports = { + .sport = local_port, + .dport = peer_port} + } + }; + + if (ip_route_output_flow(&rt, &fl, NULL, 0)) + return NULL; + return rt; +} + +static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu) +{ + int i = 0; + + while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu) + ++i; + return i; +} + +static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb) +{ + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for an active open. + */ +static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + printk(KERN_ERR MOD "ARP failure duing connect\n"); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for a CPL_ABORT_REQ. Change it into a no RST variant + * and send it along. + */ +static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_abort_req *req = cplhdr(skb); + + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + req->cmd = CPL_ABORT_NO_RST; + cxgb3_ofld_send(dev, skb); +} + +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) +{ + struct cpl_close_con_req *req; + struct sk_buff *skb; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) +{ + struct cpl_abort_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(skb, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, abort_arp_failure); + req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); + req->cmd = CPL_ABORT_SEND_RST; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_connect(struct iwch_ep *ep) +{ + struct cpl_act_open_req *req; + struct sk_buff *skb; + u32 opt0h, opt0l, opt2; + unsigned int mtu_idx; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + skb->priority = CPL_PRIORITY_SETUP; + set_arp_failure_handler(skb, act_open_req_arp_failure); + + req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid)); + req->local_port = ep->com.local_addr.sin_port; + req->peer_port = ep->com.remote_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_ip = ep->com.remote_addr.sin_addr.s_addr; + req->opt0h = htonl(opt0h); + req->opt0l = htonl(opt0l); + req->params = 0; + req->opt2 = htonl(opt2); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + + PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen); + + BUG_ON(skb_cloned(skb)); + + mpalen = sizeof(*mpa) + ep->plen; + if (skb->data + mpalen + sizeof(*req) > skb->end) { + kfree_skb(skb); + skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + connect_reply_upcall(ep, -ENOMEM); + return; + } + } + skb_trim(skb, 0); + skb_reserve(skb, sizeof(*req)); + skb_put(skb, mpalen); + skb->priority = CPL_PRIORITY_DATA; + mpa = (struct mpa_message *) skb->data; + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key)); + mpa->flags = (crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->private_data_size = htons(ep->plen); + mpa->revision = mpa_rev; + + if (ep->plen) + memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + start_ep_timer(ep); + state_set(&ep->com, MPA_REQ_SENT); + return; +} + +static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = MPA_REJECT; + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb again. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(mpalen); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + ep->mpa_skb = skb; + state_set(&ep->com, MPA_REP_SENT); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_establish *req = cplhdr(skb); + unsigned int tid = GET_TID(req); + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid); + + dst_confirm(ep->dst); + + /* setup the hwtid for this connection */ + ep->hwtid = tid; + cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid); + + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + /* dealloc the atid */ + cxgb3_free_atid(ep->com.tdev, ep->atid); + + /* start MPA negotiation */ + send_mpa_req(ep, skb); + + return 0; +} + +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb) +{ + PDBG("%s ep %p\n", __FILE__, ep); + state_set(&ep->com, ABORTING); + send_abort(ep, skb, GFP_KERNEL); +} + +static void close_complete_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + if (ep->com.cm_id) { + PDBG("close complete delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void peer_close_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_DISCONNECT; + if (ep->com.cm_id) { + PDBG("peer close delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static void peer_abort_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + event.status = -ECONNRESET; + if (ep->com.cm_id) { + PDBG("abort delivered ep %p cm_id %p tid %d\n", ep, + ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_reply_upcall(struct iwch_ep *ep, int status) +{ + struct iw_cm_event event; + + PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REPLY; + event.status = status; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + + if ((status == 0) || (status == -ECONNREFUSED)) { + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + } + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, + ep->hwtid, status); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } + if (status < 0) { + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_request_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REQUEST; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + event.provider_data = ep; + if (state_read(&ep->parent_ep->com) != DEAD) + ep->parent_ep->com.cm_id->event_handler( + ep->parent_ep->com.cm_id, + &event); + put_ep(&ep->parent_ep->com); + ep->parent_ep = NULL; +} + +static void established_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_ESTABLISHED; + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static int update_rx_credits(struct iwch_ep *ep, u32 credits) +{ + struct cpl_rx_data_ack *req; + struct sk_buff *skb; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n"); + return 0; + } + + req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); + req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); + skb->priority = CPL_PRIORITY_ACK; + ep->com.tdev->send(ep->com.tdev, skb); + return credits; +} + +static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + int err; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) + return; + state_set(&ep->com, FPDU_MODE); + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + err = -EINVAL; + goto err; + } + + /* + * copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * if we don't even have the mpa message, then bail. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* Validate MPA header. */ + if (mpa->revision != mpa_rev) { + err = -EPROTO; + goto err; + } + if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) { + err = -EPROTO; + goto err; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + err = -EPROTO; + goto err; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + err = -EPROTO; + goto err; + } + + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + if (mpa->flags & MPA_REJECT) { + err = -ECONNREFUSED; + goto err; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. And + * the MPA header is valid. + */ + + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ird; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD; + + /* bind QP and TID with INIT_WR */ + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + if (!err) + goto out; +err: + abort_connection(ep, skb); +out: + connect_reply_upcall(ep, err); + return; +} + +static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) + return; + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + abort_connection(ep, skb); + return; + } + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * If we don't even have the mpa message, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* + * Validate MPA Header. + */ + if (mpa->revision != mpa_rev) { + abort_connection(ep, skb); + return; + } + + if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) { + abort_connection(ep, skb); + return; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + abort_connection(ep, skb); + return; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + abort_connection(ep, skb); + return; + } + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. + */ + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + state_set(&ep->com, MPA_REQ_RCVD); + + /* drive upcall */ + connect_request_upcall(ep); + return; +} + +static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_rx_data *hdr = cplhdr(skb); + unsigned int dlen = ntohs(hdr->len); + + PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen); + + skb_pull(skb, sizeof(*hdr)); + skb_trim(skb, dlen); + + switch (state_read(&ep->com)) { + case MPA_REQ_SENT: + process_mpa_reply(ep, skb); + break; + case MPA_REQ_WAIT: + process_mpa_request(ep, skb); + break; + case MPA_REP_SENT: + break; + default: + printk(KERN_ERR MOD "%s Unexpected streaming data." + " ep %p state %d tid %d\n", + __FUNCTION__, ep, state_read(&ep->com), ep->hwtid); + + /* + * The ep will timeout and inform the ULP of the failure. + * See ep_timeout(). + */ + break; + } + + /* update RX credits */ + update_rx_credits(ep, dlen); + + return CPL_RET_BUF_DONE; +} + +/* + * Upcall from the adapter indicating data has been transmitted. + * For us its just the single MPA request or reply. We can now free + * the skb holding the mpa message. + */ +static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_wr_ack *hdr = cplhdr(skb); + unsigned int credits = ntohs(hdr->credits); + enum iwch_qp_attr_mask mask; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + + if (credits == 0) + return CPL_RET_BUF_DONE; + BUG_ON(credits != 1); + BUG_ON(ep->mpa_skb == NULL); + kfree_skb(ep->mpa_skb); + ep->mpa_skb = NULL; + dst_confirm(ep->dst); + if (state_read(&ep->com) == MPA_REP_SENT) { + struct iwch_qp_attributes attrs; + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (!ep->com.rpl_err) { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + + ep->com.rpl_done = 1; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + } + return CPL_RET_BUF_DONE; +} + +static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + close_complete_upcall(ep); + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status, + status2errno(rpl->status)); + connect_reply_upcall(ep, status2errno(rpl->status)); + state_set(&ep->com, DEAD); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, GET_TID(rpl), NULL); + cxgb3_free_atid(ep->com.tdev, ep->atid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + return -ENOMEM; + } + + req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); + req->local_port = ep->com.local_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_port = 0; + req->peer_ip = 0; + req->peer_netmask = 0; + req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS); + req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10)); + req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); + + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_pass_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, + rpl->status, status2errno(rpl->status)); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + + return CPL_RET_BUF_DONE; +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_close_listserv_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, + void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_close_listserv_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + return CPL_RET_BUF_DONE; +} + +static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) +{ + struct cpl_pass_accept_rpl *rpl; + unsigned int mtu_idx; + u32 opt0h, opt0l, opt2; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(*rpl)); + skb_get(skb); + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + + rpl = cplhdr(skb); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(opt0h); + rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT); + rpl->opt2 = htonl(opt2); + rpl->rsvd = rpl->opt2; /* workaround for HW bug */ + skb->priority = CPL_PRIORITY_SETUP; + l2t_send(ep->com.tdev, skb, ep->l2t); + + return; +} + +static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip, + struct sk_buff *skb) +{ + PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, + peer_ip); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(struct cpl_tid_release)); + skb_get(skb); + + if (tdev->type == T3B) + release_tid(tdev, hwtid, skb); + else { + struct cpl_pass_accept_rpl *rpl; + + rpl = cplhdr(skb); + skb->priority = CPL_PRIORITY_SETUP; + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, + hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(F_TCAM_BYPASS); + rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); + rpl->opt2 = 0; + rpl->rsvd = rpl->opt2; + tdev->send(tdev, skb); + } +} + +static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *child_ep, *parent_ep = ctx; + struct cpl_pass_accept_req *req = cplhdr(skb); + unsigned int hwtid = GET_TID(req); + struct dst_entry *dst; + struct l2t_entry *l2t; + struct rtable *rt; + struct iff_mac tim; + + PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid); + + if (state_read(&parent_ep->com) != LISTEN) { + printk(KERN_ERR "%s - listening ep not in LISTEN\n", + __FUNCTION__); + goto reject; + } + + /* + * Find the netdev for this connection request. + */ + tim.mac_addr = req->dst_mac; + tim.vlan_tag = ntohs(req->vlan_tag); + if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) { + printk(KERN_ERR + "%s bad dst mac %02x %02x %02x %02x %02x %02x\n", + __FUNCTION__, + req->dst_mac[0], + req->dst_mac[1], + req->dst_mac[2], + req->dst_mac[3], + req->dst_mac[4], + req->dst_mac[5]); + goto reject; + } + + /* Find output route */ + rt = find_route(tdev, + req->local_ip, + req->peer_ip, + req->local_port, + req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid))); + if (!rt) { + printk(KERN_ERR MOD "%s - failed to find dst entry!\n", + __FUNCTION__); + goto reject; + } + dst = &rt->u.dst; + l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev); + if (!l2t) { + printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n", + __FUNCTION__); + dst_release(dst); + goto reject; + } + child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL); + if (!child_ep) { + printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n", + __FUNCTION__); + l2t_release(L2DATA(tdev), l2t); + dst_release(dst); + goto reject; + } + state_set(&child_ep->com, CONNECTING); + child_ep->com.tdev = tdev; + child_ep->com.cm_id = NULL; + child_ep->com.local_addr.sin_family = PF_INET; + child_ep->com.local_addr.sin_port = req->local_port; + child_ep->com.local_addr.sin_addr.s_addr = req->local_ip; + child_ep->com.remote_addr.sin_family = PF_INET; + child_ep->com.remote_addr.sin_port = req->peer_port; + child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip; + get_ep(&parent_ep->com); + child_ep->parent_ep = parent_ep; + child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid)); + child_ep->l2t = l2t; + child_ep->dst = dst; + child_ep->hwtid = hwtid; + init_timer(&child_ep->timer); + cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid); + accept_cr(child_ep, req->peer_ip, skb); + goto out; +reject: + reject_cr(tdev, hwtid, req->peer_ip, skb); +out: + return CPL_RET_BUF_DONE; +} + +static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_pass_establish *req = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + dst_confirm(ep->dst); + state_set(&ep->com, MPA_REQ_WAIT); + start_ep_timer(ep); + + return CPL_RET_BUF_DONE; +} + +static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + int ret; + int abort = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + dst_confirm(ep->dst); + switch (state_read(&ep->com)) { + case MPA_REQ_WAIT: + state_set(&ep->com, CLOSING); + break; + case MPA_REQ_SENT: + state_set(&ep->com, CLOSING); + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + state_set(&ep->com, CLOSING); + get_ep(&ep->com); + break; + case MPA_REP_SENT: + state_set(&ep->com, CLOSING); + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case FPDU_MODE: + state_set(&ep->com, CLOSING); + peer_close_upcall(ep); + attrs.next_state = IWCH_QP_STATE_CLOSING; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) { + printk(KERN_ERR MOD "%s - qp <- closing err!\n", + __FUNCTION__); + abort = 1; + } + break; + case ABORTING: + goto out; + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + goto out; + case MORIBUND: + stop_ep_timer(ep); + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + goto out; + case DEAD: + goto out; + default: + BUG_ON(1); + } + iwch_ep_disconnect(ep, abort, GFP_KERNEL); +out: + return CPL_RET_BUF_DONE; +} + +/* + * Returns whether an ABORT_REQ_RSS message is a negative advice. + */ +static inline int is_neg_adv_abort(unsigned int status) +{ + return status == CPL_ERR_RTX_NEG_ADVICE || + status == CPL_ERR_PERSIST_NEG_ADVICE; +} + +static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_abort_req_rss *req = cplhdr(skb); + struct iwch_ep *ep = ctx; + struct cpl_abort_rpl *rpl; + struct sk_buff *rpl_skb; + struct iwch_qp_attributes attrs; + int ret; + int state; + + if (is_neg_adv_abort(req->status)) { + PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, + ep->hwtid); + t3_l2t_send_event(ep->com.tdev, ep->l2t); + return CPL_RET_BUF_DONE; + } + + state = state_read(&ep->com); + PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state); + switch (state) { + case CONNECTING: + break; + case MPA_REQ_WAIT: + break; + case MPA_REQ_SENT: + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REP_SENT: + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + get_ep(&ep->com); + break; + case MORIBUND: + stop_ep_timer(ep); + case FPDU_MODE: + case CLOSING: + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) + printk(KERN_ERR MOD + "%s - qp <- error failed!\n", + __FUNCTION__); + } + peer_abort_upcall(ep); + break; + case ABORTING: + break; + case DEAD: + PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__); + return CPL_RET_BUF_DONE; + default: + BUG_ON(1); + break; + } + dst_confirm(ep->dst); + + rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL); + if (!rpl_skb) { + printk(KERN_ERR MOD "%s - cannot allocate skb!\n", + __FUNCTION__); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; + } + rpl_skb->priority = CPL_PRIORITY_DATA; + rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl)); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL)); + rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); + rpl->cmd = CPL_ABORT_NO_RST; + ep->com.tdev->send(ep->com.tdev, rpl_skb); + if (state != ABORTING) + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(!ep); + + /* The cm_id may be null if we failed to connect */ + switch (state_read(&ep->com)) { + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + break; + case MORIBUND: + stop_ep_timer(ep); + if ((ep->com.cm_id) && (ep->com.qp)) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + break; + case DEAD: + default: + BUG_ON(1); + break; + } + + return CPL_RET_BUF_DONE; +} + +/* + * T3A does 3 things when a TERM is received: + * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet + * 2) generate an async event on the QP with the TERMINATE opcode + * 3) post a TERMINATE opcde cqe into the associated CQ. + * + * For (1), we save the message in the qp for later consumer consumption. + * For (2), we move the QP into TERMINATE, post a QP event and disconnect. + * For (3), we toss the CQE in cxio_poll_cq(). + * + * terminate() handles case (1)... + */ +static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb_pull(skb, sizeof(struct cpl_rdma_terminate)); + PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len); + memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len); + ep->com.qp->attr.terminate_msg_len = skb->len; + ep->com.qp->attr.is_terminate_local = 0; + return CPL_RET_BUF_DONE; +} + +static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_rdma_ec_status *rep = cplhdr(skb); + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, + rep->status); + if (rep->status) { + struct iwch_qp_attributes attrs; + + printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n", + __FUNCTION__, ep->hwtid); + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + abort_connection(ep, NULL); + } + return CPL_RET_BUF_DONE; +} + +static void ep_timeout(unsigned long arg) +{ + struct iwch_ep *ep = (struct iwch_ep *)arg; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) { + struct sk_buff *skb; + + connect_reply_upcall(ep, -ETIMEDOUT); + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) { + struct sk_buff *skb; + + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) { + struct sk_buff *skb; + + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) + abort_connection(ep, skb); + } + put_ep(&ep->com); +} + +int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + state_set(&ep->com, CLOSING); + if (mpa_rev == 0) + abort_connection(ep, NULL); + else { + err = send_mpa_reject(ep, pdata, pdata_len); + err = send_halfclose(ep, GFP_KERNEL); + } + return 0; +} + +int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + struct iwch_ep *ep = to_ep(cm_id); + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_qp *qp = get_qhp(h, conn_param->qpn); + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + BUG_ON(!qp); + + if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) || + (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) { + abort_connection(ep, NULL); + return -EINVAL; + } + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = qp; + + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord); + get_ep(&ep->com); + err = send_mpa_reply(ep, conn_param->private_data, + conn_param->private_data_len); + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + put_ep(&ep->com); + return err; + } + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + } else { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + put_ep(&ep->com); + return err; +} + +int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_ep *ep; + struct rtable *rt; + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto out; + } + init_timer(&ep->timer); + ep->plen = conn_param->private_data_len; + if (ep->plen) + memcpy(ep->mpa_pkt + sizeof(struct mpa_message), + conn_param->private_data, ep->plen); + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + ep->com.tdev = h->rdev.t3cdev_p; + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = get_qhp(h, conn_param->qpn); + BUG_ON(!ep->com.qp); + PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, + ep->com.qp, cm_id); + + /* + * Allocate an active TID to initiate a TCP connection. + */ + ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->atid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + /* find a route */ + rt = find_route(h->rdev.t3cdev_p, + cm_id->local_addr.sin_addr.s_addr, + cm_id->remote_addr.sin_addr.s_addr, + cm_id->local_addr.sin_port, + cm_id->remote_addr.sin_port, IPTOS_LOWDELAY); + if (!rt) { + printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__); + err = -EHOSTUNREACH; + goto fail3; + } + ep->dst = &rt->u.dst; + + /* get a l2t entry */ + ep->l2t = t3_l2t_get(ep->com.tdev, ep->dst->neighbour, + ep->dst->neighbour->dev); + if (!ep->l2t) { + printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__); + err = -ENOMEM; + goto fail4; + } + + state_set(&ep->com, CONNECTING); + ep->tos = IPTOS_LOWDELAY; + ep->com.local_addr = cm_id->local_addr; + ep->com.remote_addr = cm_id->remote_addr; + + /* send connect request to rnic */ + err = send_connect(ep); + if (!err) + goto out; + + l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t); +fail4: + dst_release(ep->dst); +fail3: + cxgb3_free_atid(ep->com.tdev, ep->atid); +fail2: + put_ep(&ep->com); +out: + return err; +} + +int iwch_create_listen(struct iw_cm_id *cm_id, int backlog) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_listen_ep *ep; + + + might_sleep(); + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto fail1; + } + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.tdev = h->rdev.t3cdev_p; + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->backlog = backlog; + ep->com.local_addr = cm_id->local_addr; + + /* + * Allocate a server TID. + */ + ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + state_set(&ep->com, LISTEN); + err = listen_start(ep); + if (err) + goto fail3; + + /* wait for pass_open_rpl */ + wait_event(ep->com.waitq, ep->com.rpl_done); + err = ep->com.rpl_err; + if (!err) { + cm_id->provider_data = ep; + goto out; + } +fail3: + cxgb3_free_stid(ep->com.tdev, ep->stid); +fail2: + put_ep(&ep->com); +fail1: +out: + return err; +} + +int iwch_destroy_listen(struct iw_cm_id *cm_id) +{ + int err; + struct iwch_listen_ep *ep = to_listen_ep(cm_id); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + might_sleep(); + state_set(&ep->com, DEAD); + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + err = listen_stop(ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + cxgb3_free_stid(ep->com.tdev, ep->stid); + err = ep->com.rpl_err; + cm_id->rem_ref(cm_id); + put_ep(&ep->com); + return err; +} + +int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) +{ + int ret=0; + int state; + + + state = state_read(&ep->com); + PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, + states[state], abrupt); + if (state == DEAD) { + PDBG("%s already dead ep %p\n", __FUNCTION__, ep); + return 0; + } + if (abrupt) { + if (state != ABORTING) { + state_set(&ep->com, ABORTING); + ret = send_abort(ep, NULL, gfp); + } + } else { + + if (state != CLOSING) + state_set(&ep->com, CLOSING); + else { + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + } + + ret = send_halfclose(ep, gfp); + } + return ret; +} + +int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, + struct l2t_entry *l2t) +{ + struct iwch_ep *ep = ctx; + + if (ep->dst != old) + return 0; + + PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, + l2t); + dst_hold(new); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + ep->l2t = l2t; + dst_release(old); + ep->dst = new; + return 1; +} + +/* + * All the CM events are handled on a work queue to have a safe context. + */ +static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep_common *epc = ctx; + + get_ep(epc); + + /* + * Save ctx and tdev in the skb->cb area. + */ + *((void **) skb->cb) = ctx; + *((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev; + + /* + * Queue the skb and schedule the worker thread. + */ + skb_queue_tail(&rxq, skb); + queue_work(workq, &skb_work); + return 0; +} + +int __init iwch_cm_init(void) +{ + skb_queue_head_init(&rxq); + + workq = create_singlethread_workqueue("iw_cxgb3"); + if (!workq) + return -ENOMEM; + + /* + * All upcalls from the T3 Core go to sched() to + * schedule the processing on a work queue. + */ + t3c_handlers[CPL_ACT_ESTABLISH] = sched; + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; + t3c_handlers[CPL_RX_DATA] = sched; + t3c_handlers[CPL_TX_DMA_ACK] = sched; + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; + t3c_handlers[CPL_ABORT_RPL] = sched; + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; + t3c_handlers[CPL_PASS_ESTABLISH] = sched; + t3c_handlers[CPL_PEER_CLOSE] = sched; + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; + t3c_handlers[CPL_RDMA_TERMINATE] = sched; + t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + + /* + * These are the real handlers that are called from a + * work queue. + */ + work_handlers[CPL_ACT_ESTABLISH] = act_establish; + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; + work_handlers[CPL_RX_DATA] = rx_data; + work_handlers[CPL_TX_DMA_ACK] = tx_ack; + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; + work_handlers[CPL_ABORT_RPL] = abort_rpl; + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; + work_handlers[CPL_PEER_CLOSE] = peer_close; + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; + work_handlers[CPL_RDMA_TERMINATE] = terminate; + work_handlers[CPL_RDMA_EC_STATUS] = ec_status; + return 0; +} + +void __exit iwch_cm_term(void) +{ + flush_workqueue(workq); + destroy_workqueue(workq); +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h new file mode 100644 index 0000000..893f9d0 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -0,0 +1,223 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _IWCH_CM_H_ +#define _IWCH_CM_H_ + +#include +#include +#include +#include + +#include +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" + +#define MPA_KEY_REQ "MPA ID Req Frame" +#define MPA_KEY_REP "MPA ID Rep Frame" + +#define MPA_MAX_PRIVATE_DATA 256 +#define MPA_REV 0 /* XXX - amso1100 uses rev 0 ! */ +#define MPA_REJECT 0x20 +#define MPA_CRC 0x40 +#define MPA_MARKERS 0x80 +#define MPA_FLAGS_MASK 0xE0 + +#define put_ep(ep) { \ + PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_put(&((ep)->kref), __free_ep); \ +} + +#define get_ep(ep) { \ + PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_get(&((ep)->kref)); \ +} + +struct mpa_message { + u8 key[16]; + u8 flags; + u8 revision; + __be16 private_data_size; + u8 private_data[0]; +}; + +struct terminate_message { + u8 layer_etype; + u8 ecode; + __be16 hdrct_rsvd; + u8 len_hdrs[0]; +}; + +#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28) + +enum iwch_layers_types { + LAYER_RDMAP = 0x00, + LAYER_DDP = 0x10, + LAYER_MPA = 0x20, + RDMAP_LOCAL_CATA = 0x00, + RDMAP_REMOTE_PROT = 0x01, + RDMAP_REMOTE_OP = 0x02, + DDP_LOCAL_CATA = 0x00, + DDP_TAGGED_ERR = 0x01, + DDP_UNTAGGED_ERR = 0x02, + DDP_LLP = 0x03 +}; + +enum iwch_rdma_ecodes { + RDMAP_INV_STAG = 0x00, + RDMAP_BASE_BOUNDS = 0x01, + RDMAP_ACC_VIOL = 0x02, + RDMAP_STAG_NOT_ASSOC = 0x03, + RDMAP_TO_WRAP = 0x04, + RDMAP_INV_VERS = 0x05, + RDMAP_INV_OPCODE = 0x06, + RDMAP_STREAM_CATA = 0x07, + RDMAP_GLOBAL_CATA = 0x08, + RDMAP_CANT_INV_STAG = 0x09, + RDMAP_UNSPECIFIED = 0xff +}; + +enum iwch_ddp_ecodes { + DDPT_INV_STAG = 0x00, + DDPT_BASE_BOUNDS = 0x01, + DDPT_STAG_NOT_ASSOC = 0x02, + DDPT_TO_WRAP = 0x03, + DDPT_INV_VERS = 0x04, + DDPU_INV_QN = 0x01, + DDPU_INV_MSN_NOBUF = 0x02, + DDPU_INV_MSN_RANGE = 0x03, + DDPU_INV_MO = 0x04, + DDPU_MSG_TOOBIG = 0x05, + DDPU_INV_VERS = 0x06 +}; + +enum iwch_mpa_ecodes { + MPA_CRC_ERR = 0x02, + MPA_MARKER_ERR = 0x03 +}; + +enum iwch_ep_state { + IDLE = 0, + LISTEN, + CONNECTING, + MPA_REQ_WAIT, + MPA_REQ_SENT, + MPA_REQ_RCVD, + MPA_REP_SENT, + FPDU_MODE, + ABORTING, + CLOSING, + MORIBUND, + DEAD, +}; + +struct iwch_ep_common { + struct iw_cm_id *cm_id; + struct iwch_qp *qp; + struct t3cdev *tdev; + enum iwch_ep_state state; + struct kref kref; + spinlock_t lock; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + wait_queue_head_t waitq; + int rpl_done; + int rpl_err; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; + unsigned int stid; + int backlog; +}; + +struct iwch_ep { + struct iwch_ep_common com; + struct iwch_ep *parent_ep; + struct timer_list timer; + unsigned int atid; + u32 hwtid; + u32 snd_seq; + struct l2t_entry *l2t; + struct dst_entry *dst; + struct sk_buff *mpa_skb; + struct iwch_mpa_attributes mpa_attr; + unsigned int mpa_pkt_len; + u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA]; + u8 tos; + u16 emss; + u16 plen; + u32 ird; + u32 ord; +}; + +static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_ep *)cm_id->provider_data; +} + +static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_listen_ep *)cm_id->provider_data; +} + +static inline int compute_wscale(int win) +{ + int wscale = 0; + + while (wscale < 14 && (65535< References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135506.21159.2723.stgit@dell3.ogc.int> Code to manipulate the QP. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++ 1 files changed, 1007 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c new file mode 100644 index 0000000..9f6b251 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -0,0 +1,1007 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" + +#define NO_SUPPORT -1 + +static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr, + u8 * flit_cnt) +{ + int i; + u32 plen; + + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE; + else + wqe->send.rdmaop = T3_SEND; + wqe->send.rem_stag = 0; + break; +#if 0 /* Not currently supported */ + case TYPE_SEND_INVALIDATE: + case TYPE_SEND_INVALIDATE_IMMEDIATE: + wqe->send.rdmaop = T3_SEND_WITH_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; + case TYPE_SEND_SE_INVALIDATE: + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; +#endif + default: + break; + } + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->send.reserved[0] = 0; + wqe->send.reserved[1] = 0; + wqe->send.reserved[2] = 0; + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + plen = 4; + wqe->send.sgl[0].stag = wr->imm_data; + wqe->send.sgl[0].len = __constant_cpu_to_be32(0); + wqe->send.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 5; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + } + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); + } + wqe->send.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + int i; + u32 plen; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->write.rdmaop = T3_RDMA_WRITE; + wqe->write.reserved[0] = 0; + wqe->write.reserved[1] = 0; + wqe->write.reserved[2] = 0; + wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey); + wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr); + + if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + plen = 4; + wqe->write.sgl[0].stag = wr->imm_data; + wqe->write.sgl[0].len = __constant_cpu_to_be32(0); + wqe->write.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 6; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->write.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->write.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->write.sgl[i].to = + cpu_to_be64(wr->sg_list[i].addr); + } + wqe->write.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 5 + ((wr->num_sge) << 1); + } + wqe->write.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + if (wr->num_sge > 1) + return -EINVAL; + wqe->read.rdmaop = T3_READ_REQ; + wqe->read.reserved[0] = 0; + wqe->read.reserved[1] = 0; + wqe->read.reserved[2] = 0; + wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr); + wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey); + wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length); + wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr); + *flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3; + return 0; +} + +/* + * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. + */ +static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp, + struct ib_sge *sg_list, u32 num_sgle, + u32 * pbl_addr, u8 * page_size) +{ + int i; + struct iwch_mr *mhp; + u32 offset; + for (i = 0; i < num_sgle; i++) { + + mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8); + if (!mhp) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (!mhp->attr.state) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (mhp->attr.zbva) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + + if (sg_list[i].addr < mhp->attr.va_fbo) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) < + sg_list[i].addr) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) > + mhp->attr.va_fbo + ((u64) mhp->attr.len)) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + offset = sg_list[i].addr - mhp->attr.va_fbo; + offset += ((u32) mhp->attr.va_fbo) % + (1UL << (12 + mhp->attr.page_size)); + pbl_addr[i] = ((mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3) + + (offset >> (12 + mhp->attr.page_size)); + page_size[i] = mhp->attr.page_size; + } + return 0; +} + +static inline int iwch_build_rdma_recv(struct iwch_dev *rhp, + union t3_wr *wqe, + struct ib_recv_wr *wr) +{ + int i, err = 0; + u32 pbl_addr[4]; + u8 page_size[4]; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, + page_size); + if (err) + return err; + wqe->recv.pagesz[0] = page_size[0]; + wqe->recv.pagesz[1] = page_size[1]; + wqe->recv.pagesz[2] = page_size[2]; + wqe->recv.pagesz[3] = page_size[3]; + wqe->recv.num_sgle = cpu_to_be32(wr->num_sge); + for (i = 0; i < wr->num_sge; i++) { + wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + + /* to in the WQE == the offset into the page */ + wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % + (1UL << (12 + page_size[i]))); + + /* pbl_addr is the adapters address in the PBL */ + wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); + } + for (; i < T3_MAX_SGE; i++) { + wqe->recv.sgl[i].stag = 0; + wqe->recv.sgl[i].len = 0; + wqe->recv.sgl[i].to = 0; + wqe->recv.pbl_addr[i] = 0; + } + return 0; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + int err = 0; + u8 t3_wr_flit_cnt; + enum t3_wr_opcode t3_wr_opcode = 0; + enum t3_wr_flags t3_wr_flags; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if (num_wrs <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + while (wr) { + if (num_wrs == 0) { + err = -ENOMEM; + *bad_wr = wr; + break; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + t3_wr_flags = 0; + if (wr->send_flags & IB_SEND_SOLICITED) + t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; + if (wr->send_flags & IB_SEND_SIGNALED) + t3_wr_flags |= T3_COMPLETION_FLAG; + sqp = qhp->wq.sq + + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + t3_wr_opcode = T3_WR_SEND; + err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + t3_wr_opcode = T3_WR_WRITE; + err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_READ: + t3_wr_opcode = T3_WR_READ; + t3_wr_flags = 0; /* T3 reads are always signaled */ + err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt); + if (err) + break; + sqp->read_len = wqe->read.local_len; + if (!qhp->wq.oldest_read) + qhp->wq.oldest_read = sqp; + break; + default: + PDBG("%s post of type=%d TBD!\n", __FUNCTION__, + wr->opcode); + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp->wr_id = wr->wr_id; + sqp->opcode = wr2opcode(t3_wr_opcode); + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED); + + build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, t3_wr_flit_cnt); + PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", + __FUNCTION__, wr->wr_id, idx, + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2), + sqp->opcode); + wr = wr->next; + num_wrs--; + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + int err = 0; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, + qhp->wq.rq_size_log2) - 1; + if (!wr) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + while (wr) { + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + if (num_wrs) + err = iwch_build_rdma_recv(qhp->rhp, wqe, wr); + else + err = -ENOMEM; + if (err) { + *bad_wr = wr; + break; + } + qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = + wr->wr_id; + build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, sizeof(struct t3_receive_wr) >> 3); + PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x " + "wqe %p \n", __FUNCTION__, wr->wr_id, idx, + qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); + ++(qhp->wq.rq_wptr); + ++(qhp->wq.wptr); + wr = wr->next; + num_wrs--; + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_qp *qhp; + union t3_wr *wqe; + u32 pbl_addr; + u8 page_size; + u32 num_wrs; + unsigned long flag; + struct ib_sge sgl; + int err=0; + enum t3_wr_flags t3_wr_flags; + u32 idx; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(qp); + mhp = to_iwch_mw(mw); + rhp = qhp->rhp; + + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if ((num_wrs) <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, + mw, mw_bind); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + + t3_wr_flags = 0; + if (mw_bind->send_flags & IB_SEND_SIGNALED) + t3_wr_flags = T3_COMPLETION_FLAG; + + sgl.addr = mw_bind->addr; + sgl.lkey = mw_bind->mr->lkey; + sgl.length = mw_bind->length; + wqe->bind.reserved = 0; + wqe->bind.type = T3_VA_BASED_TO; + + /* TBD: check perms */ + wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags); + wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey); + wqe->bind.mw_stag = cpu_to_be32(mw->rkey); + wqe->bind.mw_len = cpu_to_be32(mw_bind->length); + wqe->bind.mw_va = cpu_to_be64(mw_bind->addr); + err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size); + if (err) { + spin_unlock_irqrestore(&qhp->lock, flag); + return err; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + sqp->wr_id = mw_bind->wr_id; + sqp->opcode = T3_BIND_MW; + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED); + wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr); + wqe->bind.mr_pagesz = page_size; + wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; + build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, + sizeof(struct t3_bind_mw_wr) >> 3); + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + spin_unlock_irqrestore(&qhp->lock, flag); + + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + + return err; +} + +static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode, + int tagged) +{ + switch (t3err) { + case TPT_ERR_STAG: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_STAG; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_INV_STAG; + } + break; + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_STAG_NOT_ASSOC; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_STAG_NOT_ASSOC; + } + break; + case TPT_ERR_WRAP: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_TO_WRAP; + break; + case TPT_ERR_BOUND: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_BASE_BOUNDS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + } + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_CANT_INV_STAG; + break; + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + *layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_OUT_OF_RQE: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_NOBUF; + break; + case TPT_ERR_PBL_ADDR_BOUND: + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + break; + case TPT_ERR_CRC: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_CRC_ERR; + break; + case TPT_ERR_MARKER: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_MARKER_ERR; + break; + case TPT_ERR_PDU_LEN_ERR: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + break; + case TPT_ERR_DDP_VERSION: + if (tagged) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_VERS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_VERS; + } + break; + case TPT_ERR_RDMA_VERSION: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_VERS; + break; + case TPT_ERR_OPCODE: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_OPCODE; + break; + case TPT_ERR_DDP_QUEUE_NUM: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_QN; + break; + case TPT_ERR_MSN: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_RANGE; + break; + case TPT_ERR_TBIT: + *layer_type = LAYER_DDP|DDP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_MO: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MO; + break; + default: + *layer_type = LAYER_RDMAP|DDP_LOCAL_CATA; + *ecode = 0; + break; + } +} + +/* + * This posts a TERMINATE with layer=RDMA, type=catastrophic. + */ +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) +{ + union t3_wr *wqe; + struct terminate_message *term; + int status; + int tagged = 0; + struct sk_buff *skb; + + PDBG("%s %d\n", __FUNCTION__, __LINE__); + skb = alloc_skb(40, GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (union t3_wr *)skb_put(skb, 40); + memset(wqe, 0, 40); + wqe->send.rdmaop = T3_TERMINATE; + + /* immediate data length */ + wqe->send.plen = htonl(4); + + /* immediate data starts here. */ + term = (struct terminate_message *)wqe->send.sgl; + if (rsp_msg) { + status = CQE_STATUS(rsp_msg->cqe); + if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE) + tagged = 1; + if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) || + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) + tagged = 2; + } else { + status = TPT_ERR_INTERNAL_ERR; + } + build_term_codes(status, &term->layer_etype, &term->ecode, tagged); + build_fw_riwrh((void *)wqe, T3_WR_SEND, + T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, + qhp->ep->hwtid, 5); + skb->priority = CPL_PRIORITY_DATA; + return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb)); +} + +/* + * Assumes qhp lock is held. + */ +static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + struct iwch_cq *rchp, *schp; + int count; + + rchp = get_chp(qhp->rhp, qhp->attr.rcq); + schp = get_chp(qhp->rhp, qhp->attr.scq); + + PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp); + /* take a ref on the qhp since we must release the lock */ + atomic_inc(&qhp->refcnt); + spin_unlock_irqrestore(&qhp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&rchp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&rchp->cq); + cxio_count_rcqes(&rchp->cq, &qhp->wq, &count); + cxio_flush_rq(&qhp->wq, &rchp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&rchp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&schp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&schp->cq); + cxio_count_scqes(&schp->cq, &qhp->wq, &count); + cxio_flush_sq(&qhp->wq, &schp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&schp->lock, *flag); + + /* deref */ + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); + + spin_lock_irqsave(&qhp->lock, *flag); +} + +static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + if (t3b_device(qhp->rhp)) + cxio_set_wq_in_error(&qhp->wq); + else + __flush_qp(qhp, flag); +} + + +/* + * Return non zero if at least one RECV was pre-posted. + */ +static inline int rqes_posted(struct iwch_qp *qhp) +{ + return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV); +} + +static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs) +{ + struct t3_rdma_init_attr init_attr; + int ret; + + init_attr.tid = qhp->ep->hwtid; + init_attr.qpid = qhp->wq.qpid; + init_attr.pdid = qhp->attr.pd; + init_attr.scqid = qhp->attr.scq; + init_attr.rcqid = qhp->attr.rcq; + init_attr.rq_addr = qhp->wq.rq_addr; + init_attr.rq_size = 1 << qhp->wq.rq_size_log2; + init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | + qhp->attr.mpa_attr.recv_marker_enabled | + (qhp->attr.mpa_attr.xmit_marker_enabled << 1) | + (qhp->attr.mpa_attr.crc_enabled << 2); + + /* + * XXX - The IWCM doesn't quite handle getting these + * attrs set before going into RTS. For now, just turn + * them on always... + */ +#if 0 + init_attr.qpcaps = qhp->attr.enableRdmaRead | + (qhp->attr.enableRdmaWrite << 1) | + (qhp->attr.enableBind << 2) | + (qhp->attr.enable_stag0_fastreg << 3) | + (qhp->attr.enable_stag0_fastreg << 4); +#else + init_attr.qpcaps = 0x1f; +#endif + init_attr.tcp_emss = qhp->ep->emss; + init_attr.ord = qhp->attr.max_ord; + init_attr.ird = qhp->attr.max_ird; + init_attr.qp_dma_addr = qhp->wq.dma_addr; + init_attr.qp_dma_size = (1UL << qhp->wq.size_log2); + init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0; + PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d " + "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, + init_attr.rq_addr, init_attr.rq_size, + init_attr.flags, init_attr.qpcaps); + ret = cxio_rdma_init(&rhp->rdev, &init_attr); + PDBG("%s ret %d\n", __FUNCTION__, ret); + return ret; +} + +int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal) +{ + int ret = 0; + struct iwch_qp_attributes newattr = qhp->attr; + unsigned long flag; + int disconnect = 0; + int terminate = 0; + int abort = 0; + int free = 0; + struct iwch_ep *ep = NULL; + + PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, + qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, + (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1); + + spin_lock_irqsave(&qhp->lock, flag); + + /* Process attr changes if in IDLE */ + if (mask & IWCH_QP_ATTR_VALID_MODIFY) { + if (qhp->attr.state != IWCH_QP_STATE_IDLE) { + ret = -EIO; + goto out; + } + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ) + newattr.enable_rdma_read = attrs->enable_rdma_read; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE) + newattr.enable_rdma_write = attrs->enable_rdma_write; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND) + newattr.enable_bind = attrs->enable_bind; + if (mask & IWCH_QP_ATTR_MAX_ORD) { + if (attrs->max_ord > + rhp->attr.max_rdma_read_qp_depth) { + ret = -EINVAL; + goto out; + } + newattr.max_ord = attrs->max_ord; + } + if (mask & IWCH_QP_ATTR_MAX_IRD) { + if (attrs->max_ird > + rhp->attr.max_rdma_reads_per_qp) { + ret = -EINVAL; + goto out; + } + newattr.max_ird = attrs->max_ird; + } + qhp->attr = newattr; + } + + if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) + goto out; + if (qhp->attr.state == attrs->next_state) + goto out; + + switch (qhp->attr.state) { + case IWCH_QP_STATE_IDLE: + switch (attrs->next_state) { + case IWCH_QP_STATE_RTS: + if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) { + ret = -EINVAL; + goto out; + } + if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) { + ret = -EINVAL; + goto out; + } + qhp->attr.mpa_attr = attrs->mpa_attr; + qhp->attr.llp_stream_handle = attrs->llp_stream_handle; + qhp->ep = qhp->attr.llp_stream_handle; + qhp->attr.state = IWCH_QP_STATE_RTS; + + /* + * Ref the endpoint here and deref when we + * disassociate the endpoint from the QP. This + * happens in CLOSING->IDLE transition or *->ERROR + * transition. + */ + get_ep(&qhp->ep->com); + spin_unlock_irqrestore(&qhp->lock, flag); + ret = rdma_init(rhp, qhp, mask, attrs); + spin_lock_irqsave(&qhp->lock, flag); + if (ret) + goto err; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + flush_qp(qhp, &flag); + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_RTS: + switch (attrs->next_state) { + case IWCH_QP_STATE_CLOSING: + BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2); + qhp->attr.state = IWCH_QP_STATE_CLOSING; + if (!internal) { + abort=0; + disconnect = 1; + ep = qhp->ep; + } + break; + case IWCH_QP_STATE_TERMINATE: + qhp->attr.state = IWCH_QP_STATE_TERMINATE; + if (!internal) + terminate = 1; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + if (!internal) { + abort=1; + disconnect = 1; + ep = qhp->ep; + } + goto err; + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_CLOSING: + if (!internal) { + ret = -EINVAL; + goto out; + } + switch (attrs->next_state) { + case IWCH_QP_STATE_IDLE: + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.llp_stream_handle = NULL; + put_ep(&qhp->ep->com); + qhp->ep = NULL; + wake_up(&qhp->wait); + break; + case IWCH_QP_STATE_ERROR: + goto err; + default: + ret = -EINVAL; + goto err; + } + break; + case IWCH_QP_STATE_ERROR: + if (attrs->next_state != IWCH_QP_STATE_IDLE) { + ret = -EINVAL; + goto out; + } + + if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || + !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) { + ret = -EINVAL; + goto out; + } + qhp->attr.state = IWCH_QP_STATE_IDLE; + memset(&qhp->attr, 0, sizeof(qhp->attr)); + break; + case IWCH_QP_STATE_TERMINATE: + if (!internal) { + ret = -EINVAL; + goto out; + } + goto err; + break; + default: + printk(KERN_ERR "%s in a bad state %d\n", + __FUNCTION__, qhp->attr.state); + ret = -EINVAL; + goto err; + break; + } + goto out; +err: + PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, + qhp->wq.qpid); + + /* disassociate the LLP connection */ + qhp->attr.llp_stream_handle = NULL; + ep = qhp->ep; + qhp->ep = NULL; + qhp->attr.state = IWCH_QP_STATE_ERROR; + free=1; + wake_up(&qhp->wait); + BUG_ON(!ep); + flush_qp(qhp, &flag); +out: + spin_unlock_irqrestore(&qhp->lock, flag); + + if (terminate) + iwch_post_terminate(qhp, NULL); + + /* + * If disconnect is 1, then we need to initiate a disconnect + * on the EP. This can be a normal close (RTS->CLOSING) or + * an abnormal close (RTS/CLOSING->ERROR). + */ + if (disconnect) + iwch_ep_disconnect(ep, abort, GFP_KERNEL); + + /* + * If free is 1, then we've disassociated the EP from the QP + * and we need to dereference the EP. + */ + if (free) + put_ep(&ep->com); + + PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state); + return ret; +} + +static int quiesce_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_quiesce_tid(qhp->ep); + qhp->flags |= QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +static int resume_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_resume_tid(qhp->ep); + qhp->flags &= ~QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +int iwch_quiesce_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) { + quiesce_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) + quiesce_qp(qhp); + } + return 0; +} + +int iwch_resume_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) { + resume_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp)) + resume_qp(qhp); + } + return 0; +} From swise at opengridcomputing.com Thu Dec 14 05:55:36 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:55:36 -0600 Subject: [openib-general] [PATCH v4 06/13] Completion Queues In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135536.21159.74057.stgit@dell3.ogc.int> Functions to manipulate CQs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cq.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c new file mode 100644 index 0000000..9d82df4 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" + +/* + * Get one cq entry from cxio and map it to openib. + * + * Returns: + * 0 EMPTY; + * 1 cqe returned + * -EAGAIN caller must try again + * any other -errno fatal error + */ +int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp, + struct ib_wc *wc) +{ + struct iwch_qp *qhp = NULL; + struct t3_cqe cqe, *rd_cqe; + struct t3_wq *wq; + u32 credit = 0; + u8 cqe_flushed; + u64 cookie; + int ret = 1; + + rd_cqe = cxio_next_cqe(&chp->cq); + + if (!rd_cqe) + return 0; + + qhp = get_qhp(rhp, CQE_QPID(*rd_cqe)); + if (!qhp) + wq = NULL; + else { + spin_lock(&qhp->lock); + wq = &(qhp->wq); + } + ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie, + &credit); + if (t3a_device(chp->rhp) && credit) { + PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, + credit, chp->cq.cqid); + cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit); + } + + if (ret) { + ret = -EAGAIN; + goto out; + } + ret = 1; + + wc->wr_id = cookie; + wc->qp_num = qhp->wq.qpid; + wc->vendor_err = CQE_STATUS(cqe); + + PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x " + "lo 0x%x cookie 0x%llx\n", __FUNCTION__, + CQE_QPID(cqe), CQE_TYPE(cqe), + CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe), + CQE_WRID_LOW(cqe), cookie); + + if (CQE_TYPE(cqe) == 0) { + if (!CQE_STATUS(cqe)) + wc->byte_len = CQE_LEN(cqe); + else + wc->byte_len = 0; + wc->opcode = IB_WC_RECV; + } else { + switch (CQE_OPCODE(cqe)) { + case T3_RDMA_WRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case T3_READ_REQ: + wc->opcode = IB_WC_RDMA_READ; + wc->byte_len = CQE_LEN(cqe); + break; + case T3_SEND: + case T3_SEND_WITH_SE: + wc->opcode = IB_WC_SEND; + break; + case T3_BIND_MW: + wc->opcode = IB_WC_BIND_MW; + break; + + /* these aren't supported yet */ + case T3_SEND_WITH_INV: + case T3_SEND_WITH_SE_INV: + case T3_LOCAL_INV: + case T3_FAST_REGISTER: + default: + printk(KERN_ERR MOD "Unexpected opcode %d " + "in the CQE received for QPID=0x%0x\n", + CQE_OPCODE(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + goto out; + } + } + + if (cqe_flushed) + wc->status = IB_WC_WR_FLUSH_ERR; + else { + + switch (CQE_STATUS(cqe)) { + case TPT_ERR_SUCCESS: + wc->status = IB_WC_SUCCESS; + break; + case TPT_ERR_STAG: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_PDID: + wc->status = IB_WC_LOC_PROT_ERR; + break; + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_WRAP: + wc->status = IB_WC_GENERAL_ERR; + break; + case TPT_ERR_BOUND: + wc->status = IB_WC_LOC_LEN_ERR; + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + wc->status = IB_WC_MW_BIND_ERR; + break; + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + case TPT_ERR_OPCODE: + wc->status = IB_WC_FATAL_ERR; + break; + case TPT_ERR_SWFLUSH: + wc->status = IB_WC_WR_FLUSH_ERR; + break; + default: + printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for " + "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + } + } +out: + if (wq) + spin_unlock(&qhp->lock); + return ret; +} + +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + unsigned long flags; + int npolled; + int err = 0; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + + spin_lock_irqsave(&chp->lock, flags); + for (npolled = 0; npolled < num_entries; ++npolled) { +#ifdef DEBUG + int i=0; +#endif + + /* + * Because T3 can post CQEs that are _not_ associated + * with a WR, we might have to poll again after removing + * one of these. + */ + do { + err = iwch_poll_cq_one(rhp, chp, wc + npolled); +#ifdef DEBUG + BUG_ON(++i > 1000); +#endif + } while (err == -EAGAIN); + if (err <= 0) + break; + } + spin_unlock_irqrestore(&chp->lock, flags); + + if (err < 0) + return err; + else { + return npolled; + } +} + +int iwch_modify_cq(struct ib_cq *cq, int cqe) +{ + PDBG("iwch_modify_cq: TBD\n"); + return 0; +} From swise at opengridcomputing.com Thu Dec 14 05:56:06 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:56:06 -0600 Subject: [openib-general] [PATCH v4 07/13] Async Event Handler In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135606.21159.29525.stgit@dell3.ogc.int> Code to handle async events coming from the T3 RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_ev.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c new file mode 100644 index 0000000..b0bd014 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp, + struct respQ_msg_t *rsp_msg, + enum ib_event_type ib_event, + int send_term) +{ + struct ib_event event; + struct iwch_qp_attributes attrs; + struct iwch_qp *qhp; + + printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + + spin_lock(&rnicp->lock); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + + if (!qhp) { + printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", + __FUNCTION__, CQE_STATUS(rsp_msg->cqe), + CQE_QPID(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + if ((qhp->attr.state == IWCH_QP_STATE_ERROR) || + (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) { + PDBG("%s AE received after RTS - " + "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, + qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + atomic_inc(&qhp->refcnt); + spin_unlock(&rnicp->lock); + + event.event = ib_event; + event.device = chp->ibcq.device; + if (ib_event == IB_EVENT_CQ_ERR) + event.element.cq = &chp->ibcq; + else + event.element.qp = &qhp->ibqp; + + if (qhp->ibqp.event_handler) + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_TERMINATE; + iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (send_term) + iwch_post_terminate(qhp, rsp_msg); + } + + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); +} + +void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb) +{ + struct iwch_dev *rnicp; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + struct iwch_cq *chp; + struct iwch_qp *qhp; + u32 cqid = RSPQ_CQID(rsp_msg); + + rnicp = (struct iwch_dev *) rdev_p->ulp; + spin_lock(&rnicp->lock); + chp = get_chp(rnicp, cqid); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + if (!chp || !qhp) { + printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d " + "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", + cqid, CQE_QPID(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), + CQE_WRID_LOW(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + goto out; + } + iwch_qp_add_ref(&qhp->ibqp); + atomic_inc(&chp->refcnt); + spin_unlock(&rnicp->lock); + + /* + * 1) completion of our sending a TERMINATE. + * 2) incoming TERMINATE message. + */ + if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && + (CQE_STATUS(rsp_msg->cqe) == 0)) { + if (SQ_TYPE(rsp_msg->cqe)) { + PDBG("%s QPID 0x%x ep %p disconnecting\n", + __FUNCTION__, qhp->wq.qpid, qhp->ep); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } else { + PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, + qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, + IB_EVENT_QP_REQ_ERR, 0); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } + goto done; + } + + /* Bad incoming Read request */ + if (SQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + /* Bad incoming write */ + if (RQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + switch (CQE_STATUS(rsp_msg->cqe)) { + + /* Completion Events */ + case TPT_ERR_SUCCESS: + + /* + * Confirm the destination entry if this is a RECV completion. + */ + if (qhp->ep && SQ_TYPE(rsp_msg->cqe)) + dst_confirm(qhp->ep->dst); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + break; + + case TPT_ERR_STAG: + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + case TPT_ERR_WRAP: + case TPT_ERR_BOUND: + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1); + break; + + /* Device Fatal Errors */ + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1); + break; + + /* QP Fatal Errors */ + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_PBL_ADDR_BOUND: + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_OPCODE: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_RQE_ADDR_BOUND: + case TPT_ERR_IRD_OVERFLOW: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + + default: + printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", + CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + } +done: + if (atomic_dec_and_test(&chp->refcnt)) + wake_up(&chp->wait); + iwch_qp_rem_ref(&qhp->ibqp); +out: + dev_kfree_skb_irq(skb); +} From swise at opengridcomputing.com Thu Dec 14 05:56:37 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:56:37 -0600 Subject: [openib-general] [PATCH v4 08/13] Memory Registration In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135636.21159.34359.stgit@dell3.ogc.int> Functions to register memory regions. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_mem.c | 170 ++++++++++++++++++++++++++++++++ 1 files changed, 170 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c new file mode 100644 index 0000000..774d11e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c @@ -0,0 +1,170 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include + +#include +#include + +#include "cxio_hal.h" +#include "iwch.h" +#include "iwch_provider.h" + +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list) +{ + u32 stag; + u32 mmid; + + + if (cxio_register_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages) +{ + u32 stag; + u32 mmid; + + + /* We could support this... */ + if (npages > mhp->attr.pbl_size) + return -ENOMEM; + + stag = mhp->attr.stag; + if (cxio_reregister_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list) +{ + u64 mask; + int i, j, n; + + mask = 0; + *total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) + return -EINVAL; + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return -EINVAL; + *total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + if (*total_size > 0xFFFFFFFFULL) + return -ENOMEM; + + /* Find largest page shift we can use to cover buffers */ + for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift)) + if (num_phys_buf > 1) { + if ((1ULL << *shift) & mask) + break; + } else + if (1ULL << *shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << *shift) - 1))) + break; + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1); + buffer_list[0].addr &= ~0ull << *shift; + + *npages = 0; + for (i = 0; i < num_phys_buf; ++i) + *npages += (buffer_list[i].size + + (1ULL << *shift) - 1) >> *shift; + + if (!*npages) + return -EINVAL; + + *page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL); + if (!*page_list) + return -ENOMEM; + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; + ++j) + (*page_list)[n++] = cpu_to_be64(buffer_list[i].addr + + ((u64) j << *shift)); + + PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n", + __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages); + + return 0; + +} From swise at opengridcomputing.com Thu Dec 14 05:57:07 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:57:07 -0600 Subject: [openib-general] [PATCH v4 09/13] Core WQE/CQE Types In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135707.21159.1944.stgit@dell3.ogc.int> T3 WQE and CQE structures, defines, etc... Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_wr.h | 685 ++++++++++++++++++++++++++++ 1 files changed, 685 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h new file mode 100644 index 0000000..45870be --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h @@ -0,0 +1,685 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_WR_H__ +#define __CXIO_WR_H__ + +#include +#include +#include +#include "firmware_exports.h" + +#define T3_MAX_SGE 4 + +#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) +#define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ + ((rptr)!=(wptr)) ) +#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1)) +#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<> S_FW_RIWR_OP)) & M_FW_RIWR_OP) + +#define S_FW_RIWR_SOPEOP 22 +#define M_FW_RIWR_SOPEOP 0x3 +#define V_FW_RIWR_SOPEOP(x) ((x) << S_FW_RIWR_SOPEOP) + +#define S_FW_RIWR_FLAGS 8 +#define M_FW_RIWR_FLAGS 0x3fffff +#define V_FW_RIWR_FLAGS(x) ((x) << S_FW_RIWR_FLAGS) +#define G_FW_RIWR_FLAGS(x) ((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS) + +#define S_FW_RIWR_TID 8 +#define V_FW_RIWR_TID(x) ((x) << S_FW_RIWR_TID) + +#define S_FW_RIWR_LEN 0 +#define V_FW_RIWR_LEN(x) ((x) << S_FW_RIWR_LEN) + +#define S_FW_RIWR_GEN 31 +#define V_FW_RIWR_GEN(x) ((x) << S_FW_RIWR_GEN) + +struct t3_sge { + __be32 stag; + __be32 len; + __be64 to; +}; + +/* If num_sgle is zero, flit 5+ contains immediate data.*/ +struct t3_send_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be32 plen; /* 3 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ +}; + +struct t3_local_inv_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 stag; /* 2 */ + __be32 reserved3; +}; + +struct t3_rdma_write_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 stag_sink; + __be64 to_sink; /* 3 */ + __be32 plen; /* 4 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 5+ */ +}; + +struct t3_rdma_read_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be64 rem_to; /* 3 */ + __be32 local_stag; /* 4 */ + __be32 local_len; + __be64 local_to; /* 5 */ +}; + +enum t3_addr_type { + T3_VA_BASED_TO = 0x0, + T3_ZERO_BASED_TO = 0x1 +} __attribute__ ((packed)); + +enum t3_mem_perms { + T3_MEM_ACCESS_LOCAL_READ = 0x1, + T3_MEM_ACCESS_LOCAL_WRITE = 0x2, + T3_MEM_ACCESS_REM_READ = 0x4, + T3_MEM_ACCESS_REM_WRITE = 0x8 +} __attribute__ ((packed)); + +struct t3_bind_mw_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u16 reserved; /* 2 */ + u8 type; + u8 perms; + __be32 mr_stag; + __be32 mw_stag; /* 3 */ + __be32 mw_len; + __be64 mw_va; /* 4 */ + __be32 mr_pbl_addr; /* 5 */ + u8 reserved2[3]; + u8 mr_pagesz; +}; + +struct t3_receive_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 pagesz[T3_MAX_SGE]; + __be32 num_sgle; /* 2 */ + struct t3_sge sgl[T3_MAX_SGE]; /* 3+ */ + __be32 pbl_addr[T3_MAX_SGE]; +}; + +struct t3_bypass_wr { + struct fw_riwrh wrh; + union t3_wrid wrid; /* 1 */ +}; + +struct t3_modify_qp_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 flags; /* 2 */ + __be32 quiesce; /* 2 */ + __be32 max_ird; /* 3 */ + __be32 max_ord; /* 3 */ + __be64 sge_cmd; /* 4 */ + __be64 ctx1; /* 5 */ + __be64 ctx0; /* 6 */ +}; + +enum t3_modify_qp_flags { + MODQP_QUIESCE = 0x01, + MODQP_MAX_IRD = 0x02, + MODQP_MAX_ORD = 0x04, + MODQP_WRITE_EC = 0x08, + MODQP_READ_EC = 0x10, +}; + + +enum t3_mpa_attrs { + uP_RI_MPA_RX_MARKER_ENABLE = 0x1, + uP_RI_MPA_TX_MARKER_ENABLE = 0x2, + uP_RI_MPA_CRC_ENABLE = 0x4, + uP_RI_MPA_IETF_ENABLE = 0x8 +} __attribute__ ((packed)); + +enum t3_qp_caps { + uP_RI_QP_RDMA_READ_ENABLE = 0x01, + uP_RI_QP_RDMA_WRITE_ENABLE = 0x02, + uP_RI_QP_BIND_ENABLE = 0x04, + uP_RI_QP_FAST_REGISTER_ENABLE = 0x08, + uP_RI_QP_STAG0_ENABLE = 0x10 +} __attribute__ ((packed)); + +struct t3_rdma_init_attr { + u32 tid; + u32 qpid; + u32 pdid; + u32 scqid; + u32 rcqid; + u32 rq_addr; + u32 rq_size; + enum t3_mpa_attrs mpaattrs; + enum t3_qp_caps qpcaps; + u16 tcp_emss; + u32 ord; + u32 ird; + u64 qp_dma_addr; + u32 qp_dma_size; + u32 flags; +}; + +struct t3_rdma_init_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 qpid; /* 2 */ + __be32 pdid; + __be32 scqid; /* 3 */ + __be32 rcqid; + __be32 rq_addr; /* 4 */ + __be32 rq_size; + u8 mpaattrs; /* 5 */ + u8 qpcaps; + __be16 ulpdu_size; + __be32 flags; /* bits 31-1 - reservered */ + /* bit 0 - set if RECV posted */ + __be32 ord; /* 6 */ + __be32 ird; + __be64 qp_dma_addr; /* 7 */ + __be32 qp_dma_size; /* 8 */ + u32 rsvd; +}; + +struct t3_genbit { + u64 flit[15]; + __be64 genbit; +}; + +enum rdma_init_wr_flags { + RECVS_POSTED = 1, +}; + +union t3_wr { + struct t3_send_wr send; + struct t3_rdma_write_wr write; + struct t3_rdma_read_wr read; + struct t3_receive_wr recv; + struct t3_local_inv_wr local_inv; + struct t3_bind_mw_wr bind; + struct t3_bypass_wr bypass; + struct t3_rdma_init_wr init; + struct t3_modify_qp_wr qp_mod; + struct t3_genbit genbit; + u64 flit[16]; +}; + +#define T3_SQ_CQE_FLIT 13 +#define T3_SQ_COOKIE_FLIT 14 + +#define T3_RQ_COOKIE_FLIT 13 +#define T3_RQ_CQE_FLIT 14 + +static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe) +{ + return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags)); +} + +static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, + enum t3_wr_flags flags, u8 genbit, u32 tid, + u8 len) +{ + wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | + V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_FLAGS(flags)); + wmb(); + wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | + V_FW_RIWR_TID(tid) | + V_FW_RIWR_LEN(len)); + /* 2nd gen bit... */ + ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit); +} + +/* + * T3 ULP2_TX commands + */ +enum t3_utx_mem_op { + T3_UTX_MEM_READ = 2, + T3_UTX_MEM_WRITE = 3 +}; + +/* T3 MC7 RDMA TPT entry format */ + +enum tpt_mem_type { + TPT_NON_SHARED_MR = 0x0, + TPT_SHARED_MR = 0x1, + TPT_MW = 0x2, + TPT_MW_RELAXED_PROTECTION = 0x3 +}; + +enum tpt_addr_type { + TPT_ZBTO = 0, + TPT_VATO = 1 +}; + +enum tpt_mem_perm { + TPT_LOCAL_READ = 0x8, + TPT_LOCAL_WRITE = 0x4, + TPT_REMOTE_READ = 0x2, + TPT_REMOTE_WRITE = 0x1 +}; + +struct tpt_entry { + __be32 valid_stag_pdid; + __be32 flags_pagesize_qpid; + + __be32 rsvd_pbl_addr; + __be32 len; + __be32 va_hi; + __be32 va_low_or_fbo; + + __be32 rsvd_bind_cnt_or_pstag; + __be32 rsvd_pbl_size; +}; + +#define S_TPT_VALID 31 +#define V_TPT_VALID(x) ((x) << S_TPT_VALID) +#define F_TPT_VALID V_TPT_VALID(1U) + +#define S_TPT_STAG_KEY 23 +#define M_TPT_STAG_KEY 0xFF +#define V_TPT_STAG_KEY(x) ((x) << S_TPT_STAG_KEY) +#define G_TPT_STAG_KEY(x) (((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY) + +#define S_TPT_STAG_STATE 22 +#define V_TPT_STAG_STATE(x) ((x) << S_TPT_STAG_STATE) +#define F_TPT_STAG_STATE V_TPT_STAG_STATE(1U) + +#define S_TPT_STAG_TYPE 20 +#define M_TPT_STAG_TYPE 0x3 +#define V_TPT_STAG_TYPE(x) ((x) << S_TPT_STAG_TYPE) +#define G_TPT_STAG_TYPE(x) (((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE) + +#define S_TPT_PDID 0 +#define M_TPT_PDID 0xFFFFF +#define V_TPT_PDID(x) ((x) << S_TPT_PDID) +#define G_TPT_PDID(x) (((x) >> S_TPT_PDID) & M_TPT_PDID) + +#define S_TPT_PERM 28 +#define M_TPT_PERM 0xF +#define V_TPT_PERM(x) ((x) << S_TPT_PERM) +#define G_TPT_PERM(x) (((x) >> S_TPT_PERM) & M_TPT_PERM) + +#define S_TPT_REM_INV_DIS 27 +#define V_TPT_REM_INV_DIS(x) ((x) << S_TPT_REM_INV_DIS) +#define F_TPT_REM_INV_DIS V_TPT_REM_INV_DIS(1U) + +#define S_TPT_ADDR_TYPE 26 +#define V_TPT_ADDR_TYPE(x) ((x) << S_TPT_ADDR_TYPE) +#define F_TPT_ADDR_TYPE V_TPT_ADDR_TYPE(1U) + +#define S_TPT_MW_BIND_ENABLE 25 +#define V_TPT_MW_BIND_ENABLE(x) ((x) << S_TPT_MW_BIND_ENABLE) +#define F_TPT_MW_BIND_ENABLE V_TPT_MW_BIND_ENABLE(1U) + +#define S_TPT_PAGE_SIZE 20 +#define M_TPT_PAGE_SIZE 0x1F +#define V_TPT_PAGE_SIZE(x) ((x) << S_TPT_PAGE_SIZE) +#define G_TPT_PAGE_SIZE(x) (((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE) + +#define S_TPT_PBL_ADDR 0 +#define M_TPT_PBL_ADDR 0x1FFFFFFF +#define V_TPT_PBL_ADDR(x) ((x) << S_TPT_PBL_ADDR) +#define G_TPT_PBL_ADDR(x) (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR) + +#define S_TPT_QPID 0 +#define M_TPT_QPID 0xFFFFF +#define V_TPT_QPID(x) ((x) << S_TPT_QPID) +#define G_TPT_QPID(x) (((x) >> S_TPT_QPID) & M_TPT_QPID) + +#define S_TPT_PSTAG 0 +#define M_TPT_PSTAG 0xFFFFFF +#define V_TPT_PSTAG(x) ((x) << S_TPT_PSTAG) +#define G_TPT_PSTAG(x) (((x) >> S_TPT_PSTAG) & M_TPT_PSTAG) + +#define S_TPT_PBL_SIZE 0 +#define M_TPT_PBL_SIZE 0xFFFFF +#define V_TPT_PBL_SIZE(x) ((x) << S_TPT_PBL_SIZE) +#define G_TPT_PBL_SIZE(x) (((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE) + +/* + * CQE defs + */ +struct t3_cqe { + __be32 header; + __be32 len; + union { + struct { + __be32 stag; + __be32 msn; + } rcqe; + struct { + u32 wrid_hi; + u32 wrid_low; + } scqe; + } u; +}; + +#define S_CQE_OOO 31 +#define M_CQE_OOO 0x1 +#define G_CQE_OOO(x) ((((x) >> S_CQE_OOO)) & M_CQE_OOO) +#define V_CEQ_OOO(x) ((x)<> S_CQE_QPID)) & M_CQE_QPID) +#define V_CQE_QPID(x) ((x)<> S_CQE_SWCQE)) & M_CQE_SWCQE) +#define V_CQE_SWCQE(x) ((x)<> S_CQE_GENBIT) & M_CQE_GENBIT) +#define V_CQE_GENBIT(x) ((x)<> S_CQE_STATUS)) & M_CQE_STATUS) +#define V_CQE_STATUS(x) ((x)<> S_CQE_TYPE)) & M_CQE_TYPE) +#define V_CQE_TYPE(x) ((x)<> S_CQE_OPCODE)) & M_CQE_OPCODE) +#define V_CQE_OPCODE(x) ((x)<queue->flit[13] = 1; +} + +static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + return NULL; +} + +static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +#endif From swise at opengridcomputing.com Thu Dec 14 05:57:37 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:57:37 -0600 Subject: [openib-general] [PATCH v4 10/13] Core HAL In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135737.21159.98294.stgit@dell3.ogc.int> The RDMA Core interfaces with the T3 HW and ULLD providing a low level RDMA interface. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_hal.h | 201 ++++ 2 files changed, 1503 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c new file mode 100644 index 0000000..ffc4ec0 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c @@ -0,0 +1,1302 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include +#include +#include +#include + +#include "cxio_resource.h" +#include "cxio_hal.h" +#include "cxgb3_offload.h" +#include "sge_defs.h" + +static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC]; +static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL; + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (!strcmp(rdev_tbl[i]->dev_name, dev_name)) + return rdev_tbl[i]; + return NULL; +} + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev + *tdev) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (rdev_tbl[i]->t3cdev_p == tdev) + return rdev_tbl[i]; + return NULL; +} + +static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (!rdev_tbl[i]) { + rdev_tbl[i] = rdev_p; + break; + } + return (i == T3_MAX_NUM_RNIC); +} + +static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i] == rdev_p) { + rdev_tbl[i] = NULL; + break; + } +} + +int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit) +{ + int ret; + struct t3_cqe *cqe; + u32 rptr; + + struct rdma_cq_op setup; + setup.id = cq->cqid; + setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0; + setup.op = op; + ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup); + + if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) + return ret; + + /* + * If the rearm returned an index other than our current index, + * then there might be CQE's in flight (being DMA'd). We must wait + * here for them to complete or the consumer can miss a notification. + */ + if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) { + int i=0; + + rptr = cq->rptr; + + /* + * Keep the generation correct by bumping rptr until it + * matches the index returned by the rearm - 1. + */ + while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret) + rptr++; + + /* + * Now rptr is the index for the (last) cqe that was + * in-flight at the time the HW rearmed the CQ. We + * spin until that CQE is valid. + */ + cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2); + while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) { + udelay(1); + if (i++ > 1000000) { + BUG_ON(1); + printk(KERN_ERR "%s: stalled rnic\n", + rdev_p->dev_name); + return -EIO; + } + } + } + return 0; +} + +static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid) +{ + struct rdma_cq_setup setup; + setup.id = cqid; + setup.base_addr = 0; /* NULL address */ + setup.size = 0; /* disaable the CQ */ + setup.credits = 0; + setup.credit_thres = 0; + setup.ovfl_mode = 0; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) +{ + u64 sge_cmd; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = qpid << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe); + + cq->cqid = cxio_hal_get_cqid(rdev_p->rscp); + if (!cq->cqid) + return -ENOMEM; + cq->sw_queue = kzalloc(size, GFP_KERNEL); + if (!cq->sw_queue) + return -ENOMEM; + cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) * + sizeof(struct t3_cqe), + &(cq->dma_addr), GFP_KERNEL); + if (!cq->queue) { + kfree(cq->sw_queue); + return -ENOMEM; + } + pci_unmap_addr_set(cq, mapping, cq->dma_addr); + memset(cq->queue, 0, size); + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = 65535; + setup.credit_thres = 1; + if (rdev_p->t3cdev_p->type == T3B) + setup.ovfl_mode = 0; + else + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = setup.size; + setup.credit_thres = setup.size; /* TBD: overflow recovery */ + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + u32 qpid; + int i; + + mutex_lock(&uctx->lock); + if (!list_empty(&uctx->qpids)) { + entry = list_entry(uctx->qpids.next, struct cxio_qpid_list, + entry); + list_del(&entry->entry); + qpid = entry->qpid; + kfree(entry); + } else { + qpid = cxio_hal_get_qpid(rdev_p->rscp); + if (!qpid) + goto out; + for (i = qpid+1; i & rdev_p->qpmask; i++) { + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + break; + entry->qpid = i; + list_add_tail(&entry->entry, &uctx->qpids); + } + } +out: + mutex_unlock(&uctx->lock); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid, + struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + return; + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + entry->qpid = qpid; + mutex_lock(&uctx->lock); + list_add_tail(&entry->entry, &uctx->qpids); + mutex_unlock(&uctx->lock); +} + +void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct list_head *pos, *nxt; + struct cxio_qpid_list *entry; + + mutex_lock(&uctx->lock); + list_for_each_safe(pos, nxt, &uctx->qpids) { + entry = list_entry(pos, struct cxio_qpid_list, entry); + list_del_init(&entry->entry); + if (!(entry->qpid & rdev_p->qpmask)) + cxio_hal_put_qpid(rdev_p->rscp, entry->qpid); + kfree(entry); + } + mutex_unlock(&uctx->lock); +} + +void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + INIT_LIST_HEAD(&uctx->qpids); + mutex_init(&uctx->lock); +} + +int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain, + struct t3_wq *wq, struct cxio_ucontext *uctx) +{ + int depth = 1UL << wq->size_log2; + int rqsize = 1UL << wq->rq_size_log2; + + wq->qpid = get_qpid(rdev_p, uctx); + if (!wq->qpid) + return -ENOMEM; + + wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL); + if (!wq->rq) + goto err1; + + wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize); + if (!wq->rq_addr) + goto err2; + + wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL); + if (!wq->sq) + goto err3; + + wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + depth * sizeof(union t3_wr), + &(wq->dma_addr), GFP_KERNEL); + if (!wq->queue) + goto err4; + + memset(wq->queue, 0, depth * sizeof(union t3_wr)); + pci_unmap_addr_set(wq, mapping, wq->dma_addr); + wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + if (!kernel_domain) + wq->udb = (u64)rdev_p->rnic_info.udbell_physbase + + (wq->qpid << rdev_p->qpshift); + PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__, + wq->qpid, wq->doorbell, wq->udb); + return 0; +err4: + kfree(wq->sq); +err3: + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize); +err2: + kfree(wq->rq); +err1: + put_qpid(rdev_p, wq->qpid, uctx); + return -ENOMEM; +} + +int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + int err; + err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid); + kfree(cq->sw_queue); + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) + * sizeof(struct t3_cqe), cq->queue, + pci_unmap_addr(cq, mapping)); + cxio_hal_put_cqid(rdev_p->rscp, cq->cqid); + return err; +} + +int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq, + struct cxio_ucontext *uctx) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (wq->size_log2)) + * sizeof(union t3_wr), wq->queue, + pci_unmap_addr(wq, mapping)); + kfree(wq->sq); + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2)); + kfree(wq->rq); + put_qpid(rdev_p, wq->qpid, uctx); + return 0; +} + +static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(T3_SEND) | + V_CQE_TYPE(0) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + u32 ptr; + + PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq); + + /* flush RQ */ + PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, + wq->rq_rptr, wq->rq_wptr, count); + ptr = wq->rq_rptr + count; + while (ptr++ != wq->rq_wptr) + insert_recv_cqe(wq, cq); +} + +static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, + struct t3_swsq *sqp) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(sqp->opcode) | + V_CQE_TYPE(1) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + cqe.u.scqe.wrid_hi = sqp->sq_wptr; + + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + __u32 ptr; + struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2); + + ptr = wq->sq_rptr + count; + sqp += count; + while (ptr != wq->sq_wptr) { + insert_sq_cqe(wq, cq, sqp); + sqp++; + ptr++; + } +} + +/* + * Move all CQEs from the HWCQ into the SWCQ. + */ +void cxio_flush_hw_cq(struct t3_cq *cq) +{ + struct t3_cqe *cqe, *swcqe; + + PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid); + cqe = cxio_next_hw_cqe(cq); + while (cqe) { + PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n", + __FUNCTION__, cq->rptr, cq->sw_wptr); + swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2); + *swcqe = *cqe; + swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1)); + cq->sw_wptr++; + cq->rptr++; + cqe = cxio_next_hw_cqe(cq); + } +} + +static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq) +{ + if (CQE_OPCODE(*cqe) == T3_TERMINATE) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) + return 0; + + return 1; +} + +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) && + (CQE_QPID(*cqe) == wq->qpid)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + PDBG("%s count zero %d\n", __FUNCTION__, *count); + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) && + (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p) +{ + struct rdma_cq_setup setup; + setup.id = 0; + setup.base_addr = 0; /* NULL address */ + setup.size = 1; /* enable the CQ */ + setup.credits = 0; + + /* force SGE to redirect to RspQ and interrupt */ + setup.credit_thres = 0; + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) +{ + int err; + u64 sge_cmd, ctx0, ctx1; + u64 base_addr; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + + + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + err = cxio_hal_init_ctrl_cq(rdev_p); + if (err) { + PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err); + return err; + } + rdev_p->ctrl_qp.workq = dma_alloc_coherent( + &(rdev_p->rnic_info.pdev->dev), + (1 << T3_CTRL_QP_SIZE_LOG2) * + sizeof(union t3_wr), + &(rdev_p->ctrl_qp.dma_addr), + GFP_KERNEL); + if (!rdev_p->ctrl_qp.workq) { + PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__); + return -ENOMEM; + } + pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, + rdev_p->ctrl_qp.dma_addr); + rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + memset(rdev_p->ctrl_qp.workq, 0, + (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr)); + + init_MUTEX(&rdev_p->ctrl_qp.sem); + init_waitqueue_head(&rdev_p->ctrl_qp.waitq); + + /* update HW Ctrl QP context */ + base_addr = rdev_p->ctrl_qp.dma_addr; + base_addr >>= 12; + ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) | + V_EC_BASE_LO((u32) base_addr & 0xffff)); + ctx0 <<= 32; + ctx0 |= V_EC_CREDITS(FW_WR_NUM); + base_addr >>= 16; + ctx1 = (u32) base_addr; + base_addr >>= 32; + ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) | + V_EC_TYPE(0) | V_EC_GEN(1) | + V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32; + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1, + T3_CTL_QP_TID, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + wqe->ctx1 = cpu_to_be64(ctx1); + wqe->ctx0 = cpu_to_be64(ctx0); + PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n", + (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq, + 1 << T3_CTRL_QP_SIZE_LOG2); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << T3_CTRL_QP_SIZE_LOG2) + * sizeof(union t3_wr), rdev_p->ctrl_qp.workq, + pci_unmap_addr(&rdev_p->ctrl_qp, mapping)); + return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID); +} + +/* write len bytes of data into addr (32B aligned address) + * If data is NULL, clear len byte of memory to zero. + * caller aquires the sem before the call + */ +static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, + u32 len, void *data, int completion) +{ + u32 i, nr_wqe, copy_len; + u8 *copy_data; + u8 wr_len, utx_len; /* lenght in 8 byte flit */ + enum t3_wr_flags flag; + __be64 *wqe; + u64 utx_cmd; + addr &= 0x7FFFFFF; + nr_wqe = len % 96 ? len / 96 + 1 : len / 96; /* 96B max per WQE */ + PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n", + __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, + nr_wqe, data, addr); + utx_len = 3; /* in 32B unit */ + for (i = 0; i < nr_wqe; i++) { + if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2)) { + PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, " + "wait for more space i %d\n", __FUNCTION__, + rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i); + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + !Q_FULL(rdev_p->ctrl_qp.rptr, + rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2))) { + PDBG("%s ctrl_qp workq interrupted\n", + __FUNCTION__); + return -ERESTARTSYS; + } + PDBG("%s ctrl_qp wakeup, continue posting work request " + "i %d\n", __FUNCTION__, i); + } + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + flag = 0; + if (i == (nr_wqe - 1)) { + /* last WQE */ + flag = completion ? T3_COMPLETION_FLAG : 0; + if (len % 32) + utx_len = len / 32 + 1; + else + utx_len = len / 32; + } + + /* + * Force a CQE to return the credit to the workq in case + * we posted more than half the max QP size of WRs + */ + if ((i != 0) && + (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) { + flag = T3_COMPLETION_FLAG; + PDBG("%s force completion at i %d\n", __FUNCTION__, i); + } + + /* build the utx mem command */ + wqe += (sizeof(struct t3_bypass_wr) >> 3); + utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3); + utx_cmd <<= 32; + utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1); + *wqe = cpu_to_be64(utx_cmd); + wqe++; + copy_data = (u8 *) data + i * 96; + copy_len = len > 96 ? 96 : len; + + /* clear memory content if data is NULL */ + if (data) + memcpy(wqe, copy_data, copy_len); + else + memset(wqe, 0, copy_len); + if (copy_len % 32) + memset(((u8 *) wqe) + copy_len, 0, + 32 - (copy_len % 32)); + wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + + (utx_len << 2); + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + + /* wptr in the WRID[31:0] */ + ((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr; + + /* + * This must be the last write with a memory barrier + * for the genbit + */ + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag, + Q_GENBIT(rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID, + wr_len); + if (flag == T3_COMPLETION_FLAG) + ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID); + len -= 96; + rdev_p->ctrl_qp.wptr++; + } + return 0; +} + +/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size + * OUT: stag index, actual pbl_size, pbl_addr allocated. + * TBD: shared memory region support + */ +static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, + u32 *stag, u8 stag_state, u32 pdid, + enum tpt_mem_type type, enum tpt_mem_perm perm, + u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl, + u32 *pbl_size, u32 *pbl_addr) +{ + int err; + struct tpt_entry tpt; + u32 stag_idx; + u32 wptr; + int rereg = (*stag != T3_STAG_UNSET); + + stag_state = stag_state > 0; + stag_idx = (*stag) >> 8; + + if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) { + stag_idx = cxio_hal_get_stag(rdev_p->rscp); + if (!stag_idx) + return -ENOMEM; + *stag = (stag_idx << 8) | ((*stag) & 0xFF); + } + PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", + __FUNCTION__, stag_state, type, pdid, stag_idx); + + if (reset_tpt_entry) + cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3); + else if (!rereg) { + *pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3); + if (!*pbl_addr) { + return -ENOMEM; + } + } + + down_interruptible(&rdev_p->ctrl_qp.sem); + + /* write PBL first if any - update pbl only if pbl list exist */ + if (pbl) { + + PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n", + __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base, + *pbl_size); + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + (*pbl_addr >> 5), + (*pbl_size << 3), pbl, 0); + if (err) + goto ret; + } + + /* write TPT entry */ + if (reset_tpt_entry) + memset(&tpt, 0, sizeof(tpt)); + else { + tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID | + V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) | + V_TPT_STAG_STATE(stag_state) | + V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid)); + BUG_ON(page_size >= 28); + tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | + F_TPT_MW_BIND_ENABLE | + V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) | + V_TPT_PAGE_SIZE(page_size)); + tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3)); + tpt.len = cpu_to_be32(len); + tpt.va_hi = cpu_to_be32((u32) (to >> 32)); + tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL)); + tpt.rsvd_bind_cnt_or_pstag = 0; + tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2)); + } + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + stag_idx + + (rdev_p->rnic_info.tpt_base >> 5), + sizeof(tpt), &tpt, 1); + + /* release the stag index to free pool */ + if (reset_tpt_entry) + cxio_hal_put_stag(rdev_p->rscp, stag_idx); +ret: + wptr = rdev_p->ctrl_qp.wptr; + up(&rdev_p->ctrl_qp.sem); + if (!err) + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + SEQ32_GE(rdev_p->ctrl_qp.rptr, + wptr))) + return -ERESTARTSYS; + return err; +} + +/* IN : stag key, pdid, pbl_size + * Out: stag index, actaul pbl_size, and pbl_addr allocated. + */ +int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr) +{ + *stag = T3_STAG_UNSET; + return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, + perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr)); +} + +int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, + u32 pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + &pbl_size, &pbl_addr); +} + +int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) +{ + u32 pbl_size = 0; + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0, + NULL, &pbl_size, NULL); +} + +int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + NULL, NULL); +} + +int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) +{ + struct t3_rdma_init_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC); + if (!skb) + return -ENOMEM; + PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p); + wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe)); + wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT)); + wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) | + V_FW_RIWR_LEN(sizeof(*wqe) >> 3)); + wqe->wrid.id1 = 0; + wqe->qpid = cpu_to_be32(attr->qpid); + wqe->pdid = cpu_to_be32(attr->pdid); + wqe->scqid = cpu_to_be32(attr->scqid); + wqe->rcqid = cpu_to_be32(attr->rcqid); + wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base); + wqe->rq_size = cpu_to_be32(attr->rq_size); + wqe->mpaattrs = attr->mpaattrs; + wqe->qpcaps = attr->qpcaps; + wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss); + wqe->flags = cpu_to_be32(attr->flags); + wqe->ord = cpu_to_be32(attr->ord); + wqe->ird = cpu_to_be32(attr->ird); + wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr); + wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); + wqe->rsvd = 0; + skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = ev_cb; +} + +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = NULL; +} + +static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb) +{ + static int cnt; + struct cxio_rdev *rdev_p = NULL; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x" + " se %0x notify %0x cqbranch %0x creditth %0x\n", + cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg), + RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg), + RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg), + RSPQ_CREDIT_THRESH(rsp_msg)); + PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d " + "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + rdev_p = (struct cxio_rdev *)t3cdev_p->ulp; + if (!rdev_p) { + PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__, + t3cdev_p); + return 0; + } + if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) { + rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1; + wake_up_interruptible(&rdev_p->ctrl_qp.waitq); + dev_kfree_skb_irq(skb); + } else if (CQE_QPID(rsp_msg->cqe) == 0xfff8) + dev_kfree_skb_irq(skb); + else if (cxio_ev_cb) + (*cxio_ev_cb) (rdev_p, skb); + else + dev_kfree_skb_irq(skb); + cnt++; + return 0; +} + +/* Caller takes care of locking if needed */ +int cxio_rdev_open(struct cxio_rdev *rdev_p) +{ + struct net_device *netdev_p = NULL; + int err = 0; + if (strlen(rdev_p->dev_name)) { + if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) { + return -EBUSY; + } + netdev_p = dev_get_by_name(rdev_p->dev_name); + if (!netdev_p) { + return -EINVAL; + } + dev_put(netdev_p); + } else if (rdev_p->t3cdev_p) { + if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) { + return -EBUSY; + } + netdev_p = rdev_p->t3cdev_p->lldev; + strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name, + T3_MAX_DEV_NAME_LEN); + } else { + PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__); + return -EINVAL; + } + + if (cxio_hal_add_rdev(rdev_p)) + return -ENOMEM; + + PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name); + memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp)); + if (!rdev_p->t3cdev_p) + rdev_p->t3cdev_p = T3CDEV(netdev_p); + rdev_p->t3cdev_p->ulp = (void *) rdev_p; + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS, + &(rdev_p->rnic_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS, + &(rdev_p->port_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + + /* + * qpshift is the number of bits to shift the qpid left in order + * to get the correct address of the doorbell for that qp. + */ + cxio_init_ucontext(rdev_p, &rdev_p->uctx); + rdev_p->qpshift = PAGE_SHIFT - + ilog2(65536 >> + ilog2(rdev_p->rnic_info.udbell_len >> + PAGE_SHIFT)); + rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT; + rdev_p->qpmask = (65536 >> ilog2(rdev_p->qpnr)) - 1; + PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d " + "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n", + __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base, + rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p), + rdev_p->rnic_info.pbl_base, + rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base, + rdev_p->rnic_info.rqt_top); + PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu " + "qpnr %d qpmask 0x%x\n", + rdev_p->rnic_info.udbell_len, + rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr, + rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask); + + err = cxio_hal_init_ctrl_qp(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing ctrl_qp.\n", + __FUNCTION__, err); + goto err1; + } + err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0, + 0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ, + T3_MAX_NUM_PD); + if (err) { + printk(KERN_ERR "%s error %d initializing hal resources.\n", + __FUNCTION__, err); + goto err2; + } + err = cxio_hal_pblpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing pbl mem pool.\n", + __FUNCTION__, err); + goto err3; + } + err = cxio_hal_rqtpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing rqt mem pool.\n", + __FUNCTION__, err); + goto err4; + } + return 0; +err4: + cxio_hal_pblpool_destroy(rdev_p); +err3: + cxio_hal_destroy_resource(rdev_p->rscp); +err2: + cxio_hal_destroy_ctrl_qp(rdev_p); +err1: + cxio_hal_delete_rdev(rdev_p); + return err; +} + +void cxio_rdev_close(struct cxio_rdev *rdev_p) +{ + if (rdev_p) { + cxio_hal_pblpool_destroy(rdev_p); + cxio_hal_rqtpool_destroy(rdev_p); + cxio_hal_delete_rdev(rdev_p); + rdev_p->t3cdev_p->ulp = NULL; + cxio_hal_destroy_ctrl_qp(rdev_p); + cxio_hal_destroy_resource(rdev_p->rscp); + } +} + +int __init cxio_hal_init(void) +{ + if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI)) + return -ENOMEM; + memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *)); + t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler); + return 0; +} + +void __exit cxio_hal_exit(void) +{ + int i; + t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL); + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + cxio_rdev_close(rdev_tbl[i]); + cxio_hal_destroy_rhdl_resource(); +} + +static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_swsq *sqp; + __u32 ptr = wq->sq_rptr; + int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr); + + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + while (count--) + if (!sqp->signaled) { + ptr++; + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + } else if (sqp->complete) { + + /* + * Insert this completed cqe into the swcq. + */ + PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n", + __FUNCTION__, Q_PTR2IDX(ptr, wq->sq_size_log2), + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)); + sqp->cqe.header |= htonl(V_CQE_SWCQE(1)); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) + = sqp->cqe; + cq->sw_wptr++; + sqp->signaled = 0; + break; + } else + break; +} + +static inline void create_read_req_cqe(struct t3_wq *wq, + struct t3_cqe *hw_cqe, + struct t3_cqe *read_cqe) +{ + read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr; + read_cqe->len = wq->oldest_read->read_len; + read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) | + V_CQE_SWCQE(SW_CQE(*hw_cqe)) | + V_CQE_OPCODE(T3_READ_REQ) | + V_CQE_TYPE(1)); +} + +/* + * Return a ptr to the next read wr in the SWSQ or NULL. + */ +static inline void advance_oldest_read(struct t3_wq *wq) +{ + + u32 rptr = wq->oldest_read - wq->sq + 1; + u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2); + + while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) { + wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2); + + if (wq->oldest_read->opcode == T3_READ_REQ) + return; + rptr++; + } + wq->oldest_read = NULL; +} + +/* + * cxio_poll_cq + * + * Caller must: + * check the validity of the first CQE, + * supply the wq assicated with the qpid. + * + * credit: cq credit to return to sge. + * cqe_flushed: 1 iff the CQE is flushed. + * cqe: copy of the polled CQE. + * + * return value: + * 0 CQE returned, + * -1 CQE skipped, try again. + */ +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit) +{ + int ret = 0; + struct t3_cqe *hw_cqe, read_cqe; + + *cqe_flushed = 0; + *credit = 0; + hw_cqe = cxio_next_cqe(cq); + + PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x" + " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe), + CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe), + CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe), + CQE_WRID_LOW(*hw_cqe)); + + /* + * skip cqe's not affiliated with a QP. + */ + if (wq == NULL) { + ret = -1; + goto skip_cqe; + } + + /* + * Gotta tweak READ completions: + * 1) the cqe doesn't contain the sq_wptr from the wr. + * 2) opcode not reflected from the wr. + * 3) read_len not reflected from the wr. + * 4) cq_type is RQ_TYPE not SQ_TYPE. + */ + if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) { + + /* + * Don't write to the HWCQ, so create a new read req CQE + * in local memory. + */ + create_read_req_cqe(wq, hw_cqe, &read_cqe); + hw_cqe = &read_cqe; + advance_oldest_read(wq); + } + + /* + * T3A: Discard TERMINATE CQEs. + */ + if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) { + ret = -1; + wq->error = 1; + goto skip_cqe; + } + + if (CQE_STATUS(*hw_cqe) || wq->error) { + *cqe_flushed = wq->error; + wq->error = 1; + + /* + * T3A inserts errors into the CQE. We cannot return + * these as work completions. + */ + /* incoming write failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE) + && RQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + /* incoming read request failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + + /* incoming SEND with no receive posted failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) { + ret = -1; + goto skip_cqe; + } + goto proc_cqe; + } + + /* + * RECV completion. + */ + if (RQ_TYPE(*hw_cqe)) { + + /* + * HW only validates 4 bits of MSN. So we must validate that + * the MSN in the SEND is the next expected MSN. If its not, + * then we complete this with TPT_ERR_MSN and mark the wq in + * error. + */ + if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) { + wq->error = 1; + hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN)); + goto proc_cqe; + } + goto proc_cqe; + } + + /* + * If we get here its a send completion. + * + * Handle out of order completion. These get stuffed + * in the SW SQ. Then the SW SQ is walked to move any + * now in-order completions into the SW CQ. This handles + * 2 cases: + * 1) reaping unsignaled WRs when the first subsequent + * signaled WR is completed. + * 2) out of order read completions. + */ + if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) { + struct t3_swsq *sqp; + + PDBG("%s out of order completion going in swsq at idx %ld\n", + __FUNCTION__, + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2)); + sqp = wq->sq + + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2); + sqp->cqe = *hw_cqe; + sqp->complete = 1; + ret = -1; + goto flush_wq; + } + +proc_cqe: + *cqe = *hw_cqe; + + /* + * Reap the associated WR(s) that are freed up with this + * completion. + */ + if (SQ_TYPE(*hw_cqe)) { + wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe); + PDBG("%s completing sq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2)); + *cookie = (wq->sq + + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id; + wq->sq_rptr++; + } else { + PDBG("%s completing rq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + *cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + wq->rq_rptr++; + } + +flush_wq: + /* + * Flush any completed cqes that are now in-order. + */ + flush_completed_wrs(wq, cq); + +skip_cqe: + if (SW_CQE(*hw_cqe)) { + PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->sw_rptr); + ++cq->sw_rptr; + } else { + PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->rptr); + ++cq->rptr; + + /* + * T3A: compute credits. + */ + if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1))) + || ((cq->rptr - cq->wptr) >= 128)) { + *credit = cq->rptr - cq->wptr; + cq->wptr = cq->rptr; + } + } + return ret; +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h new file mode 100644 index 0000000..bde5cfb --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_HAL_H__ +#define __CXIO_HAL_H__ + +#include +#include + +#include "t3_cpl.h" +#include "t3cdev.h" +#include "cxgb3_ctl_defs.h" +#include "cxio_wr.h" + +#define T3_CTRL_QP_ID FW_RI_SGEEC_START +#define T3_CTL_QP_TID FW_RI_TID_START +#define T3_CTRL_QP_SIZE_LOG2 8 +#define T3_CTRL_CQ_ID 0 + +/* TBD */ +#define T3_MAX_NUM_RNIC 8 +#define T3_MAX_NUM_RI (1<<15) +#define T3_MAX_NUM_QP (1<<15) +#define T3_MAX_NUM_CQ (1<<15) +#define T3_MAX_NUM_PD (1<<15) +#define T3_MAX_PBL_SIZE 256 +#define T3_MAX_RQ_SIZE 1024 +#define T3_MAX_NUM_STAG (1<<15) + +#define T3_STAG_UNSET 0xffffffff + +#define T3_MAX_DEV_NAME_LEN 32 + +struct cxio_hal_ctrl_qp { + u32 wptr; + u32 rptr; + struct semaphore sem; /* for the wtpr, can sleep */ + wait_queue_head_t waitq; /* wait for RspQ/CQE msg */ + union t3_wr *workq; /* the work request queue */ + dma_addr_t dma_addr; /* pci bus address of the workq */ + DECLARE_PCI_UNMAP_ADDR(mapping) + void __iomem *doorbell; +}; + +struct cxio_hal_resource { + struct kfifo *tpt_fifo; + spinlock_t tpt_fifo_lock; + struct kfifo *qpid_fifo; + spinlock_t qpid_fifo_lock; + struct kfifo *cqid_fifo; + spinlock_t cqid_fifo_lock; + struct kfifo *pdid_fifo; + spinlock_t pdid_fifo_lock; +}; + +struct cxio_qpid_list { + struct list_head entry; + u32 qpid; +}; + +struct cxio_ucontext { + struct list_head qpids; + struct mutex lock; +}; + +struct cxio_rdev { + char dev_name[T3_MAX_DEV_NAME_LEN]; + struct t3cdev *t3cdev_p; + struct rdma_info rnic_info; + struct adap_ports port_info; + struct cxio_hal_resource *rscp; + struct cxio_hal_ctrl_qp ctrl_qp; + void *ulp; + unsigned long qpshift; + u32 qpnr; + u32 qpmask; + struct cxio_ucontext uctx; + struct gen_pool *pbl_pool; + struct gen_pool *rqt_pool; +}; + +static inline int cxio_num_stags(struct cxio_rdev *rdev_p) +{ + return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5)); +} + +typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p, + struct sk_buff * skb); + +#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff) +#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff) +#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1) +#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1) +#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1) +#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1) +#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1) +#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1) +#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1) + +struct respQ_msg_t { + __be32 flags; /* flit 0 */ + __be32 cq_ptrid; + __be64 rsvd; /* flit 1 */ + struct t3_cqe cqe; /* flits 2-3 */ +}; + +enum t3_cq_opcode { + CQ_ARM_AN = 0x2, + CQ_ARM_SE = 0x6, + CQ_FORCE_AN = 0x3, + CQ_CREDIT_UPDATE = 0x7 +}; + +int cxio_rdev_open(struct cxio_rdev *rdev); +void cxio_rdev_close(struct cxio_rdev *rdev); +int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit); +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid); +int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode); +int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr); +int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, + u32 pbl_addr); +int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); +int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag); +int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr); +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +u32 cxio_hal_get_rhdl(void); +void cxio_hal_put_rhdl(u32 rhdl); +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp); +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid); +int __init cxio_hal_init(void); +void __exit cxio_hal_exit(void); +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_flush_hw_cq(struct t3_cq *cq); +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit); + +#define MOD "iw_cxgb3: " +#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args) + +#ifdef DEBUG +void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag); +void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift); +void cxio_dump_wqe(union t3_wr *wqe); +void cxio_dump_wce(struct t3_cqe *wce); +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents); +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid); +#endif + +#endif From swise at opengridcomputing.com Thu Dec 14 05:58:07 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:58:07 -0600 Subject: [openib-general] [PATCH v4 11/13] Core Resource Allocation In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135807.21159.36678.stgit@dell3.ogc.int> Core functions to carve up adapter memory, stag, qp, and cq IDs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_resource.c | 331 ++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_resource.h | 70 +++++ 2 files changed, 401 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c new file mode 100644 index 0000000..444df15 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c @@ -0,0 +1,331 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +/* Crude resource management */ +#include +#include +#include +#include +#include +#include +#include "cxio_resource.h" +#include "cxio_hal.h" + +static struct kfifo *rhdl_fifo; +static spinlock_t rhdl_fifo_lock; + +#define RANDOM_SIZE 16 + +static int __cxio_init_resource_fifo(struct kfifo **fifo, + spinlock_t *fifo_lock, + u32 nr, u32 skip_low, + u32 skip_high, + int random) +{ + u32 i, j, entry = 0, idx; + u32 random_bytes; + u32 rarray[16]; + spin_lock_init(fifo_lock); + + *fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock); + if (IS_ERR(*fifo)) + return -ENOMEM; + + for (i = 0; i < skip_low + skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32)); + if (random) { + j = 0; + random_bytes = random32(); + for (i = 0; i < RANDOM_SIZE; i++) + rarray[i] = i + skip_low; + for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) { + if (j >= RANDOM_SIZE) { + j = 0; + random_bytes = random32(); + } + idx = (random_bytes >> (j * 2)) & 0xF; + __kfifo_put(*fifo, + (unsigned char *) &rarray[idx], + sizeof(u32)); + rarray[idx] = i; + j++; + } + for (i = 0; i < RANDOM_SIZE; i++) + __kfifo_put(*fifo, + (unsigned char *) &rarray[i], + sizeof(u32)); + } else + for (i = skip_low; i < nr - skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32)); + + for (i = 0; i < skip_low + skip_high; i++) + kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32)); + return 0; +} + +static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 0)); +} + +static int cxio_init_resource_fifo_random(struct kfifo **fifo, + spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 1)); +} + +static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p) +{ + u32 i; + + spin_lock_init(&rdev_p->rscp->qpid_fifo_lock); + + rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), + GFP_KERNEL, + &rdev_p->rscp->qpid_fifo_lock); + if (IS_ERR(rdev_p->rscp->qpid_fifo)) + return -ENOMEM; + + for (i = 16; i < T3_MAX_NUM_QP; i++) + if (!(i & rdev_p->qpmask)) + __kfifo_put(rdev_p->rscp->qpid_fifo, + (unsigned char *) &i, sizeof(u32)); + return 0; +} + +int cxio_hal_init_rhdl_resource(u32 nr_rhdl) +{ + return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1, + 0); +} + +void cxio_hal_destroy_rhdl_resource(void) +{ + kfifo_free(rhdl_fifo); +} + +/* nr_* must be power of 2 */ +int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid) +{ + int err = 0; + struct cxio_hal_resource *rscp; + + rscp = kmalloc(sizeof(*rscp), GFP_KERNEL); + if (!rscp) + return -ENOMEM; + rdev_p->rscp = rscp; + err = cxio_init_resource_fifo_random(&rscp->tpt_fifo, + &rscp->tpt_fifo_lock, + nr_tpt, 1, 0); + if (err) + goto tpt_err; + err = cxio_init_qpid_fifo(rdev_p); + if (err) + goto qpid_err; + err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, + nr_cqid, 1, 0); + if (err) + goto cqid_err; + err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, + nr_pdid, 1, 0); + if (err) + goto pdid_err; + return 0; +pdid_err: + kfifo_free(rscp->cqid_fifo); +cqid_err: + kfifo_free(rscp->qpid_fifo); +qpid_err: + kfifo_free(rscp->tpt_fifo); +tpt_err: + return -ENOMEM; +} + +/* + * returns 0 if no resource available + */ +static inline u32 cxio_hal_get_resource(struct kfifo *fifo) +{ + u32 entry; + if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32))) + return entry; + else + return 0; /* fifo emptry */ +} + +static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry) +{ + BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0); +} + +u32 cxio_hal_get_rhdl(void) +{ + return cxio_hal_get_resource(rhdl_fifo); +} + +void cxio_hal_put_rhdl(u32 rhdl) +{ + cxio_hal_put_resource(rhdl_fifo, rhdl); +} + +u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->tpt_fifo); +} + +void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag) +{ + cxio_hal_put_resource(rscp->tpt_fifo, stag); +} + +u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp) +{ + u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid) +{ + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + cxio_hal_put_resource(rscp->qpid_fifo, qpid); +} + +u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->cqid_fifo); +} + +void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid) +{ + cxio_hal_put_resource(rscp->cqid_fifo, cqid); +} + +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->pdid_fifo); +} + +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid) +{ + cxio_hal_put_resource(rscp->pdid_fifo, pdid); +} + +void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp) +{ + kfifo_free(rscp->tpt_fifo); + kfifo_free(rscp->cqid_fifo); + kfifo_free(rscp->qpid_fifo); + kfifo_free(rscp->pdid_fifo); + kfree(rscp); +} + +/* + * PBL Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_PBL_SHIFT 8 /* 256B == min PBL size (32 entries) */ +#define PBL_CHUNK 2*1024*1024 + +u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size); + return (u32)addr; +} + +void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size); + gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size); +} + +int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1); + if (rdev_p->pbl_pool) + for (i = rdev_p->rnic_info.pbl_base; + i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; + i += PBL_CHUNK) + gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1); + return rdev_p->pbl_pool ? 0 : -ENOMEM; +} + +void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->pbl_pool); +} + +/* + * RQT Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_RQT_SHIFT 10 /* 1KB == mini RQT size (16 entries) */ +#define RQT_CHUNK 2*1024*1024 + +u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6); + return (u32)addr; +} + +void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6); + gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6); +} + +int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1); + if (rdev_p->rqt_pool) + for (i = rdev_p->rnic_info.rqt_base; + i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; + i += RQT_CHUNK) + gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1); + return rdev_p->rqt_pool ? 0 : -ENOMEM; +} + +void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->rqt_pool); +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h new file mode 100644 index 0000000..a6bbe83 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_RESOURCE_H__ +#define __CXIO_RESOURCE_H__ + +#include +#include +#include +#include +#include +#include +#include +#include "cxio_hal.h" + +extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl); +extern void cxio_hal_destroy_rhdl_resource(void); +extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, + u32 nr_pdid); +extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag); +extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid); +extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid); +extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp); + +#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base ) +extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); + +#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base ) +extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); +#endif From swise at opengridcomputing.com Thu Dec 14 05:58:38 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:58:38 -0600 Subject: [openib-general] [PATCH v4 12/13] Core Debug functions In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135837.21159.29330.stgit@dell3.ogc.int> Debug code to dump various data structs, some of which are in adapter memory. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_dbg.c | 205 +++++++++++++++++++++++++++ 1 files changed, 205 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c new file mode 100644 index 0000000..22f4f75 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c @@ -0,0 +1,205 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifdef DEBUG +#include +#include "common.h" +#include "cxgb3_ioctl.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size = 32; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base; + m->len = size; + PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size, npages; + + shift += 12; + npages = (len + (1ULL << shift) - 1) >> shift; + size = npages * sizeof(u64); + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = pbl_addr; + m->len = size; + PDBG("%s PBL addr 0x%x len %d depth %d\n", + __FUNCTION__, m->addr, m->len, npages); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_wqe(union t3_wr *wqe) +{ + __be64 *data = (__be64 *)wqe; + uint size = (uint)(be64_to_cpu(*data) & 0xff); + + if (size == 0) + size = 8; + while (size > 0) { + PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data)); + size--; + data++; + } +} + +void cxio_dump_wce(struct t3_cqe *wce) +{ + __be64 *data = (__be64 *)wce; + int size = sizeof(*wce); + + while (size > 0) { + PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data)); + size -= 8; + data++; + } +} + +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents) +{ + struct ch_mem_range *m; + int size = nents * 64; + u64 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base; + m->len = size; + PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid) +{ + struct ch_mem_range *m; + int size = TCB_SIZE; + u32 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_CM; + m->addr = hwtid * size; + m->len = size; + PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u32 *)m->buf; + while (size > 0) { + printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", + m->addr, + *(data+2), *(data+3), *(data),*(data+1), + *(data+6), *(data+7), *(data+4), *(data+5)); + size -= 32; + data += 8; + m->addr += 32; + } + kfree(m); +} +#endif From swise at opengridcomputing.com Thu Dec 14 05:59:08 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 07:59:08 -0600 Subject: [openib-general] [PATCH v4 13/13] Kconfig/Makefile In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> Message-ID: <20061214135908.21159.80049.stgit@dell3.ogc.int> Signed-off-by: Steve Wise --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/cxgb3/Kconfig | 27 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/Makefile | 12 ++++++++++++ 4 files changed, 41 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 59b3932..06453ab 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/hw/ipath/Kconfig" source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/hw/amso1100/Kconfig" +source "drivers/infiniband/hw/cxgb3/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index 570b30a..69bdd55 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mt obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ +obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig new file mode 100644 index 0000000..d3db264 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Kconfig @@ -0,0 +1,27 @@ +config INFINIBAND_CXGB3 + tristate "Chelsio RDMA Driver" + depends on CHELSIO_T3 && INFINIBAND + select GENERIC_ALLOCATOR + ---help--- + This is an iWARP/RDMA driver for the Chelsio T3 1GbE and + 10GbE adapters. + + For general information about Chelsio and our products, visit + our website at . + + For customer support, please visit our customer support page at + . + + Please send feedback to . + + To compile this driver as a module, choose M here: the module + will be called iw_cxgb3. + +config INFINIBAND_CXGB3_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_CXGB3 + default n + ---help--- + This option causes the Chelsio RDMA driver to produce copious + amounts of debug messages. Select this if you are developing + the driver or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile new file mode 100644 index 0000000..7a89f6d --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Makefile @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \ + -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core + +obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o + +iw_cxgb3-y := iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \ + iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o + +ifdef CONFIG_INFINIBAND_CXGB3_DEBUG +EXTRA_CFLAGS += -DDEBUG -g +iw_cxgb3-y += core/cxio_dbg.o +endif From halr at voltaire.com Thu Dec 14 05:57:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2006 08:57:11 -0500 Subject: [openib-general] [query]requirement of 'process_mad' in the HCA driver In-Reply-To: <2875.47466.qm@web8317.mail.in.yahoo.com> References: <2875.47466.qm@web8317.mail.in.yahoo.com> Message-ID: <1166104604.28709.126501.camel@hal.voltaire.com> On Wed, 2006-12-13 at 22:49, keshetti mahesh wrote: > thanks for your reply, > > >The driver is needed to obtain the information for the IB node to > fill > >in the MADs for response to the SMA query. It may also issue some > traps. > >Similarly for PMA as well. > > Do u mean to say that HCA driver is needed to pass the HCA related > information (like GID, GUID, port_info etc..) to the SMA so that it > can reply to query(or GET ) MADs. Yes. > Isn't SMA capable of doing the same by using "query_(gid, pkey, > port)" verbs. One reason I can think of is that not all the needed information is available via verbs. I think there are some others as well. > And final questions if it is really required to implement > 'process_mad' in HCA driver then why it is not specified in the IB > specifications. IB spec is architecture not implementation. > Whose duty is this (replying to query MADs) according to the IB > psec.s(its duty of SMA right?) Depends on the MAD but if you are referring to the SMA queries, then yes it is the SMA's responsibility. > I have observed that process_mad is not implemented in the IBM's eHCA > driver. what is the case with it? With eHCA, QP0 is not exposed to the host (at least currently) and the SMA is totally implemented in firmware. > PS: I am considering only SMA in the host s/w here. This is a design choice. -- Hal > regards, > K.Mahesh. > > > > > Hal Rosenstock wrote: > On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote: > > Hello all, > > > > I want to know from u people that isi it necessary to > implement the > > process_mad for a HCA. > > > > After looking into the implementations of process_mad in > ipath and > > mthca drivers i have fount that they are used to reply the > MADs with > > port_info,gid_info,sm_info etc.. > > > > But isn't it handled by SMA in the host...... > > The SMA can either be in the host on in firmware (as is > typical with the > Mellanox silicon). > > > i am little bit confused now . > > please just whether it is required to implement process_mad > (suppose) > > for new HCA driver.... > > It is. For an example of a host (software SMA), see > drivers/infiniband/hw/ipath/ipath_mad.c > > > if it is required why? > > The driver is needed to obtain the information for the IB node > to fill > in the MADs for response to the SMA query. It may also issue > some traps. > Similarly for PMA as well. > > -- Hal > > > Please CC your replies to me. > > > > regards, > > K.Mahesh. > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > Find out what India is talking about on - Yahoo! Answers > India > > Send FREE SMS to your friend's mobile from Yahoo! Messenger > Version 8. > > Get it NOW > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > ______________________________________________________________________ > Find out what India is talking about on - Yahoo! Answers India > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. > Get it NOW From rdreier at cisco.com Thu Dec 14 06:31:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 06:31:09 -0800 Subject: [openib-general] (no subject) References: <457FB82B.4090902@voltaire.com> <45810901.3090209@voltaire.com> Message-ID: > mmm, I understand all the comments raised during the review were fixed > in the V3 post below, and now you say its both wrong and ugly... for > example what's wrong here? I take back the wrong statement, I misread the patch just now. But if you don't think the patch is ugly then I don't think we're looking at the same thing. For example > +static int __devinit mthca_check_profile_value(int* pval, int pval_default){ and so on... From philippe_bernadat at hp.com Thu Dec 14 06:39:32 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Thu, 14 Dec 2006 15:39:32 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05537F70@idaexc03.emea.cpqcorp.net> So I did tried tune_pci=1. Didn't make any difference. I used the same nodes to compare the lscpi output. I could see: [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed 30c30 < Capabilities: [40] MSI-X: Enable+ Mask- TabSize=32 --- > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 38,39c38,39 < 40: 11 50 1f 80 00 20 08 00 00 22 08 00 00 00 00 00 < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 --- > 40: 11 50 1f 00 00 20 08 00 00 22 08 00 00 00 00 00 > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 So I added the ib_mthca msi_x=1 option. It didn't help. So the only remaining difference now is: [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed 39c39 < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 --- > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 No idea what this is. Philippe > -----Original Message----- > From: Bernadat, Philippe > Sent: Thursday, December 14, 2006 1:24 PM > To: Tziporet Koren > Cc: Eric Barton; Roland Dreier; Matt Leininger; > openib-general at openib.org > Subject: RE: [openib-general] Performance Degradation with > OFED v. Voltaire > > > > Have you tried running with > > > > options ib_mthca tune_pci =1 > > > > My understanding is that this is not required anymore with > OFED-1.1 - It used to make a siginifciant differences with > OFED-1.0, but I didn't observe it with OFED-1.1 > > And again, the user mode performance if comparable between > VIB and OFED. > > Philippe > > > -----Original Message----- > > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] > > Sent: Thursday, December 14, 2006 12:30 PM > > To: Bernadat, Philippe > > Cc: Eric Barton; Roland Dreier; Matt Leininger; > > openib-general at openib.org; Bernadat, Philippe > > Subject: Re: [openib-general] Performance Degradation with > > OFED v. Voltaire > > > > Philippe Bernadat wrote: > > > Roland, > > > > > > Attached are the two lspci outputs. > > > > > > The only differences I see are: > > > > > > [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed > > > 1d0 > > > < pcilib: Resource 5 in > > /sys/bus/pci/devices/0000:00:1f.1/resource has > > > a 64-bit address, ignoring > > > 40c39 > > > < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 > > > --- > > > > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 > > > [philippe at hamish o2ib]$ > > > > > Have you tried running with > > > > options ib_mthca tune_pci =1 > > > > Tziporet > > > > From mst at mellanox.co.il Thu Dec 14 06:40:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 16:40:15 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: References: Message-ID: <20061214144015.GC27620@mellanox.co.il> > I was going to apply this, but then I realized that mthca is screwed > up on non-cache-coherent CPUs with memfree HCAs, and this patch makes > things much worse. The problem is that we allocate the MTT table with > alloc_pages() and then do pci_map_sg(). But there's no > pci_dma_sync_sg calls when the CPU tries to write directly to the MTT > table, and in fact not even that would work: since a > non-cache-coherent CPU can only work on cacheline-sized chunks there's > no safe way to touch the MTT table. > > What all that means is that FMRs are currently broken for memfree on > non-coherent CPUs. And this patch would break all memory > registration. I think the fix has to be to use dma_alloc_coherent() > to allocate the pages for the MTT table (and any other table allocated > in lowmem -- but I don't think there are any others). > > Unfortunately my PowerPC 440 system is being reworked right now so I > can't test this for a few days. > > I think this still can go into 2.6.20 after -rc1 if we can get this > fixed up. Just to clarify - do you plan to fix this, or are waiting for me to do it? -- MST From ogerlitz at voltaire.com Thu Dec 14 06:44:37 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 14 Dec 2006 16:44:37 +0200 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> Message-ID: <45816355.4010801@voltaire.com> Sean Hefty wrote: > Export the rdma cm interfaces to userspace. > +static ssize_t (*ucma_cmd_table[])(struct ucma_file *file, > + const char __user *inbuf, > + int in_len, int out_len) = { > + [RDMA_USER_CM_CMD_CREATE_ID] = ucma_create_id, > + [RDMA_USER_CM_CMD_DESTROY_ID] = ucma_destroy_id, > + [RDMA_USER_CM_CMD_BIND_ADDR] = ucma_bind_addr, > + [RDMA_USER_CM_CMD_RESOLVE_ADDR] = ucma_resolve_addr, > + [RDMA_USER_CM_CMD_RESOLVE_ROUTE]= ucma_resolve_route, > + [RDMA_USER_CM_CMD_QUERY_ROUTE] = ucma_query_route, > + [RDMA_USER_CM_CMD_CONNECT] = ucma_connect, > + [RDMA_USER_CM_CMD_LISTEN] = ucma_listen, > + [RDMA_USER_CM_CMD_ACCEPT] = ucma_accept, > + [RDMA_USER_CM_CMD_REJECT] = ucma_reject, > + [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect, > + [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr, > + [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event, > + [RDMA_USER_CM_CMD_GET_OPTION] = NULL, > + [RDMA_USER_CM_CMD_SET_OPTION] = NULL, > + [RDMA_USER_CM_CMD_NOTIFY] = ucma_notify, > +}; What about the rdma_cm_get_option() and rdma_cm_set_option() exposed by librdmacm? is it something which is on its way out? Or. From philippe_bernadat at hp.com Thu Dec 14 07:09:10 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Thu, 14 Dec 2006 16:09:10 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <458168A1.3090009@dev.mellanox.co.il> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538003@idaexc03.emea.cpqcorp.net> > Its not related to OFED 1.1 or OFED 1.0, but to difference > between OFED > and VAPI. > In VAPI this setting was always done. In OFED we do not do it > by default > and you need this parameter. > > Can you please try it. > > Tziporet > Did. I guess you are still processing you Email :-), see next Emails. Philippe From tziporet at dev.mellanox.co.il Thu Dec 14 07:07:13 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 14 Dec 2006 17:07:13 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05537DAF@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E05537DAF@idaexc03.emea.cpqcorp.net> Message-ID: <458168A1.3090009@dev.mellanox.co.il> Bernadat, Philippe wrote: >> Have you tried running with >> >> options ib_mthca tune_pci =1 >> >> > > My understanding is that this is not required anymore with OFED-1.1 - It > used to make a siginifciant differences with OFED-1.0, but I didn't > observe it with OFED-1.1 > > And again, the user mode performance if comparable between VIB and OFED. > > Philippe > > Its not related to OFED 1.1 or OFED 1.0, but to difference between OFED and VAPI. In VAPI this setting was always done. In OFED we do not do it by default and you need this parameter. See this note on mthca release notes: 4. Performance degradation due to wrong BIOS configuration: The PCI Express spec. requires BIOS to set the MaxReadReq register for each card for maximum performance and stability. If you are seeing bandwidth performance degradation, you can try forcing the card to behave out of PCI Express spec. by setting the tune_pci=1 module parameter. This tune_pci=1 option was the default setting in OFED 1.0, which might have masked performance degradation on some systems. If tune_pci=1 improves bandwidth, please report the issue to your BIOS vendor. Please note that Mellanox Technologies does not recommend using tune_pci=1 in production systems: working with tune_pci=1 option set is untested and is known to trigger stability issues on some platforms. Can you please try it. Tziporet From halr at voltaire.com Thu Dec 14 07:51:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2006 10:51:03 -0500 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05537F70@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E05537F70@idaexc03.emea.cpqcorp.net> Message-ID: <1166111433.28709.131124.camel@hal.voltaire.com> On Thu, 2006-12-14 at 09:39, Bernadat, Philippe wrote: > So I did tried tune_pci=1. Didn't make any difference. > > I used the same nodes to compare the lscpi output. > I could see: > > [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed > 30c30 > < Capabilities: [40] MSI-X: Enable+ Mask- TabSize=32 > --- > > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 Might the MSI-X difference explain it ? -- Hal > 38,39c38,39 > < 40: 11 50 1f 80 00 20 08 00 00 22 08 00 00 00 00 00 > < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 > --- > > 40: 11 50 1f 00 00 20 08 00 00 22 08 00 00 00 00 00 > > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 > > So I added the ib_mthca msi_x=1 option. > It didn't help. > > So the only remaining difference now is: > > [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed > 39c39 > < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 > --- > > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 > > No idea what this is. > > Philippe > > > -----Original Message----- > > From: Bernadat, Philippe > > Sent: Thursday, December 14, 2006 1:24 PM > > To: Tziporet Koren > > Cc: Eric Barton; Roland Dreier; Matt Leininger; > > openib-general at openib.org > > Subject: RE: [openib-general] Performance Degradation with > > OFED v. Voltaire > > > > > > > Have you tried running with > > > > > > options ib_mthca tune_pci =1 > > > > > > > My understanding is that this is not required anymore with > > OFED-1.1 - It used to make a siginifciant differences with > > OFED-1.0, but I didn't observe it with OFED-1.1 > > > > And again, the user mode performance if comparable between > > VIB and OFED. > > > > Philippe > > > > > -----Original Message----- > > > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] > > > Sent: Thursday, December 14, 2006 12:30 PM > > > To: Bernadat, Philippe > > > Cc: Eric Barton; Roland Dreier; Matt Leininger; > > > openib-general at openib.org; Bernadat, Philippe > > > Subject: Re: [openib-general] Performance Degradation with > > > OFED v. Voltaire > > > > > > Philippe Bernadat wrote: > > > > Roland, > > > > > > > > Attached are the two lspci outputs. > > > > > > > > The only differences I see are: > > > > > > > > [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed > > > > 1d0 > > > > < pcilib: Resource 5 in > > > /sys/bus/pci/devices/0000:00:1f.1/resource has > > > > a 64-bit address, ignoring > > > > 40c39 > > > > < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 > > > > --- > > > > > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 > > > > [philippe at hamish o2ib]$ > > > > > > > Have you tried running with > > > > > > options ib_mthca tune_pci =1 > > > > > > Tziporet > > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From philippe_bernadat at hp.com Thu Dec 14 07:56:10 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Thu, 14 Dec 2006 16:56:10 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <1166111433.28709.131124.camel@hal.voltaire.com> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055380B6@idaexc03.emea.cpqcorp.net> > > [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed > > 30c30 > > < Capabilities: [40] MSI-X: Enable+ Mask- TabSize=32 > > --- > > > Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 > > Might the MSI-X difference explain it ? Yes it went away when I added the option (see lines below) > > > > So I added the ib_mthca msi_x=1 option. > > It didn't help. > > > > So the only remaining difference now is: > > > > [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed > > 39c39 > > < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 > > --- > > > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 > > > > No idea what this is. > > From chas at cmf.nrl.navy.mil Thu Dec 14 07:49:08 2006 From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR) Date: Thu, 14 Dec 2006 10:49:08 -0500 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: Message-ID: <200612141549.kBEFn8mZ032667@cmf.nrl.navy.mil> In message ,"Roland Dreier" writes: >I'm not sure who declared it "unsupported" and I would really like to >know what issue(s) led to that declaration. Your report is the first >I've heard of anything like this, and I have to say that it seems >pretty implausible that running a 32-bit kernel on 64-bit-capable >hardware would be the source of problems -- if there is an issue then >I would expect it to be something to do with the 32-bit kernel. we saw this "problem" last week actually. a new instal dual core duo machine was installed with a 32-bit version of suse 10. srp ran, but sometimes the scsi data buffers had minor single byte errors (they didnt appear to be at page boundaries but i am not certain about that). perhaps a kmap issue? 64-bit machines running 32-bit/PAE with more than 4GB of memory? this is anecdotal evidence of course. we were (are) seeing symbol errors on the cable but i should think these errors get as far as the srp layer. From rdreier at cisco.com Thu Dec 14 07:58:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 07:58:06 -0800 Subject: [openib-general] [PATCH] mthca: save low memory used for reserved objects In-Reply-To: <20061214124629.GB24840@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 14 Dec 2006 14:46:29 +0200") References: <20061214124629.GB24840@mellanox.co.il> Message-ID: > We never need to allocate memory for reserved objects in low memory. True, but... > table->icm[i] = mthca_alloc_icm(dev, chunk_size >> PAGE_SHIFT, > - (use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) | > - __GFP_NOWARN); > + GFP_HIGHUSER | __GFP_NOWARN); ...it's quite not so simple, is it? the chunk being allocated here might not contain exclusively reserved objects -- it might have some real objects too. From rdreier at cisco.com Thu Dec 14 07:59:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 07:59:12 -0800 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: <20061214144015.GC27620@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 14 Dec 2006 16:40:15 +0200") References: <20061214144015.GC27620@mellanox.co.il> Message-ID: > Just to clarify - do you plan to fix this, or are waiting for me to do it? I am planning to work on it but I am going on vacation from Dec 17th until Jan 3rd so it might not be for a while... From rdreier at cisco.com Thu Dec 14 08:02:18 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 08:02:18 -0800 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: <200612141549.kBEFn8mZ032667@cmf.nrl.navy.mil> (chas williams's message of "Thu, 14 Dec 2006 10:49:08 -0500") References: <200612141549.kBEFn8mZ032667@cmf.nrl.navy.mil> Message-ID: > we saw this "problem" last week actually. a new instal dual core > duo machine was installed with a 32-bit version of suse 10. srp ran, > but sometimes the scsi data buffers had minor single byte errors (they > didnt appear to be at page boundaries but i am not certain about > that). perhaps a kmap issue? 64-bit machines running 32-bit/PAE > with more than 4GB of memory? Core duo (not Core2) isn't 64-bit capable is it? Did you mean core2 and if so did your problems go away by running a 64-bit kernel? - R. From mst at mellanox.co.il Thu Dec 14 08:03:05 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 18:03:05 +0200 Subject: [openib-general] [PATCH] mthca: save low memory used for reserved objects In-Reply-To: References: <20061214124629.GB24840@mellanox.co.il> Message-ID: <20061214160305.GD27620@mellanox.co.il> > > We never need to allocate memory for reserved objects in low memory. > > True, but... > > > table->icm[i] = mthca_alloc_icm(dev, chunk_size >> PAGE_SHIFT, > > - (use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) | > > - __GFP_NOWARN); > > + GFP_HIGHUSER | __GFP_NOWARN); > > ...it's quite not so simple, is it? the chunk being allocated here > might not contain exclusively reserved objects -- it might have some > real objects too. Correct. Missed this. -- MST From rdreier at cisco.com Thu Dec 14 08:06:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 08:06:04 -0800 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055380B6@idaexc03.emea.cpqcorp.net> (Philippe Bernadat's message of "Thu, 14 Dec 2006 16:56:10 +0100") References: <3F3894AC7A13B04E83CEBC95CFD3047E055380B6@idaexc03.emea.cpqcorp.net> Message-ID: OK, it looks like the PCI config is OK. I guess the difference must be in the Lustre NAL, since you say other userspace code gets comparable performance. Is there any difference in the architecture of the NAL for the Voltaire stack and the standard Linux stack? You may have to rely on Voltaire and/or the Lustre people to fix this, since they're the only ones with the complete picture about the Voltaire stack. - R. From philippe_bernadat at hp.com Thu Dec 14 08:11:37 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Thu, 14 Dec 2006 17:11:37 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net> > I guess the difference must be in the Lustre NAL, since you say other > userspace code gets comparable performance. Is there any difference > in the architecture of the NAL for the Voltaire stack and the standard > Linux stack? I think Eric described the major differences earlier on, here it is, see second half: On Tue, 2006-12-05 at 12:22 +0000, Eric Barton wrote: > Hi, > > We'd dearly like some help to understand why we seem to be having > performance issues with OFED. When we run a lustre network bandwidth > benchmark, we find significant performance degradation on OFED versus > Voltaire... > > Premap (256 RDMA frags) Map on demand (1 RDMA frag) > Voltaire OFED Ratio Voltaire OFED Ratio > Writes MB/s 682 567 83 % 577 436 75 % > Reads MB/s 658 554 84 % 555 432 77 % > > These tests measure the bandwidth of 1MByte transfers pipelined 8 deep. > All hardware/software was the same, apart from the IB stack and the lustre > network driver. > > The architecture of the lustre network drivers for OFED and Voltaire are > almost identical. Both use RC QPs with the same control message protocol > to set up bulk data transfers using RDMA WRITE. Control messages use a > credit flow protocol to ensure that they are only sent when buffers are > posted to receive them. Concurrent transfers over the same QP are > supported so that lustre can pipeline bulk I/O. > > The only difference between the lustre network drivers is that the Voltaire > driver has a single global CQ and the OFED driver has 1 CQ per QP. However > the measurement above are for a single pair of nodes - in this case both > implementations use a single CQ. > > By default, the drivers pre-map all of physical memory so each RDMA > consists of page fragments. However, we can also compile both drivers to > map on demand using FMR so that RDMA is not fragmented. The results above > compare both methods and although both drivers perform worse when mapping, > the OFED driver takes the bigger hit. > > We'd be delighted if anyone can shed any light or can suggest any steps we > should take to discover the reason. We're also very willing to provide > assistance if any of the OpenFabrics developers wants to duplicate the > setup. > > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Thursday, December 14, 2006 5:06 PM > To: Bernadat, Philippe > Cc: Hal Rosenstock; Tziporet Koren; openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire > > OK, it looks like the PCI config is OK. > > I guess the difference must be in the Lustre NAL, since you say other > userspace code gets comparable performance. Is there any difference > in the architecture of the NAL for the Voltaire stack and the standard > Linux stack? > > You may have to rely on Voltaire and/or the Lustre people to fix this, > since they're the only ones with the complete picture about the > Voltaire stack. > > - R. > From philippe_bernadat at hp.com Thu Dec 14 08:16:12 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Thu, 14 Dec 2006 17:16:12 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538102@idaexc03.emea.cpqcorp.net> So Roland, what is this subtle difference that remains ? >>> [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed >>> 39c39 >>> < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00 >>> --- >>> >>> > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00 >>>> Philippe > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Thursday, December 14, 2006 5:06 PM > To: Bernadat, Philippe > Cc: Hal Rosenstock; Tziporet Koren; openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire > > OK, it looks like the PCI config is OK. > > I guess the difference must be in the Lustre NAL, since you say other > userspace code gets comparable performance. Is there any difference > in the architecture of the NAL for the Voltaire stack and the standard > Linux stack? > > You may have to rely on Voltaire and/or the Lustre people to fix this, > since they're the only ones with the complete picture about the > Voltaire stack. > > - R. > From chas at cmf.nrl.navy.mil Thu Dec 14 08:22:06 2006 From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR) Date: Thu, 14 Dec 2006 11:22:06 -0500 Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users who didn't RTFM In-Reply-To: Message-ID: <200612141622.kBEGM6Lj000670@cmf.nrl.navy.mil> In message ,Roland Dreier writes: >Core duo (not Core2) isn't 64-bit capable is it? Did you mean core2 >and if so did your problems go away by running a 64-bit kernel? sorry, yes i meant core2 duo. specifically, model name : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz we havent had a chance to reinstall to a 64-bit version of suse for this machine. From rdreier at cisco.com Thu Dec 14 08:29:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 08:29:57 -0800 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05538102@idaexc03.emea.cpqcorp.net> (Philippe Bernadat's message of "Thu, 14 Dec 2006 17:16:12 +0100") References: <3F3894AC7A13B04E83CEBC95CFD3047E05538102@idaexc03.emea.cpqcorp.net> Message-ID: > So Roland, what is this subtle difference that remains ? I'm not sure ... something in the VPD capability. Doesn't seem significant. - R. From rdreier at cisco.com Thu Dec 14 08:31:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 08:31:24 -0800 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net> (Philippe Bernadat's message of "Thu, 14 Dec 2006 17:11:37 +0100") References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net> Message-ID: > I think Eric described the major differences earlier on, here it is, see > second half: OK, I forgot about that. I guess one last thing to check would be the MTU being used for the RC connections. Since this is PCI-X HW then the MTU should be 1024 for best throughput (instead of the max MTU of 2048). - R. From mst at mellanox.co.il Thu Dec 14 09:04:55 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 19:04:55 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: References: Message-ID: <20061214170455.GA12781@mellanox.co.il> > and in fact not even that would work: since a non-cache-coherent CPU > can only work on cacheline-sized chunks there's no safe way to touch the MTT > table. Roland, could you please clarify what did you mean by this statement? With current code firmware might be doing WRITE_MTT while CPU is writing to the same cache line, and I expect this might confuse things, but it seems that with my fmr/mr merge patch, we never have both CPU and firmware write to the same MTTs entries. So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code sufficient? -- MST From mst at mellanox.co.il Thu Dec 14 09:31:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 19:31:45 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net> Message-ID: <20061214173145.GC12781@mellanox.co.il> > > I think Eric described the major differences earlier on, here it is, see > > second half: > > OK, I forgot about that. > > I guess one last thing to check would be the MTU being used for the RC > connections. Since this is PCI-X HW then the MTU should be 1024 for > best throughput (instead of the max MTU of 2048). The MTU issue is described in the OFED release notes. You must turn the Tavor work-around for it on in opensm. This was introduced late in release cycle to it was deemed safer to make it off by default. By the way, Eitan, Hal, can we turn this on by default now? This was we'll get more feedback from people, and we'll still have time to turn it off before release if this unexpectedly creates issues. -- MST From mshefty at ichips.intel.com Thu Dec 14 09:57:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 14 Dec 2006 09:57:39 -0800 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <45816355.4010801@voltaire.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <45816355.4010801@voltaire.com> Message-ID: <45819093.3090405@ichips.intel.com> > What about the rdma_cm_get_option() and rdma_cm_set_option() exposed by > librdmacm? is it something which is on its way out? I did not expose those to userspace at this time. I believe what was there needed to be reworked. For example, the timeout could be generic, rather than IB specific, and the option to get a list of path records should be eliminated. - Sean From sashak at voltaire.com Thu Dec 14 10:12:59 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 14 Dec 2006 20:12:59 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061214061951.GH1689@mellanox.co.il> References: <20061213232638.GC14186@sashak.voltaire.com> <20061214061951.GH1689@mellanox.co.il> Message-ID: <20061214181259.GE28849@sashak.voltaire.com> On 08:19 Thu 14 Dec , Michael S. Tsirkin wrote: > > > > For me it is unclear yet how long we may need this - 1.1 still be in > > > > SVN yet, and 1.1 git branch is updated there. > > > > > > By the way, one can't actually build OFED 1.1 userspace from git > > > because OFED also applies some patches after checking things out > > > from svn. They are here: > > > https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes > > > > I guess those patches should be committed in 1.1 svn branch (and imported > > to git's 1.1). > > This could be done, but why invest the time? To do commits? SVN commit was done anyway, just in the different place and in form of the diffs. > And once we do touch the branch, who will test that the thing you > pull from there even works? How this is different? Who will test branch + ofed_fixes diffs? Use tag to mark tested version (or date). > I would say that if you really want to mirror the OFED branch, > and make it buildable to some extent, the way to do this > would be to have a single git tree with all of OFED - patches, > scripts and all. I'm able to build OpenSM for OFED 1.1 from git tree just fine. And synced 1.1 branch in git let me some useful stuff - I can log, diff, rebase and cherry-pick fixes, etc.. - everything is in-tree (I said that I like branches :)). > Oh, by the way, some tools in OFED tried to read an svn version > in their code, this wouldn't work on git. > And I don't see git trees for a lot of OFED bits - look at > https://openib.org/svn/gen2/branches/1.1/ofed/ IMHO this is not too much hard to switch OFED 1.1.x to git. But it is not really my point - I just think that synced 1.1 branch in git tree can be useful for developers and for 1.1 project's support works. > What I am trying to say is, let's just keep SVN around and > do OFED 1.1 maintainance there. You can't fix the history. > > > Any reason why it is not committed? > > This was dicussed before OFED 1.1 and seems to have worked well so far. > > We tried to keep our modifications to upstream as separate as possible - > this made transition to upstream in OFED 1.2 very easy as it was trivial > to check what was applied and what wasn't. I cannot understand how not committing changes helps. Sasha From eitan at mellanox.co.il Thu Dec 14 10:13:28 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 14 Dec 2006 20:13:28 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <20061214173145.GC12781@mellanox.co.il> References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net> <20061214173145.GC12781@mellanox.co.il> Message-ID: <45819448.8060005@mellanox.co.il> Michael S. Tsirkin wrote: >> > I think Eric described the major differences earlier on, here it is, see >> > second half: >> >> OK, I forgot about that. >> >> I guess one last thing to check would be the MTU being used for the RC >> connections. Since this is PCI-X HW then the MTU should be 1024 for >> best throughput (instead of the max MTU of 2048). >> > > The MTU issue is described in the OFED release notes. > You must turn the Tavor work-around for it on in opensm. > This was introduced late in release cycle to it was deemed safer > to make it off by default. > > By the way, Eitan, Hal, can we turn this on by default now? > This was we'll get more feedback from people, and we'll still have > time to turn it off before release if this unexpectedly creates issues. > > I agree that we should enable this feature by default now. EZ From mst at mellanox.co.il Thu Dec 14 10:40:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 20:40:15 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061214181259.GE28849@sashak.voltaire.com> References: <20061214181259.GE28849@sashak.voltaire.com> Message-ID: <20061214184015.GE12781@mellanox.co.il> > > > Any reason why it is not committed? > > > > This was dicussed before OFED 1.1 and seems to have worked well so far. > > > > We tried to keep our modifications to upstream as separate as possible - > > this made transition to upstream in OFED 1.2 very easy as it was trivial > > to check what was applied and what wasn't. > > I cannot understand how not committing changes helps. OFED is tracking trunk and using quilt to manage changes against trunk. That's why they are in form of patches. Now that everything is in git, we can look at using stgit for this, I'm not sure how well would publishing stgit-managed tree work. -- MST From mst at mellanox.co.il Thu Dec 14 10:46:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 20:46:31 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061214181259.GE28849@sashak.voltaire.com> References: <20061213232638.GC14186@sashak.voltaire.com> <20061214061951.GH1689@mellanox.co.il> <20061214181259.GE28849@sashak.voltaire.com> Message-ID: <20061214184631.GF12781@mellanox.co.il> > > > I guess those patches should be committed in 1.1 svn branch (and imported > > > to git's 1.1). > > > > This could be done, but why invest the time? > > To do commits? SVN commit was done anyway, just in the different place > and in form of the diffs. But it was already done, is my point. > > And once we do touch the branch, who will test that the thing you > > pull from there even works? > > How this is different? Who will test branch + ofed_fixes diffs? No one until we do a bugfix release :). > > I would say that if you really want to mirror the OFED branch, > > and make it buildable to some extent, the way to do this > > would be to have a single git tree with all of OFED - patches, > > scripts and all. > > I'm able to build OpenSM for OFED 1.1 from git tree just fine. And > synced 1.1 branch in git let me some useful stuff - I can log, diff, > rebase and cherry-pick fixes, etc.. - everything is in-tree (I said > that I like branches :)). I sure don't have a problem with that. But it would be better to avoid touching 1.1 svn branch any more than absolutely necessary. Do you only want this for opensm? opensm happens not to have any patches, so it's easy. -- MST From mst at mellanox.co.il Thu Dec 14 10:52:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 20:52:10 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: <20061214170455.GA12781@mellanox.co.il> References: <20061214170455.GA12781@mellanox.co.il> Message-ID: <20061214185210.GH12781@mellanox.co.il> > > and in fact not even that would work: since a non-cache-coherent CPU > > can only work on cacheline-sized chunks there's no safe way to touch the MTT > > table. > > Roland, could you please clarify what did you mean by this statement? > > With current code firmware might be doing WRITE_MTT while CPU is writing to the > same cache line, and I expect this might confuse things, but it seems that with > my fmr/mr merge patch, we never have both CPU and firmware write to the same > MTTs entries. > > So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code > sufficient? Documentation/DMA-mapping.txt actually says: > Without that, you'd see cacheline > sharing problems (data corruption) on CPUs with DMA-incoherent caches. > (The CPU could write to one word, DMA would write to a different one > in the same cache line, and one of them could be overwritten.) So with my patch, since w enevr have both HW and CPU DMA into buffer, we should be OK. -- MST _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From sean.hefty at intel.com Thu Dec 14 11:22:19 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 14 Dec 2006 11:22:19 -0800 Subject: [openib-general] [PATCH] 2.6.20 rdma_ucm: fix struct ucma_event leak Message-ID: <000001c71fb5$2cf517b0$8698070a@amr.corp.intel.com> We discard new connection requests while the listen backlog is full, but leak a struct ucma_event in the process. Free the structure. Signed-off-by: Sean Hefty --- diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index f51b755..ace2cad 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -209,6 +209,7 @@ static int ucma_event_handler(struct rdm if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) { if (!ctx->backlog) { ret = -EDQUOT; + kfree(uevent); goto out; } ctx->backlog--; From sashak at voltaire.com Thu Dec 14 11:50:34 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 14 Dec 2006 21:50:34 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061214184015.GE12781@mellanox.co.il> References: <20061214181259.GE28849@sashak.voltaire.com> <20061214184015.GE12781@mellanox.co.il> Message-ID: <20061214195034.GA7838@sashak.voltaire.com> On 20:40 Thu 14 Dec , Michael S. Tsirkin wrote: > > Now that everything is in git, we can look at using stgit for this, > I'm not sure how well would publishing stgit-managed tree work. I've used stgit couple of months ago, but switched to core git, today it does everything what stgit did. Don't think however that this is better for publishing - git-rebase and git-reset produce non-linear history. Sasha From eitan at mellanox.co.il Thu Dec 14 11:53:52 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 14 Dec 2006 21:53:52 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion In-Reply-To: <4581525C.9060104@mellanox.co.il> References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> <1166098306.28709.122104.camel@hal.voltaire.com> <4581525C.9060104@mellanox.co.il> Message-ID: <4581ABD0.7050509@mellanox.co.il> Update on analysis of failures: Eitan Zahavi wrote: > Hal Rosenstock wrote: > >> Hi Eitan, >> >> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote: >> >> >>> OSM Simulation Regression Summary >>> OpenSM rev = ____ >>> ibutils rev = ____ >>> Total=264 Pass=261 Fail=3 >>> >>> Pass: >>> 36 Stability IS1-16.topo >>> 36 Pkey IS1-16.topo >>> 36 Multicast IS1-16.topo >>> 36 LidMgr IS1-16.topo >>> 35 OsmStress IS1-16.topo >>> 12 Stability IS3-loop.topo >>> 12 Stability IS3-128.topo >>> 12 Pkey IS3-128.topo >>> 12 OsmStress IS3-128.topo >>> 12 Multicast IS3-loop.topo >>> 11 Multicast IS3-128.topo >>> 11 LidMgr IS3-128.topo >>> >>> Failures: >>> 1 OsmStress IS1-16.topo >>> Job was killed in the middle. Just an accident. >>> 1 Multicast IS3-128.topo >>> A single packet was dropped on the way to the SM. Still not clear where. However, I have seen a perfectly good link reported by the drop manager as missing. I will rerun some tests with valgrind as I think this might be a memory corruption issue. >>> 1 LidMgr IS3-128.topo >>> Seems like the last sweep started before the last change in LID was made. So it missed one of the nodes. Additional sweep was enforced at the end of the test - just to make sure all changes are handled. >>> >>> >> There are now 2 more failures. You had previously explained OsmStress >> failure as needing more investigation. Now there is a Multicast and >> LidMgr failure yet nothing really changed since the previous run the >> night before. Are these new tests ? What were the failures ? >> >> > The tests use random seeds and thus can catch other bugs in each run. > I am investigating these failures. Some might be due to bugs in the > checker code too. > > Please pay attention the failure rate is low (LidMgr pass 36+11 runs > failed 1 test). > This to imply the bug is a hard to find one. > >> The repetitions have also been reduced from previous reports. Are these >> the same or different tests ? >> >> > Number of repetitions depends on runtime. The regression started later > thus run less iterations. > I run the "same" tests ("same" means same code not same random sequence). > >> -- Hal >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Thu Dec 14 12:04:02 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 14 Dec 2006 22:04:02 +0200 Subject: [openib-general] userspace git trees In-Reply-To: <20061214184631.GF12781@mellanox.co.il> References: <20061213232638.GC14186@sashak.voltaire.com> <20061214061951.GH1689@mellanox.co.il> <20061214181259.GE28849@sashak.voltaire.com> <20061214184631.GF12781@mellanox.co.il> Message-ID: <20061214200402.GB7838@sashak.voltaire.com> On 20:46 Thu 14 Dec , Michael S. Tsirkin wrote: > > > > I would say that if you really want to mirror the OFED branch, > > > and make it buildable to some extent, the way to do this > > > would be to have a single git tree with all of OFED - patches, > > > scripts and all. > > > > I'm able to build OpenSM for OFED 1.1 from git tree just fine. And > > synced 1.1 branch in git let me some useful stuff - I can log, diff, > > rebase and cherry-pick fixes, etc.. - everything is in-tree (I said > > that I like branches :)). > > I sure don't have a problem with that. But it would be better to avoid > touching 1.1 svn branch any more than absolutely necessary. Yes, only critical fixes should be committed - actually the patches from ofed_fixes/ . > Do you only want this for opensm? opensm happens not to have any patches, > so it's easy. OpenSM has 1.1 fixes too - it is all committed. Sasha From kliteyn at dev.mellanox.co.il Thu Dec 14 11:58:29 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 14 Dec 2006 21:58:29 +0200 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' Message-ID: <4581ACE5.9000109@dev.mellanox.co.il> Hi Hal This patch fixes a bug that caused ucast manager to return OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. Added a boolean flag that marks whether there was some change or not (in which case OSM_SIGNAL_DONE should be returned). -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/include/opensm/osm_ucast_mgr.h | 6 ++++++ osm/opensm/osm_ucast_mgr.c | 13 ++++++------- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/osm/include/opensm/osm_ucast_mgr.h b/osm/include/opensm/osm_ucast_mgr.h index 8237963..39bf45a 100644 --- a/osm/include/opensm/osm_ucast_mgr.h +++ b/osm/include/opensm/osm_ucast_mgr.h @@ -104,6 +104,7 @@ typedef struct _osm_ucast_mgr osm_req_t *p_req; osm_log_t *p_log; cl_plock_t *p_lock; + boolean_t any_change; uint8_t *lft_buf; } osm_ucast_mgr_t; /* @@ -120,6 +121,11 @@ typedef struct _osm_ucast_mgr * p_lock * Pointer to the serializing lock. * +* any_change +* Initialized to FALSE at the beginning of the algorithm, +* set to TRUE by osm_ucast_mgr_set_fwd_table() if any mad +* was sent. +* * lft_buf * LFT buffer - used during LFT calculation/setup. * diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index 3341eea..e977253 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -984,6 +984,7 @@ osm_ucast_mgr_set_fwd_table( } else { + p_mgr->any_change = TRUE; /* HACK: for now we will assume we succeeded to send and set the local DB based on it. This should allow @@ -1220,6 +1221,7 @@ osm_ucast_mgr_process( if (cl_qmap_count( p_sw_guid_tbl ) == 0) goto Exit; + p_mgr->any_change = FALSE; cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL); if (!p_routing_eng->build_lid_matrices || @@ -1246,13 +1248,10 @@ osm_ucast_mgr_process( if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) __osm_ucast_mgr_dump_tables( p_mgr ); - /* - For now don't bother checking if the switch forwarding tables - actually needed updating. The current code will always update - them, and thus leave transactions pending on the wire. - Therefore, return OSM_SIGNAL_DONE_PENDING. - */ - signal = OSM_SIGNAL_DONE_PENDING; + if (p_mgr->any_change) + signal = OSM_SIGNAL_DONE_PENDING; + else + signal = OSM_SIGNAL_DONE; osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, "osm_ucast_mgr_process: " -- 1.4.4.1.GIT From raleigh at systemfabricworks.com Thu Dec 14 12:07:02 2006 From: raleigh at systemfabricworks.com (Raleigh F Rinehart) Date: Thu, 14 Dec 2006 14:07:02 -0600 Subject: [openib-general] SA MADs and Cisco SM Message-ID: <4581AEE6.8060905@systemfabricworks.com> Hi All, I am developing an API that uses SA MADs to create and get ServiceRecords. I am using the OFED1.1 mad and umad libraries. Everything works flawlessly with OpenSM. However if I run in a fabric that is using the SM embedded in a Cisco/Topspin switch, queries (SubnAdmGetTable) for ServiceRecords fail with a status 110 (ETIMEDOUT). Since I don't have direct access to the switch and logs I can't tell what is going on from the switch, I am working on getting those logs asap. However I was wondering if there were any known issues with interoperability, or functionality with OpenIB and the Cisco SM? Any ideas or pointers in the right direction would be greatly appreciated. thanks in advance, -raleigh From sashak at voltaire.com Thu Dec 14 12:16:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 14 Dec 2006 22:16:32 +0200 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <4581ACE5.9000109@dev.mellanox.co.il> References: <4581ACE5.9000109@dev.mellanox.co.il> Message-ID: <20061214201632.GD7838@sashak.voltaire.com> On 21:58 Thu 14 Dec , Yevgeny Kliteynik wrote: > Hi Hal > > This patch fixes a bug that caused ucast manager to return > OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. > Added a boolean flag that marks whether there was some change or not > (in which case OSM_SIGNAL_DONE should be returned). > > -- > Yevgeny > > Signed-off-by: Yevgeny Kliteynik Good finding. Sasha From halr at voltaire.com Thu Dec 14 12:12:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2006 15:12:29 -0500 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <4581ACE5.9000109@dev.mellanox.co.il> References: <4581ACE5.9000109@dev.mellanox.co.il> Message-ID: <1166127103.28709.140656.camel@hal.voltaire.com> Hi Yevgeny, On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > Hi Hal > > This patch fixes a bug that caused ucast manager to return > OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. > Added a boolean flag that marks whether there was some change or not > (in which case OSM_SIGNAL_DONE should be returned). Just wondering what is the test case for this ? -- Hal From halr at voltaire.com Thu Dec 14 12:18:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2006 15:18:07 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion In-Reply-To: <4581ABD0.7050509@mellanox.co.il> References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> <1166098306.28709.122104.camel@hal.voltaire.com> <4581525C.9060104@mellanox.co.il> <4581ABD0.7050509@mellanox.co.il> Message-ID: <1166127430.28709.140858.camel@hal.voltaire.com> On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote: > Update on analysis of failures: > > Eitan Zahavi wrote: > > Hal Rosenstock wrote: > > > >> Hi Eitan, > >> > >> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote: > >> > >> > >>> OSM Simulation Regression Summary > >>> OpenSM rev = ____ > >>> ibutils rev = ____ > >>> Total=264 Pass=261 Fail=3 > >>> > >>> Pass: > >>> 36 Stability IS1-16.topo > >>> 36 Pkey IS1-16.topo > >>> 36 Multicast IS1-16.topo > >>> 36 LidMgr IS1-16.topo > >>> 35 OsmStress IS1-16.topo > >>> 12 Stability IS3-loop.topo > >>> 12 Stability IS3-128.topo > >>> 12 Pkey IS3-128.topo > >>> 12 OsmStress IS3-128.topo > >>> 12 Multicast IS3-loop.topo > >>> 11 Multicast IS3-128.topo > >>> 11 LidMgr IS3-128.topo > >>> > >>> Failures: > >>> 1 OsmStress IS1-16.topo > >>> > Job was killed in the middle. Just an accident. Is that always the case ? This one has been consistently failing. I think you had written something about this failure back in July. I can dig it out if you want. > >>> 1 Multicast IS3-128.topo > >>> > A single packet was dropped on the way to the SM. Still not clear where. > However, I have seen a perfectly good link reported by the drop manager > as missing. I think I may have seen this as well on some rare occasions. I could never figure out why this happened. > I will rerun some tests with valgrind as I think this might be a memory > corruption issue. OK. > >>> 1 LidMgr IS3-128.topo > >>> > Seems like the last sweep started before the last change in LID was > made. So it missed one of the nodes. > Additional sweep was enforced at the end of the test - just to make sure > all changes are handled. So is this being reported as a failure improperly then ? -- Hal > >>> > >>> > >> There are now 2 more failures. You had previously explained OsmStress > >> failure as needing more investigation. Now there is a Multicast and > >> LidMgr failure yet nothing really changed since the previous run the > >> night before. Are these new tests ? What were the failures ? > >> > >> > > The tests use random seeds and thus can catch other bugs in each run. > > I am investigating these failures. Some might be due to bugs in the > > checker code too. > > > > Please pay attention the failure rate is low (LidMgr pass 36+11 runs > > failed 1 test). > > This to imply the bug is a hard to find one. > > > >> The repetitions have also been reduced from previous reports. Are these > >> the same or different tests ? > >> > >> > > Number of repetitions depends on runtime. The regression started later > > thus run less iterations. > > I run the "same" tests ("same" means same code not same random sequence). > > > >> -- Hal > >> > >> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > >> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From eitan at mellanox.co.il Thu Dec 14 12:24:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 14 Dec 2006 22:24:26 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion In-Reply-To: <1166127430.28709.140858.camel@hal.voltaire.com> References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> <1166098306.28709.122104.camel@hal.voltaire.com> <4581525C.9060104@mellanox.co.il> <4581ABD0.7050509@mellanox.co.il> <1166127430.28709.140858.camel@hal.voltaire.com> Message-ID: <4581B2FA.7090602@mellanox.co.il> Hal Rosenstock wrote: > On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote: > >> Update on analysis of failures: >> >> Eitan Zahavi wrote: >> >>> Hal Rosenstock wrote: >>> >>> >>>> Hi Eitan, >>>> >>>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote: >>>> >>>> >>>> >>>>> OSM Simulation Regression Summary >>>>> OpenSM rev = ____ >>>>> ibutils rev = ____ >>>>> Total=264 Pass=261 Fail=3 >>>>> >>>>> Pass: >>>>> 36 Stability IS1-16.topo >>>>> 36 Pkey IS1-16.topo >>>>> 36 Multicast IS1-16.topo >>>>> 36 LidMgr IS1-16.topo >>>>> 35 OsmStress IS1-16.topo >>>>> 12 Stability IS3-loop.topo >>>>> 12 Stability IS3-128.topo >>>>> 12 Pkey IS3-128.topo >>>>> 12 OsmStress IS3-128.topo >>>>> 12 Multicast IS3-loop.topo >>>>> 11 Multicast IS3-128.topo >>>>> 11 LidMgr IS3-128.topo >>>>> >>>>> Failures: >>>>> 1 OsmStress IS1-16.topo >>>>> >>>>> >> Job was killed in the middle. Just an accident. >> > > Is that always the case ? This one has been consistently failing. > I think you had written something about this failure back in July. I can > dig it out if you want. > > >>>>> 1 Multicast IS3-128.topo >>>>> >>>>> >> A single packet was dropped on the way to the SM. Still not clear where. >> However, I have seen a perfectly good link reported by the drop manager >> as missing. >> > > I think I may have seen this as well on some rare occasions. I could > never figure out why this happened. > > >> I will rerun some tests with valgrind as I think this might be a memory >> corruption issue. >> > > OK. > > >>>>> 1 LidMgr IS3-128.topo >>>>> >>>>> >> Seems like the last sweep started before the last change in LID was >> made. So it missed one of the nodes. >> Additional sweep was enforced at the end of the test - just to make sure >> all changes are handled. >> > > So is this being reported as a failure improperly then ? > Well the test failed. The fix was committed. We will see in the next few days if it is really a false alarm. > -- Hal > > >>>>> >>>>> >>>>> >>>> There are now 2 more failures. You had previously explained OsmStress >>>> failure as needing more investigation. Now there is a Multicast and >>>> LidMgr failure yet nothing really changed since the previous run the >>>> night before. Are these new tests ? What were the failures ? >>>> >>>> >>>> >>> The tests use random seeds and thus can catch other bugs in each run. >>> I am investigating these failures. Some might be due to bugs in the >>> checker code too. >>> >>> Please pay attention the failure rate is low (LidMgr pass 36+11 runs >>> failed 1 test). >>> This to imply the bug is a hard to find one. >>> >>> >>>> The repetitions have also been reduced from previous reports. Are these >>>> the same or different tests ? >>>> >>>> >>>> >>> Number of repetitions depends on runtime. The regression started later >>> thus run less iterations. >>> I run the "same" tests ("same" means same code not same random sequence). >>> >>> >>>> -- Hal >>>> >>>> >>>> _______________________________________________ >>>> openib-general mailing list >>>> openib-general at openib.org >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Dec 14 12:40:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 12:40:52 -0800 Subject: [openib-general] userspace git trees In-Reply-To: <20061214195034.GA7838@sashak.voltaire.com> (Sasha Khapyorsky's message of "Thu, 14 Dec 2006 21:50:34 +0200") References: <20061214181259.GE28849@sashak.voltaire.com> <20061214184015.GE12781@mellanox.co.il> <20061214195034.GA7838@sashak.voltaire.com> Message-ID: > I've used stgit couple of months ago, but switched to core git, today it > does everything what stgit did. Don't think however that this is better > for publishing - git-rebase and git-reset produce non-linear history. How do you get the equivalent of stg pop edit patch stg refresh stg push with core git? - R. From sashak at voltaire.com Thu Dec 14 12:51:09 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 14 Dec 2006 22:51:09 +0200 Subject: [openib-general] userspace git trees In-Reply-To: References: <20061214181259.GE28849@sashak.voltaire.com> <20061214184015.GE12781@mellanox.co.il> <20061214195034.GA7838@sashak.voltaire.com> Message-ID: <20061214205109.GE7838@sashak.voltaire.com> On 12:40 Thu 14 Dec , Roland Dreier wrote: > > I've used stgit couple of months ago, but switched to core git, today it > > does everything what stgit did. Don't think however that this is better > > for publishing - git-rebase and git-reset produce non-linear history. > > How do you get the equivalent of > > stg pop > edit patch > stg refresh > stg push > > with core git? git-reset HEAD^ edit patch git-commit -c ORIG_HEAD I think there is also 'git-commit --amend', but didn't use it yet. Sasha From tziporet at dev.mellanox.co.il Thu Dec 14 12:46:43 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 14 Dec 2006 22:46:43 +0200 Subject: [openib-general] reminder: OFED 1.2 meeting on Monday 18-Dec at 9am PST Message-ID: <4581B833.7060600@dev.mellanox.co.il> Hi All, I wish to remind all that we are going to have the OFED 1.2 coordination meeting next Monday at 9am PST. Bridge info same as all meetings (sent by Jeff) Meeting agenda: 1. Status review of the features we agreed upon to make sure all code will be ready by end of January for the alpha release. 2. Feedback on the daily build Reminder: release plan on the Wiki: https://openib.org/tiki/tiki-index.php?page=OFED+1.2+release+plan+and+features Please plan to attend since our next meeting is only at 15-January 07 due to new year holiday. Tziporet From rdreier at cisco.com Thu Dec 14 12:46:38 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 12:46:38 -0800 Subject: [openib-general] SA MADs and Cisco SM In-Reply-To: <4581AEE6.8060905@systemfabricworks.com> (Raleigh F. Rinehart's message of "Thu, 14 Dec 2006 14:07:02 -0600") References: <4581AEE6.8060905@systemfabricworks.com> Message-ID: > I am developing an API that uses SA MADs to create and get > ServiceRecords. I am using the OFED1.1 mad and umad libraries. > Everything works flawlessly with OpenSM. However if I run in a fabric > that is using the SM embedded in a Cisco/Topspin switch, queries > (SubnAdmGetTable) for ServiceRecords fail with a status 110 (ETIMEDOUT). > Since I don't have direct access to the switch and logs I can't tell > what is going on from the switch, I am working on getting those logs > asap. However I was wondering if there were any known issues with > interoperability, or functionality with OpenIB and the Cisco SM? I don't know of any issues with the Cisco SM, and I do most of my development using the Cisco SM running on Cisco switches. However, since you are seeing a problem, it would probably make sense to work with Cisco support to figure out if there is an issue with the embedded SM on Cisco switches. In this case, since the error is ETIMEDOUT, it might make sense to try your query with a longer timeout; it could just be that returning a big table is taking longer than the timeout you set. Perhaps opensm works because it's running on a fast server CPU, while the switch SM is running out of gas on the embedded CPU in the switch. - R. From rdreier at cisco.com Thu Dec 14 12:50:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 12:50:03 -0800 Subject: [openib-general] userspace git trees In-Reply-To: <20061214205109.GE7838@sashak.voltaire.com> (Sasha Khapyorsky's message of "Thu, 14 Dec 2006 22:51:09 +0200") References: <20061214181259.GE28849@sashak.voltaire.com> <20061214184015.GE12781@mellanox.co.il> <20061214195034.GA7838@sashak.voltaire.com> <20061214205109.GE7838@sashak.voltaire.com> Message-ID: > > How do you get the equivalent of > > > > stg pop > > edit patch > > stg refresh > > stg push > > > > with core git? > > git-reset HEAD^ > edit patch > git-commit -c ORIG_HEAD > > I think there is also 'git-commit --amend', but didn't use it yet. I don't think either of those is really equivalent. You can edit the commit at the end of your current branch, but there's no convenient analog of stg pop/stg push. Of course stgit is implemented on top of core git so you can reimplement it by hand, but I do think there is value in the stgit porcelain. - R. From halr at voltaire.com Thu Dec 14 12:48:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 14 Dec 2006 15:48:11 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion In-Reply-To: <4581B2FA.7090602@mellanox.co.il> References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com> <1166098306.28709.122104.camel@hal.voltaire.com> <4581525C.9060104@mellanox.co.il> <4581ABD0.7050509@mellanox.co.il> <1166127430.28709.140858.camel@hal.voltaire.com> <4581B2FA.7090602@mellanox.co.il> Message-ID: <1166129270.28709.141866.camel@hal.voltaire.com> On Thu, 2006-12-14 at 15:24, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote: > > > >> Update on analysis of failures: > >> > >> Eitan Zahavi wrote: > >> > >>> Hal Rosenstock wrote: > >>> > >>> > >>>> Hi Eitan, > >>>> > >>>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote: > >>>> > >>>> > >>>> > >>>>> OSM Simulation Regression Summary > >>>>> OpenSM rev = ____ > >>>>> ibutils rev = ____ > >>>>> Total=264 Pass=261 Fail=3 > >>>>> > >>>>> Pass: > >>>>> 36 Stability IS1-16.topo > >>>>> 36 Pkey IS1-16.topo > >>>>> 36 Multicast IS1-16.topo > >>>>> 36 LidMgr IS1-16.topo > >>>>> 35 OsmStress IS1-16.topo > >>>>> 12 Stability IS3-loop.topo > >>>>> 12 Stability IS3-128.topo > >>>>> 12 Pkey IS3-128.topo > >>>>> 12 OsmStress IS3-128.topo > >>>>> 12 Multicast IS3-loop.topo > >>>>> 11 Multicast IS3-128.topo > >>>>> 11 LidMgr IS3-128.topo > >>>>> > >>>>> Failures: > >>>>> 1 OsmStress IS1-16.topo > >>>>> > >>>>> > >> Job was killed in the middle. Just an accident. > >> > > > > Is that always the case ? This one has been consistently failing. > > I think you had written something about this failure back in July. I can > > dig it out if you want. > > > > > >>>>> 1 Multicast IS3-128.topo > >>>>> > >>>>> > >> A single packet was dropped on the way to the SM. Still not clear where. > >> However, I have seen a perfectly good link reported by the drop manager > >> as missing. > >> > > > > I think I may have seen this as well on some rare occasions. I could > > never figure out why this happened. > > > > > >> I will rerun some tests with valgrind as I think this might be a memory > >> corruption issue. > >> > > > > OK. > > > > > >>>>> 1 LidMgr IS3-128.topo > >>>>> > >>>>> > >> Seems like the last sweep started before the last change in LID was > >> made. So it missed one of the nodes. > >> Additional sweep was enforced at the end of the test - just to make sure > >> all changes are handled. > >> > > > > So is this being reported as a failure improperly then ? > > > Well the test failed. The fix was committed. Which fix ? Are you referring to the one Yevgeny just sent ? -- Hal > We will see in the next few > days if it is really a false alarm. > > -- Hal > > > > > >>>>> > >>>>> > >>>>> > >>>> There are now 2 more failures. You had previously explained OsmStress > >>>> failure as needing more investigation. Now there is a Multicast and > >>>> LidMgr failure yet nothing really changed since the previous run the > >>>> night before. Are these new tests ? What were the failures ? > >>>> > >>>> > >>>> > >>> The tests use random seeds and thus can catch other bugs in each run. > >>> I am investigating these failures. Some might be due to bugs in the > >>> checker code too. > >>> > >>> Please pay attention the failure rate is low (LidMgr pass 36+11 runs > >>> failed 1 test). > >>> This to imply the bug is a hard to find one. > >>> > >>> > >>>> The repetitions have also been reduced from previous reports. Are these > >>>> the same or different tests ? > >>>> > >>>> > >>>> > >>> Number of repetitions depends on runtime. The regression started later > >>> thus run less iterations. > >>> I run the "same" tests ("same" means same code not same random sequence). > >>> > >>> > >>>> -- Hal > >>>> > >>>> > >>>> _______________________________________________ > >>>> openib-general mailing list > >>>> openib-general at openib.org > >>>> http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> > >>>> > >>> _______________________________________________ > >>> openib-general mailing list > >>> openib-general at openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From or.gerlitz at gmail.com Thu Dec 14 12:51:17 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 14 Dec 2006 22:51:17 +0200 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <45819093.3090405@ichips.intel.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com> Message-ID: <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com> On 12/14/06, Sean Hefty wrote: > > What about the rdma_cm_get_option() and rdma_cm_set_option() exposed by > > librdmacm? is it something which is on its way out? > > I did not expose those to userspace at this time. I believe what was there > needed to be reworked. For example, the timeout could be generic, rather than > IB specific, and the option to get a list of path records should be eliminated. I see. I understand that there is some code which is part of OFED (udapl) that uses this api, what were you thinking to suggest them to do in the spirit of this code you have posted being the basis for OFED 1.2 ? Or. From rdreier at cisco.com Thu Dec 14 13:00:34 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 13:00:34 -0800 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: <20061214185210.GH12781@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 14 Dec 2006 20:52:10 +0200") References: <20061214170455.GA12781@mellanox.co.il> <20061214185210.GH12781@mellanox.co.il> Message-ID: > > With current code firmware might be doing WRITE_MTT while CPU is writing to the > > same cache line, and I expect this might confuse things, but it seems that with > > my fmr/mr merge patch, we never have both CPU and firmware write to the same > > MTTs entries. > > > > So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code > > sufficient? Yes, assuming that the CPU is the only entity ever writing to the MTT table, then doing pci_dma_sync_sg_for_cpu() before writing and pci_dma_sync_sg_for_device() afterwards should be OK. I think. > Documentation/DMA-mapping.txt actually says: > > > Without that, you'd see cacheline > > sharing problems (data corruption) on CPUs with DMA-incoherent caches. > > (The CPU could write to one word, DMA would write to a different one > > in the same cache line, and one of them could be overwritten.) Not sure what the relevance of that is -- it's kind of making the opposite point, that you need to make sure the CPU never touches a cacheline that might be DMAed at the same point. The part you snipped mentions alignment problems. What saves us for the MTT table is that with your patch the device never writes to the MTT table at all. - R. From or.gerlitz at gmail.com Thu Dec 14 13:07:50 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 14 Dec 2006 23:07:50 +0200 Subject: [openib-general] (no subject) In-Reply-To: References: <457FB82B.4090902@voltaire.com> <45810901.3090209@voltaire.com> Message-ID: <15ddcffd0612141307r24c95f6ag7bf75482705fa125@mail.gmail.com> On 12/14/06, Roland Dreier wrote: >> mmm, I understand all the comments raised during the review were fixed >> in the V3 post below, and now you say its both wrong and ugly... for >> example what's wrong here? > I take back the wrong statement, I misread the patch just now. good, we are making some progress... > But if you don't think the patch is ugly then I don't think we're looking at > the same thing. For example >> +static int __devinit mthca_check_profile_value(int* pval, int pval_default){ > and so on... I see. Being less familiar with __devinit and friends, will have to educate myself a little to see why the current patch is ugly... anyway, thanks for agreeing to fix it yourself. Or. From sashak at voltaire.com Thu Dec 14 13:14:02 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 14 Dec 2006 23:14:02 +0200 Subject: [openib-general] userspace git trees In-Reply-To: References: <20061214181259.GE28849@sashak.voltaire.com> <20061214184015.GE12781@mellanox.co.il> <20061214195034.GA7838@sashak.voltaire.com> <20061214205109.GE7838@sashak.voltaire.com> Message-ID: <20061214211402.GF7838@sashak.voltaire.com> On 12:50 Thu 14 Dec , Roland Dreier wrote: > > > How do you get the equivalent of > > > > > > stg pop > > > edit patch > > > stg refresh > > > stg push > > > > > > with core git? > > > > git-reset HEAD^ > > edit patch > > git-commit -c ORIG_HEAD > > > > I think there is also 'git-commit --amend', but didn't use it yet. > > I don't think either of those is really equivalent. You can edit the > commit at the end of your current branch, but there's no convenient > analog of stg pop/stg push. In the "worst" case - git-format-patch/git-am always help. > Of course stgit is implemented on top of core git so you can > reimplement it by hand, but I do think there is value in the stgit porcelain. Sure. I have nothing against stgit, it is nice tool and I used this successfully couple of months (and switched not because stgit was bad but because was needed to deal with core git for other stuff anyway). Sasha From rdreier at cisco.com Thu Dec 14 13:12:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 13:12:36 -0800 Subject: [openib-general] (no subject) In-Reply-To: <15ddcffd0612141307r24c95f6ag7bf75482705fa125@mail.gmail.com> (Or Gerlitz's message of "Thu, 14 Dec 2006 23:07:50 +0200") References: <457FB82B.4090902@voltaire.com> <45810901.3090209@voltaire.com> <15ddcffd0612141307r24c95f6ag7bf75482705fa125@mail.gmail.com> Message-ID: > I see. Being less familiar with __devinit and friends, will have to > educate myself a little to see why the current patch is ugly... > anyway, thanks for agreeing to fix it yourself. >> +static int __devinit mthca_check_profile_value(int* pval, int pval_default){ No, not the __devinit part -- I meant whitespace in "pval_default){". There's crazy indentation all over, whitespace breakage like > + if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) { And the macro +#define mthca_check_profile_and_warn(name, var, defval) \ + if (mthca_check_profile_value(&var, defval)) \ + mthca_warn(mdev, "invalid %s passed. changed to %d.\n", #name, var); is a little crazy -- why can't that if () statement be part of the function too? Anyway... - R. - R. From swise at opengridcomputing.com Thu Dec 14 13:28:47 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 14 Dec 2006 15:28:47 -0600 Subject: [openib-general] librdmacm git repos needs config dir Message-ID: <1166131727.12420.9.camel@stevo-desktop> Sean, The librdmacm git repository needs a config dir or autoconf changes to make that dir as part of config. I'm not a autoconf wiz, so I just created the config dir and put a hidden file named .gitignore in it for libamso. That way its created when folks clone it. Dunno if that's the best way, but it worked... Steve. From akepner at sgi.com Thu Dec 14 13:08:26 2006 From: akepner at sgi.com (akepner at sgi.com) Date: Thu, 14 Dec 2006 13:08:26 -0800 (PST) Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race In-Reply-To: References: Message-ID: On Wed, 13 Dec 2006, Roland Dreier wrote: > Are there other possible ordering problems involving user memory (not > in a CQ or QP)? Something like a CPU on node A writing to memory on > node B and then posting a work request that makes the HCA DMA from > that memory on node B, and having the work request doorbell reach the > HCA before the write to node B actually happens, so the HCA DMAs the > old contents of node B's memory? Well, this case could be handled with mb() operations (if I understand you correctly). The type of race I had in mind is between DMA operations and updates to data structures shared between the host and HCA. But, yes, the example I used was only one of the possiblilities of this type of race. > > I guess the only feasible solution to the problem you're pointing out > is to have libmthca use some special mmap()-based allocator for queues > so that the kernel can give it memory that has the special > dma_map_consistent treatment. That's an excellent idea. (And, now that you've mentioned it, it's "obvious" ;-) I'll see what I can come up with using this approach. > > Ugh. > Well stated. -- Arthur From mshefty at ichips.intel.com Thu Dec 14 13:40:05 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 14 Dec 2006 13:40:05 -0800 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com> <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com> Message-ID: <4581C4B5.5020702@ichips.intel.com> > I see. I understand that there is some code which is part of OFED > (udapl) that uses this api, what were you thinking to suggest them to > do in the spirit of this code you have posted being the basis for OFED > 1.2 ? DAPL has been updated to remove its use of these calls. The rdma cm timeout is essentially 1 minute now. If needed a kernel fix can be applied to send an MRA to increase the timeout, but I'm holding off on doing that unless it's really needed. - Sean From mst at mellanox.co.il Thu Dec 14 13:40:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 14 Dec 2006 23:40:21 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: References: <20061214170455.GA12781@mellanox.co.il> <20061214185210.GH12781@mellanox.co.il> Message-ID: <20061214214021.GB19449@mellanox.co.il> > What saves us for the MTT table is that with your patch the device > never writes to the MTT table at all. Except for the reserved MTTs. -- MST From sweitzen at cisco.com Thu Dec 14 13:47:14 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Thu, 14 Dec 2006 13:47:14 -0800 Subject: [openib-general] Cisco OFED 1.1 now available Message-ID: Cisco OFED 1.1 includes OFED 1.1 source code (same source code as that on openfabrics.org), binary RPMS for RHEL4 and SLES10, firmware for the tvflash utility, and Cisco documentation. Anyone who registers at cisco.com can download it, but you need a Cisco support contract to get technical support from Cisco. http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Dec 14 13:54:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 14 Dec 2006 13:54:36 -0800 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: <20061214214021.GB19449@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 14 Dec 2006 23:40:21 +0200") References: <20061214170455.GA12781@mellanox.co.il> <20061214185210.GH12781@mellanox.co.il> <20061214214021.GB19449@mellanox.co.il> Message-ID: > > What saves us for the MTT table is that with your patch the device > > never writes to the MTT table at all. > > Except for the reserved MTTs. Good point. So I guess we need a patch that makes sure all reserved MTTs are given their own ICM chunk (which doesn't need to be in lowmem) to fix things. - R. From bugzilla-daemon at openib.org Thu Dec 14 14:45:05 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 14 Dec 2006 14:45:05 -0800 (PST) Subject: [openib-general] [Bug 172] Need an interface to load alternate path to RC QP Message-ID: <20061214224505.30D002283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=172 sean.hefty at intel.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from sean.hefty at intel.com 2006-12-14 14:45 ------- ib_cm_init_qp_attr() was expanded to handle setting the QP attributes for an alternate path. ib_cm_establish() was renamed to ib_cm_notify() to allow the user to signal to the CM that failover has occurred on a connection. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Thu Dec 14 14:46:06 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 14 Dec 2006 14:46:06 -0800 (PST) Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL Message-ID: <20061214224606.ADAD42283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=160 sean.hefty at intel.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from sean.hefty at intel.com 2006-12-14 14:46 ------- Fixed applied to upstream version of ib_cm. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Thu Dec 14 15:08:07 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 14 Dec 2006 15:08:07 -0800 (PST) Subject: [openib-general] [Bug 159] OFED1.0: Missing interfaces Message-ID: <20061214230807.94EE02283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=159 sean.hefty at intel.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |ASSIGNED ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kliteyn at dev.mellanox.co.il Thu Dec 14 15:27:43 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 15 Dec 2006 01:27:43 +0200 Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine [1/2] Message-ID: <4581DDEF.7000206@dev.mellanox.co.il> Hi Hal This patch (1/2) adds Fat Tree routing engine to OpenSM. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/Makefile.am | 2 +- osm/opensm/main.c | 3 ++- osm/opensm/osm_opensm.c | 2 ++ 3 files changed, 5 insertions(+), 2 deletions(-) diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index b273eca..64b984b 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -87,7 +87,7 @@ opensm_SOURCES = main.c osm_console.c os osm_sw_info_rcv_ctrl.c osm_switch.c \ osm_prtn.c osm_prtn_config.c osm_qos.c \ osm_trap_rcv.c osm_trap_rcv_ctrl.c \ - osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \ + osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c osm_ucast_ftree.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ osm_vl_arb_rcv_ctrl.c st.c if OSMV_OPENIB diff --git a/osm/opensm/main.c b/osm/opensm/main.c index ca9a749..7b1c325 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -172,7 +172,8 @@ show_usage(void) printf( "-R\n" "--routing_engine \n" " This option chooses routing engine instead of Min Hop\n" - " algorithm (default). Supported engines: updn, file\n\n"); + " algorithm (default).\n" + " Supported engines: updn, file, ftree.\n\n"); printf( "-M\n" "--lid_matrix_file \n" " This option specifies the name of the lid matrix dump file\n" diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 52ae75a..9cac636 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -74,6 +74,7 @@ struct routing_engine_module { extern int osm_ucast_updn_setup(osm_opensm_t *p_osm); extern int osm_ucast_file_setup(osm_opensm_t *p_osm); +extern int osm_ucast_ftree_setup(osm_opensm_t *p_osm); static int osm_ucast_null_setup(osm_opensm_t *p_osm); @@ -81,6 +82,7 @@ const static struct routing_engine_modul { "null", osm_ucast_null_setup }, { "updn", osm_ucast_updn_setup }, { "file", osm_ucast_file_setup }, + { "ftree", osm_ucast_ftree_setup }, { NULL, NULL } }; -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Thu Dec 14 15:27:59 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 15 Dec 2006 01:27:59 +0200 Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine [2/2] Message-ID: <4581DDFF.2000903@dev.mellanox.co.il> Hi Hal This patch (2/2) adds Fat Tree routing engine to OpenSM. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 2936 ++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 2936 insertions(+), 0 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c new file mode 100644 index 0000000..15e4cd0 --- /dev/null +++ b/osm/opensm/osm_ucast_ftree.c @@ -0,0 +1,2936 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of OpenSM FatTree routing + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* This var is predefined and initialized */ +extern osm_opensm_t osm; + +/* + * FatTree rank is bounded between 2 and 8: + * - Tree of rank 1 has only trivial routing pathes, + * so no need to use FatTree routing. + * - Why maximum rank is 8: + * Each node (switch) is assigned a unique tuple. + * Switches are stored in two cl_qmaps - one is + * ordered by guid, and the other by a key that is + * generated from tuple. Since cl_qmap supports only + * a 64-bit key, the maximal tuple lenght is 8 bytes. + * which means that maximal tree rank is 8. + * Note that the above also implies that each switch + * can have at max 255 up/down ports. + */ + +#define FAT_TREE_MIN_RANK 2 +#define FAT_TREE_MAX_RANK 8 + +typedef enum { + FTREE_DIRECTION_DOWN = -1, + FTREE_DIRECTION_SAME, + FTREE_DIRECTION_UP +} ftree_direction_t; + + +/*************************************************** + ** + ** Forward references + ** + ***************************************************/ + +struct ftree_sw_t_; +struct ftree_hca_t_; +struct ftree_port_t_; +struct ftree_port_group_t_; +struct ftree_fabric_t_; + +/*************************************************** + ** + ** ftree_tuple_t definition + ** + ***************************************************/ + +#define FTREE_TUPLE_BUFF_LEN 1024 +#define FTREE_TUPLE_LEN 8 + +typedef uint8_t ftree_tuple_t[FTREE_TUPLE_LEN]; +typedef uint64_t ftree_tuple_key_t; + +/*************************************************** + ** + ** ftree_sw_table_element_t definition + ** + ***************************************************/ + +typedef struct { + cl_map_item_t map_item; + struct ftree_sw_t_ * p_sw; +} ftree_sw_tbl_element_t; + +/*************************************************** + ** + ** ftree_fwd_tbl_t definition + ** + ***************************************************/ + +typedef uint8_t * ftree_fwd_tbl_t; +#define FTREE_FWD_TBL_LEN (IB_LID_UCAST_END_HO + 1) + +/*************************************************** + ** + ** ftree_port_t definition + ** + ***************************************************/ + +typedef struct ftree_port_t_ +{ + cl_map_item_t map_item; + uint16_t port_num; /* port number on the current node */ + uint16_t remote_port_num; /* port number on the remote node */ + uint32_t counter_up; /* number of allocated routs upwards */ + uint32_t counter_down; /* number of allocated routs downwards */ +} ftree_port_t; + +/*************************************************** + ** + ** ftree_port_group_t definition + ** + ***************************************************/ + +typedef struct ftree_port_group_t_ +{ + cl_map_item_t map_item; + ib_net16_t base_lid; /* base lid of the current node */ + uint8_t lmc; /* LMC of the current node */ + ib_net16_t remote_base_lid; /* base lid of the remote node */ + uint8_t remote_lmc; /* LMC of the remote node */ + ib_net64_t port_guid; /* port guid of this port */ + ib_net64_t remote_port_guid; /* port guid of the remote port */ + ib_net64_t remote_node_guid; /* node guid of the remote node */ + uint8_t remote_node_type; /* IB_NODE_TYPE_{CA,SWITCH,ROUTER,...} */ + union remote_hca_or_sw_ + { + struct ftree_hca_t_ * remote_hca; + struct ftree_sw_t_ * remote_sw; + } remote_hca_or_sw; /* pointer to remote hca/switch */ + cl_ptr_vector_t ports; /* vector of ports to the same lid */ +} ftree_port_group_t; + +/*************************************************** + ** + ** ftree_sw_t definition + ** + ***************************************************/ + +typedef struct ftree_sw_t_ +{ + cl_map_item_t map_item; + osm_switch_t * p_osm_sw; + uint8_t rank; + ftree_tuple_t tuple; + ib_net16_t base_lid; + uint8_t lmc; + ftree_port_group_t ** down_port_groups; + uint16_t down_port_groups_num; + ftree_port_group_t ** up_port_groups; + uint16_t up_port_groups_num; + ftree_fwd_tbl_t lft_buf; +} ftree_sw_t; + +/*************************************************** + ** + ** ftree_hca_t definition + ** + ***************************************************/ + +typedef struct ftree_hca_t_ { + cl_map_item_t map_item; + osm_node_t * p_osm_node; + ftree_port_group_t ** up_port_groups; + uint16_t up_port_groups_num; +} ftree_hca_t; + +/*************************************************** + ** + ** ftree_fabric_t definition + ** + ***************************************************/ + +typedef struct ftree_fabric_t_ +{ + cl_qmap_t hca_tbl; + cl_qmap_t sw_tbl; + cl_qmap_t sw_by_tuple_tbl; + uint32_t tree_rank; + ftree_sw_t ** leaf_switches; + uint32_t leaf_switches_num; + uint16_t max_hcas_per_leaf; + cl_pool_t sw_fwd_tbl_pool; +} ftree_fabric_t; + +/*************************************************** + ** + ** comparators + ** + ***************************************************/ + +int +__osm_ftree_compare_switches_by_index( + IN const void * p1, + IN const void * p2) +{ + ftree_sw_t ** pp_sw1 = (ftree_sw_t **)p1; + ftree_sw_t ** pp_sw2 = (ftree_sw_t **)p2; + + uint16_t i; + for (i = 0; i < FTREE_TUPLE_LEN; i++) + { + if ((*pp_sw1)->tuple[i] > (*pp_sw2)->tuple[i]) + return 1; + if ((*pp_sw1)->tuple[i] < (*pp_sw2)->tuple[i]) + return -1; + } + return 0; +} + +/***************************************************/ + +int +__osm_ftree_compare_port_groups_by_remote_switch_index( + IN const void * p1, + IN const void * p2) +{ + ftree_port_group_t ** pp_g1 = (ftree_port_group_t **)p1; + ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2; + + return __osm_ftree_compare_switches_by_index( + &((*pp_g1)->remote_hca_or_sw.remote_sw), + &((*pp_g2)->remote_hca_or_sw.remote_sw) ); +} + +/***************************************************/ + +boolean_t +__osm_ftree_sw_less_by_index( + IN ftree_sw_t * p_sw1, + IN ftree_sw_t * p_sw2) +{ + if (__osm_ftree_compare_switches_by_index((void *)&p_sw1, + (void *)&p_sw2) < 0) + return TRUE; + return FALSE; +} + +/***************************************************/ + +boolean_t +__osm_ftree_sw_greater_by_index( + IN ftree_sw_t * p_sw1, + IN ftree_sw_t * p_sw2) +{ + if (__osm_ftree_compare_switches_by_index((void *)&p_sw1, + (void *)&p_sw2) > 0) + return TRUE; + return FALSE; +} + +/*************************************************** + ** + ** ftree_tuple_t functions + ** + ***************************************************/ + +static void +__osm_ftree_tuple_init( + IN ftree_tuple_t tuple) +{ + memset(tuple, 0xFF, FTREE_TUPLE_LEN); +} + +/***************************************************/ + +static inline boolean_t +__osm_ftree_tuple_assigned( + IN ftree_tuple_t tuple) +{ + return (tuple[0] != 0xFF); +} + +/***************************************************/ + +#define FTREE_TUPLE_BUFFERS_NUM 6 + +static char * +__osm_ftree_tuple_to_str( + IN ftree_tuple_t tuple) +{ + static char buffer[FTREE_TUPLE_BUFFERS_NUM][FTREE_TUPLE_BUFF_LEN]; + static uint8_t ind = 0; + char * ret_buffer; + uint32_t i; + + if (!__osm_ftree_tuple_assigned(tuple)) + return "INDEX.NOT.ASSIGNED"; + + buffer[ind][0] = '\0'; + + for(i = 0; (i < FTREE_TUPLE_LEN) && (tuple[i] != 0xFF); i++) + { + if ((strlen(buffer[ind]) + 10) > FTREE_TUPLE_BUFF_LEN) + return "INDEX.TOO.LONG"; + if (i != 0) + strcat(buffer[ind],"."); + sprintf(&buffer[ind][strlen(buffer[ind])], "%u", tuple[i]); + } + + ret_buffer = buffer[ind]; + ind = (ind + 1) % FTREE_TUPLE_BUFFERS_NUM; + return ret_buffer; +} /* __osm_ftree_tuple_to_str() */ + +/***************************************************/ + +static inline ftree_tuple_key_t +__osm_ftree_tuple_to_key( + IN ftree_tuple_t tuple) +{ + ftree_tuple_key_t key; + memcpy(&key, tuple, FTREE_TUPLE_LEN); + return key; +} + +/***************************************************/ + +static inline void +__osm_ftree_tuple_from_key( + IN ftree_tuple_t tuple, + IN ftree_tuple_key_t key) +{ + memcpy(tuple, &key, FTREE_TUPLE_LEN); +} + +/*************************************************** + ** + ** ftree_sw_tbl_element_t functions + ** + ***************************************************/ + +static ftree_sw_tbl_element_t * +__osm_ftree_sw_tbl_element_create( + IN ftree_sw_t * p_sw) +{ + ftree_sw_tbl_element_t * p_element = + (ftree_sw_tbl_element_t *) malloc(sizeof(ftree_sw_tbl_element_t)); + if (!p_element) + return NULL; + memset(p_element, 0,sizeof(ftree_sw_tbl_element_t)); + + if (p_element) + p_element->p_sw = p_sw; + return p_element; +} + +/***************************************************/ + +static void +__osm_ftree_sw_tbl_element_destroy( + IN ftree_sw_tbl_element_t * p_element) +{ + if (!p_element) + return; + free(p_element); +} + +/*************************************************** + ** + ** ftree_port_t functions + ** + ***************************************************/ + +static ftree_port_t * +__osm_ftree_port_create( + IN uint16_t port_num, + IN uint16_t remote_port_num) +{ + ftree_port_t * p_port = (ftree_port_t *)malloc(sizeof(ftree_port_t)); + if (!p_port) + return NULL; + memset(p_port,0,sizeof(ftree_port_t)); + + p_port->port_num = port_num; + p_port->remote_port_num = remote_port_num; + + return p_port; +} + +/***************************************************/ + +static void +__osm_ftree_port_destroy( + IN ftree_port_t * p_port) +{ + if(p_port) + free(p_port); +} + +/*************************************************** + ** + ** ftree_port_group_t functions + ** + ***************************************************/ + +static ftree_port_group_t * +__osm_ftree_port_group_create( + IN ib_net16_t base_lid, + IN uint8_t lmc, + IN ib_net16_t remote_base_lid, + IN uint8_t remote_lmc, + IN ib_net64_t * p_port_guid, + IN ib_net64_t * p_remote_port_guid, + IN ib_net64_t * p_remote_node_guid, + IN uint8_t remote_node_type, + IN void * p_remote_hca_or_sw) +{ + ftree_port_group_t * p_group = + (ftree_port_group_t *)malloc(sizeof(ftree_port_group_t)); + if (p_group == NULL) + return NULL; + memset(p_group, 0, sizeof(ftree_port_group_t)); + + p_group->base_lid = base_lid; + p_group->lmc = lmc; + p_group->remote_base_lid = remote_base_lid; + p_group->remote_lmc = remote_lmc; + memcpy(&p_group->port_guid, p_port_guid, sizeof(ib_net64_t)); + memcpy(&p_group->remote_port_guid, p_remote_port_guid, sizeof(ib_net64_t)); + memcpy(&p_group->remote_node_guid, p_remote_node_guid, sizeof(ib_net64_t)); + + p_group->remote_node_type = remote_node_type; + switch (remote_node_type) + { + case IB_NODE_TYPE_CA: + p_group->remote_hca_or_sw.remote_hca = (ftree_hca_t *)p_remote_hca_or_sw; + break; + case IB_NODE_TYPE_SWITCH: + p_group->remote_hca_or_sw.remote_sw = (ftree_sw_t *)p_remote_hca_or_sw; + break; + default: + /* we shouldn't get here - port is created only in hca or switch */ + CL_ASSERT(0); + } + + cl_ptr_vector_init(&p_group->ports, + 0, /* min size */ + 8); /* grow size */ + return p_group; +} /* __osm_ftree_port_group_create() */ + +/***************************************************/ + +static void +__osm_ftree_port_group_destroy( + IN ftree_port_group_t * p_group) +{ + uint32_t i; + uint32_t size; + ftree_port_t * p_port; + + if (!p_group) + return; + + /* remove all the elements of p_group->ports vector */ + size = cl_ptr_vector_get_size(&p_group->ports); + for (i = 0; i < size; i++) + { + cl_ptr_vector_at(&p_group->ports, i, (void **)&p_port); + __osm_ftree_port_destroy(p_port); + } + cl_ptr_vector_destroy(&p_group->ports); + free(p_group); +} /* __osm_ftree_port_group_destroy() */ + +/***************************************************/ + +static void +__osm_ftree_port_group_dump( + IN ftree_port_group_t * p_group, + IN ftree_direction_t direction) +{ + ftree_port_t * p_port; + uint32_t size; + uint32_t i; + char buff[10*1024]; + + if (!p_group) + return; + + if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + return; + + size = cl_ptr_vector_get_size(&p_group->ports); + buff[0] = '\0'; + + for (i = 0; i < size; i++) + { + cl_ptr_vector_at(&p_group->ports, i, (void **)&p_port); + CL_ASSERT(p_port); + + if (i != 0) + strcat(buff,", "); + sprintf(buff + strlen(buff), "%u", p_port->port_num); + } + + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_port_group_dump:" + " Port Group of size %u, port(s): %s, direction: %s\n" + " Local <--> Remote GUID (LID):" + "0x%016" PRIx64 " (0x%x) <--> 0x%016" PRIx64 " (0x%x)\n", + size, + buff, + (direction == FTREE_DIRECTION_DOWN)? "DOWN" : "UP", + cl_ntoh64(p_group->port_guid), + cl_ntoh16(p_group->base_lid), + cl_ntoh64(p_group->remote_port_guid), + cl_ntoh16(p_group->remote_base_lid)); + +} /* __osm_ftree_port_group_dump() */ + +/***************************************************/ + +static void +__osm_ftree_port_group_add_port( + IN ftree_port_group_t * p_group, + IN uint16_t port_num, + IN uint16_t remote_port_num) +{ + uint16_t i; + ftree_port_t * p_port; + + for (i = 0; i < cl_ptr_vector_get_size(&p_group->ports); i++) + { + cl_ptr_vector_at(&p_group->ports, i, (void **)&p_port); + if (p_port->port_num == port_num) + return; + } + + p_port = __osm_ftree_port_create(port_num,remote_port_num); + cl_ptr_vector_insert(&p_group->ports, p_port, NULL); +} + +/*************************************************** + ** + ** ftree_sw_t functions + ** + ***************************************************/ + +static ftree_sw_t * +__osm_ftree_sw_create( + IN ftree_fabric_t * p_ftree, + IN osm_switch_t * p_osm_sw) +{ + ftree_sw_t * p_sw; + uint8_t ports_num; + + /* make sure that the switch has ports */ + if (osm_switch_get_num_ports(p_osm_sw) == 1) + return NULL; + + p_sw = (ftree_sw_t *)malloc(sizeof(ftree_sw_t)); + if (p_sw == NULL) + return NULL; + memset(p_sw, 0, sizeof(ftree_sw_t)); + + p_sw->p_osm_sw = p_osm_sw; + p_sw->rank = 0xFF; + __osm_ftree_tuple_init(p_sw->tuple); + + p_sw->base_lid = osm_node_get_base_lid(osm_switch_get_node_ptr(p_sw->p_osm_sw),0); + + ports_num = osm_node_get_num_physp(osm_switch_get_node_ptr(p_sw->p_osm_sw)); + p_sw->down_port_groups = + (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *)); + p_sw->up_port_groups = + (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *)); + if (!p_sw->down_port_groups || !p_sw->up_port_groups) + return NULL; + p_sw->down_port_groups_num = 0; + p_sw->up_port_groups_num = 0; + + /* initialize lft buffer */ + p_sw->lft_buf = (ftree_fwd_tbl_t)cl_pool_get(&p_ftree->sw_fwd_tbl_pool); + memset(p_sw->lft_buf, OSM_NO_PATH, FTREE_FWD_TBL_LEN); + + return p_sw; +} /* __osm_ftree_sw_create() */ + +/***************************************************/ + +static void +__osm_ftree_sw_destroy( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_sw) +{ + uint8_t i; + + if (!p_sw) + return; + + for (i = 0; i < p_sw->down_port_groups_num; i++) + __osm_ftree_port_group_destroy(p_sw->down_port_groups[i]); + for (i = 0; i < p_sw->up_port_groups_num; i++) + __osm_ftree_port_group_destroy(p_sw->up_port_groups[i]); + if (p_sw->down_port_groups) + free(p_sw->down_port_groups); + if (p_sw->up_port_groups) + free(p_sw->up_port_groups); + + /* return switch fwd_tbl to pool */ + if (p_sw->lft_buf) + cl_pool_put(&p_ftree->sw_fwd_tbl_pool, (void *)p_sw->lft_buf); + + free(p_sw); +} /* __osm_ftree_sw_destroy() */ + +/***************************************************/ + +static void +__osm_ftree_sw_dump( + IN ftree_sw_t * p_sw) +{ + uint32_t i; + if (!p_sw) + return; + + if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + return; + + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_sw_dump: " + "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n", + __osm_ftree_tuple_to_str(p_sw->tuple), + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), + p_sw->down_port_groups_num, + p_sw->up_port_groups_num); + + for( i = 0; i < p_sw->down_port_groups_num; i++ ) + __osm_ftree_port_group_dump(p_sw->down_port_groups[i], FTREE_DIRECTION_DOWN); + for( i = 0; i < p_sw->up_port_groups_num; i++ ) + __osm_ftree_port_group_dump(p_sw->up_port_groups[i], FTREE_DIRECTION_UP); + +} /* __osm_ftree_sw_dump() */ + +/***************************************************/ + +static boolean_t +__osm_ftree_sw_ranked( + IN ftree_sw_t * p_sw) +{ + return (p_sw->rank != 0xFF); +} + +/***************************************************/ + +static ftree_port_group_t * +__osm_ftree_sw_get_port_group_by_remote_lid( + IN ftree_sw_t * p_sw, + IN ib_net16_t remote_base_lid, + IN ftree_direction_t direction) +{ + uint32_t i; + uint32_t size; + ftree_port_group_t ** port_groups; + + if (direction == FTREE_DIRECTION_UP) + { + port_groups = p_sw->up_port_groups; + size = p_sw->up_port_groups_num; + } + else + { + port_groups = p_sw->down_port_groups; + size = p_sw->down_port_groups_num; + } + + for (i = 0; i < size; i++) + if (remote_base_lid == port_groups[i]->remote_base_lid) + return port_groups[i]; + + return NULL; +} /* __osm_ftree_sw_get_port_group_by_remote_lid() */ + +/***************************************************/ + +static void +__osm_ftree_sw_add_port( + IN ftree_sw_t * p_sw, + IN uint16_t port_num, + IN uint16_t remote_port_num, + IN ib_net16_t base_lid, + IN uint8_t lmc, + IN ib_net16_t remote_base_lid, + IN uint8_t remote_lmc, + IN ib_net64_t port_guid, + IN ib_net64_t remote_port_guid, + IN ib_net64_t remote_node_guid, + IN uint8_t remote_node_type, + IN void * p_remote_hca_or_sw, + IN ftree_direction_t direction) +{ + ftree_port_group_t * p_group = + __osm_ftree_sw_get_port_group_by_remote_lid(p_sw,remote_base_lid,direction); + + if (!p_group) + { + p_group = __osm_ftree_port_group_create( + base_lid, + lmc, + remote_base_lid, + remote_lmc, + &port_guid, + &remote_port_guid, + &remote_node_guid, + remote_node_type, + p_remote_hca_or_sw); + CL_ASSERT(p_group); + + if (direction == FTREE_DIRECTION_UP) + p_sw->up_port_groups[p_sw->up_port_groups_num++] = p_group; + else + p_sw->down_port_groups[p_sw->down_port_groups_num++] = p_group; + } + __osm_ftree_port_group_add_port(p_group,port_num,remote_port_num); +} /* __osm_ftree_sw_add_port() */ + +/***************************************************/ + +static void +__osm_ftree_sw_set_fwd_table_block( + IN ftree_sw_t * p_sw, + IN uint16_t lid_ho, + IN uint8_t port_num) +{ + p_sw->lft_buf[lid_ho] = port_num; +} + +/***************************************************/ + +static uint8_t +__osm_ftree_sw_get_fwd_table_block( + IN ftree_sw_t * p_sw, + IN uint16_t lid_ho) +{ + return p_sw->lft_buf[lid_ho]; +} + +/*************************************************** + ** + ** ftree_hca_t functions + ** + ***************************************************/ + +static ftree_hca_t * +__osm_ftree_hca_create( + IN osm_node_t * p_osm_node) +{ + ftree_hca_t * p_hca = (ftree_hca_t *)malloc(sizeof(ftree_hca_t)); + if (p_hca == NULL) + return NULL; + memset(p_hca,0,sizeof(ftree_hca_t)); + + p_hca->p_osm_node = p_osm_node; + p_hca->up_port_groups = (ftree_port_group_t **) + malloc(osm_node_get_num_physp(p_hca->p_osm_node) * sizeof (ftree_port_group_t *)); + if (!p_hca->up_port_groups) + return NULL; + p_hca->up_port_groups_num = 0; + return p_hca; +} + +/***************************************************/ + +static void +__osm_ftree_hca_destroy( + IN ftree_hca_t * p_hca) +{ + uint32_t i; + + if (!p_hca) + return; + + for (i = 0; i < p_hca->up_port_groups_num; i++) + __osm_ftree_port_group_destroy(p_hca->up_port_groups[i]); + + if (p_hca->up_port_groups) + free(p_hca->up_port_groups); + + free(p_hca); +} + +/***************************************************/ + +static void +__osm_ftree_hca_dump( + IN ftree_hca_t * p_hca) +{ + uint32_t i; + if (!p_hca) + return; + + if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + return; + + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_hca_dump: " + "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", + cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), + p_hca->up_port_groups_num); + + for( i = 0; i < p_hca->up_port_groups_num; i++ ) + __osm_ftree_port_group_dump(p_hca->up_port_groups[i],FTREE_DIRECTION_UP); +} + +/***************************************************/ + +static ftree_port_group_t * +__osm_ftree_hca_get_port_group_by_remote_lid( + IN ftree_hca_t * p_hca, + IN ib_net16_t remote_base_lid) +{ + uint32_t i; + for (i = 0; i < p_hca->up_port_groups_num; i++) + if (remote_base_lid == p_hca->up_port_groups[i]->remote_base_lid) + return p_hca->up_port_groups[i]; + + return NULL; +} + +/***************************************************/ + +static void +__osm_ftree_hca_add_port( + IN ftree_hca_t * p_hca, + IN uint16_t port_num, + IN uint16_t remote_port_num, + IN ib_net16_t base_lid, + IN uint8_t lmc, + IN ib_net16_t remote_base_lid, + IN uint8_t remote_lmc, + IN ib_net64_t port_guid, + IN ib_net64_t remote_port_guid, + IN ib_net64_t remote_node_guid, + IN uint8_t remote_node_type, + IN void * p_remote_hca_or_sw) +{ + ftree_port_group_t * p_group; + + /* this function is supposed to be called only for adding ports + in hca's that lead to switches */ + CL_ASSERT(remote_node_type == IB_NODE_TYPE_SWITCH); + + p_group = __osm_ftree_hca_get_port_group_by_remote_lid(p_hca,remote_base_lid); + + if (!p_group) + { + p_group = __osm_ftree_port_group_create( + base_lid, + lmc, + remote_base_lid, + remote_lmc, + &port_guid, + &remote_port_guid, + &remote_node_guid, + remote_node_type, + p_remote_hca_or_sw); + p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group; + } + __osm_ftree_port_group_add_port(p_group, port_num, remote_port_num); + +} /* __osm_ftree_hca_add_port() */ + +/*************************************************** + ** + ** ftree_fabric_t functions + ** + ***************************************************/ + +static ftree_fabric_t * +__osm_ftree_fabric_create() +{ + cl_status_t status; + ftree_fabric_t * p_ftree = (ftree_fabric_t *)malloc(sizeof(ftree_fabric_t)); + if (p_ftree == NULL) + return NULL; + + memset(p_ftree,0,sizeof(ftree_fabric_t)); + + cl_qmap_init(&p_ftree->hca_tbl); + cl_qmap_init(&p_ftree->sw_tbl); + cl_qmap_init(&p_ftree->sw_by_tuple_tbl); + + status = cl_pool_init( &p_ftree->sw_fwd_tbl_pool, + 8, /* min pool size */ + 0, /* max pool size - unlimited */ + 8, /* grow size */ + FTREE_FWD_TBL_LEN, /* object_size */ + NULL, /* object initializer */ + NULL, /* object destructor */ + NULL ); /* context */ + if (status != CL_SUCCESS) + return NULL; + + p_ftree->tree_rank = 1; + return p_ftree; +} + +/***************************************************/ + +static void +__osm_ftree_fabric_clear(ftree_fabric_t * p_ftree) +{ + ftree_hca_t * p_hca; + ftree_hca_t * p_next_hca; + ftree_sw_t * p_sw; + ftree_sw_t * p_next_sw; + ftree_sw_tbl_element_t * p_element; + ftree_sw_tbl_element_t * p_next_element; + + if (!p_ftree) + return; + + /* remove all the elements of hca_tbl */ + + p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); + while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) ) + { + p_hca = p_next_hca; + p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item ); + __osm_ftree_hca_destroy(p_hca); + } + cl_qmap_remove_all(&p_ftree->hca_tbl); + + /* remove all the elements of sw_tbl */ + + p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) ) + { + p_sw = p_next_sw; + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + __osm_ftree_sw_destroy(p_ftree,p_sw); + } + cl_qmap_remove_all(&p_ftree->sw_tbl); + + /* remove all the elements of sw_by_tuple_tbl */ + + p_next_element = + (ftree_sw_tbl_element_t *)cl_qmap_head(&p_ftree->sw_by_tuple_tbl); + while( p_next_element != + (ftree_sw_tbl_element_t *)cl_qmap_end( &p_ftree->sw_by_tuple_tbl ) ) + { + p_element = p_next_element; + p_next_element = + (ftree_sw_tbl_element_t *)cl_qmap_next(&p_element->map_item); + __osm_ftree_sw_tbl_element_destroy(p_element); + } + cl_qmap_remove_all(&p_ftree->sw_by_tuple_tbl); + + /* free the leaf switches array */ + if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches)) + free(p_ftree->leaf_switches); + + p_ftree->leaf_switches_num = 0; + p_ftree->leaf_switches = NULL; + +} /* __osm_ftree_fabric_destroy() */ + +/***************************************************/ + +static void +__osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree) +{ + if (!p_ftree) + return; + __osm_ftree_fabric_clear(p_ftree); + cl_pool_destroy(&p_ftree->sw_fwd_tbl_pool); + free(p_ftree); +} + +/***************************************************/ + +static void +__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint16_t rank) +{ + if (rank > p_ftree->tree_rank) + p_ftree->tree_rank = rank; +} + +/***************************************************/ + +static uint16_t +__osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree) +{ + return p_ftree->tree_rank; +} + +/***************************************************/ + +static void +__osm_ftree_fabric_add_hca(ftree_fabric_t * p_ftree, osm_node_t * p_osm_node) +{ + ftree_hca_t * p_hca = __osm_ftree_hca_create(p_osm_node); + + CL_ASSERT(osm_node_get_type(p_osm_node) == IB_NODE_TYPE_CA); + + cl_qmap_insert(&p_ftree->hca_tbl, + p_osm_node->node_info.node_guid, + &p_hca->map_item); +} + +/***************************************************/ + +static void +__osm_ftree_fabric_add_sw(ftree_fabric_t * p_ftree, osm_switch_t * p_osm_sw) +{ + ftree_sw_t * p_sw = __osm_ftree_sw_create(p_ftree,p_osm_sw); + + CL_ASSERT(osm_node_get_type(p_osm_sw->p_node) == IB_NODE_TYPE_SWITCH); + + cl_qmap_insert(&p_ftree->sw_tbl, + p_osm_sw->p_node->node_info.node_guid, + &p_sw->map_item); +} + +/***************************************************/ + +static void +__osm_ftree_fabric_add_sw_by_tuple( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_sw) +{ + CL_ASSERT(__osm_ftree_tuple_assigned(p_sw->tuple)); + + cl_qmap_insert(&p_ftree->sw_by_tuple_tbl, + __osm_ftree_tuple_to_key(p_sw->tuple), + &__osm_ftree_sw_tbl_element_create(p_sw)->map_item); +} + +/***************************************************/ + +static ftree_sw_t * +__osm_ftree_fabric_get_sw_by_tuple( + IN ftree_fabric_t * p_ftree, + IN ftree_tuple_t tuple) +{ + ftree_sw_tbl_element_t * p_element; + + CL_ASSERT(__osm_ftree_tuple_assigned(tuple)); + + __osm_ftree_tuple_to_key(tuple); + + p_element = (ftree_sw_tbl_element_t * )cl_qmap_get(&p_ftree->sw_by_tuple_tbl, + __osm_ftree_tuple_to_key(tuple)); + if (p_element == (ftree_sw_tbl_element_t * )cl_qmap_end(&p_ftree->sw_by_tuple_tbl)) + return NULL; + + return p_element->p_sw; +} + +/***************************************************/ + +static void +__osm_ftree_fabric_dump(ftree_fabric_t * p_ftree) +{ + uint32_t i; + ftree_hca_t * p_hca; + ftree_sw_t * p_sw; + + if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + return; + + osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" + " |-------------------------------|\n" + " |- Full fabric topology dump -|\n" + " |-------------------------------|\n\n"); + + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_dump: -- HCAs:\n"); + + for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); + p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl); + p_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item) ) + { + __osm_ftree_hca_dump(p_hca); + } + + for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++) + { + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_dump: -- Rank %u switches\n", i); + for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl); + p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) ) + { + if (p_sw->rank == i) + __osm_ftree_sw_dump(p_sw); + } + } + + osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" + " |---------------------------------------|\n" + " |- Full fabric topology dump completed -|\n" + " |---------------------------------------|\n\n"); +} /* __osm_ftree_fabric_dump() */ + +/***************************************************/ + +static void +__osm_ftree_fabric_dump_general_info( + IN ftree_fabric_t * p_ftree) +{ + uint32_t i,j; + ftree_sw_t * p_sw; + char * addition_str; + + osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info:\n"); + osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + "General fabric topology info\n"); + osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + "============================\n"); + + osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + " - FatTree rank (switches only): %u\n", + p_ftree->tree_rank); + osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + " - Fabric has %u HCAs, %u switches\n", + cl_qmap_count(&p_ftree->hca_tbl), + cl_qmap_count(&p_ftree->sw_tbl)); + + for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++) + { + j = 0; + for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl); + p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) ) + { + if (p_sw->rank == i) + j++; + } + if (i == 0) + addition_str = " (root) "; + else + if (i == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + addition_str = " (leaf) "; + else + addition_str = " "; + osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + " - Fabric has %u rank %u%sswitches\n",j,i,addition_str); + } + + if (osm_log_is_active(&osm.log,OSM_LOG_VERBOSE)) + { + osm_log(&osm.log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_dump_general_info: " + " - Root switches:\n"); + for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl); + p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) ) + { + if (p_sw->rank == 0) + osm_log(&osm.log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_dump_general_info: " + " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), + cl_ntoh16(p_sw->base_lid), + __osm_ftree_tuple_to_str(p_sw->tuple)); + } + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_dump_general_info: " + " - Leaf switches (sorted by index):\n"); + for (i = 0; i < p_ftree->leaf_switches_num; i++) + { + osm_log(&osm.log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_dump_general_info: " + " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", + cl_ntoh64(osm_node_get_node_guid( + osm_switch_get_node_ptr(p_ftree->leaf_switches[i]->p_osm_sw))), + cl_ntoh16(p_ftree->leaf_switches[i]->base_lid), + __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple)); + } + } +} /* __osm_ftree_fabric_dump_general_info() */ + +/***************************************************/ + +static void +__osm_ftree_fabric_dump_hca_ordering( + IN ftree_fabric_t * p_ftree) +{ + ftree_hca_t * p_hca; + ftree_sw_t * p_sw; + ftree_port_group_t * p_group; + uint32_t i; + uint32_t j; + + char desc[IB_NODE_DESCRIPTION_SIZE + 1]; + char path[1024]; + FILE * p_hca_ordering_file; + char * filename = "osm-ftree-ca-order.dump"; + + snprintf(path, sizeof(path), "%s/%s", + osm.subn.opt.dump_files_dir, filename); + p_hca_ordering_file = fopen(path, "w"); + if (!p_hca_ordering_file) + { + osm_log(&osm.log, OSM_LOG_ERROR, + "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: " + "cannot open file \'%s\': %s\n", + filename, strerror(errno)); + OSM_LOG_EXIT(&(osm.log)); + return; + } + + /* for each leaf switch (in indexing order) */ + for(i = 0; i < p_ftree->leaf_switches_num; i++) + { + p_sw = p_ftree->leaf_switches[i]; + /* for each real HCA connected to this switch */ + for (j = 0; j < p_sw->down_port_groups_num; j++) + { + p_group = p_sw->down_port_groups[j]; + p_hca = p_group->remote_hca_or_sw.remote_hca; + memcpy(desc,p_hca->p_osm_node->node_desc.description,IB_NODE_DESCRIPTION_SIZE); + desc[IB_NODE_DESCRIPTION_SIZE] = '\0'; + + fprintf(p_hca_ordering_file,"0x%x\t%s\n", + cl_ntoh16(p_group->remote_base_lid), desc); + } + + /* now print dummy HCAs */ + for (j = p_sw->down_port_groups_num; j < p_ftree->max_hcas_per_leaf; j++) + { + fprintf(p_hca_ordering_file,"0xFFFF\tDUMMY\n"); + } + + } + /* done going through all the leaf switches */ + + fclose(p_hca_ordering_file); +} /* __osm_ftree_fabric_dump_hca_ordering() */ + +/***************************************************/ + +static void +__osm_ftree_fabric_assign_tuple( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_sw, + IN ftree_tuple_t new_tuple) +{ + memcpy(p_sw->tuple, new_tuple, FTREE_TUPLE_LEN); + __osm_ftree_fabric_add_sw_by_tuple(p_ftree,p_sw); +} + +/***************************************************/ + +static void +__osm_ftree_fabric_assign_first_tuple( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_sw) +{ + uint8_t i; + ftree_tuple_t new_tuple; + + __osm_ftree_tuple_init(new_tuple); + new_tuple[0] = p_sw->rank; + for (i = 1; i <= p_sw->rank; i++) + new_tuple[i] = 0; + + __osm_ftree_fabric_assign_tuple(p_ftree,p_sw,new_tuple); +} + +/***************************************************/ + +static void +__osm_ftree_fabric_get_new_tuple( + IN ftree_fabric_t * p_ftree, + OUT ftree_tuple_t new_tuple, + IN ftree_tuple_t from_tuple, + IN ftree_direction_t direction) +{ + ftree_sw_t * p_sw; + ftree_tuple_t temp_tuple; + uint8_t var_index; + uint8_t i; + + __osm_ftree_tuple_init(new_tuple); + memcpy(temp_tuple, from_tuple, FTREE_TUPLE_LEN); + + if (direction == FTREE_DIRECTION_DOWN) + { + temp_tuple[0] ++; + var_index = from_tuple[0] + 1; + } + else + { + temp_tuple[0] --; + var_index = from_tuple[0]; + } + + for (i = 0; i < 0xFF; i++) + { + temp_tuple[var_index] = i; + p_sw = __osm_ftree_fabric_get_sw_by_tuple(p_ftree,temp_tuple); + if (p_sw == NULL) /* found free tuple */ + break; + } + + if (i == 0xFF) + { + /* new tuple not found - there are more than 255 ports in one direction */ + return; + } + memcpy(new_tuple, temp_tuple, FTREE_TUPLE_LEN); + +} /* __osm_ftree_fabric_get_new_tuple() */ + +/***************************************************/ + +static void +__osm_ftree_fabric_calculate_rank( + IN ftree_fabric_t * p_ftree) +{ + ftree_sw_t * p_sw; + ftree_sw_t * p_next_sw; + uint16_t max_rank = 0; + + /* go over all the switches and find maximal switch rank */ + + p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ) + { + p_sw = p_next_sw; + if(p_sw->rank > max_rank) + max_rank = p_sw->rank; + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + } + + /* set FatTree rank */ + __osm_ftree_fabric_set_rank(p_ftree, max_rank + 1); +} + +/***************************************************/ + +static void +__osm_ftree_fabric_make_indexing( + IN ftree_fabric_t * p_ftree) +{ + ftree_sw_t * p_remote_sw; + ftree_sw_t * p_sw; + ftree_sw_t * p_next_sw; + ftree_tuple_t new_tuple; + uint32_t i; + cl_list_t bfs_list; + ftree_sw_tbl_element_t * p_sw_tbl_element; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_make_indexing); + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " + "Starting FatTree indexing\n"); + + /* create array of leaf switches */ + p_ftree->leaf_switches = (ftree_sw_t **) + malloc(cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *)); + + /* Looking for a leaf switch - the one that has rank equal to (tree_rank - 1). + This switch will be used as a starting point for indexing algorithm. */ + + p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) ) + { + p_sw = p_next_sw; + if(p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + break; + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + } + + CL_ASSERT(p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)); + + /* Assign the first tuple to the switch that is used as BFS starting point. + The tuple will be as follows: [rank].0.0.0... + This fuction also adds the switch it into the switch_by_tuple table. */ + __osm_ftree_fabric_assign_first_tuple(p_ftree,p_sw); + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " + "Indexing starting point:\n" + " - Switch rank : %u\n" + " - Switch index : %s\n" + " - Node LID : 0x%x\n" + " - Node GUID : 0x%016" PRIx64 "\n", + p_sw->rank, + __osm_ftree_tuple_to_str(p_sw->tuple), + cl_ntoh16(p_sw->base_lid), + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw)))); + + /* + * Now run BFS and assign indexes to all switches + * Pseudo code of the algorithm is as follows: + * + * * Add first switch to BFS queue + * * While (BFS queue not empty) + * - Pop the switch from the head of the queue + * - Scan all the downward and upward ports + * - For each port + * + Get the remote switch + * + Assign index to the remote switch + * + Add remote switch to the BFS queue + */ + + cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); + cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_sw)->map_item); + + while (!cl_is_list_empty(&bfs_list)) + { + p_sw_tbl_element = (ftree_sw_tbl_element_t *)cl_list_remove_head(&bfs_list); + p_sw = p_sw_tbl_element->p_sw; + __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element); + + /* Discover all the nodes from ports that are pointing down */ + + if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + { + /* add switch to leaf switches array */ + p_ftree->leaf_switches[p_ftree->leaf_switches_num++] = p_sw; + /* update the max_hcas_per_leaf value */ + if (p_sw->down_port_groups_num > p_ftree->max_hcas_per_leaf) + p_ftree->max_hcas_per_leaf = p_sw->down_port_groups_num; + } + else + { + /* This is not the leaf switch, which means that all the + ports that point down are taking us to another switches. + No need to assign indexing to HCAs */ + for( i = 0; i < p_sw->down_port_groups_num; i++ ) + { + p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.remote_sw; + if (__osm_ftree_tuple_assigned(p_remote_sw->tuple)) + { + /* this switch has been already indexed */ + continue; + } + /* allocate new tuple */ + __osm_ftree_fabric_get_new_tuple(p_ftree, + new_tuple, + p_sw->tuple, + FTREE_DIRECTION_DOWN); + /* Assign the new tuple to the remote switch. + This fuction also adds the switch into the switch_by_tuple table. */ + __osm_ftree_fabric_assign_tuple(p_ftree, + p_remote_sw, + new_tuple); + + /* add the newly discovered switch to the BFS queue */ + cl_list_insert_tail(&bfs_list, + &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); + } + /* Done assigning indexes to all the remote switches + that are pointed by the downgoing ports. + Now sort port groups according to remote index. */ + qsort(p_sw->down_port_groups, /* array */ + p_sw->down_port_groups_num, /* number of elements */ + sizeof(ftree_port_group_t *), /* size of each element */ + __osm_ftree_compare_port_groups_by_remote_switch_index); /* comparator */ + } + + /* Done indexing switches from ports that go down. + Now do the same with ports that are pointing up. */ + + if (p_sw->rank != 0) + { + /* This is not the root switch, which means that all the ports + that are pointing up are taking us to another switches. */ + for( i = 0; i < p_sw->up_port_groups_num; i++ ) + { + p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.remote_sw; + if (__osm_ftree_tuple_assigned(p_remote_sw->tuple)) + continue; + /* allocate new tuple */ + __osm_ftree_fabric_get_new_tuple(p_ftree, + new_tuple, + p_sw->tuple, + FTREE_DIRECTION_UP); + /* Assign the new tuple to the remote switch. + This fuction also adds the switch to the + switch_by_tuple table. */ + __osm_ftree_fabric_assign_tuple(p_ftree, + p_remote_sw, + new_tuple); + /* add the newly discovered switch to the BFS queue */ + cl_list_insert_tail(&bfs_list, + &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); + } + /* Done assigning indexes to all the remote switches + that are pointed by the upgoing ports. + Now sort port groups according to remote index. */ + qsort(p_sw->up_port_groups, /* array */ + p_sw->up_port_groups_num, /* number of elements */ + sizeof(ftree_port_group_t *), /* size of each element */ + __osm_ftree_compare_port_groups_by_remote_switch_index); /* comparator */ + } + /* Done assigning indexes to all the switches that are directly connected + to the current switch - go to the next switch in the BFS queue */ + } + + /* sort array of leaf switches by index */ + qsort(p_ftree->leaf_switches, /* array */ + p_ftree->leaf_switches_num, /* number of elements */ + sizeof(ftree_sw_t *), /* size of each element */ + __osm_ftree_compare_switches_by_index); /* comparator */ + + OSM_LOG_EXIT(&(osm.log)); +} /* __osm_ftree_fabric_make_indexing() */ + +/***************************************************/ + +static boolean_t +__osm_ftree_fabric_validate_topology( + IN ftree_fabric_t * p_ftree) +{ + ftree_port_group_t * p_group; + ftree_port_group_t * p_ref_group; + ftree_sw_t * p_sw; + ftree_sw_t * p_next_sw; + ftree_sw_t ** reference_sw_arr; + uint16_t tree_rank = __osm_ftree_fabric_get_rank(p_ftree); + boolean_t res = TRUE; + uint8_t i; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_validate_topology); + + osm_log(&osm.log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_validate_topology: " + "Validating fabric topology\n"); + + reference_sw_arr = (ftree_sw_t **)malloc(tree_rank * sizeof(ftree_sw_t *)); + if ( reference_sw_arr == NULL ) + { + osm_log(&osm.log, OSM_LOG_SYS,"Fat-tree routing: Memory allocation failed\n"); + return FALSE; + } + memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *)); + + p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + while( res && + p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) ) + { + p_sw = p_next_sw; + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + + if (!reference_sw_arr[p_sw->rank]) + { + /* This is the first switch in the current level that + we're checking - use it as a reference */ + reference_sw_arr[p_sw->rank] = p_sw; + } + else + { + /* compare this switch properties to the reference switch */ + + if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != p_sw->up_port_groups_num ) + { + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + "ERR AB09: Different number of upward port groups on switches:\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n", + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))), + cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), + __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), + reference_sw_arr[p_sw->rank]->up_port_groups_num, + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), + cl_ntoh16(p_sw->base_lid), + __osm_ftree_tuple_to_str(p_sw->tuple), + p_sw->up_port_groups_num); + res = FALSE; + break; + } + + if ( p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1) && + reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num ) + { + /* we're allowing some hca's to be missing */ + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + "ERR AB0A: Different number of downward port groups on switches:\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n", + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))), + cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), + __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), + reference_sw_arr[p_sw->rank]->down_port_groups_num, + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), + cl_ntoh16(p_sw->base_lid), + __osm_ftree_tuple_to_str(p_sw->tuple), + p_sw->down_port_groups_num); + res = FALSE; + break; + } + + if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != 0 ) + { + p_ref_group = reference_sw_arr[p_sw->rank]->up_port_groups[0]; + for (i = 0; i < p_sw->up_port_groups_num; i++) + { + p_group = p_sw->up_port_groups[i]; + if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports)) + { + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + "ERR AB0B: Different number of ports in an upward port group on switches:\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))), + cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), + __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), + cl_ptr_vector_get_size(&p_ref_group->ports), + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), + cl_ntoh16(p_sw->base_lid), + __osm_ftree_tuple_to_str(p_sw->tuple), + cl_ptr_vector_get_size(&p_group->ports)); + res = FALSE; + break; + } + } + } + if ( reference_sw_arr[p_sw->rank]->down_port_groups_num != 0 && + p_sw->rank != (tree_rank - 1) ) + { + /* we're allowing some hca's to be missing */ + p_ref_group = reference_sw_arr[p_sw->rank]->down_port_groups[0]; + for (i = 0; i < p_sw->down_port_groups_num; i++) + { + p_group = p_sw->down_port_groups[0]; + if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports)) + { + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + "ERR AB0C: Different number of ports in an downward port group on switches:\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" + " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))), + cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid), + __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple), + cl_ptr_vector_get_size(&p_ref_group->ports), + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), + cl_ntoh16(p_sw->base_lid), + __osm_ftree_tuple_to_str(p_sw->tuple), + cl_ptr_vector_get_size(&p_group->ports)); + res = FALSE; + break; + } + } + } + } /* end of else */ + } /* end of while */ + + if (res == TRUE) + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_validate_topology: " + "Fabric topology has been identified as FatTree\n"); + else + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + "ERR AB0D: Fabric topology hasn't been identified as FatTree\n"); + + free(reference_sw_arr); + OSM_LOG_EXIT(&(osm.log)); + return res; +} /* __osm_ftree_fabric_validate_topology() */ + +/*************************************************** + ***************************************************/ + +static void +__osm_ftree_set_sw_fwd_table( + IN cl_map_item_t* const p_map_item, + IN void *context) +{ + ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item; + memcpy(osm.sm.ucast_mgr.lft_buf, p_sw->lft_buf, FTREE_FWD_TBL_LEN); + osm_ucast_mgr_set_fwd_table(&osm.sm.ucast_mgr,p_sw->p_osm_sw); +} + +/*************************************************** + ***************************************************/ + +/* + * Function: assign-up-going-port-by-descending-down + * Given : a switch and a LID + * Pseudo code: + * foreach down-going-port-group (in indexing order) + * skip this group if the LFT(LID) port is part of this group + * find the least loaded port of the group (scan in indexing order) + * r-port is the remote port connected to it + * assign the remote switch node LFT(LID) to r-port + * increase r-port usage counter + * assign-up-going-port-by-descending-down to r-port node (recursion) + */ + +static void +__osm_ftree_fabric_route_upgoing_by_going_down( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_sw, + IN ftree_sw_t * p_prev_sw, + IN ib_net16_t target_lid, + IN boolean_t is_real_lid, + IN boolean_t is_main_path) +{ + ftree_sw_t * p_remote_sw; + uint16_t ports_num; + ftree_port_group_t * p_group; + ftree_port_t * p_port; + ftree_port_t * p_min_port; + uint16_t i; + uint16_t j; + + /* we shouldn't enter here if both real_lid and main_path are false */ + CL_ASSERT(is_real_lid || is_main_path); + + /* can't be here for leaf switch, */ + CL_ASSERT(p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)); + + /* if there is no down-going ports */ + if (p_sw->down_port_groups_num == 0) + return; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_upgoing_by_going_down); + + /* foreach down-going port group (in indexing order) */ + for (i = 0; i < p_sw->down_port_groups_num; i++) + { + p_group = p_sw->down_port_groups[i]; + + if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) ) + { + /* This port group has a port that was used when we entered this switch, + which means that the current group points to the switch where we were + at the previous step of the algorithm (before going up). + Skipping this group. */ + continue; + } + + /* find the least loaded port of the group (in indexing order) */ + p_min_port = NULL; + ports_num = cl_ptr_vector_get_size(&p_group->ports); + /* ToDo: no need to select a least loaded port for non-main path. + Think about optimization. */ + for (j = 0; j < ports_num; j++) + { + cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port); + if (!p_min_port) + { + /* first port that we're checking - set as port with the lowest load */ + p_min_port = p_port; + } + else if (p_port->counter_up < p_min_port->counter_up) + { + /* this port is less loaded - use it as min */ + p_min_port = p_port; + } + } + /* At this point we have selected a port in this group with the + lowest load of upgoing routes. + Set on the remote switch how to get to the target_lid - + set LFT(target_lid) on the remote switch to the remote port */ + p_remote_sw = p_group->remote_hca_or_sw.remote_sw; + + /* Four possible cases: + * + * 1. is_real_lid == TRUE && is_main_path == TRUE: + * - going DOWN(TRUE,TRUE) through ALL the groups + * + promoting port counter + * + setting path in remote switch fwd tbl + * + * 2. is_real_lid == TRUE && is_main_path == FALSE: + * - going DOWN(TRUE,FALSE) through ALL the groups but only if + * the remote (upper) switch hasn't been already configured + * for this target LID + * + NOT promoting port counter + * + setting path in remote switch fwd tbl if it hasn't been set yet + * + * 3. is_real_lid == FALSE && is_main_path == TRUE: + * - going DOWN(FALSE,TRUE) through ALL the groups + * + promoting port counter + * + NOT setting path in remote switch fwd tbl + * + * 4. is_real_lid == FALSE && is_main_path == FALSE: + * - illegal state - we shouldn't get here + */ + + /* second case: skip the port group if the remote (upper) + switch has been already configured for this target LID */ + if ( is_real_lid && !is_main_path && + __osm_ftree_sw_get_fwd_table_block(p_remote_sw, + cl_ntoh16(target_lid)) != OSM_NO_PATH ) + continue; + + /* setting fwd tbl port only if this is real LID */ + if (is_real_lid) + { + __osm_ftree_sw_set_fwd_table_block(p_remote_sw, + cl_ntoh16(target_lid), + p_min_port->remote_port_num); + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_upgoing_by_going_down: " + "Switch %s: set path to HCA LID 0x%x through port %u\n", + __osm_ftree_tuple_to_str(p_remote_sw->tuple), + cl_ntoh16(target_lid), + p_min_port->remote_port_num); + } + + /* The number of upgoing routes is tracked in the + p_port->counter_up counter of the port that belongs to + the upper side of the link (on switch with lower rank). + Counter is promoted only if we're routing LID on the main + path (whether it's a real LID or a dummy one). */ + if (is_main_path) + p_min_port->counter_up++; + + /* Recursion step: + Assign upgoing ports by stepping down, starting on REMOTE switch. + Recursion stop condition - if the REMOTE switch is a leaf switch. */ + if (p_remote_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + { + __osm_ftree_fabric_route_upgoing_by_going_down( + p_ftree, + p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ + NULL, /* prev. position - NULL to mark that we went down and not up */ + target_lid, /* LID that we're routing to */ + is_real_lid, /* whether the target LID is real or dummy */ + is_main_path); /* whether this is path to HCA that should by tracked by counters */ + } + } + /* done scanning all the down-going port groups */ + + OSM_LOG_EXIT(&(osm.log)); +} /* __osm_ftree_fabric_route_upgoing_by_going_down() */ + +/***************************************************/ + +/* + * Function: assign-down-going-port-by-descending-up + * Given : a switch and a LID + * Pseudo code: + * find the least loaded port of all the upgoing groups (scan in indexing order) + * assign the LFT(LID) of remote switch to that port + * track that port usage + * assign-up-going-port-by-descending-down on CURRENT switch + * assign-down-going-port-by-descending-up on REMOTE switch (recursion) + */ + +static void +__osm_ftree_fabric_route_downgoing_by_going_up( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_sw, + IN ftree_sw_t * p_prev_sw, + IN ib_net16_t target_lid, + IN boolean_t is_real_lid, + IN boolean_t is_main_path) +{ + ftree_sw_t * p_remote_sw; + uint16_t ports_num; + ftree_port_group_t * p_group; + ftree_port_t * p_port; + ftree_port_group_t * p_min_group; + ftree_port_t * p_min_port; + uint16_t i; + uint16_t j; + + /* we shouldn't enter here if both real_lid and main_path are false */ + CL_ASSERT(is_real_lid || is_main_path); + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_downgoing_by_going_up); + + /* If this switch isn't a leaf switch: + Assign upgoing ports by stepping down, starting on THIS switch. */ + if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + { + __osm_ftree_fabric_route_upgoing_by_going_down( + p_ftree, + p_sw, /* local switch - used as a route-upgoing alg. start point */ + p_prev_sw, /* switch that we went up from (NULL means that we went down) */ + target_lid, /* LID that we're routing to */ + is_real_lid, /* whether this target LID is real or dummy */ + is_main_path); /* whether this path to HCA should by tracked by counters */ + } + + /* recursion stop condition - if it's a root switch, */ + if (p_sw->rank == 0) + { + OSM_LOG_EXIT(&(osm.log)); + return; + } + + /* Find the least loaded port of all the upgoing port groups + (in indexing order of the remote switches). */ + p_min_group = NULL; + p_min_port = NULL; + for (i = 0; i < p_sw->up_port_groups_num; i++) + { + p_group = p_sw->up_port_groups[i]; + + ports_num = cl_ptr_vector_get_size(&p_group->ports); + for (j = 0; j < ports_num; j++) + { + cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port); + if (!p_min_group) + { + /* first port that we're checking - use + it as a port with the lowest load */ + p_min_group = p_group; + p_min_port = p_port; + } + else + { + if ( p_port->counter_down < p_min_port->counter_down ) + { + /* this port is less loaded - use it as min */ + p_min_group = p_group; + p_min_port = p_port; + } + } + } + } + + /* At this point we have selected a group and port with the + lowest load of downgoing routes. + Set on the remote switch how to get to the target_lid - + set LFT(target_lid) on the remote switch to the remote port */ + p_remote_sw = p_min_group->remote_hca_or_sw.remote_sw; + + /* Four possible cases: + * + * 1. is_real_lid == TRUE && is_main_path == TRUE: + * - going UP(TRUE,TRUE) on selected min_group and min_port + * + promoting port counter + * + setting path in remote switch fwd tbl + * - going UP(TRUE,FALSE) on rest of the groups, each time on port 0 + * + NOT promoting port counter + * + setting path in remote switch fwd tbl if it hasn't been set yet + * + * 2. is_real_lid == TRUE && is_main_path == FALSE: + * - going UP(TRUE,FALSE) on ALL the groups, each time on port 0, + * but only if the remote (upper) switch hasn't been already + * configured for this target LID + * + NOT promoting port counter + * + setting path in remote switch fwd tbl if it hasn't been set yet + * + * 3. is_real_lid == FALSE && is_main_path == TRUE: + * - going UP(FALSE,TRUE) ONLY on selected min_group and min_port + * + promoting port counter + * + NOT setting path in remote switch fwd tbl + * + * 4. is_real_lid == FALSE && is_main_path == FALSE: + * - illegal state - we shouldn't get here + */ + + /* covering first half of case 1, and case 3 */ + if (is_main_path) + { + if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + { + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_downgoing_by_going_up: " + " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n", + (is_real_lid)? "real" : "DUMMY", + cl_ntoh16(target_lid), + __osm_ftree_tuple_to_str(p_sw->tuple), + __osm_ftree_tuple_to_str(p_remote_sw->tuple)); + } + /* The number of downgoing routes is tracked in the + p_port->counter_down counter of the port that belongs to + the lower side of the link (on switch with higher rank) */ + p_min_port->counter_down++; + if (is_real_lid) + { + __osm_ftree_sw_set_fwd_table_block(p_remote_sw, + cl_ntoh16(target_lid), + p_min_port->remote_port_num); + p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = p_min_port->remote_port_num; + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_downgoing_by_going_up: " + "Switch %s: set path to HCA LID 0x%x through port %u\n", + __osm_ftree_tuple_to_str(p_remote_sw->tuple), + cl_ntoh16(target_lid),p_min_port->remote_port_num); + } + + /* Recursion step: + Assign downgoing ports by stepping up, starting on REMOTE switch. */ + __osm_ftree_fabric_route_downgoing_by_going_up( + p_ftree, + p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ + p_sw, /* this switch - prev. position switch for the function */ + target_lid, /* LID that we're routing to */ + is_real_lid, /* whether this target LID is real or dummy */ + is_main_path); /* whether this is path to HCA that should by tracked by counters */ + } + + /* we're done for the third case */ + if (!is_real_lid) + { + OSM_LOG_EXIT(&(osm.log)); + return; + } + + /* What's left to do at this point: + * + * 1. is_real_lid == TRUE && is_main_path == TRUE: + * - going UP(TRUE,FALSE) on rest of the groups, each time on port 0, + * but only if the remote (upper) switch hasn't been already + * configured for this target LID + * + NOT promoting port counter + * + setting path in remote switch fwd tbl if it hasn't been set yet + * + * 2. is_real_lid == TRUE && is_main_path == FALSE: + * - going UP(TRUE,FALSE) on ALL the groups, each time on port 0, + * but only if the remote (upper) switch hasn't been already + * configured for this target LID + * + NOT promoting port counter + * + setting path in remote switch fwd tbl if it hasn't been set yet + * + * These two rules can be rephrased this way: + * - foreach UP port group + * + if remote switch has been set with the target LID + * - skip this port group + * + else + * - select port 0 + * - do NOT promote port counter + * - set path in remote switch fwd tbl + * - go UP(TRUE,FALSE) to the remote switch + */ + + for (i = 0; i < p_sw->up_port_groups_num; i++) + { + p_group = p_sw->up_port_groups[i]; + p_remote_sw = p_group->remote_hca_or_sw.remote_sw; + + /* skip if target lid has been already set on remote switch fwd tbl */ + if (__osm_ftree_sw_get_fwd_table_block( + p_remote_sw,cl_ntoh16(target_lid)) != OSM_NO_PATH) + continue; + + if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) + { + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_downgoing_by_going_up: " + " - Routing SECONDARY path for LID 0x%x: %s --> %s\n", + cl_ntoh16(target_lid), + __osm_ftree_tuple_to_str(p_sw->tuple), + __osm_ftree_tuple_to_str(p_remote_sw->tuple)); + } + + cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port); + __osm_ftree_sw_set_fwd_table_block(p_remote_sw, + cl_ntoh16(target_lid), + p_port->remote_port_num); + /* Recursion step: + Assign downgoing ports by stepping up, starting on REMOTE switch. */ + __osm_ftree_fabric_route_downgoing_by_going_up( + p_ftree, + p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */ + p_sw, /* this switch - prev. position switch for the function */ + target_lid, /* LID that we're routing to */ + TRUE, /* whether the target LID is real or dummy */ + FALSE); /* whether this is path to HCA that should by tracked by counters */ + } + + OSM_LOG_EXIT(&(osm.log)); +} /* ftree_fabric_route_downgoing_by_going_up() */ + +/***************************************************/ + +/* + * Pseudo code: + * foreach leaf switch (in indexing order) + * for each compute node (in indexing order) + * obtain the LID of the compute node + * set local LFT(LID) of the port connecting to compute node + * call assign-down-going-port-by-descending-up(TRUE,TRUE) on CURRENT switch + * for each MISSING compute node + * call assign-down-going-port-by-descending-up(FALSE,TRUE) on CURRENT switch + */ + +static void +__osm_ftree_fabric_route_to_hcas( + IN ftree_fabric_t * p_ftree) +{ + ftree_sw_t * p_sw; + ftree_port_group_t * p_group; + ftree_port_t * p_port; + uint32_t i; + uint32_t j; + ib_net16_t remote_lid; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_hcas); + + /* for each leaf switch (in indexing order) */ + for(i = 0; i < p_ftree->leaf_switches_num; i++) + { + p_sw = p_ftree->leaf_switches[i]; + + /* for each HCA connected to this switch */ + for (j = 0; j < p_sw->down_port_groups_num; j++) + { + /* obtain the LID of HCA port */ + p_group = p_sw->down_port_groups[j]; + remote_lid = p_group->remote_base_lid; + + /* set local LFT(LID) to the port that is connected to HCA */ + cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port); + __osm_ftree_sw_set_fwd_table_block(p_sw, + cl_ntoh16(remote_lid), + p_port->port_num); + osm_log(&osm.log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_to_hcas: " + "Switch %s: set path to HCA LID 0x%x through port %u\n", + __osm_ftree_tuple_to_str(p_sw->tuple), + cl_ntoh16(remote_lid), + p_port->port_num); + + /* assign downgoing ports by stepping up */ + __osm_ftree_fabric_route_downgoing_by_going_up( + p_ftree, + p_sw, /* local switch - used as a route-downgoing alg. start point */ + NULL, /* prev. position switch */ + remote_lid, /* LID that we're routing to */ + TRUE, /* whether this HCA LID is real or dummy */ + TRUE); /* whether this path to HCA should by tracked by counters */ + } + + /* We're done with the real HCAs. Now route the dummy HCAs that are missing.*/ + + if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num) + { + osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " + "Routing %u dummy HCAs\n", + p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); + for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++) + { + /* assign downgoing ports by stepping up */ + __osm_ftree_fabric_route_downgoing_by_going_up( + p_ftree, + p_sw, /* local switch - used as a route-downgoing alg. start point */ + NULL, /* prev. position switch */ + 0, /* LID that we're routing to - ignored for dummy HCA */ + FALSE, /* whether this HCA LID is real or dummy */ + TRUE); /* whether this path to HCA should by tracked by counters */ + } + } + } + /* done going through all the leaf switches */ + OSM_LOG_EXIT(&(osm.log)); +} /* __osm_ftree_fabric_route_to_hcas() */ + +/***************************************************/ + +/* + * Pseudo code: + * foreach switch in fabric + * obtain its LID + * set local LFT(LID) to port 0 + * call assign-down-going-port-by-descending-up(TRUE,FALSE) on CURRENT switch + * + * Routing to switch is similar to routing a REAL hca lid on SECONDARY path: + * - we should set fwd tables + * - we should NOT update port counters + */ + +static void +__osm_ftree_fabric_route_to_switches( + IN ftree_fabric_t * p_ftree) +{ + ftree_sw_t * p_sw; + ftree_sw_t * p_next_sw; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_switches); + + p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ) + { + p_sw = p_next_sw; + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + + /* set local LFT(LID) to 0 (route to itself) */ + __osm_ftree_sw_set_fwd_table_block(p_sw, + cl_ntoh16(p_sw->base_lid), + 0); + + osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_switches: " + "Switch %s (LID 0x%x): routing switch-to-switch pathes\n", + __osm_ftree_tuple_to_str(p_sw->tuple), + cl_ntoh16(p_sw->base_lid)); + + __osm_ftree_fabric_route_downgoing_by_going_up( + p_ftree, + p_sw, /* local switch - used as a route-downgoing alg. start point */ + NULL, /* prev. position switch */ + p_sw->base_lid, /* LID that we're routing to */ + TRUE, /* whether the target LID is a real or dummy */ + FALSE); /* whether this path should by tracked by counters */ + } + + OSM_LOG_EXIT(&(osm.log)); +} /* __osm_ftree_fabric_route_to_switches() */ + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_fabric_populate_switches( + IN ftree_fabric_t * p_ftree) +{ + osm_switch_t * p_osm_sw; + osm_switch_t * p_next_osm_sw; + osm_opensm_t * p_osm = &osm; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_switches); + + p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_osm->subn.sw_guid_tbl); + while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl) ) + { + p_osm_sw = p_next_osm_sw; + p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item ); + __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw); + } + OSM_LOG_EXIT(&(osm.log)); + return 0; +} /* __osm_ftree_fabric_populate_switches() */ + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_fabric_populate_hcas( + IN ftree_fabric_t * p_ftree) +{ + osm_node_t * p_osm_node; + osm_node_t * p_next_osm_node; + osm_opensm_t * p_osm = &osm; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_hcas); + + p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_osm->subn.node_guid_tbl); + while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_osm->subn.node_guid_tbl) ) + { + p_osm_node = p_next_osm_node; + p_next_osm_node = (osm_node_t *)cl_qmap_next(&p_osm_node->map_item); + switch (osm_node_get_type(p_osm_node)) + { + case IB_NODE_TYPE_CA: + __osm_ftree_fabric_add_hca(p_ftree,p_osm_node); + break; + case IB_NODE_TYPE_ROUTER: + break; + case IB_NODE_TYPE_SWITCH: + /* all the switches added separately */ + break; + default: + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_populate_hcas: ERR AB0E: " + "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", + cl_ntoh64(osm_node_get_node_guid(p_osm_node)), + ib_get_node_type_str(osm_node_get_type(p_osm_node))); + OSM_LOG_EXIT(&(osm.log)); + return -1; + } + } + + OSM_LOG_EXIT(&(osm.log)); + return 0; +} /* __osm_ftree_fabric_populate_hcas() */ + +/*************************************************** + ***************************************************/ + +static void +__osm_ftree_rank_from_switch( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_starting_sw) +{ + ftree_sw_t * p_sw; + ftree_sw_t * p_remote_sw; + osm_node_t * p_node; + osm_node_t * p_remote_node; + osm_physp_t * p_osm_port; + uint16_t i; + cl_list_t bfs_list; + ftree_sw_tbl_element_t * p_sw_tbl_element = NULL; + + p_starting_sw->rank = 0; + + /* Run BFS scan of the tree, starting from this switch */ + + cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl)); + cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_starting_sw)->map_item); + + while (!cl_is_list_empty(&bfs_list)) + { + p_sw_tbl_element = (ftree_sw_tbl_element_t *)cl_list_remove_head(&bfs_list); + p_sw = p_sw_tbl_element->p_sw; + __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element); + + p_node = osm_switch_get_node_ptr(p_sw->p_osm_sw); + + /* note: skipping port 0 on switches */ + for (i = 1; i < osm_node_get_num_physp(p_node); i++) + { + p_osm_port = osm_node_get_physp_ptr(p_node,i); + if (!osm_physp_is_valid(p_osm_port)) + continue; + if (!osm_link_is_healthy(p_osm_port)) + continue; + + p_remote_node = osm_node_get_remote_node(p_node,i,NULL); + if (!p_remote_node) + continue; + if (osm_node_get_type(p_remote_node) != IB_NODE_TYPE_SWITCH) + continue; + + p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl, + osm_node_get_node_guid(p_remote_node)); + if (p_remote_sw == (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)) + { + /* remote node is not a switch */ + continue; + } + if (__osm_ftree_sw_ranked(p_remote_sw) && p_remote_sw->rank <= (p_sw->rank + 1)) + continue; + + /* rank the remote switch and add it to the BFS list */ + p_remote_sw->rank = p_sw->rank + 1; + cl_list_insert_tail(&bfs_list, + &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item); + } + } +} /* __osm_ftree_rank_from_switch() */ + + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_rank_switches_from_hca( + IN ftree_fabric_t * p_ftree, + IN ftree_hca_t * p_hca) +{ + ftree_sw_t * p_sw; + osm_node_t * p_osm_node = p_hca->p_osm_node; + osm_node_t * p_remote_osm_node; + osm_physp_t * p_osm_port; + static uint16_t i = 0; + int res = 0; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_rank_switches_from_hca); + + for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++) + { + p_osm_port = osm_node_get_physp_ptr(p_osm_node,i); + if (!osm_physp_is_valid(p_osm_port)) + continue; + if (!osm_link_is_healthy(p_osm_port)) + continue; + + p_remote_osm_node = osm_node_get_remote_node(p_osm_node,i,NULL); + + switch (osm_node_get_type(p_remote_osm_node)) + { + case IB_NODE_TYPE_CA: + /* HCA connected directly to another HCA - not FatTree */ + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB0F: " + "HCA conected directly to another HCA: " + "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", + cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), + cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node))); + res = -1; + goto Exit; + + case IB_NODE_TYPE_ROUTER: + /* leaving this port - proceeding to the next one */ + continue; + + case IB_NODE_TYPE_SWITCH: + /* continue with this port */ + break; + + default: + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB10: " + "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", + cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)), + ib_get_node_type_str(osm_node_get_type(p_remote_osm_node))); + res = -1; + goto Exit; + } + + /* remote node is switch */ + + p_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl, + p_osm_port->p_remote_physp->p_node->node_info.node_guid); + + CL_ASSERT(p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)); + + if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0) + continue; + + osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_rank_switches_from_hca: " + "Marking rank of switch that is directly connected to HCA:\n" + " - HCA guid : 0x%016" PRIx64 "\n" + " - Switch guid: 0x%016" PRIx64 "\n" + " - Switch LID : 0x%x\n", + cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), + cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), + cl_ntoh16(p_sw->base_lid)); + __osm_ftree_rank_from_switch(p_ftree, p_sw); + } + + Exit: + OSM_LOG_EXIT(&(osm.log)); + return res; +} /* __osm_ftree_rank_switches_from_hca() */ + +/***************************************************/ + +static void +__osm_ftree_sw_reverse_rank( + IN cl_map_item_t* const p_map_item, + IN void *context) +{ + ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; + ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item; + p_sw->rank = __osm_ftree_fabric_get_rank(p_ftree) - p_sw->rank - 1; +} + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_fabric_construct_hca_ports( + IN ftree_fabric_t * p_ftree, + IN ftree_hca_t * p_hca) +{ + ftree_sw_t * p_remote_sw; + osm_node_t * p_node = p_hca->p_osm_node; + osm_node_t * p_remote_node; + uint8_t remote_node_type; + ib_net64_t remote_node_guid; + osm_physp_t * p_remote_osm_port; + uint16_t i; + uint8_t remote_port_num; + int res = 0; + + for (i = 0; i < osm_node_get_num_physp(p_node); i++) + { + osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i); + + if (!osm_physp_is_valid(p_osm_port)) + continue; + if (!osm_link_is_healthy(p_osm_port)) + continue; + + p_remote_osm_port = osm_physp_get_remote(p_osm_port); + p_remote_node = osm_node_get_remote_node(p_node,i,&remote_port_num); + + if (!p_remote_osm_port) + continue; + + remote_node_type = osm_node_get_type(p_remote_node); + remote_node_guid = osm_node_get_node_guid(p_remote_node); + + switch (remote_node_type) + { + case IB_NODE_TYPE_ROUTER: + /* leaving this port - proceeding to the next one */ + continue; + + case IB_NODE_TYPE_CA: + /* HCA connected directly to another HCA - not FatTree */ + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB11: " + "HCA conected directly to another HCA: " + "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", + cl_ntoh64(osm_node_get_node_guid(p_node)), + cl_ntoh64(remote_node_guid)); + res = -1; + goto Exit; + + case IB_NODE_TYPE_SWITCH: + /* continue with this port */ + break; + + default: + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB12: " + "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", + cl_ntoh64(remote_node_guid), + ib_get_node_type_str(remote_node_type)); + res = -1; + goto Exit; + } + + /* remote node is switch */ + + p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid); + CL_ASSERT( p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ); + CL_ASSERT( (p_remote_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree) ); + + __osm_ftree_hca_add_port( + p_hca, /* local ftree_hca object */ + i, /* local port number */ + remote_port_num, /* remote port number */ + osm_node_get_base_lid(p_node, i), /* local lid */ + osm_node_get_lmc(p_node, i), /* local lmc */ + osm_node_get_base_lid(p_remote_node, 0), /* remote lid */ + osm_node_get_lmc(p_remote_node, 0), /* remote lmc */ + osm_physp_get_port_guid(p_osm_port), /* local port guid */ + osm_physp_get_port_guid(p_remote_osm_port),/* remote port guid */ + remote_node_guid, /* remote node guid */ + remote_node_type, /* remote node type */ + (void *) p_remote_sw); /* remote ftree_hca/sw object */ + } + + Exit: + return res; +} /* __osm_ftree_fabric_construct_hca_ports() */ + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_fabric_construct_sw_ports( + IN ftree_fabric_t * p_ftree, + IN ftree_sw_t * p_sw) +{ + ftree_hca_t * p_remote_hca; + ftree_sw_t * p_remote_sw; + osm_node_t * p_node = osm_switch_get_node_ptr(p_sw->p_osm_sw); + osm_node_t * p_remote_node; + ib_net16_t remote_base_lid; + uint8_t remote_lmc; + uint8_t remote_node_type; + ib_net64_t remote_node_guid; + osm_physp_t * p_remote_osm_port; + ftree_direction_t direction; + void * p_remote_hca_or_sw; + uint16_t i; + uint8_t remote_port_num; + int res = 0; + + CL_ASSERT(osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH); + + for (i = 0; i < osm_node_get_num_physp(p_node); i++) + { + osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i); + + if (!osm_physp_is_valid(p_osm_port)) + continue; + if (!osm_link_is_healthy(p_osm_port)) + continue; + + p_remote_osm_port = osm_physp_get_remote(p_osm_port); + p_remote_node = osm_node_get_remote_node(p_node,i,&remote_port_num); + + if (!p_remote_osm_port) + continue; + + remote_node_type = osm_node_get_type(p_remote_node); + remote_node_guid = osm_node_get_node_guid(p_remote_node); + + switch (remote_node_type) + { + case IB_NODE_TYPE_ROUTER: + /* leaving this port - proceeding to the next one */ + continue; + + case IB_NODE_TYPE_CA: + /* switch connected to hca */ + + CL_ASSERT((p_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree)); + + p_remote_hca = (ftree_hca_t *)cl_qmap_get(&p_ftree->hca_tbl,remote_node_guid); + CL_ASSERT(p_remote_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl)); + + p_remote_hca_or_sw = (void *)p_remote_hca; + direction = FTREE_DIRECTION_DOWN; + + remote_base_lid = osm_physp_get_base_lid(p_remote_osm_port); + remote_lmc = osm_physp_get_lmc(p_remote_osm_port); + break; + + case IB_NODE_TYPE_SWITCH: + /* switch connected to another switch */ + + p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid); + CL_ASSERT(p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl)); + CL_ASSERT(abs(p_sw->rank - p_remote_sw->rank) == 1); + p_remote_hca_or_sw = (void *)p_remote_sw; + + if (p_sw->rank > p_remote_sw->rank) + direction = FTREE_DIRECTION_UP; + else + direction = FTREE_DIRECTION_DOWN; + + /* switch LID is only in port 0 port_info structure */ + remote_base_lid = osm_node_get_base_lid(p_remote_node, 0); + remote_lmc = osm_node_get_lmc(p_remote_node, 0); + + break; + + default: + osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_sw_ports: ERR AB13: " + "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", + cl_ntoh64(remote_node_guid), + ib_get_node_type_str(remote_node_type)); + res = -1; + goto Exit; + } + __osm_ftree_sw_add_port( + p_sw, /* local ftree_sw object */ + i, /* local port number */ + remote_port_num, /* remote port number */ + p_sw->base_lid, /* local lid */ + p_sw->lmc, /* local lmc */ + remote_base_lid, /* remote lid */ + remote_lmc, /* remote lmc */ + osm_physp_get_port_guid(p_osm_port), /* local port guid */ + osm_physp_get_port_guid(p_remote_osm_port), /* remote port guid */ + remote_node_guid, /* remote node guid */ + remote_node_type, /* remote node type */ + p_remote_hca_or_sw, /* remote ftree_hca/sw object */ + direction); /* port direction (up or down) */ + } + + Exit: + return res; +} /* __osm_ftree_fabric_construct_sw_ports() */ + +/*************************************************** + ***************************************************/ + +/* ToDo: improve ranking algorithm complexity + by propogating BFS from more nodes */ +static int +__osm_ftree_fabric_perform_ranking( + IN ftree_fabric_t * p_ftree) +{ + ftree_hca_t * p_hca; + ftree_hca_t * p_next_hca; + int res = 0; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_perform_ranking); + + /* Mark REVERSED rank of all the switches in the subnet. + Start from switches that are connected to hca's, and + scan all the switches in the subnet. */ + p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); + while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) ) + { + p_hca = p_next_hca; + p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item ); + if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0) + { + res = -1; + osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB14: " + "Subnet ranking failed - subnet is not FatTree"); + goto Exit; + } + } + + /* calculate and set FatTree rank */ + __osm_ftree_fabric_calculate_rank(p_ftree); + osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_perform_ranking: " + "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree)); + + /* fix ranking of the switches by reversing the ranking direction */ + cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree); + + if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK || + __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK ) + { + osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB15: " + "Tree rank is %u (should be between %u and %u)\n", + __osm_ftree_fabric_get_rank(p_ftree), + FAT_TREE_MIN_RANK, + FAT_TREE_MAX_RANK); + res = -1; + goto Exit; + } + + Exit: + OSM_LOG_EXIT(&(osm.log)); + return res; +} /* __osm_ftree_fabric_perform_ranking() */ + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_fabric_populate_ports( + IN ftree_fabric_t * p_ftree) +{ + ftree_hca_t * p_hca; + ftree_hca_t * p_next_hca; + ftree_sw_t * p_sw; + ftree_sw_t * p_next_sw; + int res = 0; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_ports); + + p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); + while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) ) + { + p_hca = p_next_hca; + p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item ); + if (__osm_ftree_fabric_construct_hca_ports(p_ftree,p_hca) != 0) + { + res = -1; + goto Exit; + } + } + + p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); + while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) ) + { + p_sw = p_next_sw; + p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item ); + if (__osm_ftree_fabric_construct_sw_ports(p_ftree,p_sw) != 0) + { + res = -1; + goto Exit; + } + } + Exit: + OSM_LOG_EXIT(&(osm.log)); + return res; +} /* __osm_ftree_fabric_populate_ports() */ + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_do_routing(void *context) +{ + ftree_fabric_t * p_ftree = context; + int status = 0; + + OSM_LOG_ENTER(&(osm.log), __osm_ftree_do_routing); + + if ( cl_qmap_count(&osm.subn.sw_guid_tbl) < 2 ) + { + osm_log(&osm.log, OSM_LOG_SYS, + "Fabric has %u switches - topology is not fat-tree.\n" + "Falling back to default routing.\n", + cl_qmap_count(&osm.subn.sw_guid_tbl)); + status = -1; + goto Exit; + } + + if ( (cl_qmap_count(&osm.subn.node_guid_tbl) - + cl_qmap_count(&osm.subn.sw_guid_tbl)) < 2) + { + osm_log(&osm.log, OSM_LOG_SYS, + "Fabric has %u nodes (%u switches) - topology is not fat-tree.\n" + "Falling back to default routing.\n", + cl_qmap_count(&osm.subn.node_guid_tbl), + cl_qmap_count(&osm.subn.sw_guid_tbl)); + status = -1; + goto Exit; + } + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n" + " |------------------------------|\n" + " |- Starting FatTree Routing -|\n" + " |------------------------------|\n\n"); + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Populating FatTree switch table\n"); + /* ToDo: now that the pointer from node to switch exists, + no need to fill the switch table in a separate loop */ + if (__osm_ftree_fabric_populate_switches(p_ftree) != 0) + { + osm_log(&osm.log, OSM_LOG_SYS, + "Fabric topology is not fat-tree - " + "falling back to default routing\n"); + status = -1; + goto Exit; + } + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Populating FatTree HCA table\n"); + if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0) + { + osm_log(&osm.log, OSM_LOG_SYS, + "Fabric topology is not fat-tree - " + "falling back to default routing\n"); + status = -1; + goto Exit; + } + + if (cl_qmap_count(&p_ftree->hca_tbl) < 2) + { + osm_log(&osm.log, OSM_LOG_SYS, + "Fabric has %u HCAa - topology is not fat-tree.\n" + "Falling back to default routing.\n", + cl_qmap_count(&p_ftree->hca_tbl)); + status = -1; + goto Exit; + } + + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Ranking FatTree\n"); + if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) + { + if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) + osm_log(&osm.log, OSM_LOG_SYS, + "Fabric rank is %u (>%u) - " + "fat-tree routing falls back to default routing\n", + __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK); + status = -1; + goto Exit; + } + + /* For each hca and switch, construct array of ports. + This is done after the whole FatTree data structure is ready, because + we want the ports to have pointers to ftree_{sw,hca}_t objects.*/ + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Populating HCA & switch ports\n"); + if (__osm_ftree_fabric_populate_ports(p_ftree) != 0) + { + osm_log(&osm.log, OSM_LOG_SYS, + "Fabric topology is not a fat-tree - " + "routing falls back to default routing\n"); + status = -1; + goto Exit; + } + + /* Assign index to all the switches and hca's in the fabric. + This function also sorts all the port arrays of the switches + by the remote switch index, creates a leaf switch array + sorted by the switch index, and tracks the maximal number of + hcas per leaf switch. */ + __osm_ftree_fabric_make_indexing(p_ftree); + + /* print general info about fabric topology */ + __osm_ftree_fabric_dump_general_info(p_ftree); + + /* dump full tree topology */ + if (osm_log_is_active(&osm.log, OSM_LOG_DEBUG)) + __osm_ftree_fabric_dump(p_ftree); + + if (! __osm_ftree_fabric_validate_topology(p_ftree)) + { + status = -1; + goto Exit; + } + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Filling switch forwarding tables for routes to HCAs\n"); + __osm_ftree_fabric_route_to_hcas(p_ftree); + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Filling switch forwarding tables for switch-to-switch pathes\n"); + __osm_ftree_fabric_route_to_switches(p_ftree); + + /* for each switch, set its fwd table */ + cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, NULL); + + /* write out hca ordering file */ + __osm_ftree_fabric_dump_hca_ordering(p_ftree); + + Exit: + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Clearing FatTree Fabric data structures\n"); + __osm_ftree_fabric_clear(p_ftree); + + osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n" + " |---------------------------------------|\n" + " |- Done FatTree Routing (status = %d) -|\n" + " |---------------------------------------|\n\n", status); + + OSM_LOG_EXIT(&(osm.log)); + return status; +} + +/*************************************************** + ***************************************************/ + +static void +__osm_ftree_delete(void * context) +{ + ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; + if (!p_ftree) + return; + + __osm_ftree_fabric_destroy(p_ftree); + +} + +/*************************************************** + ***************************************************/ + +int osm_ucast_ftree_setup(osm_opensm_t * p_osm) +{ + ftree_fabric_t * p_ftree = __osm_ftree_fabric_create(); + if (!p_ftree) + return -1; + + p_osm->routing_engine.context = (void *)p_ftree; + p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing; + p_osm->routing_engine.delete = __osm_ftree_delete; + /* ToDo: fat-tree routing doesn't use min_hop tables, so we + shouldn't fill them (p_osm->routing_engine.build_lid_matrices) */ + return 0; +} + +/*************************************************** + ***************************************************/ + -- 1.4.4.1.GIT From bugzilla-daemon at openib.org Thu Dec 14 15:59:45 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 14 Dec 2006 15:59:45 -0800 (PST) Subject: [openib-general] [Bug 172] Need an interface to load alternate path to RC QP Message-ID: <20061214235945.302852283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=172 venkatesh.babu at 3leafnetworks.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Thu Dec 14 16:00:13 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 14 Dec 2006 16:00:13 -0800 (PST) Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails with -EINVAL Message-ID: <20061215000013.BB99C2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=160 venkatesh.babu at 3leafnetworks.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sean.hefty at intel.com Thu Dec 14 16:18:55 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 14 Dec 2006 16:18:55 -0800 Subject: [openib-general] [RFC] [PATCH 0/1] ib_sa: add InformInfo registration for Notice reports Message-ID: <000501c71fde$9c1c7560$8698070a@amr.corp.intel.com> The following patch adds support to the ib_sa to allow users to register for asynchronous events (traps and reports) from the SA. The approach is similar to that used by QLogic and suggested by Venkatesh, with the implementation based on the approach used by multicast registration. Users register to receive notices for a particular generic trap number. The notice sub-module tracks the number of registration requests for a given trap number. When necessary, un/registration requests are sent to the SA. During initialization, the ib_sa module registers to receive unsolicited notice reports. When a notice is received, it is given to the notice sub-module for dispatching. The ib_sa generates a response to the notice report. This patch is also available from my rdma_dev git tree, under the informinfo branch. Signed-off-by: Sean Hefty From sean.hefty at intel.com Thu Dec 14 16:20:20 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 14 Dec 2006 16:20:20 -0800 Subject: [openib-general] [RFC] [PATCH 1/1] ib_sa: add InformInfo registration for Notice reports In-Reply-To: <000501c71fde$9c1c7560$8698070a@amr.corp.intel.com> Message-ID: <000601c71fde$ce561fe0$8698070a@amr.corp.intel.com> diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index 189e5d4..2e9c4b2 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -12,7 +12,7 @@ ib_core-y := packer.o ud_header.o verb ib_mad-y := mad.o smi.o agent.o mad_rmpp.o -ib_sa-y := sa_query.o multicast.o +ib_sa-y := sa_query.o multicast.o notice.o ib_cm-y := cm.o diff --git a/drivers/infiniband/core/notice.c b/drivers/infiniband/core/notice.c new file mode 100644 index 0000000..038878d --- /dev/null +++ b/drivers/infiniband/core/notice.c @@ -0,0 +1,750 @@ +/* + * Copyright (c) 2006 Intel Corporation.  All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "sa.h" + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("InfiniBand InformInfo & Notice event handling"); +MODULE_LICENSE("Dual BSD/GPL"); + +static void inform_add_one(struct ib_device *device); +static void inform_remove_one(struct ib_device *device); + +static struct ib_client inform_client = { + .name = "ib_notice", + .add = inform_add_one, + .remove = inform_remove_one +}; + +static struct ib_sa_client sa_client; +static struct ib_event_handler event_handler; +static struct workqueue_struct *inform_wq; + +struct inform_device; + +struct inform_port { + struct inform_device *dev; + spinlock_t lock; + struct rb_root table; + atomic_t refcount; + struct completion comp; + u8 port_num; +}; + +struct inform_device { + struct ib_device *device; + int start_port; + int end_port; + struct inform_port port[0]; +}; + +enum inform_state { + INFORM_IDLE, + INFORM_REGISTERING, + INFORM_MEMBER, + INFORM_BUSY, + INFORM_ERROR +}; + +struct inform_member; + +struct inform_group { + u16 trap_number; + struct rb_node node; + struct inform_port *port; + spinlock_t lock; + struct work_struct work; + struct list_head pending_list; + struct list_head active_list; + struct list_head notice_list; + struct inform_member *last_join; + int members; + enum inform_state join_state; /* State relative to SA */ + atomic_t refcount; + enum inform_state state; + struct ib_sa_query *query; + int query_id; +}; + +struct inform_member { + struct ib_inform_info info; + struct ib_sa_client *client; + struct inform_group *group; + struct list_head list; + enum inform_state state; + atomic_t refcount; + struct completion comp; +}; + +struct inform_notice { + struct list_head list; + struct ib_sa_notice notice; +}; + +static void reg_handler(int status, struct ib_sa_inform *inform, + void *context); +static void unreg_handler(int status, struct ib_sa_inform *inform, + void *context); + +static struct inform_group *inform_find(struct inform_port *port, + u16 trap_number) +{ + struct rb_node *node = port->table.rb_node; + struct inform_group *group; + + while (node) { + group = rb_entry(node, struct inform_group, node); + if (trap_number < group->trap_number) + node = node->rb_left; + else if (trap_number > group->trap_number) + node = node->rb_right; + else + return group; + } + return NULL; +} + +static struct inform_group *inform_insert(struct inform_port *port, + struct inform_group *group) +{ + struct rb_node **link = &port->table.rb_node; + struct rb_node *parent = NULL; + struct inform_group *cur_group; + + while (*link) { + parent = *link; + cur_group = rb_entry(parent, struct inform_group, node); + if (group->trap_number < cur_group->trap_number) + link = &(*link)->rb_left; + else if (group->trap_number > cur_group->trap_number) + link = &(*link)->rb_right; + else + return cur_group; + } + rb_link_node(&group->node, parent, link); + rb_insert_color(&group->node, &port->table); + return NULL; +} + +static void deref_port(struct inform_port *port) +{ + if (atomic_dec_and_test(&port->refcount)) + complete(&port->comp); +} + +static void release_group(struct inform_group *group) +{ + struct inform_port *port = group->port; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + if (atomic_dec_and_test(&group->refcount)) { + rb_erase(&group->node, &port->table); + spin_unlock_irqrestore(&port->lock, flags); + kfree(group); + deref_port(port); + } else + spin_unlock_irqrestore(&port->lock, flags); +} + +static void deref_member(struct inform_member *member) +{ + if (atomic_dec_and_test(&member->refcount)) + complete(&member->comp); +} + +static void queue_reg(struct inform_member *member) +{ + struct inform_group *group = member->group; + unsigned long flags; + + spin_lock_irqsave(&group->lock, flags); + list_add(&member->list, &group->pending_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + spin_unlock_irqrestore(&group->lock, flags); +} + +static int send_reg(struct inform_group *group, struct inform_member *member) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.subscribe = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(member->info.trap_number); + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + group->last_join = member; + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, IB_MGMT_METHOD_SET, &inform, + 0, 3000, GFP_KERNEL, reg_handler, group, + &group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static int send_unreg(struct inform_group *group) +{ + struct inform_port *port = group->port; + struct ib_sa_inform inform; + int ret; + + memset(&inform, 0, sizeof inform); + inform.lid_range_begin = cpu_to_be16(0xFFFF); + inform.is_generic = 1; + inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL); + inform.trap.generic.trap_num = cpu_to_be16(group->trap_number); + inform.trap.generic.qpn = IB_QP1; + inform.trap.generic.resp_time = 19; + inform.trap.generic.producer_type = + cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL); + + ret = ib_sa_informinfo_query(&sa_client, port->dev->device, + port->port_num, IB_MGMT_METHOD_SET, + &inform, 0, 3000, GFP_KERNEL, + unreg_handler, group, &group->query); + if (ret >= 0) { + group->query_id = ret; + ret = 0; + } + return ret; +} + +static void join_group(struct inform_group *group, struct inform_member *member) +{ + member->state = INFORM_MEMBER; + group->members++; + list_move(&member->list, &group->active_list); +} + +static int fail_join(struct inform_group *group, struct inform_member *member, + int status) +{ + spin_lock_irq(&group->lock); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + return member->info.callback(status, &member->info, NULL); +} + +static void process_group_error(struct inform_group *group) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + while (!list_empty(&group->active_list)) { + member = list_entry(group->active_list.next, + struct inform_member, list); + atomic_inc(&member->refcount); + list_del_init(&member->list); + group->members--; + member->state = INFORM_ERROR; + spin_unlock_irq(&group->lock); + + ret = member->info.callback(-ENETRESET, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + group->join_state = INFORM_IDLE; + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); +} + +/* + * Report a notice to all active subscribers. We use a temporary list to + * handle unsubscription requests while the notice is being reported, which + * avoids holding the group lock while in the user's callback. + */ +static void process_notice(struct inform_group *group, + struct inform_notice *info_notice) +{ + struct inform_member *member; + struct list_head list; + int ret; + + INIT_LIST_HEAD(&list); + + spin_lock_irq(&group->lock); + list_splice_init(&group->active_list, &list); + while (!list_empty(&list)) { + + member = list_entry(list.next, struct inform_member, list); + atomic_inc(&member->refcount); + list_move(&member->list, &group->active_list); + spin_unlock_irq(&group->lock); + + ret = member->info.callback(0, &member->info, + &info_notice->notice); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + spin_unlock_irq(&group->lock); +} + +static void inform_work_handler(void *data) +{ + struct inform_group *group = data; + struct inform_member *member; + struct ib_inform_info *info; + struct inform_notice *info_notice; + int status, ret; + +retest: + spin_lock_irq(&group->lock); + while (!list_empty(&group->pending_list) || + !list_empty(&group->notice_list) || + (group->state == INFORM_ERROR)) { + + if (group->state == INFORM_ERROR) { + spin_unlock_irq(&group->lock); + process_group_error(group); + goto retest; + } + + if (!list_empty(&group->notice_list)) { + info_notice = list_entry(group->notice_list.next, + struct inform_notice, list); + list_del(&info_notice->list); + spin_unlock_irq(&group->lock); + process_notice(group, info_notice); + kfree(info_notice); + goto retest; + } + + member = list_entry(group->pending_list.next, + struct inform_member, list); + info = &member->info; + atomic_inc(&member->refcount); + + if (group->join_state == INFORM_MEMBER) { + join_group(group, member); + spin_unlock_irq(&group->lock); + ret = info->callback(0, info, NULL); + } else { + spin_unlock_irq(&group->lock); + status = send_reg(group, member); + if (!status) { + deref_member(member); + return; + } + ret = fail_join(group, member, status); + } + + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + spin_lock_irq(&group->lock); + } + + if (!group->members && (group->join_state == INFORM_MEMBER)) { + group->join_state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + if (send_unreg(group)) + goto retest; + } else { + group->state = INFORM_IDLE; + spin_unlock_irq(&group->lock); + release_group(group); + } +} + +/* + * Fail a join request if it is still active - at the head of the pending queue. + */ +static void process_join_error(struct inform_group *group, int status) +{ + struct inform_member *member; + int ret; + + spin_lock_irq(&group->lock); + member = list_entry(group->pending_list.next, + struct inform_member, list); + if (group->last_join == member) { + atomic_inc(&member->refcount); + list_del_init(&member->list); + spin_unlock_irq(&group->lock); + ret = member->info.callback(status, &member->info, NULL); + deref_member(member); + if (ret) + ib_sa_unregister_inform_info(&member->info); + } else + spin_unlock_irq(&group->lock); +} + +static void reg_handler(int status, struct ib_sa_inform *inform, void *context) +{ + struct inform_group *group = context; + + if (status) + process_join_error(group, status); + else + group->join_state = INFORM_MEMBER; + + inform_work_handler(group); +} + +static void unreg_handler(int status, struct ib_sa_inform *rec, void *context) +{ + inform_work_handler(context); +} + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice) +{ + struct inform_device *dev; + struct inform_port *port; + struct inform_group *group; + struct inform_notice *info_notice; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return 0; /* No one to give notice to. */ + + port = &dev->port[port_num - dev->start_port]; + spin_lock_irq(&port->lock); + group = inform_find(port, __be16_to_cpu(notice->trap. + generic.trap_num)); + if (!group) { + spin_unlock_irq(&port->lock); + return 0; + } + + atomic_inc(&group->refcount); + spin_unlock_irq(&port->lock); + + info_notice = kmalloc(sizeof *info_notice, GFP_KERNEL); + if (!info_notice) { + release_group(group); + return -ENOMEM; + } + + info_notice->notice = *notice; + + spin_lock_irq(&group->lock); + list_add(&info_notice->list, &group->notice_list); + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + inform_work_handler(group); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + return 0; +} + +static struct inform_group *acquire_group(struct inform_port *port, + u16 trap_number, gfp_t gfp_mask) +{ + struct inform_group *group, *cur_group; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + group = inform_find(port, trap_number); + if (group) + goto found; + spin_unlock_irqrestore(&port->lock, flags); + + group = kzalloc(sizeof *group, gfp_mask); + if (!group) + return NULL; + + group->port = port; + group->trap_number = trap_number; + INIT_LIST_HEAD(&group->pending_list); + INIT_LIST_HEAD(&group->active_list); + INIT_LIST_HEAD(&group->notice_list); + INIT_WORK(&group->work, inform_work_handler, group); + spin_lock_init(&group->lock); + + spin_lock_irqsave(&port->lock, flags); + cur_group = inform_insert(port, group); + if (cur_group) { + kfree(group); + group = cur_group; + } else + atomic_inc(&port->refcount); +found: + atomic_inc(&group->refcount); + spin_unlock_irqrestore(&port->lock, flags); + return group; +} + +/* + * We serialize all join requests to a single group to make our lives much + * easier. Otherwise, two users could try to join the same group + * simultaneously, with different configurations, one could leave while the + * join is in progress, etc., which makes locking around error recovery + * difficult. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context) +{ + struct inform_device *dev; + struct inform_member *member; + struct ib_inform_info *info; + int ret; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return ERR_PTR(-ENODEV); + + member = kzalloc(sizeof *member, gfp_mask); + if (!member) + return ERR_PTR(-ENOMEM); + + ib_sa_client_get(client); + member->client = client; + member->info.trap_number = trap_number; + member->info.callback = callback; + member->info.context = context; + init_completion(&member->comp); + atomic_set(&member->refcount, 1); + member->state = INFORM_REGISTERING; + + member->group = acquire_group(&dev->port[port_num - dev->start_port], + trap_number, gfp_mask); + if (!member->group) { + ret = -ENOMEM; + goto err; + } + + /* + * The user will get the info structure in their callback. They + * could then free the info structure before we can return from + * this routine. So we save the pointer to return before queuing + * any callback. + */ + info = &member->info; + queue_reg(member); + return info; + +err: + ib_sa_client_put(member->client); + kfree(member); + return ERR_PTR(ret); +} +EXPORT_SYMBOL(ib_sa_register_inform_info); + +void ib_sa_unregister_inform_info(struct ib_inform_info *info) +{ + struct inform_member *member; + struct inform_group *group; + + member = container_of(info, struct inform_member, info); + group = member->group; + + spin_lock_irq(&group->lock); + if (member->state == INFORM_MEMBER) + group->members--; + + list_del_init(&member->list); + + if (group->state == INFORM_IDLE) { + group->state = INFORM_BUSY; + spin_unlock_irq(&group->lock); + /* Continue to hold reference on group until callback */ + queue_work(inform_wq, &group->work); + } else { + spin_unlock_irq(&group->lock); + release_group(group); + } + + deref_member(member); + wait_for_completion(&member->comp); + ib_sa_client_put(member->client); + kfree(member); +} +EXPORT_SYMBOL(ib_sa_unregister_inform_info); + +static void inform_groups_lost(struct inform_port *port) +{ + struct inform_group *group; + struct rb_node *node; + unsigned long flags; + + spin_lock_irqsave(&port->lock, flags); + for (node = rb_first(&port->table); node; node = rb_next(node)) { + group = rb_entry(node, struct inform_group, node); + spin_lock(&group->lock); + if (group->state == INFORM_IDLE) { + atomic_inc(&group->refcount); + queue_work(inform_wq, &group->work); + } + group->state = INFORM_ERROR; + spin_unlock(&group->lock); + } + spin_unlock_irqrestore(&port->lock, flags); +} + +static void inform_event_handler(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct inform_device *dev; + + dev = ib_get_client_data(event->device, &inform_client); + if (!dev) + return; + + switch (event->event) { + case IB_EVENT_PORT_ERR: + case IB_EVENT_LID_CHANGE: + case IB_EVENT_SM_CHANGE: + case IB_EVENT_CLIENT_REREGISTER: + inform_groups_lost(&dev->port[event->element.port_num - + dev->start_port]); + break; + default: + break; + } +} + +static void inform_add_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) + return; + + dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port, + GFP_KERNEL); + if (!dev) + return; + + if (device->node_type == RDMA_NODE_IB_SWITCH) + dev->start_port = dev->end_port = 0; + else { + dev->start_port = 1; + dev->end_port = device->phys_port_cnt; + } + + for (i = 0; i <= dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + port->dev = dev; + port->port_num = dev->start_port + i; + spin_lock_init(&port->lock); + port->table = RB_ROOT; + init_completion(&port->comp); + atomic_set(&port->refcount, 1); + } + + dev->device = device; + ib_set_client_data(device, &inform_client, dev); + + INIT_IB_EVENT_HANDLER(&event_handler, device, inform_event_handler); + ib_register_event_handler(&event_handler); +} + +static void inform_remove_one(struct ib_device *device) +{ + struct inform_device *dev; + struct inform_port *port; + int i; + + dev = ib_get_client_data(device, &inform_client); + if (!dev) + return; + + ib_unregister_event_handler(&event_handler); + flush_workqueue(inform_wq); + + for (i = 0; i < dev->end_port - dev->start_port; i++) { + port = &dev->port[i]; + deref_port(port); + wait_for_completion(&port->comp); + } + + kfree(dev); +} + +int notice_init(void) +{ + int ret; + + inform_wq = create_singlethread_workqueue("ib_inform_wq"); + if (!inform_wq) + return -ENOMEM; + + ib_sa_register_client(&sa_client); + + ret = ib_register_client(&inform_client); + if (ret) + goto err; + return 0; + +err: + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); + return ret; +} + +void notice_cleanup(void) +{ + ib_unregister_client(&inform_client); + ib_sa_unregister_client(&sa_client); + destroy_workqueue(inform_wq); +} diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h index 24c93fd..31cde28 100644 --- a/drivers/infiniband/core/sa.h +++ b/drivers/infiniband/core/sa.h @@ -63,4 +63,21 @@ int ib_sa_mcmember_rec_query(struct ib_s int mcast_init(void); void mcast_cleanup(void); +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, + struct ib_sa_inform *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query); + +int notice_dispatch(struct ib_device *device, u8 port_num, + struct ib_sa_notice *notice); + +int notice_init(void); +void notice_cleanup(void); + #endif /* SA_H */ diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index ea78687..88c228c 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -61,10 +61,12 @@ struct ib_sa_sm_ah { struct ib_sa_port { struct ib_mad_agent *agent; + struct ib_mad_agent *notice_agent; struct ib_sa_sm_ah *sm_ah; struct work_struct update_task; spinlock_t ah_lock; u8 port_num; + struct ib_device *device; }; struct ib_sa_device { @@ -101,6 +103,12 @@ struct ib_sa_mcmember_query { struct ib_sa_query sa_query; }; +struct ib_sa_inform_query { + void (*callback)(int, struct ib_sa_inform *, void *); + void *context; + struct ib_sa_query sa_query; +}; + static void ib_sa_add_one(struct ib_device *device); static void ib_sa_remove_one(struct ib_device *device); @@ -352,6 +360,110 @@ static const struct ib_field service_rec .size_bits = 2*64 }, }; +#define INFORM_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_inform, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_inform *) 0)->field, \ + .field_name = "sa_inform:" #field + +static const struct ib_field inform_table[] = { + { INFORM_FIELD(gid), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 128 }, + { INFORM_FIELD(lid_range_begin), + .offset_words = 4, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(lid_range_end), + .offset_words = 4, + .offset_bits = 16, + .size_bits = 16 }, + { RESERVED, + .offset_words = 5, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(is_generic), + .offset_words = 5, + .offset_bits = 16, + .size_bits = 8 }, + { INFORM_FIELD(subscribe), + .offset_words = 5, + .offset_bits = 24, + .size_bits = 8 }, + { INFORM_FIELD(type), + .offset_words = 6, + .offset_bits = 0, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.trap_num), + .offset_words = 6, + .offset_bits = 16, + .size_bits = 16 }, + { INFORM_FIELD(trap.generic.qpn), + .offset_words = 7, + .offset_bits = 0, + .size_bits = 24 }, + { RESERVED, + .offset_words = 7, + .offset_bits = 24, + .size_bits = 3 }, + { INFORM_FIELD(trap.generic.resp_time), + .offset_words = 7, + .offset_bits = 27, + .size_bits = 5 }, + { RESERVED, + .offset_words = 8, + .offset_bits = 0, + .size_bits = 8 }, + { INFORM_FIELD(trap.generic.producer_type), + .offset_words = 8, + .offset_bits = 8, + .size_bits = 24 }, +}; + +#define NOTICE_FIELD(field) \ + .struct_offset_bytes = offsetof(struct ib_sa_notice, field), \ + .struct_size_bytes = sizeof ((struct ib_sa_notice *) 0)->field, \ + .field_name = "sa_notice:" #field + +static const struct ib_field notice_table[] = { + { NOTICE_FIELD(is_generic), + .offset_words = 0, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(type), + .offset_words = 0, + .offset_bits = 1, + .size_bits = 7 }, + { NOTICE_FIELD(trap.generic.producer_type), + .offset_words = 0, + .offset_bits = 8, + .size_bits = 24 }, + { NOTICE_FIELD(trap.generic.trap_num), + .offset_words = 1, + .offset_bits = 0, + .size_bits = 16 }, + { NOTICE_FIELD(issuer_lid), + .offset_words = 1, + .offset_bits = 16, + .size_bits = 16 }, + { NOTICE_FIELD(notice_toggle), + .offset_words = 2, + .offset_bits = 0, + .size_bits = 1 }, + { NOTICE_FIELD(notice_count), + .offset_words = 2, + .offset_bits = 1, + .size_bits = 15 }, + { NOTICE_FIELD(data_details), + .offset_words = 2, + .offset_bits = 16, + .size_bits = 432 }, + { NOTICE_FIELD(issuer_gid), + .offset_words = 16, + .offset_bits = 0, + .size_bits = 128 }, +}; + static void free_sm_ah(struct kref *kref) { struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref); @@ -890,6 +1002,156 @@ err1: return ret; } +static void ib_sa_inform_callback(struct ib_sa_query *sa_query, + int status, + struct ib_sa_mad *mad) +{ + struct ib_sa_inform_query *query = + container_of(sa_query, struct ib_sa_inform_query, sa_query); + + if (mad) { + struct ib_sa_inform rec; + + ib_unpack(inform_table, ARRAY_SIZE(inform_table), + mad->data, &rec); + query->callback(status, &rec, query->context); + } else + query->callback(status, NULL, query->context); +} + +static void ib_sa_inform_release(struct ib_sa_query *sa_query) +{ + kfree(container_of(sa_query, struct ib_sa_inform_query, sa_query)); +} + +/** + * ib_sa_informinfo_query - Start an InformInfo registration. + * @client:SA client + * @device:device to send query on + * @port_num: port number to send query on + * @rec:Inform record to send in query + * @comp_mask:component mask to send in query + * @timeout_ms:time to wait for response + * @gfp_mask:GFP mask to use for internal allocations + * @callback:function called when notice handler registration completes, + * times out or is canceled + * @context:opaque user context passed to callback + * @sa_query:query context, used to cancel query + * + * This function sends inform info to register with SA to receive + * in-service notice. + * The callback function will be called when the query completes (or + * fails); status is 0 for a successful response, -EINTR if the query + * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error + * occurred sending the query. The resp parameter of the callback is + * only valid if status is 0. + * + * If the return value of ib_sa_inform_query() is negative, it is an + * error code. Otherwise it is a query ID that can be used to cancel + * the query. + */ +int ib_sa_informinfo_query(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, u8 method, + struct ib_sa_inform *rec, + ib_sa_comp_mask comp_mask, + int timeout_ms, gfp_t gfp_mask, + void (*callback)(int status, + struct ib_sa_inform *resp, + void *context), + void *context, + struct ib_sa_query **sa_query) +{ + struct ib_sa_inform_query *query; + struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client); + struct ib_sa_port *port; + struct ib_mad_agent *agent; + struct ib_sa_mad *mad; + int ret; + + if (!sa_dev) + return -ENODEV; + + port = &sa_dev->port[port_num - sa_dev->start_port]; + agent = port->agent; + + query = kmalloc(sizeof *query, gfp_mask); + if (!query) + return -ENOMEM; + + query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0, + 0, IB_MGMT_SA_HDR, + IB_MGMT_SA_DATA, gfp_mask); + if (!query->sa_query.mad_buf) { + ret = -ENOMEM; + goto err1; + } + + ib_sa_client_get(client); + query->sa_query.client = client; + query->callback = callback; + query->context = context; + + mad = query->sa_query.mad_buf->mad; + init_mad(mad, agent); + + query->sa_query.callback = callback ? ib_sa_inform_callback : NULL; + query->sa_query.release = ib_sa_inform_release; + query->sa_query.port = port; + mad->mad_hdr.method = method; + mad->mad_hdr.attr_id = cpu_to_be16(IB_SA_ATTR_INFORM_INFO); + mad->sa_hdr.comp_mask = comp_mask; + + ib_pack(inform_table, ARRAY_SIZE(inform_table), rec, mad->data); + + *sa_query = &query->sa_query; + ret = send_mad(&query->sa_query, timeout_ms, gfp_mask); + if (ret < 0) + goto err2; + + return ret; + +err2: + *sa_query = NULL; + ib_sa_client_put(query->sa_query.client); + ib_free_send_mad(query->sa_query.mad_buf); +err1: + kfree(query); + return ret; +} + +static void ib_sa_notice_resp(struct ib_sa_port *port, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_mad_send_buf *mad_buf; + struct ib_sa_mad *mad; + int ret; + + mad_buf = ib_create_send_mad(port->notice_agent, 1, 0, 0, + IB_MGMT_SA_HDR, IB_MGMT_SA_DATA, + GFP_KERNEL); + if (!mad_buf) + return; + + mad = mad_buf->mad; + memcpy(mad, &mad_recv_wc->recv_buf.mad, sizeof *mad); + mad->mad_hdr.method = IB_MGMT_METHOD_REPORT_RESP; + + spin_lock_irq(&port->ah_lock); + kref_get(&port->sm_ah->ref); + mad_buf->context[0] = &port->sm_ah->ref; + mad_buf->ah = port->sm_ah->ah; + spin_unlock_irq(&port->ah_lock); + + ret = ib_post_send_mad(mad_buf, NULL); + if (ret) + goto err; + + return; +err: + kref_put(mad_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_buf); +} + static void send_handler(struct ib_mad_agent *agent, struct ib_mad_send_wc *mad_send_wc) { @@ -944,9 +1206,36 @@ static void recv_handler(struct ib_mad_a ib_free_recv_mad(mad_recv_wc); } +static void notice_resp_handler(struct ib_mad_agent *agent, + struct ib_mad_send_wc *mad_send_wc) +{ + kref_put(mad_send_wc->send_buf->context[0], free_sm_ah); + ib_free_send_mad(mad_send_wc->send_buf); +} + +static void notice_handler(struct ib_mad_agent *mad_agent, + struct ib_mad_recv_wc *mad_recv_wc) +{ + struct ib_sa_port *port; + struct ib_sa_mad *mad; + struct ib_sa_notice notice; + + port = mad_agent->context; + mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad; + ib_unpack(notice_table, ARRAY_SIZE(notice_table), mad->data, ¬ice); + + if (!notice_dispatch(port->device, port->port_num, ¬ice)) + ib_sa_notice_resp(port, mad_recv_wc); + ib_free_recv_mad(mad_recv_wc); +} + static void ib_sa_add_one(struct ib_device *device) { struct ib_sa_device *sa_dev; + struct ib_mad_reg_req reg_req = { + .mgmt_class = IB_MGMT_CLASS_SUBN_ADM, + .mgmt_class_version = 2 + }; int s, e, i; if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB) @@ -980,6 +1269,16 @@ static void ib_sa_add_one(struct ib_devi if (IS_ERR(sa_dev->port[i].agent)) goto err; + sa_dev->port[i].device = device; + set_bit(IB_MGMT_METHOD_REPORT, reg_req.method_mask); + sa_dev->port[i].notice_agent = + ib_register_mad_agent(device, i + s, IB_QPT_GSI, + ®_req, 0, notice_resp_handler, + notice_handler, &sa_dev->port[i]); + + if (IS_ERR(sa_dev->port[i].notice_agent)) + goto err; + INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah, &sa_dev->port[i]); } @@ -1003,8 +1302,14 @@ static void ib_sa_add_one(struct ib_devi return; err: - while (--i >= 0) - ib_unregister_mad_agent(sa_dev->port[i].agent); + while (--i >= 0) { + if (!IS_ERR(sa_dev->port[i].notice_agent)) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); + } + if (!IS_ERR(sa_dev->port[i].agent)) { + ib_unregister_mad_agent(sa_dev->port[i].agent); + } + } kfree(sa_dev); @@ -1024,6 +1329,7 @@ static void ib_sa_remove_one(struct ib_d flush_scheduled_work(); for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) { + ib_unregister_mad_agent(sa_dev->port[i].notice_agent); ib_unregister_mad_agent(sa_dev->port[i].agent); kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah); } @@ -1052,7 +1358,15 @@ static int __init ib_sa_init(void) goto err2; } + ret = notice_init(); + if (ret) { + printk(KERN_ERR "Couldn't initialize notice handling\n"); + goto err3; + } + return 0; +err3: + mcast_cleanup(); err2: ib_unregister_client(&sa_client); err1: @@ -1062,6 +1376,7 @@ err1: static void __exit ib_sa_cleanup(void) { mcast_cleanup(); + notice_cleanup(); ib_unregister_client(&sa_client); idr_destroy(&query_idr); } diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h index 3b957e5..1bbf88a 100644 --- a/include/rdma/ib_sa.h +++ b/include/rdma/ib_sa.h @@ -254,6 +254,143 @@ struct ib_sa_service_rec { u64 data64[2]; }; +enum { + IB_SA_EVENT_TYPE_FATAL = 0x0, + IB_SA_EVENT_TYPE_URGENT = 0x1, + IB_SA_EVENT_TYPE_SECURITY = 0x2, + IB_SA_EVENT_TYPE_SM = 0x3, + IB_SA_EVENT_TYPE_INFO = 0x4, + IB_SA_EVENT_TYPE_EMPTY = 0x7F, + IB_SA_EVENT_TYPE_ALL = 0xFFFF +}; + +enum { + IB_SA_EVENT_PRODUCER_TYPE_CA = 0x1, + IB_SA_EVENT_PRODUCER_TYPE_SWITCH = 0x2, + IB_SA_EVENT_PRODUCER_TYPE_ROUTER = 0x3, + IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER = 0x4, + IB_SA_EVENT_PRODUCER_TYPE_ALL = 0xFFFFFF +}; + +enum { + IB_SA_SM_TRAP_GID_IN_SERVICE = 64, + IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65, + IB_SA_SM_TRAP_CREATE_MC_GROUP = 66, + IB_SA_SM_TRAP_DELETE_MC_GROUP = 67, + IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128, + IB_SA_SM_TRAP_LINK_INTEGRITY = 129, + IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130, + IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131, + IB_SA_SM_TRAP_BAD_M_KEY = 256, + IB_SA_SM_TRAP_BAD_P_KEY = 257, + IB_SA_SM_TRAP_BAD_Q_KEY = 258, + IB_SA_SM_TRAP_ALL = 0xFFFF +}; + +#define IB_SA_INFORM_GID IB_SA_COMP_MASK( 0) +#define IB_SA_INFORM_LID_RANGE_BEGIN IB_SA_COMP_MASK( 1) +#define IB_SA_INFORM_LID_RANGE_END IB_SA_COMP_MASK( 2) +/* reserved: 3 */ +#define IB_SA_INFORM_IS_GENERIC IB_SA_COMP_MASK( 4) +#define IB_SA_INFORM_SUBCRIBE IB_SA_COMP_MASK( 5) +#define IB_SA_INFORM_TYPE IB_SA_COMP_MASK( 6) + +#define IB_SA_INFORM_TRAP_NUMBER IB_SA_COMP_MASK( 7) +#define IB_SA_INFORM_DEVICE_ID IB_SA_COMP_MASK( 7) +#define IB_SA_INFORM_QPN IB_SA_COMP_MASK( 8) +/* reserved: 9 */ +#define IB_SA_INFORM_RESP_TIME IB_SA_COMP_MASK(10) +/* reserved: 11 */ +#define IB_SA_INFORM_PRODUCER_TYPE IB_SA_COMP_MASK(12) +#define IB_SA_INFORM_VENDOR_ID IB_SA_COMP_MASK(12) + +struct ib_sa_inform { + union ib_gid gid; + __be16 lid_range_begin; + __be16 lid_range_end; + u8 is_generic; + u8 subscribe; + __be16 type; + union { + struct { + __be16 trap_num; + __be32 qpn; + u8 resp_time; + __be32 producer_type; + } generic; + struct { + __be16 device_id; + __be32 qpn; + u8 resp_time; + __be32 vendor_id; + } vendor; + } trap; +}; + +struct ib_sa_notice { + u8 is_generic; + u8 type; + union { + struct { + __be32 producer_type; + __be16 trap_num; + } generic; + struct { + __be32 vendor_id; + __be16 device_id; + } vendor; + } trap; + __be16 issuer_lid; + __be16 notice_count; + u8 notice_toggle; + /* + * Align data 16 bits off 64 bit field to match InformInfo definition. + * Data contained within this field will then align properly. + * See IB spec 1.2, sections 13.4.8.2 and 14.2.5.1. + */ + u8 reserved[5]; + u8 data_details[54]; + union ib_gid issuer_gid; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_GID_IN_SERVICE = 64 + * IB_SA_SM_TRAP_GID_OUT_OF_SERVICE = 65 + * IB_SA_SM_TRAP_CREATE_MC_GROUP = 66 + * IB_SA_SM_TRAP_DELETE_MC_GROUP = 67 + */ +struct ib_sa_notice_data_gid { + u8 reserved[6]; + u8 gid[16]; + u8 padding[32]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_PORT_CHANGE_STATE = 128 + */ +struct ib_sa_notice_data_port_change { + __be16 lid; + u8 padding[52]; +}; + +/* + * SM notice data details for: + * + * IB_SA_SM_TRAP_LINK_INTEGRITY = 129 + * IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN = 130 + * IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 131 + */ +struct ib_sa_notice_data_port_error { + u8 reserved[2]; + __be16 lid; + u8 port_num; + u8 padding[49]; +}; + struct ib_sa_client { atomic_t users; struct completion comp; @@ -387,4 +524,54 @@ int ib_init_ah_from_path(struct ib_devic struct ib_sa_path_rec *rec, struct ib_ah_attr *ah_attr); +struct ib_inform_info { + void *context; + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice); + u16 trap_number; +}; + +/** + * ib_sa_register_inform_info - Registers to receive notice events. + * @device: Device associated with the registration. + * @port_num: Port on the specified device to associate with the registration. + * @trap_number: InformInfo trap number to register for. + * @gfp_mask: GFP mask for memory allocations. + * @callback: User callback invoked once the registration completes and to + * report noticed events. + * @context: User specified context stored with the ib_inform_reg structure. + * + * This call initiates a registration request with the SA for the specified + * trap number. If the operation is started successfully, it returns + * an ib_inform_info structure that is used to track the registration operation. + * Users must free this structure by calling ib_unregister_inform_info, + * even if the operation later fails. (The callback status is non-zero.) + * + * If the registration fails; status will be non-zero. If the registration + * succeeds, the callback status will be zero, but the notice parameter will + * be NULL. If the notice parameter is not NULL, a trap or notice is being + * reported to the user. + * + * A status of -ENETRESET indicates that an error occurred which requires + * reregisteration. + */ +struct ib_inform_info * +ib_sa_register_inform_info(struct ib_sa_client *client, + struct ib_device *device, u8 port_num, + u16 trap_number, gfp_t gfp_mask, + int (*callback)(int status, + struct ib_inform_info *info, + struct ib_sa_notice *notice), + void *context); + +/** + * ib_sa_unregister_inform_info - Releases an InformInfo registration. + * @info: InformInfo registration tracking structure. + * + * This call blocks until the registration request is destroyed. It may + * not be called from within the registration callback. + */ +void ib_sa_unregister_inform_info(struct ib_inform_info *info); + #endif /* IB_SA_H */ From bugzilla-daemon at openib.org Thu Dec 14 16:25:43 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 14 Dec 2006 16:25:43 -0800 (PST) Subject: [openib-general] [Bug 159] OFED1.0: Missing interfaces Message-ID: <20061215002543.2EF132283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=159 ------- Comment #5 from sean.hefty at intel.com 2006-12-14 16:25 ------- A proposed interface and implementation to register for SA events is available at: git://staging.openfabrics.org/~shefty/rdma-dev.git under the branch informinfo. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From xma at us.ibm.com Thu Dec 14 16:53:26 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 14 Dec 2006 16:53:26 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061210134137.GL29174@mellanox.co.il> Message-ID: Hi, Michael, Tried this patch, it didn't work on ehca. I couldn't change the mode from datagram to connected from /sys/class. And when unloading ib_ipoib module, all the connections to that node gone, rmmod ib_ipoib hung. The kernel is 2.6.19. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Dec 14 17:22:11 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 14 Dec 2006 17:22:11 -0800 Subject: [openib-general] librdmacm git repos needs config dir In-Reply-To: <1166131727.12420.9.camel@stevo-desktop> References: <1166131727.12420.9.camel@stevo-desktop> Message-ID: <4581F8C3.9040203@ichips.intel.com> > The librdmacm git repository needs a config dir or autoconf changes to > make that dir as part of config. I'm not a autoconf wiz, so I just > created the config dir and put a hidden file named .gitignore in it for > libamso. That way its created when folks clone it. Dunno if that's the > best way, but it worked... I committed a patch to the master and rdma_ucm-abi3 branches that should auto-create the config directory if it's not there. I also updated the libibcm. - Sean From mst at mellanox.co.il Thu Dec 14 21:07:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 15 Dec 2006 07:07:22 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: References: Message-ID: <20061215050722.GF19449@mellanox.co.il> > What saves us for the MTT table is that with your patch the device > never writes to the MTT table at all. Yes. -- MST From mst at mellanox.co.il Thu Dec 14 21:08:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 15 Dec 2006 07:08:16 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: References: Message-ID: <20061215050816.GG19449@mellanox.co.il> > > > What saves us for the MTT table is that with your patch the device > > > never writes to the MTT table at all. > > > > Except for the reserved MTTs. > > Good point. So I guess we need a patch that makes sure all reserved > MTTs are given their own ICM chunk (which doesn't need to be in > lowmem) to fix things. Or just round up the # of reserved MTTs to CPU cache line size. -- MST From mst at mellanox.co.il Thu Dec 14 21:14:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 15 Dec 2006 07:14:38 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: <20061210134137.GL29174@mellanox.co.il> Message-ID: <20061215051438.GH19449@mellanox.co.il> > Hi, Michael, > > Tried this patch, it didn't work on ehca. I couldn't change the mode from > datagram to connected from /sys/class. It's wroking as designed in that respect. ehca does not implement srq - without srq, there is no way to prepost receive buffers for a resonable number of connections without running out of memory. So it is falling back on datagram mode. Talk to ehca guys to implement srq and connected mode will be enabled. > And when unloading ib_ipoib module, all the connections to that node gone, > rmmod ib_ipoib hung. The kernel is 2.6.19. Probably a bug in error handling somewhere. Post the sysrq t trace and I'll take a look. -- MST From kliteyn at dev.mellanox.co.il Thu Dec 14 21:51:09 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 15 Dec 2006 07:51:09 +0200 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <1166127103.28709.140656.camel@hal.voltaire.com> References: <4581ACE5.9000109@dev.mellanox.co.il> <1166127103.28709.140656.camel@hal.voltaire.com> Message-ID: <458237CD.80608@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: >> Hi Hal >> >> This patch fixes a bug that caused ucast manager to return >> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. >> Added a boolean flag that marks whether there was some change or not >> (in which case OSM_SIGNAL_DONE should be returned). > > Just wondering what is the test case for this ? I found it while working on FatTree routing. The problem appears when a routing engine fills all the forwarding tables, and then osm_ucast_mgr_set_fwd_table() will decide that all the tables are identical to what was already set on switches and there is nothing to send. > > -- Hal > > From or.gerlitz at gmail.com Thu Dec 14 21:57:27 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 15 Dec 2006 07:57:27 +0200 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <4581C4B5.5020702@ichips.intel.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com> <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com> <4581C4B5.5020702@ichips.intel.com> Message-ID: <15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com> On 12/14/06, Sean Hefty wrote: > > I see. I understand that there is some code which is part of OFED > > (udapl) that uses this api, what were you thinking to suggest them to > > do in the spirit of this code you have posted being the basis for OFED > > 1.2 ? > > DAPL has been updated to remove its use of these calls. The rdma cm timeout is > essentially 1 minute now. cool, before sending the orig email i was looking on both Arlin git tree at ofa staging and the svn and the code that uses this calls are still there, so were are the updated udapl sources? Or. From mst at mellanox.co.il Thu Dec 14 22:31:27 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 15 Dec 2006 08:31:27 +0200 Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by filling MTTs directly In-Reply-To: References: Message-ID: <20061215063127.GB27865@mellanox.co.il> > > > With current code firmware might be doing WRITE_MTT while CPU is writing to the > > > same cache line, and I expect this might confuse things, but it seems that with > > > my fmr/mr merge patch, we never have both CPU and firmware write to the same > > > MTTs entries. > > > > > > So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code > > > sufficient? > > Yes, assuming that the CPU is the only entity ever writing to the MTT > table, then doing pci_dma_sync_sg_for_cpu() before writing and > pci_dma_sync_sg_for_device() afterwards should be OK. I think. However, for MPTs it seems the best we can do is allocate them out of coherent memory. -- MST From philippe_bernadat at hp.com Thu Dec 14 23:58:32 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Fri, 15 Dec 2006 08:58:32 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire (lustre) In-Reply-To: <20061214173145.GC12781@mellanox.co.il> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538304@idaexc03.emea.cpqcorp.net> I have set tavor_quirk to 1 with no effect. Another thing I have tried is the same lustre LNET echo test with a single thread (vs 8) VIB: 400 MB/s OFED-1.1: 333 MB/s I am posting the live param values for all infiniband modules in case someone could identify some wrong setting: infiniband/core/ib_cm mra_timeout_limit 30000 infiniband/core/rdma_cm max_cm_retries 15 tavor_quirk 1 infiniband/hw/ipath/ib_ipath cfgports 0 debug 1 disable_sma 0 kpiobufs 0 lkey_table_size 12 max_ahs 65535 max_cqes 196607 max_cqs 131071 max_mcast_grps 16384 max_mcast_qp_attached 16 max_pds 65535 max_qps 16384 max_qp_wrs 16383 max_sges 96 max_srqs 1024 max_srq_sges 128 max_srq_wrs 131071 qp_table_size 251 infiniband/hw/mthca/ib_mthca catas_reset_disable 0 debug_level 0 fmr_reserved_mtts 262144 fw_cmd_doorbell 0 msi 0 msi_x 1 num_cq 65536 num_mcg 8192 num_mpt 131072 num_mtt 1048576 num_qp 65536 num_udav 32768 rdb_per_qp 4 tune_pci 1 infiniband/ulp/ipoib/ib_ipoib debug_level 0 mcast_debug_level 0 recv_queue_size 128 send_queue_size 64 Philippe > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Thursday, December 14, 2006 6:32 PM > To: Roland Dreier > Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; > openib-general at openib.org > Subject: Re: Performance Degradation with OFED v. Voltaire > > > > I think Eric described the major differences earlier on, > here it is, see > > > second half: > > > > OK, I forgot about that. > > > > I guess one last thing to check would be the MTU being used > for the RC > > connections. Since this is PCI-X HW then the MTU should be 1024 for > > best throughput (instead of the max MTU of 2048). > > The MTU issue is described in the OFED release notes. > You must turn the Tavor work-around for it on in opensm. > This was introduced late in release cycle to it was deemed safer > to make it off by default. > > By the way, Eitan, Hal, can we turn this on by default now? > This was we'll get more feedback from people, and we'll still have > time to turn it off before release if this unexpectedly > creates issues. > > -- > MST > From philippe_bernadat at hp.com Fri Dec 15 00:44:14 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Fri, 15 Dec 2006 09:44:14 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire (lustre) Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net> I also looked at the HCA counters, and I indeed think there is something wrong about the MTU: For the same test With VIB PortXmitData: 2684490382 PortRcvData: 1750145 PortXmitPkts: 10280007 PortRcvPkts: 49962 With OFED XmtBytes:........................2653730483 RcvBytes:........................1710541 XmtPkts:.........................5160009 RcvPkts:.........................50012 Which means we sent half less packets with OFED and if you do the math it is 2K packets with OFED (counters are 32bit units) and 1K packets with VIB. So fo some reason the tavor_quirk param is ignored/overwriten. Is there an interface to control this ? Philippe > -----Original Message----- > From: Bernadat, Philippe > Sent: Friday, December 15, 2006 8:59 AM > To: Michael S. Tsirkin; Roland Dreier > Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org > Subject: RE: Performance Degradation with OFED v. Voltaire (lustre) > > I have set tavor_quirk to 1 with no effect. > Another thing I have tried is the same lustre > LNET echo test with a single thread (vs 8) > > VIB: 400 MB/s > OFED-1.1: 333 MB/s > > I am posting the live param values for all infiniband > modules in case someone could identify some wrong setting: > > infiniband/core/ib_cm > > mra_timeout_limit 30000 > > infiniband/core/rdma_cm > > max_cm_retries 15 > tavor_quirk 1 > > infiniband/hw/ipath/ib_ipath > > cfgports 0 > debug 1 > disable_sma 0 > kpiobufs 0 > lkey_table_size 12 > max_ahs 65535 > max_cqes 196607 > max_cqs 131071 > max_mcast_grps 16384 > max_mcast_qp_attached 16 > max_pds 65535 > max_qps 16384 > max_qp_wrs 16383 > max_sges 96 > max_srqs 1024 > max_srq_sges 128 > max_srq_wrs 131071 > qp_table_size 251 > > infiniband/hw/mthca/ib_mthca > > catas_reset_disable 0 > debug_level 0 > fmr_reserved_mtts 262144 > fw_cmd_doorbell 0 > msi 0 > msi_x 1 > num_cq 65536 > num_mcg 8192 > num_mpt 131072 > num_mtt 1048576 > num_qp 65536 > num_udav 32768 > rdb_per_qp 4 > tune_pci 1 > > infiniband/ulp/ipoib/ib_ipoib > > debug_level 0 > mcast_debug_level 0 > recv_queue_size 128 > send_queue_size 64 > > Philippe > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Thursday, December 14, 2006 6:32 PM > > To: Roland Dreier > > Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; > > openib-general at openib.org > > Subject: Re: Performance Degradation with OFED v. Voltaire > > > > > > I think Eric described the major differences earlier on, > > here it is, see > > > > second half: > > > > > > OK, I forgot about that. > > > > > > I guess one last thing to check would be the MTU being used > > for the RC > > > connections. Since this is PCI-X HW then the MTU should > be 1024 for > > > best throughput (instead of the max MTU of 2048). > > > > The MTU issue is described in the OFED release notes. > > You must turn the Tavor work-around for it on in opensm. > > This was introduced late in release cycle to it was deemed safer > > to make it off by default. > > > > By the way, Eitan, Hal, can we turn this on by default now? > > This was we'll get more feedback from people, and we'll still have > > time to turn it off before release if this unexpectedly > > creates issues. > > > > -- > > MST > > From ali at alisheriff.orangehome.co.uk Fri Dec 15 01:18:03 2006 From: ali at alisheriff.orangehome.co.uk (Ali Sheriff) Date: Fri, 15 Dec 2006 10:18:03 +0100 (CET) Subject: [openib-general] CONSIGNMENT AND CONTACT GODWIN AMALA Message-ID: <27533855.18241166174283551.JavaMail.www@wwinf3101> Sir, Thank you for you. I write to inform you that the sending of the consignment will only be possible because an agent will help you to open an account with OCEANIC BANK , then you can transfer the funds from OCEANIC BANK PLC, to your account in any part of the world. You have to contact the Agent on, Name:Godwin Amala Email:chineloadams13 at yahoo.fr OR,godwin_amala1967 at yahoo.fr Phone Number:+229 97 67 26 47 And you should contact him with the reconfirmation of your address and direct contact phone and cell numbers. The purpose of the contact is to confirm that he have received the funds and to confirm to him your are the real owner of the funds and for your wish for your inspection and deposit in the bank if you wish.The total sum he supposed to pay you is your compensation fund of $800,000,reserved by your Business associates here which his sceretary,Courier Company and Bank were unable to complete payments and hence exhort your money. This process of movement of money is very classified and it is only accorded to GOLD CARD members of the ADB organisation and, this GOLD CARD members includes Heads of States in Africa,former ministers and very top Government officials in india and south America. Through my contact, I have fronted you as a GOLD CARD member. You should therefore present yourself as Gold card member.Your passcode is “AD411W7â€. You must mention this code to Mr Godwin Amala when you contact her before he can give you any information regarding the consignments.And then you should inform him that you are expecting some consignmentsfrom COTONOU BENIN REPUBLIC,and that you wish to confirm if they have arrived. You must not let him know that you are NOT a Gold card member. Note that you must memorize the numbers because you will be the person to open the consignment upon delivery and that is a strong proof of ownership and identity. You must know that the only persons who know the contents of the consignments are your humble self,Godwin Amala and myself. Please take note of all these instructions. If you have any question,please do not hesitate to contact me by email. Thank you. Yours Ali Sherif -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlleinin at hpcn.ca.sandia.gov Fri Dec 15 02:19:28 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 15 Dec 2006 02:19:28 -0800 Subject: [openib-general] Performance Degradation with OFED v. Voltaire (lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net> Message-ID: <1166177968.21763.116.camel@localhost> On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote: > I also looked at the HCA counters, and I indeed think > there is something wrong about the MTU: > > For the same test > > With VIB > > PortXmitData: 2684490382 > PortRcvData: 1750145 > PortXmitPkts: 10280007 > PortRcvPkts: 49962 > > With OFED > > XmtBytes:........................2653730483 > RcvBytes:........................1710541 > XmtPkts:.........................5160009 > RcvPkts:.........................50012 > > Which means we sent half less packets with OFED > and if you do the math it is 2K packets with OFED (counters are 32bit > units) > and 1K packets with VIB. > > So fo some reason the tavor_quirk param is ignored/overwriten. > Is there an interface to control this ? Michael said you have to turn on this feature in OpenSM. From the release notes I'm not sure how you turn it on in OpenSM. You did turn on the tavor mtu work around in the rdma_cm, but did you turn it on in OpenSM? Also what version of OpenSM are you running? Thanks, - Matt > > Philippe > > > -----Original Message----- > > From: Bernadat, Philippe > > Sent: Friday, December 15, 2006 8:59 AM > > To: Michael S. Tsirkin; Roland Dreier > > Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org > > Subject: RE: Performance Degradation with OFED v. Voltaire (lustre) > > > > I have set tavor_quirk to 1 with no effect. > > Another thing I have tried is the same lustre > > LNET echo test with a single thread (vs 8) > > > > VIB: 400 MB/s > > OFED-1.1: 333 MB/s > > > > I am posting the live param values for all infiniband > > modules in case someone could identify some wrong setting: > > > > infiniband/core/ib_cm > > > > mra_timeout_limit 30000 > > > > infiniband/core/rdma_cm > > > > max_cm_retries 15 > > tavor_quirk 1 > > > > infiniband/hw/ipath/ib_ipath > > > > cfgports 0 > > debug 1 > > disable_sma 0 > > kpiobufs 0 > > lkey_table_size 12 > > max_ahs 65535 > > max_cqes 196607 > > max_cqs 131071 > > max_mcast_grps 16384 > > max_mcast_qp_attached 16 > > max_pds 65535 > > max_qps 16384 > > max_qp_wrs 16383 > > max_sges 96 > > max_srqs 1024 > > max_srq_sges 128 > > max_srq_wrs 131071 > > qp_table_size 251 > > > > infiniband/hw/mthca/ib_mthca > > > > catas_reset_disable 0 > > debug_level 0 > > fmr_reserved_mtts 262144 > > fw_cmd_doorbell 0 > > msi 0 > > msi_x 1 > > num_cq 65536 > > num_mcg 8192 > > num_mpt 131072 > > num_mtt 1048576 > > num_qp 65536 > > num_udav 32768 > > rdb_per_qp 4 > > tune_pci 1 > > > > infiniband/ulp/ipoib/ib_ipoib > > > > debug_level 0 > > mcast_debug_level 0 > > recv_queue_size 128 > > send_queue_size 64 > > > > Philippe > > > > > -----Original Message----- > > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > > Sent: Thursday, December 14, 2006 6:32 PM > > > To: Roland Dreier > > > Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; > > > openib-general at openib.org > > > Subject: Re: Performance Degradation with OFED v. Voltaire > > > > > > > > I think Eric described the major differences earlier on, > > > here it is, see > > > > > second half: > > > > > > > > OK, I forgot about that. > > > > > > > > I guess one last thing to check would be the MTU being used > > > for the RC > > > > connections. Since this is PCI-X HW then the MTU should > > be 1024 for > > > > best throughput (instead of the max MTU of 2048). > > > > > > The MTU issue is described in the OFED release notes. > > > You must turn the Tavor work-around for it on in opensm. > > > This was introduced late in release cycle to it was deemed safer > > > to make it off by default. > > > > > > By the way, Eitan, Hal, can we turn this on by default now? > > > This was we'll get more feedback from people, and we'll still have > > > time to turn it off before release if this unexpectedly > > > creates issues. > > > > > > -- > > > MST > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Fri Dec 15 05:17:26 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 15 Dec 2006 08:17:26 -0500 Subject: [openib-general] .openfabrics.org names In-Reply-To: <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> Message-ID: <8916AC51-131D-4AE5-A630-E72E5E3A90C1@cisco.com> These names still don't appear to exist. Do we know when they'll be created? On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote: > Who controls the DNS for openfabrics.org? Could we get these names > created? Or -- are there any objections to creating / using such > names? > > Thanks! > > > On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote: > >> The name "staging.openfabrics.org" was really intended to be >> temporary until the old openfabrics.org was taken offline and >> replaced with the new one. >> >> My $0.02 is that we should stop using staging.openfabrics.org as >> soon as possible and create / start using some new names for the >> server to allow for potential transparent service relocation someday. >> >> Here are some new name suggestions that could be done immediately >> (with appropriate changes to DNS, apache config, ...and potentially >> others): >> >> * git.openfabrics.org: for all git activity >> * wiki.openfabrics.org: a top-level name for the wiki rather than >> burying it under several layers of links on the web site >> * trac.openfabrics.org: if someone creates this name, I volunteer >> to finally get off my butt and install trac to see if people like it >> >> These are the old names and would need to be changed in DNS only >> when the old server is taken offline / we're ready to move to the >> new server: >> >> * openfabrics.org: redirect to www.openfabrics.org, and for mail >> traffic >> * www.openfabrics.org: main web site >> >> -- >> Jeff Squyres >> Server Virtualization Business Unit >> Cisco Systems >> >> > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From halr at voltaire.com Fri Dec 15 06:33:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 09:33:15 -0500 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <4581ACE5.9000109@dev.mellanox.co.il> References: <4581ACE5.9000109@dev.mellanox.co.il> Message-ID: <1166193153.28709.186595.camel@hal.voltaire.com> Hi again Yevgeny, On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > Hi Hal > > This patch fixes a bug that caused ucast manager to return > OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. > Added a boolean flag that marks whether there was some change or not > (in which case OSM_SIGNAL_DONE should be returned). > > -- > Yevgeny > > Signed-off-by: Yevgeny Kliteynik Good catch! Thanks. Applied. Is this issue (and patch or a similar one) also applicable to OFED 1.1 ? -- Hal From halr at voltaire.com Fri Dec 15 07:28:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 10:28:27 -0500 Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine [1/2] In-Reply-To: <4581DDEF.7000206@dev.mellanox.co.il> References: <4581DDEF.7000206@dev.mellanox.co.il> Message-ID: <1166196463.28709.188818.camel@hal.voltaire.com> On Thu, 2006-12-14 at 18:27, Yevgeny Kliteynik wrote: > Hi Hal > > This patch (1/2) adds Fat Tree routing engine to OpenSM. > > -- > Yevgeny > > Signed-off-by: Yevgeny Kliteynik > --- > osm/opensm/Makefile.am | 2 +- > osm/opensm/main.c | 3 ++- > osm/opensm/osm_opensm.c | 2 ++ > 3 files changed, 5 insertions(+), 2 deletions(-) Thanks. Applied. Note that these patches were in the reverse order. -- Hal From halr at voltaire.com Fri Dec 15 07:36:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 10:36:10 -0500 Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine [2/2] In-Reply-To: <4581DDFF.2000903@dev.mellanox.co.il> References: <4581DDFF.2000903@dev.mellanox.co.il> Message-ID: <1166196836.28709.188922.camel@hal.voltaire.com> Hi Yevgeny, On Thu, 2006-12-14 at 18:27, Yevgeny Kliteynik wrote: > Hi Hal > > This patch (2/2) adds Fat Tree routing engine to OpenSM. Thanks! Applied. I played with it a little and will look more at it going forward. A couple of questions: Is this algorithm currently considered experimental ? Are there any simulator tests/regressions for this ? Also, could you or Eitan update doc/current-routing.txt with a description of the fat tree algorithm and send that patch to me ? -- Hal From dotanb at dev.mellanox.co.il Fri Dec 15 08:25:20 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Fri, 15 Dec 2006 18:25:20 +0200 (IST) Subject: [openib-general] can i use the multicast module in user level? Message-ID: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il> Hi Sean. I would like to use the multicast module in user level tests (in order to send a join message to the multicast groups that I'm using). Can I use the multicast module in user level? (if the answer is yes, is there is any code reference that I can use?) thanks Dotan From eitan at mellanox.co.il Fri Dec 15 09:04:08 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 15 Dec 2006 19:04:08 +0200 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <1166193153.28709.186595.camel@hal.voltaire.com> References: <4581ACE5.9000109@dev.mellanox.co.il> <1166193153.28709.186595.camel@hal.voltaire.com> Message-ID: <4582D588.2070506@mellanox.co.il> Hal Rosenstock wrote: > Hi again Yevgeny, > > On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > >> Hi Hal >> >> This patch fixes a bug that caused ucast manager to return >> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. >> Added a boolean flag that marks whether there was some change or not >> (in which case OSM_SIGNAL_DONE should be returned). >> >> -- >> Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> > > Good catch! > > Thanks. Applied. > > Is this issue (and patch or a similar one) also applicable to OFED 1.1 ? > I think OFED 1.1 does not have the "incremental" routing patch. So it does not have this bug. EZ > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From xma at us.ibm.com Fri Dec 15 09:06:03 2006 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 15 Dec 2006 09:06:03 -0800 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061215051438.GH19449@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 12/14/2006 09:14:38 PM: > > Hi, Michael, > > > > Tried this patch, it didn't work on ehca. I couldn't change the mode from > > datagram to connected from /sys/class. > > It's wroking as designed in that respect. ehca does not implement > srq - without > srq, there is no way to prepost receive buffers for a resonable number of > connections without running out of memory. > > So it is falling back on datagram mode. > Talk to ehca guys to implement srq and connected mode will be enabled. Don't remember SRQ is a MUST for UC mode. Does this patch support devices with SRQ in RC mode? > > And when unloading ib_ipoib module, all the connections to that node gone, > > rmmod ib_ipoib hung. The kernel is 2.6.19. > > Probably a bug in error handling somewhere. > Post the sysrq t trace and I'll take a look. I will recreate the problem and post stack trace later. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlleinin at hpcn.ca.sandia.gov Fri Dec 15 09:15:24 2006 From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger) Date: Fri, 15 Dec 2006 09:15:24 -0800 Subject: [openib-general] .openfabrics.org names In-Reply-To: <8916AC51-131D-4AE5-A630-E72E5E3A90C1@cisco.com> References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com> <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com> <8916AC51-131D-4AE5-A630-E72E5E3A90C1@cisco.com> Message-ID: <1166202924.21763.124.camel@localhost> On Fri, 2006-12-15 at 08:17 -0500, Jeff Squyres wrote: > These names still don't appear to exist. Do we know when they'll be > created? Intel controls the openfabrics.org domain name. I think Jim or Michael can make this happen. - Matt > > > On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote: > > > Who controls the DNS for openfabrics.org? Could we get these names > > created? Or -- are there any objections to creating / using such > > names? > > > > Thanks! > > > > > > On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote: > > > >> The name "staging.openfabrics.org" was really intended to be > >> temporary until the old openfabrics.org was taken offline and > >> replaced with the new one. > >> > >> My $0.02 is that we should stop using staging.openfabrics.org as > >> soon as possible and create / start using some new names for the > >> server to allow for potential transparent service relocation someday. > >> > >> Here are some new name suggestions that could be done immediately > >> (with appropriate changes to DNS, apache config, ...and potentially > >> others): > >> > >> * git.openfabrics.org: for all git activity > >> * wiki.openfabrics.org: a top-level name for the wiki rather than > >> burying it under several layers of links on the web site > >> * trac.openfabrics.org: if someone creates this name, I volunteer > >> to finally get off my butt and install trac to see if people like it > >> > >> These are the old names and would need to be changed in DNS only > >> when the old server is taken offline / we're ready to move to the > >> new server: > >> > >> * openfabrics.org: redirect to www.openfabrics.org, and for mail > >> traffic > >> * www.openfabrics.org: main web site > >> > >> -- > >> Jeff Squyres > >> Server Virtualization Business Unit > >> Cisco Systems > >> > >> > > > > > > -- > > Jeff Squyres > > Server Virtualization Business Unit > > Cisco Systems > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > > openib-general > > From eitan at sw053.yok.mtl.com Fri Dec 15 09:10:58 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Fri, 15 Dec 2006 19:10:58 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-15:normal completion Message-ID: <200612151710.kBFHAw1V004597@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = ____ ibutils rev = ____ Total=198 Pass=198 Fail=0 Pass: 27 Stability IS1-16.topo 27 Pkey IS1-16.topo 27 OsmStress IS1-16.topo 27 Multicast IS1-16.topo 27 LidMgr IS1-16.topo 9 Stability IS3-loop.topo 9 Stability IS3-128.topo 9 Pkey IS3-128.topo 9 OsmStress IS3-128.topo 9 Multicast IS3-loop.topo 9 Multicast IS3-128.topo 9 LidMgr IS3-128.topo Failures: From jim.ryan at intel.com Fri Dec 15 09:17:47 2006 From: jim.ryan at intel.com (Ryan, Jim) Date: Fri, 15 Dec 2006 09:17:47 -0800 Subject: [openib-general] .openfabrics.org names Message-ID: <55CE0347B98FCA468923E5FBC25CB4DC4097DD@orsmsx413.amr.corp.intel.com> Michael has done this in the past but he's on sabbatical and unavailable for several weeks. Can someone else do this? Thanks, Jim -----Original Message----- From: Matt Leininger [mailto:mlleinin at hpcn.ca.sandia.gov] Sent: Friday, December 15, 2006 9:15 AM To: Jeff Squyres Cc: openib; Ryan, Jim; Oros, Michael Subject: Re: [openib-general] .openfabrics.org names On Fri, 2006-12-15 at 08:17 -0500, Jeff Squyres wrote: > These names still don't appear to exist. Do we know when they'll be > created? Intel controls the openfabrics.org domain name. I think Jim or Michael can make this happen. - Matt > > > On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote: > > > Who controls the DNS for openfabrics.org? Could we get these names > > created? Or -- are there any objections to creating / using such > > names? > > > > Thanks! > > > > > > On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote: > > > >> The name "staging.openfabrics.org" was really intended to be > >> temporary until the old openfabrics.org was taken offline and > >> replaced with the new one. > >> > >> My $0.02 is that we should stop using staging.openfabrics.org as > >> soon as possible and create / start using some new names for the > >> server to allow for potential transparent service relocation someday. > >> > >> Here are some new name suggestions that could be done immediately > >> (with appropriate changes to DNS, apache config, ...and potentially > >> others): > >> > >> * git.openfabrics.org: for all git activity > >> * wiki.openfabrics.org: a top-level name for the wiki rather than > >> burying it under several layers of links on the web site > >> * trac.openfabrics.org: if someone creates this name, I volunteer > >> to finally get off my butt and install trac to see if people like it > >> > >> These are the old names and would need to be changed in DNS only > >> when the old server is taken offline / we're ready to move to the > >> new server: > >> > >> * openfabrics.org: redirect to www.openfabrics.org, and for mail > >> traffic > >> * www.openfabrics.org: main web site > >> > >> -- > >> Jeff Squyres > >> Server Virtualization Business Unit > >> Cisco Systems > >> > >> > > > > > > -- > > Jeff Squyres > > Server Virtualization Business Unit > > Cisco Systems > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > > openib-general > > From eitan at mellanox.co.il Fri Dec 15 09:20:03 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 15 Dec 2006 19:20:03 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire (lustre) In-Reply-To: <1166177968.21763.116.camel@localhost> References: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net> <1166177968.21763.116.camel@localhost> Message-ID: <4582D943.2080403@mellanox.co.il> Matt Leininger wrote: > On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote: > >> I also looked at the HCA counters, and I indeed think >> there is something wrong about the MTU: >> >> For the same test >> >> With VIB >> >> PortXmitData: 2684490382 >> PortRcvData: 1750145 >> PortXmitPkts: 10280007 >> PortRcvPkts: 49962 >> >> With OFED >> >> XmtBytes:........................2653730483 >> RcvBytes:........................1710541 >> XmtPkts:.........................5160009 >> RcvPkts:.........................50012 >> >> Which means we sent half less packets with OFED >> and if you do the math it is 2K packets with OFED (counters are 32bit >> units) >> and 1K packets with VIB. >> >> So fo some reason the tavor_quirk param is ignored/overwriten. >> Is there an interface to control this ? >> > > Michael said you have to turn on this feature in OpenSM. From the > release notes I'm not sure how you turn it on in OpenSM. You did turn > on the tavor mtu work around in the rdma_cm, but did you turn it on in > OpenSM? Also what version of OpenSM are you running? > To turn this option on in opensm you need to: 1. Run: opensm -c -o 2. Modify the file /var/cache/osm/opensm.opts by changing the line below enable_quirks FALSE to enable_quirks TRUE 3. Run: opensm > Thanks, > > - Matt > > >> Philippe >> >> >>> -----Original Message----- >>> From: Bernadat, Philippe >>> Sent: Friday, December 15, 2006 8:59 AM >>> To: Michael S. Tsirkin; Roland Dreier >>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org >>> Subject: RE: Performance Degradation with OFED v. Voltaire (lustre) >>> >>> I have set tavor_quirk to 1 with no effect. >>> Another thing I have tried is the same lustre >>> LNET echo test with a single thread (vs 8) >>> >>> VIB: 400 MB/s >>> OFED-1.1: 333 MB/s >>> >>> I am posting the live param values for all infiniband >>> modules in case someone could identify some wrong setting: >>> >>> infiniband/core/ib_cm >>> >>> mra_timeout_limit 30000 >>> >>> infiniband/core/rdma_cm >>> >>> max_cm_retries 15 >>> tavor_quirk 1 >>> >>> infiniband/hw/ipath/ib_ipath >>> >>> cfgports 0 >>> debug 1 >>> disable_sma 0 >>> kpiobufs 0 >>> lkey_table_size 12 >>> max_ahs 65535 >>> max_cqes 196607 >>> max_cqs 131071 >>> max_mcast_grps 16384 >>> max_mcast_qp_attached 16 >>> max_pds 65535 >>> max_qps 16384 >>> max_qp_wrs 16383 >>> max_sges 96 >>> max_srqs 1024 >>> max_srq_sges 128 >>> max_srq_wrs 131071 >>> qp_table_size 251 >>> >>> infiniband/hw/mthca/ib_mthca >>> >>> catas_reset_disable 0 >>> debug_level 0 >>> fmr_reserved_mtts 262144 >>> fw_cmd_doorbell 0 >>> msi 0 >>> msi_x 1 >>> num_cq 65536 >>> num_mcg 8192 >>> num_mpt 131072 >>> num_mtt 1048576 >>> num_qp 65536 >>> num_udav 32768 >>> rdb_per_qp 4 >>> tune_pci 1 >>> >>> infiniband/ulp/ipoib/ib_ipoib >>> >>> debug_level 0 >>> mcast_debug_level 0 >>> recv_queue_size 128 >>> send_queue_size 64 >>> >>> Philippe >>> >>> >>>> -----Original Message----- >>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] >>>> Sent: Thursday, December 14, 2006 6:32 PM >>>> To: Roland Dreier >>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; >>>> openib-general at openib.org >>>> Subject: Re: Performance Degradation with OFED v. Voltaire >>>> >>>> >>>>> > I think Eric described the major differences earlier on, >>>>> >>>> here it is, see >>>> >>>>> > second half: >>>>> >>>>> OK, I forgot about that. >>>>> >>>>> I guess one last thing to check would be the MTU being used >>>>> >>>> for the RC >>>> >>>>> connections. Since this is PCI-X HW then the MTU should >>>>> >>> be 1024 for >>> >>>>> best throughput (instead of the max MTU of 2048). >>>>> >>>> The MTU issue is described in the OFED release notes. >>>> You must turn the Tavor work-around for it on in opensm. >>>> This was introduced late in release cycle to it was deemed safer >>>> to make it off by default. >>>> >>>> By the way, Eitan, Hal, can we turn this on by default now? >>>> This was we'll get more feedback from people, and we'll still have >>>> time to turn it off before release if this unexpectedly >>>> creates issues. >>>> >>>> -- >>>> MST >>>> >>>> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Fri Dec 15 09:30:28 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 15 Dec 2006 19:30:28 +0200 Subject: [openib-general] libsdp: RFC changing libsdp.conf location In-Reply-To: <457D3269.3070401@mellanox.co.il> References: <457D2A21.9030804@mellanox.co.il> <20061211102222.GB5944@mellanox.co.il> <457D3269.3070401@mellanox.co.il> Message-ID: <4582DBB4.80605@mellanox.co.il> Hi Roland, Scott, Nimrod, MST, Thanks for your feedbacks on the issue over the last week. What I plan to do: 1. Move the default location to /etc/libsdp.conf 2. Mark the file with %config in so it is not overwritten by the RPM install 3. Change the "make install" to not overwrite the file but to create a file named /etc/libsdp.conf.example if a file exists Eitan Eitan Zahavi wrote: > Hi Michael, > > Thanks. This proposal is simple and clear to me. > Let's wait a day and see if anybody else have other ideas. > > Thanks > > Eitan > > Michael S. Tsirkin wrote: > >>> BTW: libsdp.conf used to be overwritten in previous install. >>> I have fixed the nakefile to avoid that and instead create a >>> new file with install date under the same directory. >>> >>> >> Here's a simple proposal that will address this issue: >> - Make libsdp behave sanely when not libsdp.conf file is present. >> Do not install anything in default location in make install. >> >> - in make install, copy the example configuration file into >> libsdp.conf.example. Add a line to the top of it saying >> "rename this file to libsdp.conf to make lbisdp use it". >> >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mshefty at ichips.intel.com Fri Dec 15 09:37:59 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 15 Dec 2006 09:37:59 -0800 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com> <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com> <4581C4B5.5020702@ichips.intel.com> <15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com> Message-ID: <4582DD77.8090208@ichips.intel.com> > cool, before sending the orig email i was looking on both Arlin git > tree at ofa staging and the svn and the code that uses this calls are > still there, so were are the updated udapl sources? Arlin's DAPL tree has an rdma_ucm branch that should match. - Sean From mshefty at ichips.intel.com Fri Dec 15 09:43:41 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 15 Dec 2006 09:43:41 -0800 Subject: [openib-general] can i use the multicast module in user level? In-Reply-To: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il> References: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il> Message-ID: <4582DECD.70301@ichips.intel.com> > I would like to use the multicast module in user level tests (in order to > send a join message to the multicast groups that I'm using). > > Can I use the multicast module in user level? > (if the answer is yes, is there is any code reference that I can use?) Multicast support has only been exposed to userspace through the librdmacm. There's a mckey test app that shows how this can be used. I will be working on a raw IB multicast / InformInfo userspace support through January. There is an older userspace SA library that you might be able to play with as well, but you'd have to look back through the mail logs to find the patches. - Sean From halr at voltaire.com Fri Dec 15 10:47:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 13:47:26 -0500 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <4582D588.2070506@mellanox.co.il> References: <4581ACE5.9000109@dev.mellanox.co.il> <1166193153.28709.186595.camel@hal.voltaire.com> <4582D588.2070506@mellanox.co.il> Message-ID: <1166208365.28709.195843.camel@hal.voltaire.com> On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi again Yevgeny, > > > > On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > > > >> Hi Hal > >> > >> This patch fixes a bug that caused ucast manager to return > >> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. > >> Added a boolean flag that marks whether there was some change or not > >> (in which case OSM_SIGNAL_DONE should be returned). > >> > >> -- > >> Yevgeny > >> > >> Signed-off-by: Yevgeny Kliteynik > >> > > > > Good catch! > > > > Thanks. Applied. > > > > Is this issue (and patch or a similar one) also applicable to OFED 1.1 ? > > > I think OFED 1.1 does not have the "incremental" routing patch. Right; it doesn't. > So it does not have this bug. Are you sure that the incremental routing caused this to be needed ? By any chance, are you confusing this with a different patch ? Just want to be clear on this... -- Hal > EZ > > -- Hal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From wombat2 at us.ibm.com Fri Dec 15 11:01:37 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Fri, 15 Dec 2006 14:01:37 -0500 Subject: [openib-general] Fw: openib-general Digest, Vol 30, Issue 135 Message-ID: > ----- Message from "Shirley Ma" on Fri, 15 Dec 2006 > 09:06:03 -0800 ----- > > To: > > "Michael S. Tsirkin" > > cc: > > openib-general at openib.org > > Subject: > > Re: [openib-general] [PATCHv2] IPoIB CM Experimental support > > "Michael S. Tsirkin" wrote on 12/14/2006 09:14:38 PM: > > > > Hi, Michael, > > > > > > Tried this patch, it didn't work on ehca. I couldn't change the mode from > > > datagram to connected from /sys/class. > > > > It's wroking as designed in that respect. ehca does not implement > > srq - without > > srq, there is no way to prepost receive buffers for a resonable number of > > connections without running out of memory. > > > > So it is falling back on datagram mode. > > Talk to ehca guys to implement srq and connected mode will be enabled. > Don't remember SRQ is a MUST for UC mode. Does this patch support > devices with SRQ in RC mode? I don't think the IB HCA Spec requires SRQ support for RC but is an optional feature. There are two adapters right now that don't support SRQ which means to use IPoIB-CM on them you should make the use of SRQ an option setting. I agree that if it is available it should be used for scaling issues probably if available automatically set. But I would like to see us at least support the current hardware that meets the current SPEC. Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Fri Dec 15 11:06:43 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 15 Dec 2006 11:06:43 -0800 Subject: [openib-general] OpenSM core dump - file size exceeded Message-ID: My OpenSM, from the git tree pulled on 12/12/06 died with the following error, looks like the log file got > 2G and then it died. [root at iclust-2 RPMS]# ps -aux | grep opensm Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ root 20256 0.0 0.1 5408 656 pts/4 S+ 12:05 0:00 grep opensm [1]+ File size limit exceeded(core dumped) /usr/local/bin/opensm [root at iclust-2 RPMS]# ls -l /var/log/osm.log -rw-r--r-- 1 root root 2147483647 Dec 14 17:25 /var/log/osm.log [root at iclust-2 RPMS]# woody From halr at voltaire.com Fri Dec 15 11:14:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 14:14:58 -0500 Subject: [openib-general] Performance Degradation with OFED v. Voltaire (lustre) In-Reply-To: <4582D943.2080403@mellanox.co.il> References: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net> <1166177968.21763.116.camel@localhost> <4582D943.2080403@mellanox.co.il> Message-ID: <1166210069.28709.196688.camel@hal.voltaire.com> On Fri, 2006-12-15 at 12:20, Eitan Zahavi wrote: > Matt Leininger wrote: > > On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote: > > > >> I also looked at the HCA counters, and I indeed think > >> there is something wrong about the MTU: > >> > >> For the same test > >> > >> With VIB > >> > >> PortXmitData: 2684490382 > >> PortRcvData: 1750145 > >> PortXmitPkts: 10280007 > >> PortRcvPkts: 49962 > >> > >> With OFED > >> > >> XmtBytes:........................2653730483 > >> RcvBytes:........................1710541 > >> XmtPkts:.........................5160009 > >> RcvPkts:.........................50012 > >> > >> Which means we sent half less packets with OFED > >> and if you do the math it is 2K packets with OFED (counters are 32bit > >> units) > >> and 1K packets with VIB. > >> > >> So fo some reason the tavor_quirk param is ignored/overwriten. > >> Is there an interface to control this ? > >> > > > > Michael said you have to turn on this feature in OpenSM. From the > > release notes I'm not sure how you turn it on in OpenSM. You did turn > > on the tavor mtu work around in the rdma_cm, but did you turn it on in > > OpenSM? Also what version of OpenSM are you running? > > > To turn this option on in opensm you need to: > 1. Run: opensm -c -o If you already have an opensm.opts file then you can skip this step. -- Hal > 2. Modify the file /var/cache/osm/opensm.opts by changing the line below > enable_quirks FALSE > to > enable_quirks TRUE > > 3. Run: opensm > > Thanks, > > > > - Matt > > > > > >> Philippe > >> > >> > >>> -----Original Message----- > >>> From: Bernadat, Philippe > >>> Sent: Friday, December 15, 2006 8:59 AM > >>> To: Michael S. Tsirkin; Roland Dreier > >>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org > >>> Subject: RE: Performance Degradation with OFED v. Voltaire (lustre) > >>> > >>> I have set tavor_quirk to 1 with no effect. > >>> Another thing I have tried is the same lustre > >>> LNET echo test with a single thread (vs 8) > >>> > >>> VIB: 400 MB/s > >>> OFED-1.1: 333 MB/s > >>> > >>> I am posting the live param values for all infiniband > >>> modules in case someone could identify some wrong setting: > >>> > >>> infiniband/core/ib_cm > >>> > >>> mra_timeout_limit 30000 > >>> > >>> infiniband/core/rdma_cm > >>> > >>> max_cm_retries 15 > >>> tavor_quirk 1 > >>> > >>> infiniband/hw/ipath/ib_ipath > >>> > >>> cfgports 0 > >>> debug 1 > >>> disable_sma 0 > >>> kpiobufs 0 > >>> lkey_table_size 12 > >>> max_ahs 65535 > >>> max_cqes 196607 > >>> max_cqs 131071 > >>> max_mcast_grps 16384 > >>> max_mcast_qp_attached 16 > >>> max_pds 65535 > >>> max_qps 16384 > >>> max_qp_wrs 16383 > >>> max_sges 96 > >>> max_srqs 1024 > >>> max_srq_sges 128 > >>> max_srq_wrs 131071 > >>> qp_table_size 251 > >>> > >>> infiniband/hw/mthca/ib_mthca > >>> > >>> catas_reset_disable 0 > >>> debug_level 0 > >>> fmr_reserved_mtts 262144 > >>> fw_cmd_doorbell 0 > >>> msi 0 > >>> msi_x 1 > >>> num_cq 65536 > >>> num_mcg 8192 > >>> num_mpt 131072 > >>> num_mtt 1048576 > >>> num_qp 65536 > >>> num_udav 32768 > >>> rdb_per_qp 4 > >>> tune_pci 1 > >>> > >>> infiniband/ulp/ipoib/ib_ipoib > >>> > >>> debug_level 0 > >>> mcast_debug_level 0 > >>> recv_queue_size 128 > >>> send_queue_size 64 > >>> > >>> Philippe > >>> > >>> > >>>> -----Original Message----- > >>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > >>>> Sent: Thursday, December 14, 2006 6:32 PM > >>>> To: Roland Dreier > >>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; > >>>> openib-general at openib.org > >>>> Subject: Re: Performance Degradation with OFED v. Voltaire > >>>> > >>>> > >>>>> > I think Eric described the major differences earlier on, > >>>>> > >>>> here it is, see > >>>> > >>>>> > second half: > >>>>> > >>>>> OK, I forgot about that. > >>>>> > >>>>> I guess one last thing to check would be the MTU being used > >>>>> > >>>> for the RC > >>>> > >>>>> connections. Since this is PCI-X HW then the MTU should > >>>>> > >>> be 1024 for > >>> > >>>>> best throughput (instead of the max MTU of 2048). > >>>>> > >>>> The MTU issue is described in the OFED release notes. > >>>> You must turn the Tavor work-around for it on in opensm. > >>>> This was introduced late in release cycle to it was deemed safer > >>>> to make it off by default. > >>>> > >>>> By the way, Eitan, Hal, can we turn this on by default now? > >>>> This was we'll get more feedback from people, and we'll still have > >>>> time to turn it off before release if this unexpectedly > >>>> creates issues. > >>>> > >>>> -- > >>>> MST > >>>> > >>>> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > >> > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ardavis at ichips.intel.com Fri Dec 15 11:30:58 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Fri, 15 Dec 2006 11:30:58 -0800 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <4581C4B5.5020702@ichips.intel.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com> <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com> <4581C4B5.5020702@ichips.intel.com> Message-ID: <4582F7F2.8040305@ichips.intel.com> Sean Hefty wrote: >>I see. I understand that there is some code which is part of OFED >>(udapl) that uses this api, what were you thinking to suggest them to >>do in the spirit of this code you have posted being the basis for OFED >>1.2 ? >> >> > >DAPL has been updated to remove its use of these calls. The rdma cm timeout is >essentially 1 minute now. If needed a kernel fix can be applied to send an MRA >to increase the timeout, but I'm holding off on doing that unless it's really >needed. > > Not sure if one size fits all. Is one minute sufficient? Can you at least provide module parameters that can override your defaults when the driver loads. It would nice to have some control over extending the accept times if necessary. Maybe something at listen time that could indicate the need to send the MRA with a backoff time? -arlin > > From mshefty at ichips.intel.com Fri Dec 15 11:46:31 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 15 Dec 2006 11:46:31 -0800 Subject: [openib-general] OpenSM core dump - file size exceeded In-Reply-To: References: Message-ID: <4582FB97.6010304@ichips.intel.com> > [root at iclust-2 RPMS]# ps -aux | grep opensm > Warning: bad syntax, perhaps a bogus '-'? See > /usr/share/doc/procps-3.2.3/FAQ > root 20256 0.0 0.1 5408 656 pts/4 S+ 12:05 0:00 grep > opensm > [1]+ File size limit exceeded(core dumped) /usr/local/bin/opensm > [root at iclust-2 RPMS]# ls -l /var/log/osm.log > -rw-r--r-- 1 root root 2147483647 Dec 14 17:25 /var/log/osm.log > [root at iclust-2 RPMS]# Looking at the log file, the problem appears to be related to: http://openib.org/pipermail/openib-general/2006-December/029962.html I'm still trying to discover more details. - Sean From eitan at mellanox.co.il Fri Dec 15 12:03:42 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 15 Dec 2006 22:03:42 +0200 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <1166208365.28709.195843.camel@hal.voltaire.com> References: <4581ACE5.9000109@dev.mellanox.co.il> <1166193153.28709.186595.camel@hal.voltaire.com> <4582D588.2070506@mellanox.co.il> <1166208365.28709.195843.camel@hal.voltaire.com> Message-ID: <4582FF9E.3040901@mellanox.co.il> Hal Rosenstock wrote: > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote: > >> Hal Rosenstock wrote: >> >>> Hi again Yevgeny, >>> >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: >>> >>> >>>> Hi Hal >>>> >>>> This patch fixes a bug that caused ucast manager to return >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. >>>> Added a boolean flag that marks whether there was some change or not >>>> (in which case OSM_SIGNAL_DONE should be returned). >>>> >>>> -- >>>> Yevgeny >>>> >>>> Signed-off-by: Yevgeny Kliteynik >>>> >>>> >>> Good catch! >>> >>> Thanks. Applied. >>> >>> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ? >>> >>> >> I think OFED 1.1 does not have the "incremental" routing patch. >> > > Right; it doesn't. > > >> So it does not have this bug. >> > > Are you sure that the incremental routing caused this to be needed ? By > any chance, are you confusing this with a different patch ? Just want to > be clear on this... > Yes I am sure. Without the new incremental feature every sweep all LFT tables were set. EZ > -- Hal > > >> EZ >> >>> -- Hal >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Dec 15 12:44:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 15:44:03 -0500 Subject: [openib-general] OpenSM core dump - file size exceeded In-Reply-To: References: Message-ID: <1166215361.28709.199852.camel@hal.voltaire.com> On Fri, 2006-12-15 at 14:06, Woodruff, Robert J wrote: > My OpenSM, from the git tree pulled on 12/12/06 died with the following > error, > looks like the log file got > 2G and then it died. > > > [root at iclust-2 RPMS]# ps -aux | grep opensm > Warning: bad syntax, perhaps a bogus '-'? See > /usr/share/doc/procps-3.2.3/FAQ > root 20256 0.0 0.1 5408 656 pts/4 S+ 12:05 0:00 grep > opensm > [1]+ File size limit exceeded(core dumped) /usr/local/bin/opensm > [root at iclust-2 RPMS]# ls -l /var/log/osm.log > -rw-r--r-- 1 root root 2147483647 Dec 14 17:25 /var/log/osm.log > [root at iclust-2 RPMS]# Any idea what filled up the log ? but that's a side issue. This has been discussed on the list before. This is one option which can help with this issue: -L, --log_limit This option defines maximal log file size in MB. When specified the log file will be truncated upon reaching this limit. Is this useful ? (It was put in the last time you reported this failure). Also, log rotation will be supported for OFED 1.2 but I've not had a chance to incorporate this yet. -- Hal > woody From halr at voltaire.com Fri Dec 15 13:14:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 16:14:51 -0500 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <4582FF9E.3040901@mellanox.co.il> References: <4581ACE5.9000109@dev.mellanox.co.il> <1166193153.28709.186595.camel@hal.voltaire.com> <4582D588.2070506@mellanox.co.il> <1166208365.28709.195843.camel@hal.voltaire.com> <4582FF9E.3040901@mellanox.co.il> Message-ID: <1166217285.32666.579.camel@hal.voltaire.com> On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote: > > > >> Hal Rosenstock wrote: > >> > >>> Hi again Yevgeny, > >>> > >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > >>> > >>> > >>>> Hi Hal > >>>> > >>>> This patch fixes a bug that caused ucast manager to return > >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. > >>>> Added a boolean flag that marks whether there was some change or not > >>>> (in which case OSM_SIGNAL_DONE should be returned). > >>>> > >>>> -- > >>>> Yevgeny > >>>> > >>>> Signed-off-by: Yevgeny Kliteynik > >>>> > >>>> > >>> Good catch! > >>> > >>> Thanks. Applied. > >>> > >>> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ? > >>> > >>> > >> I think OFED 1.1 does not have the "incremental" routing patch. > >> > > > > Right; it doesn't. > > > > > >> So it does not have this bug. > >> > > > > Are you sure that the incremental routing caused this to be needed ? By > > any chance, are you confusing this with a different patch ? Just want to > > be clear on this... > > > Yes I am sure. Without the new incremental feature every sweep all LFT > tables were set. That sounds like a different bug to me. Yevgeny's patch was for a hang which involved issuing OSM_SIGNAL_DONE_PENDING rather than OSM_SIGNAL_DONE. Is this related to incremental routing ? -- Hal > EZ > > -- Hal > > > > > >> EZ > >> > >>> -- Hal > >>> > >>> > >>> _______________________________________________ > >>> openib-general mailing list > >>> openib-general at openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From eitan at mellanox.co.il Fri Dec 15 13:26:23 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Fri, 15 Dec 2006 23:26:23 +0200 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to'hang' Message-ID: <6C2C79E72C305246B504CBA17B5500C980BADC@mtlexch01.mtl.com> Hi Hal, Every osm manager (step in the algorithm) shall return OSM_SIGNAL_DONE_PENDING iff there are outstanding packets on the wire. Or it should return OSM_SIGNAL_DONE if there are none. The state manager uses there values to determine if it needs to wait for all these SMPs to finish or can progress to the next step. This is a quote from the osm_ucast_mgr.c: /* For now don't bother checking if the switch forwarding tables actually needed updating. The current code will always update them, and thus leave transactions pending on the wire. Therefore, return OSM_SIGNAL_DONE_PENDING. */ signal = OSM_SIGNAL_DONE_PENDING; This assumption was broken by the change avoiding sending Set(LFT) if they did not change. So the osm_state_mgr was stuck at the stage OSM_SM_STATE_SET_UCAST_TABLES_WAIT And never get a OSM_SIGNAL_NO_PENDING_TRANSACTIONS to exit it (since there are no outstanding SMPs). EZ > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, December 15, 2006 11:15 PM > To: Eitan Zahavi > Cc: OPENIB > Subject: Re: [openib-general] [PATCH] osm: bug that caused ucast manager > to'hang' > > On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote: > > Hal Rosenstock wrote: > > > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote: > > > > > >> Hal Rosenstock wrote: > > >> > > >>> Hi again Yevgeny, > > >>> > > >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > > >>> > > >>> > > >>>> Hi Hal > > >>>> > > >>>> This patch fixes a bug that caused ucast manager to return > > >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending > transactions. > > >>>> Added a boolean flag that marks whether there was some change or > > >>>> not (in which case OSM_SIGNAL_DONE should be returned). > > >>>> > > >>>> -- > > >>>> Yevgeny > > >>>> > > >>>> Signed-off-by: Yevgeny Kliteynik > > >>>> > > >>>> > > >>> Good catch! > > >>> > > >>> Thanks. Applied. > > >>> > > >>> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ? > > >>> > > >>> > > >> I think OFED 1.1 does not have the "incremental" routing patch. > > >> > > > > > > Right; it doesn't. > > > > > > > > >> So it does not have this bug. > > >> > > > > > > Are you sure that the incremental routing caused this to be needed ? > > > By any chance, are you confusing this with a different patch ? Just > > > want to be clear on this... > > > > > Yes I am sure. Without the new incremental feature every sweep all LFT > > tables were set. > > That sounds like a different bug to me. Yevgeny's patch was for a hang which > involved issuing OSM_SIGNAL_DONE_PENDING rather than > OSM_SIGNAL_DONE. Is this related to incremental routing ? > > -- Hal > > > EZ > > > -- Hal > > > > > > > > >> EZ > > >> > > >>> -- Hal > > >>> > > >>> > > >>> _______________________________________________ > > >>> openib-general mailing list > > >>> openib-general at openib.org > > >>> http://openib.org/mailman/listinfo/openib-general > > >>> > > >>> To unsubscribe, please visit > > >>> http://openib.org/mailman/listinfo/openib-general > > >>> > > >>> > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > From halr at voltaire.com Fri Dec 15 13:28:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 16:28:08 -0500 Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_subnet.c: Fix port_profile_switch_nodes comment in opensm.opts Message-ID: <1166218072.32666.1192.camel@hal.voltaire.com> OpenSM/osm_subnet.c: Fix port_profile_switch_nodes comment in opensm.opts Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index c218790..3db4612 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -1137,7 +1137,7 @@ osm_subn_write_conf_file( fprintf( opts_file, "#\n# ROUTING OPTIONS\n#\n" - "# If TRUE do not count switches as link subscriptions\n" + "# If TRUE count switches as link subscriptions\n" "port_profile_switch_nodes %s\n\n", p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE"); From halr at voltaire.com Fri Dec 15 13:31:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 16:31:57 -0500 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to'hang' In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BADC@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C980BADC@mtlexch01.mtl.com> Message-ID: <1166218316.32666.1349.camel@hal.voltaire.com> Hi Eitan, On Fri, 2006-12-15 at 16:26, Eitan Zahavi wrote: > Hi Hal, > > Every osm manager (step in the algorithm) shall return > OSM_SIGNAL_DONE_PENDING iff there are outstanding packets on the wire. > Or it should return OSM_SIGNAL_DONE if there are none. > The state manager uses there values to determine if it needs to wait for > all these SMPs to finish or > can progress to the next step. > > This is a quote from the osm_ucast_mgr.c: > /* > For now don't bother checking if the switch forwarding tables > actually needed updating. The current code will always update > them, and thus leave transactions pending on the wire. > Therefore, return OSM_SIGNAL_DONE_PENDING. > */ > signal = OSM_SIGNAL_DONE_PENDING; > > This assumption was broken by the change avoiding sending Set(LFT) if > they did not change. > > So the osm_state_mgr was stuck at the stage > OSM_SM_STATE_SET_UCAST_TABLES_WAIT > And never get a OSM_SIGNAL_NO_PENDING_TRANSACTIONS to exit it (since > there are no outstanding SMPs). Got it. Thanks. -- Hal > EZ > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Friday, December 15, 2006 11:15 PM > > To: Eitan Zahavi > > Cc: OPENIB > > Subject: Re: [openib-general] [PATCH] osm: bug that caused ucast > manager > > to'hang' > > > > On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote: > > > Hal Rosenstock wrote: > > > > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote: > > > > > > > >> Hal Rosenstock wrote: > > > >> > > > >>> Hi again Yevgeny, > > > >>> > > > >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > > > >>> > > > >>> > > > >>>> Hi Hal > > > >>>> > > > >>>> This patch fixes a bug that caused ucast manager to return > > > >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending > > transactions. > > > >>>> Added a boolean flag that marks whether there was some change > or > > > >>>> not (in which case OSM_SIGNAL_DONE should be returned). > > > >>>> > > > >>>> -- > > > >>>> Yevgeny > > > >>>> > > > >>>> Signed-off-by: Yevgeny Kliteynik > > > >>>> > > > >>>> > > > >>> Good catch! > > > >>> > > > >>> Thanks. Applied. > > > >>> > > > >>> Is this issue (and patch or a similar one) also applicable to > OFED 1.1 ? > > > >>> > > > >>> > > > >> I think OFED 1.1 does not have the "incremental" routing patch. > > > >> > > > > > > > > Right; it doesn't. > > > > > > > > > > > >> So it does not have this bug. > > > >> > > > > > > > > Are you sure that the incremental routing caused this to be needed > ? > > > > By any chance, are you confusing this with a different patch ? > Just > > > > want to be clear on this... > > > > > > > Yes I am sure. Without the new incremental feature every sweep all > LFT > > > tables were set. > > > > That sounds like a different bug to me. Yevgeny's patch was for a hang > which > > involved issuing OSM_SIGNAL_DONE_PENDING rather than > > OSM_SIGNAL_DONE. Is this related to incremental routing ? > > > > -- Hal > > > > > EZ > > > > -- Hal > > > > > > > > > > > >> EZ > > > >> > > > >>> -- Hal > > > >>> > > > >>> > > > >>> _______________________________________________ > > > >>> openib-general mailing list > > > >>> openib-general at openib.org > > > >>> http://openib.org/mailman/listinfo/openib-general > > > >>> > > > >>> To unsubscribe, please visit > > > >>> http://openib.org/mailman/listinfo/openib-general > > > >>> > > > >>> > > > > > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > From robert.j.woodruff at intel.com Fri Dec 15 14:05:14 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Fri, 15 Dec 2006 14:05:14 -0800 Subject: [openib-general] OpenSM core dump - file size exceeded Message-ID: Hal wrote, >Any idea what filled up the log ? but that's a side issue. Yes we were getting a bunch of multicast errors, Sean is investigating this. >This has been discussed on the list before. This is one option which can >help with this issue: > -L, --log_limit > This option defines maximal log file size in MB. When specified > the log file will be truncated upon reaching this limit. Ok, thanks. woody From swise at opengridcomputing.com Fri Dec 15 14:50:17 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 15 Dec 2006 16:50:17 -0600 Subject: [openib-general] [PATCH] rdma_cm iWARP connection setup timeouts reported as rejects. Message-ID: <20061215225017.22628.17881.stgit@dell3.ogc.int> The IWCM should report timeouts as event RDMA_CM_EVENT_UNREACHABLE, not event RDMA_CM_EVENT_REJECTED. Signed-off-by: Steve Wise --- drivers/infiniband/core/cma.c | 17 ++++++++++++++--- 1 files changed, 14 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index afd9383..5fdb9df 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1088,10 +1088,21 @@ static int cma_iw_handler(struct iw_cm_i *sin = iw_event->local_addr; sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; *sin = iw_event->remote_addr; - if (iw_event->status) - event.event = RDMA_CM_EVENT_REJECTED; - else + switch (iw_event->status) { + case 0: event.event = RDMA_CM_EVENT_ESTABLISHED; + break; + case -ECONNRESET: + case -ECONNREFUSED: + event.event = RDMA_CM_EVENT_REJECTED; + break; + case -ETIMEDOUT: + event.event = RDMA_CM_EVENT_UNREACHABLE; + break; + default: + event.event = RDMA_CM_EVENT_CONNECT_ERROR; + break; + } break; case IW_CM_EVENT_ESTABLISHED: event.event = RDMA_CM_EVENT_ESTABLISHED; From swise at opengridcomputing.com Fri Dec 15 14:56:06 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 15 Dec 2006 16:56:06 -0600 Subject: [openib-general] [PATCH] librdmacm Pass back the status or errno in RDMA CM events. Message-ID: <20061215225606.22765.18276.stgit@dell3.ogc.int> The librdmacm code isn't passing back the errno in all events. For example, if a connection request times out the kernel CMA will pass up event RDMA_CM_EVENT_UNREACHABLE with the status set to -ETIMEDOUT. This errno isn't currently passed back to the librdmacm user in the event. Signed-off-by: Steve Wise --- src/cma.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/src/cma.c b/src/cma.c index 5cffa18..850f381 100644 --- a/src/cma.c +++ b/src/cma.c @@ -1302,6 +1302,7 @@ retry: default: evt->id_priv = (void *) (uintptr_t) resp->uid; evt->event.id = &evt->id_priv->id; + evt->event.status = resp->status; if (evt->id_priv->id.ps == RDMA_PS_TCP) ucma_copy_conn_event(evt, &resp->param.conn); else From sashak at voltaire.com Fri Dec 15 16:37:28 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 16 Dec 2006 02:37:28 +0200 Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to 'hang' In-Reply-To: <1166217285.32666.579.camel@hal.voltaire.com> References: <4581ACE5.9000109@dev.mellanox.co.il> <1166193153.28709.186595.camel@hal.voltaire.com> <4582D588.2070506@mellanox.co.il> <1166208365.28709.195843.camel@hal.voltaire.com> <4582FF9E.3040901@mellanox.co.il> <1166217285.32666.579.camel@hal.voltaire.com> Message-ID: <1166229448.14664.19.camel@localhost> On Fri, 2006-12-15 at 16:14 -0500, Hal Rosenstock wrote: > On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote: > > Hal Rosenstock wrote: > > > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote: > > > > > >> Hal Rosenstock wrote: > > >> > > >>> Hi again Yevgeny, > > >>> > > >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote: > > >>> > > >>> > > >>>> Hi Hal > > >>>> > > >>>> This patch fixes a bug that caused ucast manager to return > > >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions. > > >>>> Added a boolean flag that marks whether there was some change or not > > >>>> (in which case OSM_SIGNAL_DONE should be returned). > > >>>> > > >>>> -- > > >>>> Yevgeny > > >>>> > > >>>> Signed-off-by: Yevgeny Kliteynik > > >>>> > > >>>> > > >>> Good catch! > > >>> > > >>> Thanks. Applied. > > >>> > > >>> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ? > > >>> > > >>> > > >> I think OFED 1.1 does not have the "incremental" routing patch. > > >> > > > > > > Right; it doesn't. > > > > > > > > >> So it does not have this bug. > > >> > > > > > > Are you sure that the incremental routing caused this to be needed ? By > > > any chance, are you confusing this with a different patch ? Just want to > > > be clear on this... > > > > > Yes I am sure. Without the new incremental feature every sweep all LFT > > tables were set. > > That sounds like a different bug to me. Yevgeny's patch was for a hang > which involved issuing OSM_SIGNAL_DONE_PENDING rather than > OSM_SIGNAL_DONE. Is this related to incremental routing ? Before this LFT update request was always sent. So yes, it is related. Sasha > > -- Hal > > > EZ > > > -- Hal > > > > > > > > >> EZ > > >> > > >>> -- Hal > > >>> > > >>> > > >>> _______________________________________________ > > >>> openib-general mailing list > > >>> openib-general at openib.org > > >>> http://openib.org/mailman/listinfo/openib-general > > >>> > > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > >>> > > >>> > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Fri Dec 15 17:32:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 15 Dec 2006 20:32:32 -0500 Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_subnet.c: Fix sminfo_polling_timeout comment in opensm.opts Message-ID: <1166232721.32666.12476.camel@hal.voltaire.com> OpenSM/osm_subnet.c: Fix sminfo_polling_timeout comment in opensm.opts sminfo_polling_timeout in msecs rather than secs Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index 3db4612..da82471 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -1175,7 +1175,7 @@ osm_subn_write_conf_file( "sm_priority %u\n\n" "# If TRUE other SMs on the subnet should be ignored\n" "ignore_other_sm %s\n\n" - "# Timeout in [sec] between two polls of active master SM\n" + "# Timeout in [msec] between two polls of active master SM\n" "sminfo_polling_timeout %u\n\n" "# Number of failing polls of remote SM that declares it dead\n" "polling_retry_number %u\n\n" From rdreier at cisco.com Fri Dec 15 20:57:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 15 Dec 2006 20:57:29 -0800 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus A couple of fixes for semi-nasty bugs on 32-bit architectures, plus one small mthca driver update: Leonid Arsh (1): IB/mthca: Add HCA profile module parameters Roland Dreier (3): IB: Fix ib_dma_alloc_coherent() wrapper IB/srp: Fix FMR mapping for 32-bit kernels and addresses above 4G IB/mthca: Use DEFINE_MUTEX() instead of mutex_init() drivers/infiniband/hw/mthca/mthca_main.c | 113 +++++++++++++++++++++++++---- drivers/infiniband/ulp/srp/ib_srp.c | 2 +- drivers/infiniband/ulp/srp/ib_srp.h | 2 +- include/rdma/ib_verbs.h | 9 ++- 4 files changed, 107 insertions(+), 19 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 0491ec7..44bc6cc 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -80,24 +80,61 @@ static int tune_pci = 0; module_param(tune_pci, int, 0444); MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if nonzero"); -struct mutex mthca_device_mutex; +DEFINE_MUTEX(mthca_device_mutex); + +#define MTHCA_DEFAULT_NUM_QP (1 << 16) +#define MTHCA_DEFAULT_RDB_PER_QP (1 << 2) +#define MTHCA_DEFAULT_NUM_CQ (1 << 16) +#define MTHCA_DEFAULT_NUM_MCG (1 << 13) +#define MTHCA_DEFAULT_NUM_MPT (1 << 17) +#define MTHCA_DEFAULT_NUM_MTT (1 << 20) +#define MTHCA_DEFAULT_NUM_UDAV (1 << 15) +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18) +#define MTHCA_DEFAULT_NUM_UARC_SIZE (1 << 18) + +static struct mthca_profile hca_profile = { + .num_qp = MTHCA_DEFAULT_NUM_QP, + .rdb_per_qp = MTHCA_DEFAULT_RDB_PER_QP, + .num_cq = MTHCA_DEFAULT_NUM_CQ, + .num_mcg = MTHCA_DEFAULT_NUM_MCG, + .num_mpt = MTHCA_DEFAULT_NUM_MPT, + .num_mtt = MTHCA_DEFAULT_NUM_MTT, + .num_udav = MTHCA_DEFAULT_NUM_UDAV, /* Tavor only */ + .fmr_reserved_mtts = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */ + .uarc_size = MTHCA_DEFAULT_NUM_UARC_SIZE, /* Arbel only */ +}; + +module_param_named(num_qp, hca_profile.num_qp, int, 0444); +MODULE_PARM_DESC(num_qp, "maximum number of QPs per HCA"); + +module_param_named(rdb_per_qp, hca_profile.rdb_per_qp, int, 0444); +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP"); + +module_param_named(num_cq, hca_profile.num_cq, int, 0444); +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); + +module_param_named(num_mcg, hca_profile.num_mcg, int, 0444); +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); + +module_param_named(num_mpt, hca_profile.num_mpt, int, 0444); +MODULE_PARM_DESC(num_mpt, + "maximum number of memory protection table entries per HCA"); + +module_param_named(num_mtt, hca_profile.num_mtt, int, 0444); +MODULE_PARM_DESC(num_mtt, + "maximum number of memory translation table segments per HCA"); + +module_param_named(num_udav, hca_profile.num_udav, int, 0444); +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA"); + +module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444); +MODULE_PARM_DESC(fmr_reserved_mtts, + "number of memory translation table segments reserved for FMR"); static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; -static struct mthca_profile default_profile = { - .num_qp = 1 << 16, - .rdb_per_qp = 4, - .num_cq = 1 << 16, - .num_mcg = 1 << 13, - .num_mpt = 1 << 17, - .num_mtt = 1 << 20, - .num_udav = 1 << 15, /* Tavor only */ - .fmr_reserved_mtts = 1 << 18, /* Tavor only */ - .uarc_size = 1 << 18, /* Arbel only */ -}; - static int mthca_tune_pci(struct mthca_dev *mdev) { int cap; @@ -303,7 +340,7 @@ static int mthca_init_tavor(struct mthca_dev *mdev) goto err_disable; } - profile = default_profile; + profile = hca_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.uarc_size = 0; if (mdev->mthca_flags & MTHCA_FLAG_SRQ) @@ -621,7 +658,7 @@ static int mthca_init_arbel(struct mthca_dev *mdev) goto err_stop_fw; } - profile = default_profile; + profile = hca_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.num_udav = 0; if (mdev->mthca_flags & MTHCA_FLAG_SRQ) @@ -1278,11 +1315,55 @@ static struct pci_driver mthca_driver = { .remove = __devexit_p(mthca_remove_one) }; +static void __init __mthca_check_profile_val(const char *name, int *pval, + int pval_default) +{ + /* value must be positive and power of 2 */ + int old_pval = *pval; + + if (old_pval <= 0) + *pval = pval_default; + else + *pval = roundup_pow_of_two(old_pval); + + if (old_pval != *pval) { + printk(KERN_WARNING PFX "Invalid value %d for %s in module parameter.\n", + old_pval, name); + printk(KERN_WARNING PFX "Corrected %s to %d.\n", name, *pval); + } +} + +#define mthca_check_profile_val(name, default) \ + __mthca_check_profile_val(#name, &hca_profile.name, default) + +static void __init mthca_validate_profile(void) +{ + mthca_check_profile_val(num_qp, MTHCA_DEFAULT_NUM_QP); + mthca_check_profile_val(rdb_per_qp, MTHCA_DEFAULT_RDB_PER_QP); + mthca_check_profile_val(num_cq, MTHCA_DEFAULT_NUM_CQ); + mthca_check_profile_val(num_mcg, MTHCA_DEFAULT_NUM_MCG); + mthca_check_profile_val(num_mpt, MTHCA_DEFAULT_NUM_MPT); + mthca_check_profile_val(num_mtt, MTHCA_DEFAULT_NUM_MTT); + mthca_check_profile_val(num_udav, MTHCA_DEFAULT_NUM_UDAV); + mthca_check_profile_val(fmr_reserved_mtts, MTHCA_DEFAULT_NUM_RESERVED_MTTS); + + if (hca_profile.fmr_reserved_mtts >= hca_profile.num_mtt) { + printk(KERN_WARNING PFX "Invalid fmr_reserved_mtts module parameter %d.\n", + hca_profile.fmr_reserved_mtts); + printk(KERN_WARNING PFX "(Must be smaller than num_mtt %d)\n", + hca_profile.num_mtt); + hca_profile.fmr_reserved_mtts = hca_profile.num_mtt / 2; + printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n", + hca_profile.fmr_reserved_mtts); + } +} + static int __init mthca_init(void) { int ret; - mutex_init(&mthca_device_mutex); + mthca_validate_profile(); + ret = mthca_catas_init(); if (ret) return ret; diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index e9b6a6f..cdecbf5 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1898,7 +1898,7 @@ static void srp_add_one(struct ib_device *device) */ srp_dev->fmr_page_shift = max(9, ffs(dev_attr->page_size_cap) - 1); srp_dev->fmr_page_size = 1 << srp_dev->fmr_page_shift; - srp_dev->fmr_page_mask = ~((unsigned long) srp_dev->fmr_page_size - 1); + srp_dev->fmr_page_mask = ~((u64) srp_dev->fmr_page_size - 1); INIT_LIST_HEAD(&srp_dev->dev_list); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 868a540..c217723 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -87,7 +87,7 @@ struct srp_device { struct ib_fmr_pool *fmr_pool; int fmr_page_shift; int fmr_page_size; - unsigned long fmr_page_mask; + u64 fmr_page_mask; }; struct srp_host { diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 3c2e105..0bfa332 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1639,7 +1639,14 @@ static inline void *ib_dma_alloc_coherent(struct ib_device *dev, { if (dev->dma_ops) return dev->dma_ops->alloc_coherent(dev, size, dma_handle, flag); - return dma_alloc_coherent(dev->dma_device, size, dma_handle, flag); + else { + dma_addr_t handle; + void *ret; + + ret = dma_alloc_coherent(dev->dma_device, size, &handle, flag); + *dma_handle = handle; + return ret; + } } /** From rdreier at cisco.com Fri Dec 15 21:04:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 15 Dec 2006 21:04:23 -0800 Subject: [openib-general] [PATCH v3] IB_mthca HCA profile module parameters In-Reply-To: <457BF221.8080701@voltaire.com> (Moni Shoua's message of "Sun, 10 Dec 2006 13:40:17 +0200") References: <457BF221.8080701@voltaire.com> Message-ID: OK, the patch below is what I ended up committing. I am really not pleased with the patch you sent and expected me to include -- there are really obvious simple-to-fix things that it's just ridiculous for you to be sending, eg: > +MODULE_PARM_DESC(num_mpt, trailing whitespace -- please check that your patch applies with 'git apply --check --whitespace=error-all' > + "maximum number of memory protection pable entries per HCA"); umm, 'pable'?? and plenty of other things... For some reason I felt guilty about letting this patch hang for so long, and so I fixed it up, but after doing it this time, I'm not going to spend my time like that again. I have plenty of work to do without cleaning up other people's messes... IB/mthca: Add HCA profile module parameters Add module parameters that enable settting some of the HCA profile values, such as the number of QPs, CQs, etc. Signed-off-by: Leonid Arsh Signed-off-by: Moni Shoua Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c index 0491ec7..711c1b8 100644 --- a/drivers/infiniband/hw/mthca/mthca_main.c +++ b/drivers/infiniband/hw/mthca/mthca_main.c @@ -82,22 +82,59 @@ MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if n struct mutex mthca_device_mutex; +#define MTHCA_DEFAULT_NUM_QP (1 << 16) +#define MTHCA_DEFAULT_RDB_PER_QP (1 << 2) +#define MTHCA_DEFAULT_NUM_CQ (1 << 16) +#define MTHCA_DEFAULT_NUM_MCG (1 << 13) +#define MTHCA_DEFAULT_NUM_MPT (1 << 17) +#define MTHCA_DEFAULT_NUM_MTT (1 << 20) +#define MTHCA_DEFAULT_NUM_UDAV (1 << 15) +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18) +#define MTHCA_DEFAULT_NUM_UARC_SIZE (1 << 18) + +static struct mthca_profile hca_profile = { + .num_qp = MTHCA_DEFAULT_NUM_QP, + .rdb_per_qp = MTHCA_DEFAULT_RDB_PER_QP, + .num_cq = MTHCA_DEFAULT_NUM_CQ, + .num_mcg = MTHCA_DEFAULT_NUM_MCG, + .num_mpt = MTHCA_DEFAULT_NUM_MPT, + .num_mtt = MTHCA_DEFAULT_NUM_MTT, + .num_udav = MTHCA_DEFAULT_NUM_UDAV, /* Tavor only */ + .fmr_reserved_mtts = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */ + .uarc_size = MTHCA_DEFAULT_NUM_UARC_SIZE, /* Arbel only */ +}; + +module_param_named(num_qp, hca_profile.num_qp, int, 0444); +MODULE_PARM_DESC(num_qp, "maximum number of QPs per HCA"); + +module_param_named(rdb_per_qp, hca_profile.rdb_per_qp, int, 0444); +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP"); + +module_param_named(num_cq, hca_profile.num_cq, int, 0444); +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); + +module_param_named(num_mcg, hca_profile.num_mcg, int, 0444); +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); + +module_param_named(num_mpt, hca_profile.num_mpt, int, 0444); +MODULE_PARM_DESC(num_mpt, + "maximum number of memory protection table entries per HCA"); + +module_param_named(num_mtt, hca_profile.num_mtt, int, 0444); +MODULE_PARM_DESC(num_mtt, + "maximum number of memory translation table segments per HCA"); + +module_param_named(num_udav, hca_profile.num_udav, int, 0444); +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA"); + +module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444); +MODULE_PARM_DESC(fmr_reserved_mtts, + "number of memory translation table segments reserved for FMR"); + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; -static struct mthca_profile default_profile = { - .num_qp = 1 << 16, - .rdb_per_qp = 4, - .num_cq = 1 << 16, - .num_mcg = 1 << 13, - .num_mpt = 1 << 17, - .num_mtt = 1 << 20, - .num_udav = 1 << 15, /* Tavor only */ - .fmr_reserved_mtts = 1 << 18, /* Tavor only */ - .uarc_size = 1 << 18, /* Arbel only */ -}; - static int mthca_tune_pci(struct mthca_dev *mdev) { int cap; @@ -303,7 +340,7 @@ static int mthca_init_tavor(struct mthca_dev *mdev) goto err_disable; } - profile = default_profile; + profile = hca_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.uarc_size = 0; if (mdev->mthca_flags & MTHCA_FLAG_SRQ) @@ -621,7 +658,7 @@ static int mthca_init_arbel(struct mthca_dev *mdev) goto err_stop_fw; } - profile = default_profile; + profile = hca_profile; profile.num_uar = dev_lim.uar_size / PAGE_SIZE; profile.num_udav = 0; if (mdev->mthca_flags & MTHCA_FLAG_SRQ) @@ -1278,11 +1315,57 @@ static struct pci_driver mthca_driver = { .remove = __devexit_p(mthca_remove_one) }; +static void __init __mthca_check_profile_val(const char *name, int *pval, + int pval_default) +{ + /* value must be positive and power of 2 */ + int old_pval = *pval; + + if (old_pval <= 0) + *pval = pval_default; + else + *pval = roundup_pow_of_two(old_pval); + + if (old_pval != *pval) { + printk(KERN_WARNING PFX "Invalid value %d for %s in module parameter.\n", + old_pval, name); + printk(KERN_WARNING PFX "Corrected %s to %d.\n", name, *pval); + } +} + +#define mthca_check_profile_val(name, default) \ + __mthca_check_profile_val(#name, &hca_profile.name, default) + +static void __init mthca_validate_profile(void) +{ + mthca_check_profile_val(num_qp, MTHCA_DEFAULT_NUM_QP); + mthca_check_profile_val(rdb_per_qp, MTHCA_DEFAULT_RDB_PER_QP); + mthca_check_profile_val(num_cq, MTHCA_DEFAULT_NUM_CQ); + mthca_check_profile_val(num_mcg, MTHCA_DEFAULT_NUM_MCG); + mthca_check_profile_val(num_mpt, MTHCA_DEFAULT_NUM_MPT); + mthca_check_profile_val(num_mtt, MTHCA_DEFAULT_NUM_MTT); + mthca_check_profile_val(num_udav, MTHCA_DEFAULT_NUM_UDAV); + mthca_check_profile_val(fmr_reserved_mtts, MTHCA_DEFAULT_NUM_RESERVED_MTTS); + + if (hca_profile.fmr_reserved_mtts >= hca_profile.num_mtt) { + printk(KERN_WARNING PFX "Invalid fmr_reserved_mtts module parameter %d.\n", + hca_profile.fmr_reserved_mtts); + printk(KERN_WARNING PFX "(Must be smaller than num_mtt %d)\n", + hca_profile.num_mtt); + hca_profile.fmr_reserved_mtts = hca_profile.num_mtt / 2; + printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n", + hca_profile.fmr_reserved_mtts); + } +} + static int __init mthca_init(void) { int ret; mutex_init(&mthca_device_mutex); + + mthca_validate_profile(); + ret = mthca_catas_init(); if (ret) return ret; From jsquyres at cisco.com Fri Dec 15 21:30:55 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Sat, 16 Dec 2006 00:30:55 -0500 Subject: [openib-general] .openfabrics.org names In-Reply-To: <55CE0347B98FCA468923E5FBC25CB4DC4097DD@orsmsx413.amr.corp.intel.com> References: <55CE0347B98FCA468923E5FBC25CB4DC4097DD@orsmsx413.amr.corp.intel.com> Message-ID: I think the question is -- who has the godaddy password? If Michael is the only one who has it, can someone contact him to get it? Jim? On Dec 15, 2006, at 12:17 PM, Ryan, Jim wrote: > Michael has done this in the past but he's on sabbatical and > unavailable > for several weeks. Can someone else do this? > > Thanks, Jim > > -----Original Message----- > From: Matt Leininger [mailto:mlleinin at hpcn.ca.sandia.gov] > Sent: Friday, December 15, 2006 9:15 AM > To: Jeff Squyres > Cc: openib; Ryan, Jim; Oros, Michael > Subject: Re: [openib-general] .openfabrics.org names > > On Fri, 2006-12-15 at 08:17 -0500, Jeff Squyres wrote: >> These names still don't appear to exist. Do we know when they'll be >> created? > > Intel controls the openfabrics.org domain name. I think Jim or > Michael can make this happen. > > - Matt > >> >> >> On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote: >> >>> Who controls the DNS for openfabrics.org? Could we get these names >>> created? Or -- are there any objections to creating / using such >>> names? >>> >>> Thanks! >>> >>> >>> On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote: >>> >>>> The name "staging.openfabrics.org" was really intended to be >>>> temporary until the old openfabrics.org was taken offline and >>>> replaced with the new one. >>>> >>>> My $0.02 is that we should stop using staging.openfabrics.org as >>>> soon as possible and create / start using some new names for the >>>> server to allow for potential transparent service relocation > someday. >>>> >>>> Here are some new name suggestions that could be done immediately >>>> (with appropriate changes to DNS, apache config, ...and potentially >>>> others): >>>> >>>> * git.openfabrics.org: for all git activity >>>> * wiki.openfabrics.org: a top-level name for the wiki rather than >>>> burying it under several layers of links on the web site >>>> * trac.openfabrics.org: if someone creates this name, I volunteer >>>> to finally get off my butt and install trac to see if people like > it >>>> >>>> These are the old names and would need to be changed in DNS only >>>> when the old server is taken offline / we're ready to move to the >>>> new server: >>>> >>>> * openfabrics.org: redirect to www.openfabrics.org, and for mail >>>> traffic >>>> * www.openfabrics.org: main web site >>>> >>>> -- >>>> Jeff Squyres >>>> Server Virtualization Business Unit >>>> Cisco Systems >>>> >>>> >>> >>> >>> -- >>> Jeff Squyres >>> Server Virtualization Business Unit >>> Cisco Systems >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >>> openib-general >> >> -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From dotanb at dev.mellanox.co.il Fri Dec 15 23:29:36 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Sat, 16 Dec 2006 09:29:36 +0200 (IST) Subject: [openib-general] can i use the multicast module in user level? In-Reply-To: <4582DECD.70301@ichips.intel.com> References: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il> <4582DECD.70301@ichips.intel.com> Message-ID: <1539.85.65.224.140.1166254176.squirrel@dev.mellanox.co.il> >> I would like to use the multicast module in user level tests (in order >> to >> send a join message to the multicast groups that I'm using). >> >> Can I use the multicast module in user level? >> (if the answer is yes, is there is any code reference that I can use?) > > Multicast support has only been exposed to userspace through the > librdmacm. > There's a mckey test app that shows how this can be used. > > I will be working on a raw IB multicast / InformInfo userspace support > through > January. There is an older userspace SA library that you might be able to > play > with as well, but you'd have to look back through the mail logs to find > the patches. Thank you very much. I think i will wait until the raw IB multicast support will be ready; you have a waiting customer .. ;) thanks Dotan From eitan at mellanox.co.il Sat Dec 16 14:12:28 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 17 Dec 2006 00:12:28 +0200 Subject: [openib-general] [PATCH] osm: fix bugs related to not passing OSM_SIGNAL_DONE_PENDING Message-ID: <45846F4C.4080501@mellanox.co.il> Hi Hal This set of patches fixes issues of not providing back to state manager OSM_SIGNAL_DONE_PENDING which breaks the state machine later in the sweep. Eitan Signed-off-by: Eitan Zahavi osm/opensm/osm_pkey_mgr.c | 112 ++++++++++++++++++++++++++++++++------------ osm/opensm/osm_state_mgr.c | 11 +++-- osm/opensm/osm_ucast_mgr.c | 96 ++++++++++++++++++++++++-------------- 4 files changed, 179 insertions(+), 88 deletions(-) diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 48837bc..a33aec7 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry( /********************************************************************** **********************************************************************/ -static ib_api_status_t +static boolean_t pkey_mgr_enforce_partition( + IN osm_log_t *p_log, IN const osm_req_t *p_req, IN const osm_physp_t *p_physp, IN const boolean_t enforce) @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition( osm_madw_context_t context; uint8_t payload[IB_SMP_DATA_SIZE]; ib_port_info_t *p_pi; + ib_api_status_t status; if (!(p_pi = osm_physp_get_port_info_ptr( p_physp ))) - return IB_ERROR; + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0507: " + "No port info for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } - if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) - return IB_SUCCESS; + if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "No need to update PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } memset( payload, 0, IB_SMP_DATA_SIZE ); memcpy( payload, p_pi, sizeof(ib_port_info_t) ); @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition( context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; - return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), - payload, sizeof(payload), - IB_MAD_ATTR_PORT_INFO, - cl_hton32( osm_physp_get_port_num( p_physp ) ), - CL_DISP_MSGID_NONE, &context ); + status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), + payload, sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32( osm_physp_get_port_num( p_physp ) ), + CL_DISP_MSGID_NONE, &context ); + if (status != IB_SUCCESS) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0520: " + "Failed to set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } + else + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "Set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return TRUE; + } } /********************************************************************** @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port( status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, block_index ); if (status == IB_SUCCESS) - ret_val = TRUE; + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_update_port: " + "Updated " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + ret_val = TRUE; + } else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0506: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p_physp ) ); + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } } return ret_val; @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port( uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; + boolean_t port_info_set = FALSE; ib_pkey_table_t empty_block; - + memset(&empty_block, 0, sizeof(ib_pkey_table_t)); p_physp = osm_port_get_default_phys_ptr( p_port ); @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port( enforce = FALSE; } - if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) - { - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0507: " - "pkey_mgr_enforce_partition() failed to update " - "node 0x%016" PRIx64 " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( peer ) ); - } + if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce)) + port_info_set = TRUE; if (enforce == FALSE) - return FALSE; + return port_info_set; p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; for (block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port( osm_physp_get_port_num( peer ) ); } + if (port_info_set) return TRUE; return ret_val; } @@ -541,10 +593,10 @@ osm_pkey_mgr_process( signal = OSM_SIGNAL_DONE_PENDING; p_node = osm_port_get_parent_node( p_port ); if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) - signal = OSM_SIGNAL_DONE_PENDING; + signal = OSM_SIGNAL_DONE_PENDING; } _err: diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 9eac038..4e61259 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1853,6 +1853,7 @@ osm_state_mgr_process( { ib_api_status_t status; osm_remote_sm_t *p_remote_sm; + osm_signal_t tmp_signal; CL_ASSERT( p_mgr ); @@ -2075,11 +2076,10 @@ osm_state_mgr_process( case OSM_SIGNAL_CHANGE_DETECTED: /* * Nothing to do here. One subnet change typcially - * begets another.... + * begets another.... But needs to wait for all transactions */ signal = OSM_SIGNAL_NONE; - break; - + break; case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: /* * A change was detected on the subnet. @@ -2219,7 +2219,10 @@ osm_state_mgr_process( signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); /* the returned signal is always DONE */ - signal = osm_qos_setup(p_mgr->p_subn->p_osm); + tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm); + + if (tmp_signal == OSM_SIGNAL_DONE_PENDING) + signal = OSM_SIGNAL_DONE_PENDING; /* try to restore SA DB (this should be before lid_mgr because we may want to disable clients reregistration diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index e977253..39973de 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table( ib_switch_info_t si; uint32_t block_id_ho = 0; uint8_t block[IB_SMP_DATA_SIZE]; + boolean_t set_swinfo_require = FALSE; + uint16_t lin_top; + uint8_t life_state; CL_ASSERT( p_mgr ); @@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table( Set the top of the unicast forwarding table. */ si = *osm_switch_get_si_ptr( p_sw ); - si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); + lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); + if (si.lin_top != lin_top) + { + set_swinfo_require = TRUE; + si.lin_top = lin_top; + } /* check to see if the change state bit is on. If it is - then we need to clear it. */ - if( ib_switch_info_get_state_change( &si ) ) - si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) - | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; + if ( ib_switch_info_get_state_change( &si ) ) + life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) + | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; else - si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; + life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + if (life_state != si.life_state) { - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "osm_ucast_mgr_set_fwd_table: " - "Setting switch FT top to LID 0x%X\n", - osm_switch_get_max_lid_ho( p_sw ) ); + set_swinfo_require = TRUE; + si.life_state = life_state; } - - context.si_context.light_sweep = FALSE; - context.si_context.node_guid = osm_node_get_node_guid( p_node ); - context.si_context.set_method = TRUE; - - status = osm_req_set( p_mgr->p_req, - p_path, - (uint8_t*)&si, - sizeof(si), - IB_MAD_ATTR_SWITCH_INFO, - 0, - CL_DISP_MSGID_NONE, - &context ); - - if( status != IB_SUCCESS ) + + if ( set_swinfo_require ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_ucast_mgr_set_fwd_table: ERR 3A06: " - "Sending SwitchInfo attribute failed (%s)\n", - ib_get_err_str( status ) ); + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "osm_ucast_mgr_set_fwd_table: " + "Setting switch FT top to LID 0x%X\n", + osm_switch_get_max_lid_ho( p_sw ) ); + } + + context.si_context.light_sweep = FALSE; + context.si_context.node_guid = osm_node_get_node_guid( p_node ); + context.si_context.set_method = TRUE; + + status = osm_req_set( p_mgr->p_req, + p_path, + (uint8_t*)&si, + sizeof(si), + IB_MAD_ATTR_SWITCH_INFO, + 0, + CL_DISP_MSGID_NONE, + &context ); + + if( status != IB_SUCCESS ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_ucast_mgr_set_fwd_table: ERR 3A06: " + "Sending SwitchInfo attribute failed (%s)\n", + ib_get_err_str( status ) ); + } + else + p_mgr->any_change = TRUE; } /* @@ -1215,13 +1234,14 @@ osm_ucast_mgr_process( CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); + p_mgr->any_change = FALSE; + /* If there are no switches in the subnet, we are done. */ if (cl_qmap_count( p_sw_guid_tbl ) == 0) goto Exit; - p_mgr->any_change = FALSE; cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL); if (!p_routing_eng->build_lid_matrices || @@ -1248,14 +1268,20 @@ osm_ucast_mgr_process( if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) __osm_ucast_mgr_dump_tables( p_mgr ); - if (p_mgr->any_change) + if (p_mgr->any_change) + { signal = OSM_SIGNAL_DONE_PENDING; + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_ucast_mgr_process: " + "LFT Tables configured on all switches\n"); + } else + { + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_ucast_mgr_process: " + "No need to set any LFT Tables on all switches\n"); signal = OSM_SIGNAL_DONE; - - osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, - "osm_ucast_mgr_process: " - "LFT Tables configured on all switches\n"); + } Exit: CL_PLOCK_RELEASE( p_mgr->p_lock ); From eitan at mellanox.co.il Sat Dec 16 10:56:39 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 16 Dec 2006 20:56:39 +0200 Subject: [openib-general] [PATCH] osm: fix a bug in ignroing pending transaction of Light Sweep Message-ID: <45844167.9060302@mellanox.co.il> Hi Hal This patch provides fixes an issue discovered by the nightly regression. OpenSM state machine got stack due to pending SwitchInfo transaction being ignored since one of the queries for SwitchInfo failed (due to bad-link). The patch below simply avoids aborting the wait for all SwitchInfo requests to return. I think this issue might have hurt us in other situations too sine it aborted the wait on "CHANGE DETECTED" too. CHANGE_DETECTED is fired on the first switch that reported "Change Bit". It is possible that the issue is showing up as we added incremental support (e.g. for routing) Since only of there are no other SMP's sent during the heavy sweep we will get the "NO_PENDING_TRANSACTIONS" signal caused by the SwitchInfo requests Eitan Signed-off-by: Eitan Zahavi osm/opensm/osm_state_mgr.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 9eac038..91d9dbd 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -2075,11 +2075,10 @@ osm_state_mgr_process( case OSM_SIGNAL_CHANGE_DETECTED: /* * Nothing to do here. One subnet change typcially - * begets another.... + * begets another.... But needs to wait for all transactions to + * complete */ - signal = OSM_SIGNAL_NONE; break; - case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: /* * A change was detected on the subnet. From mst at mellanox.co.il Sat Dec 16 09:03:28 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 16 Dec 2006 19:03:28 +0200 Subject: [openib-general] Fw: openib-general Digest, Vol 30, Issue 135 In-Reply-To: References: Message-ID: <20061216170328.GB24716@mellanox.co.il> First, please avoid answering the digest. This breaks threading in most mailers. > > > > > > > > Tried this patch, it didn't work on ehca. I couldn't change the mode from > > > > datagram to connected from /sys/class. > > > > > > It's wroking as designed in that respect. ehca does not implement > > > srq - without > > > srq, there is no way to prepost receive buffers for a resonable number of > > > connections without running out of memory. > > > > > > So it is falling back on datagram mode. > > > Talk to ehca guys to implement srq and connected mode will be enabled. > > Don't remember SRQ is a MUST for UC mode. Does this patch support > > devices with SRQ in RC mode? > > I don't think the IB HCA Spec requires SRQ support for RC but is an optional > feature. There are two adapters right now that don't support SRQ which means to > use IPoIB-CM on them you should make the use of SRQ an option setting. No, adding such "drink up all memory on real clusters but run well on a back to back benchmark platform" option does not seem like a good idea to me. Rather, we should use UD mode to keep IPoIB scalable on all hardware. > I agree > that if it is available it should be used for scaling issues probably if > available automatically set. But I would like to see us at least support the > current hardware that meets the current SPEC. SRQ support is clearly optional. But neither is IPoIB CM support a required feature. Current code will fall back to datagram mode when SRQ is not supported, and since UD support in not optional, all current hardware is still supported with IPoIB - this patch does not break this. -- MST From mst at mellanox.co.il Sat Dec 16 08:47:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sat, 16 Dec 2006 18:47:09 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: <20061215051438.GH19449@mellanox.co.il> Message-ID: <20061216164709.GA24716@mellanox.co.il> > > > Hi, Michael, > > > > > > Tried this patch, it didn't work on ehca. I couldn't change the mode from > > > datagram to connected from /sys/class. > > > > It's wroking as designed in that respect. ehca does not implement > > srq - without > > srq, there is no way to prepost receive buffers for a resonable number of > > connections without running out of memory. > > > > So it is falling back on datagram mode. > > Talk to ehca guys to implement srq and connected mode will be enabled. > > Don't remember SRQ is a MUST for UC mode. Does this patch support devices with > SRQ in RC mode? Yes. Only RC mode is supported by this patch. >From what you say I am guessing that SRQ is supported by ehca HW but support is currently lacking in the ehca driver? > > > And when unloading ib_ipoib module, all the connections to that node gone, > > > rmmod ib_ipoib hung. The kernel is 2.6.19. > > > > Probably a bug in error handling somewhere. > > Post the sysrq t trace and I'll take a look. > > I will recreate the problem and post stack trace later. > > Thanks > Shirley Ma -- MST From dotanb at dev.mellanox.co.il Sun Dec 17 02:03:40 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 17 Dec 2006 12:03:40 +0200 Subject: [openib-general] what should happen in a completion event channel is being destroyed when there are several CQs associated to it? In-Reply-To: <21986.194.90.237.34.1163686615.squirrel@dev.mellanox.co.il> References: <4553480F.80000@dev.mellanox.co.il> <21986.194.90.237.34.1163686615.squirrel@dev.mellanox.co.il> Message-ID: <458515FC.5050900@dev.mellanox.co.il> Hi Roland. dotanb at dev.mellanox.co.il wrote: > Hi roland. > > >> > What should happen in a completion event channel is being destroyed >> > when there are several CQs associated to it? >> > Should this operation fail (return EBUSY)? >> >> I think that would be the most consistent thing, since we return EBUSY >> for example if a CQ is destroyed with QPs still attached. >> >> > When i tried to do it and later on try to wait for a completion on >> > this event channel i got seg fault... >> >> Does the destroy succeed? >> >> Anyway I'll look at this code to see if it seems OK. >> >> - R. >> >> > I'm writing the man pages to this verb, so which behaviour should i write > the current behaviour or the future behaviour? > > for now, i'm writing the current behaviour. > > thanks > Dotan > Is there is any update with this issue? (if the answer is no, do you plan to change this behavior?) thanks Dotan From tziporet at dev.mellanox.co.il Sun Dec 17 02:50:55 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 17 Dec 2006 12:50:55 +0200 Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm interface to userspace In-Reply-To: <4582DD77.8090208@ichips.intel.com> References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> <45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com> <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com> <4581C4B5.5020702@ichips.intel.com> <15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com> <4582DD77.8090208@ichips.intel.com> Message-ID: <4585210F.1050106@dev.mellanox.co.il> Sean Hefty wrote: >> cool, before sending the orig email i was looking on both Arlin git >> tree at ofa staging and the svn and the code that uses this calls are >> still there, so were are the updated udapl sources? >> > > Arlin's DAPL tree has an rdma_ucm branch that should match. > > - Sean > > > Arlin and Sean, Can you make sure the code that going to OFED 1.2 will be on the place where we take our daily build: librdmacm_git="git://staging.openfabrics.org/~shefty/librdmacm.git" dapl_git="git://staging.openfabrics.org/~ardavis/dapl.git" Thanks, Tziporet From dotanb at dev.mellanox.co.il Sun Dec 17 04:24:42 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 17 Dec 2006 14:24:42 +0200 Subject: [openib-general] Different low level drivers returns different return values incase of an error Message-ID: <4585370A.5000707@dev.mellanox.co.il> Hi. I noticed that low level drivers from different vendors don't act the same when there is an error. For example: when ibv_post_send fails, libmthca returns -1 when ibv_post_send fails, libehca returns -(errno value), such as: -EINVAL, -ENOMEM (i didn't check the code of ipath) I wrote the man pages to the libibverbs (that i hope, soon will be committed), and tried to describe the return values of the verbs. I don't think that the description(behavior) of the verb need to be according to the HW which is being used .. If we are going to change to change the return values to a common behavior i suggest to use a way which will give more information to the user that uses the verbs (create IB oriented errno values(?)), or another method which will give the user a hint of the problem. for example: when the user try to modify a QP with a bad value there is an EINVAL return value for all of the values that he tries to modify ... What do you think? Dotan From sashak at voltaire.com Sun Dec 17 04:50:52 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 17 Dec 2006 14:50:52 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm: better log message. Message-ID: <20061217125052.GA2521@sashak.voltaire.com> Better log message for mcrecord dumping in __osm_mcmr_rcv_leave_mgrp(). Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_sa_mcmember_record.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 31d1fb5..3fec8b9 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1418,8 +1418,11 @@ __osm_mcmr_rcv_leave_mgrp( mcmember_rec = *p_recvd_mcmember_rec; - if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mcmr_rcv_leave_mgrp: Dump of record\n" ); osm_dump_mc_record( p_rcv->p_log, &mcmember_rec, OSM_LOG_DEBUG ); + } CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock); status = __get_mgrp_by_mgid(p_rcv,p_recvd_mcmember_rec, &p_mgrp); -- 1.4.4.1.gbfd3 From sashak at voltaire.com Sun Dec 17 04:52:30 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 17 Dec 2006 14:52:30 +0200 Subject: [openib-general] [PATCH] opensm: sa mcmember_rec leave locking Message-ID: <20061217125230.GB2521@sashak.voltaire.com> Hold locked multicast group leave request (MCMember Record) processing. This prevents kind of race with multicast group join request where those requests can be reordered during processing. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_sa_mcmember_record.c | 2 +- osm/opensm/osm_sm.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index 3fec8b9..382dcff 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -1471,7 +1471,7 @@ __osm_mcmr_rcv_leave_mgrp( mcmember_rec.scope_state = p_mcm_port->scope_state; /* OK we can leave */ - CL_PLOCK_RELEASE( p_rcv->p_lock ); + /* note: osm_sm_mcgrp_leave() will release p_rcv->p_lock */ status = osm_sm_mcgrp_leave(p_rcv->p_sm, mlid, portguid); if(status != IB_SUCCESS) diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c index 70c3584..71fd847 100644 --- a/osm/opensm/osm_sm.c +++ b/osm/opensm/osm_sm.c @@ -776,7 +776,8 @@ osm_sm_mcgrp_leave( /* * Acquire the port object for the port leaving this group. */ - CL_PLOCK_EXCL_ACQUIRE( p_sm->p_lock ); + /* note: p_sm->p_lock is locked by caller, but will be released later + this function */ p_port = ( osm_port_t * ) cl_qmap_get( &p_sm->p_subn->port_guid_tbl, port_guid ); if( p_port == -- 1.4.4.1.gbfd3 From glebn at voltaire.com Sun Dec 17 07:42:41 2006 From: glebn at voltaire.com (glebn at voltaire.com) Date: Sun, 17 Dec 2006 17:42:41 +0200 Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race In-Reply-To: References: Message-ID: <20061217154241.GD11360@minantech.com> On Wed, Dec 13, 2006 at 11:41:41PM -0800, Roland Dreier wrote: > Are there other possible ordering problems involving user memory (not > in a CQ or QP)? Something like a CPU on node A writing to memory on > node B and then posting a work request that makes the HCA DMA from > that memory on node B, and having the work request doorbell reach the > HCA before the write to node B actually happens, so the HCA DMAs the > old contents of node B's memory? > > I guess the only feasible solution to the problem you're pointing out > is to have libmthca use some special mmap()-based allocator for queues > so that the kernel can give it memory that has the special > dma_map_consistent treatment. Do you think this should be part of mthca or some general framework like uio which allows writing driver in userspace? Also another solution could be to do something similar to ehca. It allocates QP and CQ in the kernel and maps them into process address space. -- Gleb. From kliteyn at dev.mellanox.co.il Sun Dec 17 23:30:13 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 18 Dec 2006 09:30:13 +0200 Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine [2/2] In-Reply-To: <1166196836.28709.188922.camel@hal.voltaire.com> References: <4581DDFF.2000903@dev.mellanox.co.il> <1166196836.28709.188922.camel@hal.voltaire.com> Message-ID: <45864385.1040105@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Thu, 2006-12-14 at 18:27, Yevgeny Kliteynik wrote: >> Hi Hal >> >> This patch (2/2) adds Fat Tree routing engine to OpenSM. > > Thanks! Applied. > > I played with it a little and will look more at it going forward. > > A couple of questions: > > Is this algorithm currently considered experimental ? I wouldn't say that it's experimental. It's not perfect - there are still things to improve to make it more efficient, but the routing itself will remain intact. > Are there any simulator tests/regressions for this ? There is a bunch of simulation tests for this engine, but they're not integrated into the nightly simulation regression yet. It's on my to-do list. > Also, could you or Eitan update doc/current-routing.txt with a > description of the fat tree algorithm and send that patch to me ? Sure. -- Yevgeny > -- Hal > From devesh28 at gmail.com Mon Dec 18 00:17:23 2006 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon, 18 Dec 2006 13:47:23 +0530 Subject: [openib-general] [query]requirement of 'process_mad' in the HCA driver In-Reply-To: <1166104604.28709.126501.camel@hal.voltaire.com> References: <2875.47466.qm@web8317.mail.in.yahoo.com> <1166104604.28709.126501.camel@hal.voltaire.com> Message-ID: <309a667c0612180017g44d9be7dn9cb00dffaa081dd3@mail.gmail.com> On similar lines I have a confusion about the mad agent creation:- there is a function in mad.c ib_agent_port_open() which creates _send_only_ SMAs for GSI and SMI per port. There is a function in mthca_mad.c mthca_create_agents() which is _again_ createing two send only mad agents for SMI and GSI. Why this driver specific agent creation is required? On 14 Dec 2006 08:57:11 -0500, Hal Rosenstock wrote: > On Wed, 2006-12-13 at 22:49, keshetti mahesh wrote: > > thanks for your reply, > > > > >The driver is needed to obtain the information for the IB node to > > fill > > >in the MADs for response to the SMA query. It may also issue some > > traps. > > >Similarly for PMA as well. > > > > Do u mean to say that HCA driver is needed to pass the HCA related > > information (like GID, GUID, port_info etc..) to the SMA so that it > > can reply to query(or GET ) MADs. > > Yes. > > > Isn't SMA capable of doing the same by using "query_(gid, pkey, > > port)" verbs. > > One reason I can think of is that not all the needed information is > available via verbs. I think there are some others as well. > > > And final questions if it is really required to implement > > 'process_mad' in HCA driver then why it is not specified in the IB > > specifications. > > IB spec is architecture not implementation. > > > Whose duty is this (replying to query MADs) according to the IB > > psec.s(its duty of SMA right?) > > Depends on the MAD but if you are referring to the SMA queries, then yes > it is the SMA's responsibility. > > > I have observed that process_mad is not implemented in the IBM's eHCA > > driver. what is the case with it? > > With eHCA, QP0 is not exposed to the host (at least currently) and the > SMA is totally implemented in firmware. > > > PS: I am considering only SMA in the host s/w here. > > This is a design choice. > > -- Hal > > > regards, > > K.Mahesh. > > > > > > > > > > Hal Rosenstock wrote: > > On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote: > > > Hello all, > > > > > > I want to know from u people that isi it necessary to > > implement the > > > process_mad for a HCA. > > > > > > After looking into the implementations of process_mad in > > ipath and > > > mthca drivers i have fount that they are used to reply the > > MADs with > > > port_info,gid_info,sm_info etc.. > > > > > > But isn't it handled by SMA in the host...... > > > > The SMA can either be in the host on in firmware (as is > > typical with the > > Mellanox silicon). > > > > > i am little bit confused now . > > > please just whether it is required to implement process_mad > > (suppose) > > > for new HCA driver.... > > > > It is. For an example of a host (software SMA), see > > drivers/infiniband/hw/ipath/ipath_mad.c > > > > > if it is required why? > > > > The driver is needed to obtain the information for the IB node > > to fill > > in the MADs for response to the SMA query. It may also issue > > some traps. > > Similarly for PMA as well. > > > > -- Hal > > > > > Please CC your replies to me. > > > > > > regards, > > > K.Mahesh. > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > > Find out what India is talking about on - Yahoo! Answers > > India > > > Send FREE SMS to your friend's mobile from Yahoo! Messenger > > Version 8. > > > Get it NOW > > > > > > > > ______________________________________________________________________ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > ______________________________________________________________________ > > Find out what India is talking about on - Yahoo! Answers India > > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. > > Get it NOW > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From HNGUYEN at de.ibm.com Mon Dec 18 01:22:29 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 18 Dec 2006 10:22:29 +0100 Subject: [openib-general] Different low level drivers returns different return values incase of an error In-Reply-To: <4585370A.5000707@dev.mellanox.co.il> Message-ID: Hi Dotan! > I noticed that low level drivers from different vendors don't act the > same when there is an error. > For example: > when ibv_post_send fails, libmthca returns -1 > when ibv_post_send fails, libehca returns -(errno value), such as: > -EINVAL, -ENOMEM > (i didn't check the code of ipath) > > I wrote the man pages to the libibverbs (that i hope, soon will be > committed), and tried to describe the > return values of the verbs. > > I don't think that the description(behavior) of the verb need to be > according to the HW which is being used .. > > If we are going to change to change the return values to a common > behavior i suggest to use a way > which will give more information to the user that uses the verbs (create > IB oriented errno values(?)), or > another method which will give the user a hint of the problem. for > example: when the user try to modify a QP > with a bad value there is an EINVAL return value for all of the values > that he tries to modify ... > > What do you think? Good point. I can speak for ehca only. We prefer to reuse existing errno values and not to define new ones as it's also a question of how much information we want to tell the consumer in case of error and what it can handle for. To me the defined errno values give enough information to caller. Anyway we should use same error codes for both kernel and user space verbs. Regards Nam From mst at mellanox.co.il Mon Dec 18 01:41:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 11:41:22 +0200 Subject: [openib-general] Different low level drivers returns different return values incase of an error In-Reply-To: References: <4585370A.5000707@dev.mellanox.co.il> Message-ID: <20061218094122.GA3169@mellanox.co.il> > Anyway we should use same error > codes for both kernel and user space verbs. This actually does not sound like a good idea. In particular returning - values, or incoding them in pointers by means of PTR_ERR is the standard in linux kernel but seems quite nonstandard for a userspace library. -- MST From HNGUYEN at de.ibm.com Mon Dec 18 01:49:41 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 18 Dec 2006 10:49:41 +0100 Subject: [openib-general] Different low level drivers returns different return values incase of an error In-Reply-To: <20061218094122.GA3169@mellanox.co.il> Message-ID: Hi Michael! > > Anyway we should use same error > > codes for both kernel and user space verbs. > > This actually does not sound like a good idea. > In particular returning - values, or incoding them in pointers > by means of PTR_ERR is the standard in linux kernel but seems quite > nonstandard for a userspace library. Oops, you're right in that case. I've overseen it. Thx Nam From philippe_bernadat at hp.com Mon Dec 18 03:18:44 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Mon, 18 Dec 2006 12:18:44 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <1166210069.28709.196688.camel@hal.voltaire.com> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net> I think I am going to need more help here. I did use both tricks, opensm enable_quirks TRUE & rdma_cm tavor_quirk=1. This seems to have no effect. But I may be doing something wrong. So some questions I have: 1) Doc (sdp_release_notes.txt, see below) says we can use either of the two tricks, is it really the case ? 2) I usually don't run opensm (not required for me till now) and I am not too familiar with it. But I did, so that I could try the enable_quirks TRUE quirk option. Does opensm run in background, when I run it never returns, last messages are: >>>> ------------------------------------------------- >>>> OpenSM Rev:openib-2.0.5 >>>> Based on OpenIB svn Exported revision >>>> Using Cached Option:guid = 0x0008f10403961e4d >>>> Using Cached Option:log_flags = 3 >>>> Using Cached Option:enable_quirks = TRUE >>>> Command Line Arguments: >>>> Log File: /var/log/osm.log >>>> ------------------------------------------------- >>>> OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision >>>> >>>> Entering STANDBY state 3) Is there a way to change the MTU from within the lustre LND kernel module. I saw that the IB perf programs did this with the modify_qp() APIs. 4) And by the way, I can confirm that the MTU is the issue. Forcing it to 2K with the ib_witre_perf test also degrades performance. Extract from sdp_release_notes.txt - By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth. Workaround: reset the MTU size to 1K in this situation, using either of the two methods below: 1. Activate the "tavor quirk" workaround in opensm: a. Create an opensm options cache file (/var/cache/osm/opensm.opts): > opensm --cache-options -o b. Add the following line to /var/cache/osm/opensm.opts: enable_quirks TRUE c. Rerun opensm using your usual command line options to activate the opensm quirk option. 2. Activate the "tavor quirk" workaround in cma: set the tavor_quirk module parameter of the rdma_cm module to value 1 (default: 0). Philippe > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, December 15, 2006 8:15 PM > To: Eitan Zahavi > Cc: Matt L. Leininger; Roland Dreier; Bernadat, Philippe; > openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > On Fri, 2006-12-15 at 12:20, Eitan Zahavi wrote: > > Matt Leininger wrote: > > > On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote: > > > > > >> I also looked at the HCA counters, and I indeed think > > >> there is something wrong about the MTU: > > >> > > >> For the same test > > >> > > >> With VIB > > >> > > >> PortXmitData: 2684490382 > > >> PortRcvData: 1750145 > > >> PortXmitPkts: 10280007 > > >> PortRcvPkts: 49962 > > >> > > >> With OFED > > >> > > >> XmtBytes:........................2653730483 > > >> RcvBytes:........................1710541 > > >> XmtPkts:.........................5160009 > > >> RcvPkts:.........................50012 > > >> > > >> Which means we sent half less packets with OFED > > >> and if you do the math it is 2K packets with OFED > (counters are 32bit > > >> units) > > >> and 1K packets with VIB. > > >> > > >> So fo some reason the tavor_quirk param is ignored/overwriten. > > >> Is there an interface to control this ? > > >> > > > > > > Michael said you have to turn on this feature in > OpenSM. From the > > > release notes I'm not sure how you turn it on in OpenSM. > You did turn > > > on the tavor mtu work around in the rdma_cm, but did you > turn it on in > > > OpenSM? Also what version of OpenSM are you running? > > > > > To turn this option on in opensm you need to: > > 1. Run: opensm -c -o > > If you already have an opensm.opts file then you can skip this step. > > -- Hal > > > 2. Modify the file /var/cache/osm/opensm.opts by changing > the line below > > enable_quirks FALSE > > to > > enable_quirks TRUE > > > > 3. Run: opensm > > > Thanks, > > > > > > - Matt > > > > > > > > >> Philippe > > >> > > >> > > >>> -----Original Message----- > > >>> From: Bernadat, Philippe > > >>> Sent: Friday, December 15, 2006 8:59 AM > > >>> To: Michael S. Tsirkin; Roland Dreier > > >>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org > > >>> Subject: RE: Performance Degradation with OFED v. > Voltaire (lustre) > > >>> > > >>> I have set tavor_quirk to 1 with no effect. > > >>> Another thing I have tried is the same lustre > > >>> LNET echo test with a single thread (vs 8) > > >>> > > >>> VIB: 400 MB/s > > >>> OFED-1.1: 333 MB/s > > >>> > > >>> I am posting the live param values for all infiniband > > >>> modules in case someone could identify some wrong setting: > > >>> > > >>> infiniband/core/ib_cm > > >>> > > >>> mra_timeout_limit 30000 > > >>> > > >>> infiniband/core/rdma_cm > > >>> > > >>> max_cm_retries 15 > > >>> tavor_quirk 1 > > >>> > > >>> infiniband/hw/ipath/ib_ipath > > >>> > > >>> cfgports 0 > > >>> debug 1 > > >>> disable_sma 0 > > >>> kpiobufs 0 > > >>> lkey_table_size 12 > > >>> max_ahs 65535 > > >>> max_cqes 196607 > > >>> max_cqs 131071 > > >>> max_mcast_grps 16384 > > >>> max_mcast_qp_attached 16 > > >>> max_pds 65535 > > >>> max_qps 16384 > > >>> max_qp_wrs 16383 > > >>> max_sges 96 > > >>> max_srqs 1024 > > >>> max_srq_sges 128 > > >>> max_srq_wrs 131071 > > >>> qp_table_size 251 > > >>> > > >>> infiniband/hw/mthca/ib_mthca > > >>> > > >>> catas_reset_disable 0 > > >>> debug_level 0 > > >>> fmr_reserved_mtts 262144 > > >>> fw_cmd_doorbell 0 > > >>> msi 0 > > >>> msi_x 1 > > >>> num_cq 65536 > > >>> num_mcg 8192 > > >>> num_mpt 131072 > > >>> num_mtt 1048576 > > >>> num_qp 65536 > > >>> num_udav 32768 > > >>> rdb_per_qp 4 > > >>> tune_pci 1 > > >>> > > >>> infiniband/ulp/ipoib/ib_ipoib > > >>> > > >>> debug_level 0 > > >>> mcast_debug_level 0 > > >>> recv_queue_size 128 > > >>> send_queue_size 64 > > >>> > > >>> Philippe > > >>> > > >>> > > >>>> -----Original Message----- > > >>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > >>>> Sent: Thursday, December 14, 2006 6:32 PM > > >>>> To: Roland Dreier > > >>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; > > >>>> openib-general at openib.org > > >>>> Subject: Re: Performance Degradation with OFED v. Voltaire > > >>>> > > >>>> > > >>>>> > I think Eric described the major differences earlier on, > > >>>>> > > >>>> here it is, see > > >>>> > > >>>>> > second half: > > >>>>> > > >>>>> OK, I forgot about that. > > >>>>> > > >>>>> I guess one last thing to check would be the MTU being used > > >>>>> > > >>>> for the RC > > >>>> > > >>>>> connections. Since this is PCI-X HW then the MTU should > > >>>>> > > >>> be 1024 for > > >>> > > >>>>> best throughput (instead of the max MTU of 2048). > > >>>>> > > >>>> The MTU issue is described in the OFED release notes. > > >>>> You must turn the Tavor work-around for it on in opensm. > > >>>> This was introduced late in release cycle to it was > deemed safer > > >>>> to make it off by default. > > >>>> > > >>>> By the way, Eitan, Hal, can we turn this on by default now? > > >>>> This was we'll get more feedback from people, and > we'll still have > > >>>> time to turn it off before release if this unexpectedly > > >>>> creates issues. > > >>>> > > >>>> -- > > >>>> MST > > >>>> > > >>>> > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > From ogerlitz at voltaire.com Mon Dec 18 03:19:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 18 Dec 2006 13:19:05 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <20061214173145.GC12781@mellanox.co.il> References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net> <20061214173145.GC12781@mellanox.co.il> Message-ID: <45867929.4080300@voltaire.com> Michael S. Tsirkin wrote: >> I guess one last thing to check would be the MTU being used for the RC >> connections. Since this is PCI-X HW then the MTU should be 1024 for >> best throughput (instead of the max MTU of 2048). > The MTU issue is described in the OFED release notes. > You must turn the Tavor work-around for it on in opensm. > This was introduced late in release cycle to it was deemed safer > to make it off by default. Michael, Let me see i follow you correct: a user must enable the tavor quirk in the **openSM** ? what about the cma_tavor_quirk? and what about users who want to use OFED with commercial/3rd party SMs ??? looking in the OFED 1.1 docs it is mentioned that either way should work. Looking on kernel_patches/fixes/cma_tavor_quirk.patch of OFED 1.1 (below) the thing seems to me uncompleted as the IB_SA_PATH_REC_MTU_SELECTOR and IB_SA_PATH_REC_MTU bits are not set in the component mask of the path record query done by the cma, am i missing something? Or. > Tavor systems get better performance with 1K MTU. Since there does > not seem to be any way to find out whether the remote system uses Tavor, > add an option to limit the MTU globally. > > Signed-off-by: Michael S. Tsirkin > > Index: linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c > =================================================================== > --- linux-2.6.18-rc2-devel.orig/drivers/infiniband/core/cma.c 2006-09-11 16:01:37.000000000 +0300 > +++ linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c 2006-09-13 18:51:45.000000000 +0300 > @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty"); > MODULE_DESCRIPTION("Generic RDMA CM Agent"); > MODULE_LICENSE("Dual BSD/GPL"); > > +static int tavor_quirk = 0; > +module_param_named(tavor_quirk, tavor_quirk, int, 0644); > +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: limit MTU to 1K if > 0"); > + > #define CMA_CM_RESPONSE_TIMEOUT 20 > #define CMA_MAX_CM_RETRIES 3 > > @@ -1123,6 +1127,11 @@ static int cma_query_ib_route(struct rdm > path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > path_rec.numb_path = 1; > > + if (tavor_quirk) { > + path_rec.mtu_selector = IB_SA_LT; > + path_rec.mtu = IB_MTU_2048; > + } > + > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > id_priv->id.port_num, &path_rec, > IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | From eitan at sw053.yok.mtl.com Mon Dec 18 03:19:31 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Mon, 18 Dec 2006 13:19:31 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-18:normal completion Message-ID: <200612181119.kBIBJVLN029482@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 ibutils rev = Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1 Total=221 Pass=219 Fail=2 Pass: 31 LidMgr IS1-16.topo 30 Stability IS1-16.topo 30 Pkey IS1-16.topo 30 Multicast IS1-16.topo 29 OsmStress IS1-16.topo 10 Stability IS3-loop.topo 10 Stability IS3-128.topo 10 Pkey IS3-128.topo 10 Multicast IS3-loop.topo 10 Multicast IS3-128.topo 10 LidMgr IS3-128.topo 9 OsmStress IS3-128.topo Failures: 1 OsmStress IS3-128.topo 1 OsmStress IS1-16.topo From mst at mellanox.co.il Mon Dec 18 03:35:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 13:35:02 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <45867929.4080300@voltaire.com> References: <45867929.4080300@voltaire.com> Message-ID: <20061218113502.GB3169@mellanox.co.il> > >> I guess one last thing to check would be the MTU being used for the RC > >> connections. Since this is PCI-X HW then the MTU should be 1024 for > >> best throughput (instead of the max MTU of 2048). > > > The MTU issue is described in the OFED release notes. > > You must turn the Tavor work-around for it on in opensm. > > This was introduced late in release cycle to it was deemed safer > > to make it off by default. > > Michael, > > Let me see i follow you correct: a user must enable the tavor quirk in > the **openSM** ? what about the cma_tavor_quirk? and what about users > who want to use OFED with commercial/3rd party SMs ??? looking in the > OFED 1.1 docs it is mentioned that either way should work. Right. But CMA quirk can only work if OFED CMA is initiating the connection from Tavor (i.e. it can't handle Arbel->Tavor case). Enabling this in Opensm solves the problem for all ULPs, and in all cases - whether Tavor is active or passive side in the connection. So fixing this in the SM is clearly the best solution. Further, as you point out the cma quirk patch in OFED looks broken :(. > Looking on kernel_patches/fixes/cma_tavor_quirk.patch of OFED 1.1 > (below) the thing seems to me uncompleted as the > IB_SA_PATH_REC_MTU_SELECTOR and IB_SA_PATH_REC_MTU bits are not set in > the component mask of the path record query done by the cma, am i > missing something? > > Or. Correct, looks like that bit is missing. > Tavor systems get better performance with 1K MTU. Since there does > not seem to be any way to find out whether the remote system uses Tavor, > add an option to limit the MTU globally. > > Signed-off-by: Michael S. Tsirkin > > Index: linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c > =================================================================== > --- linux-2.6.18-rc2-devel.orig/drivers/infiniband/core/cma.c 2006-09-11 16:01:37.000000000 +0300 > +++ linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c 2006-09-13 18:51:45.000000000 +0300 > @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty"); > MODULE_DESCRIPTION("Generic RDMA CM Agent"); > MODULE_LICENSE("Dual BSD/GPL"); > > +static int tavor_quirk = 0; > +module_param_named(tavor_quirk, tavor_quirk, int, 0644); > +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: limit MTU to 1K if > 0"); > + > #define CMA_CM_RESPONSE_TIMEOUT 20 > #define CMA_MAX_CM_RETRIES 3 > > @@ -1123,6 +1127,11 @@ static int cma_query_ib_route(struct rdm > path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > path_rec.numb_path = 1; > > + if (tavor_quirk) { > + path_rec.mtu_selector = IB_SA_LT; > + path_rec.mtu = IB_MTU_2048; > + } > + > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > id_priv->id.port_num, &path_rec, > IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | -- MST From mst at mellanox.co.il Mon Dec 18 03:37:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 13:37:50 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net> References: <1166210069.28709.196688.camel@hal.voltaire.com> <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net> Message-ID: <20061218113750.GC3169@mellanox.co.il> cma quirk seems not to work. Enabling the opensm quirk should work, and should be sufficient. However, you seem to be running another SM on your fabric (on your switch?) that's why it enters STANDBY. Disable that and try again. Quoting r. Bernadat, Philippe : Subject: Re: Performance Degradation with OFED v. Voltaire(lustre) I think I am going to need more help here. I did use both tricks, opensm enable_quirks TRUE & rdma_cm tavor_quirk=1. This seems to have no effect. But I may be doing something wrong. So some questions I have: 1) Doc (sdp_release_notes.txt, see below) says we can use either of the two tricks, is it really the case ? 2) I usually don't run opensm (not required for me till now) and I am not too familiar with it. But I did, so that I could try the enable_quirks TRUE quirk option. Does opensm run in background, when I run it never returns, last messages are: >>>> ------------------------------------------------- >>>> OpenSM Rev:openib-2.0.5 >>>> Based on OpenIB svn Exported revision >>>> Using Cached Option:guid = 0x0008f10403961e4d >>>> Using Cached Option:log_flags = 3 >>>> Using Cached Option:enable_quirks = TRUE >>>> Command Line Arguments: >>>> Log File: /var/log/osm.log >>>> ------------------------------------------------- >>>> OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision >>>> >>>> Entering STANDBY state 3) Is there a way to change the MTU from within the lustre LND kernel module. I saw that the IB perf programs did this with the modify_qp() APIs. 4) And by the way, I can confirm that the MTU is the issue. Forcing it to 2K with the ib_witre_perf test also degrades performance. Extract from sdp_release_notes.txt - By default, SDP utilizes a 2 Kbyte MTU size. This may cause PCI-X cards using Mellanox Technologies "Infinihost" HCAs to experience low bandwidth. Workaround: reset the MTU size to 1K in this situation, using either of the two methods below: 1. Activate the "tavor quirk" workaround in opensm: a. Create an opensm options cache file (/var/cache/osm/opensm.opts): > opensm --cache-options -o b. Add the following line to /var/cache/osm/opensm.opts: enable_quirks TRUE c. Rerun opensm using your usual command line options to activate the opensm quirk option. 2. Activate the "tavor quirk" workaround in cma: set the tavor_quirk module parameter of the rdma_cm module to value 1 (default: 0). Philippe > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, December 15, 2006 8:15 PM > To: Eitan Zahavi > Cc: Matt L. Leininger; Roland Dreier; Bernadat, Philippe; > openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > On Fri, 2006-12-15 at 12:20, Eitan Zahavi wrote: > > Matt Leininger wrote: > > > On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote: > > > > > >> I also looked at the HCA counters, and I indeed think > > >> there is something wrong about the MTU: > > >> > > >> For the same test > > >> > > >> With VIB > > >> > > >> PortXmitData: 2684490382 > > >> PortRcvData: 1750145 > > >> PortXmitPkts: 10280007 > > >> PortRcvPkts: 49962 > > >> > > >> With OFED > > >> > > >> XmtBytes:........................2653730483 > > >> RcvBytes:........................1710541 > > >> XmtPkts:.........................5160009 > > >> RcvPkts:.........................50012 > > >> > > >> Which means we sent half less packets with OFED > > >> and if you do the math it is 2K packets with OFED > (counters are 32bit > > >> units) > > >> and 1K packets with VIB. > > >> > > >> So fo some reason the tavor_quirk param is ignored/overwriten. > > >> Is there an interface to control this ? > > >> > > > > > > Michael said you have to turn on this feature in > OpenSM. From the > > > release notes I'm not sure how you turn it on in OpenSM. > You did turn > > > on the tavor mtu work around in the rdma_cm, but did you > turn it on in > > > OpenSM? Also what version of OpenSM are you running? > > > > > To turn this option on in opensm you need to: > > 1. Run: opensm -c -o > > If you already have an opensm.opts file then you can skip this step. > > -- Hal > > > 2. Modify the file /var/cache/osm/opensm.opts by changing > the line below > > enable_quirks FALSE > > to > > enable_quirks TRUE > > > > 3. Run: opensm > > > Thanks, > > > > > > - Matt > > > > > > > > >> Philippe > > >> > > >> > > >>> -----Original Message----- > > >>> From: Bernadat, Philippe > > >>> Sent: Friday, December 15, 2006 8:59 AM > > >>> To: Michael S. Tsirkin; Roland Dreier > > >>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org > > >>> Subject: RE: Performance Degradation with OFED v. > Voltaire (lustre) > > >>> > > >>> I have set tavor_quirk to 1 with no effect. > > >>> Another thing I have tried is the same lustre > > >>> LNET echo test with a single thread (vs 8) > > >>> > > >>> VIB: 400 MB/s > > >>> OFED-1.1: 333 MB/s > > >>> > > >>> I am posting the live param values for all infiniband > > >>> modules in case someone could identify some wrong setting: > > >>> > > >>> infiniband/core/ib_cm > > >>> > > >>> mra_timeout_limit 30000 > > >>> > > >>> infiniband/core/rdma_cm > > >>> > > >>> max_cm_retries 15 > > >>> tavor_quirk 1 > > >>> > > >>> infiniband/hw/ipath/ib_ipath > > >>> > > >>> cfgports 0 > > >>> debug 1 > > >>> disable_sma 0 > > >>> kpiobufs 0 > > >>> lkey_table_size 12 > > >>> max_ahs 65535 > > >>> max_cqes 196607 > > >>> max_cqs 131071 > > >>> max_mcast_grps 16384 > > >>> max_mcast_qp_attached 16 > > >>> max_pds 65535 > > >>> max_qps 16384 > > >>> max_qp_wrs 16383 > > >>> max_sges 96 > > >>> max_srqs 1024 > > >>> max_srq_sges 128 > > >>> max_srq_wrs 131071 > > >>> qp_table_size 251 > > >>> > > >>> infiniband/hw/mthca/ib_mthca > > >>> > > >>> catas_reset_disable 0 > > >>> debug_level 0 > > >>> fmr_reserved_mtts 262144 > > >>> fw_cmd_doorbell 0 > > >>> msi 0 > > >>> msi_x 1 > > >>> num_cq 65536 > > >>> num_mcg 8192 > > >>> num_mpt 131072 > > >>> num_mtt 1048576 > > >>> num_qp 65536 > > >>> num_udav 32768 > > >>> rdb_per_qp 4 > > >>> tune_pci 1 > > >>> > > >>> infiniband/ulp/ipoib/ib_ipoib > > >>> > > >>> debug_level 0 > > >>> mcast_debug_level 0 > > >>> recv_queue_size 128 > > >>> send_queue_size 64 > > >>> > > >>> Philippe > > >>> > > >>> > > >>>> -----Original Message----- > > >>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > >>>> Sent: Thursday, December 14, 2006 6:32 PM > > >>>> To: Roland Dreier > > >>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; > > >>>> openib-general at openib.org > > >>>> Subject: Re: Performance Degradation with OFED v. Voltaire > > >>>> > > >>>> > > >>>>> > I think Eric described the major differences earlier on, > > >>>>> > > >>>> here it is, see > > >>>> > > >>>>> > second half: > > >>>>> > > >>>>> OK, I forgot about that. > > >>>>> > > >>>>> I guess one last thing to check would be the MTU being used > > >>>>> > > >>>> for the RC > > >>>> > > >>>>> connections. Since this is PCI-X HW then the MTU should > > >>>>> > > >>> be 1024 for > > >>> > > >>>>> best throughput (instead of the max MTU of 2048). > > >>>>> > > >>>> The MTU issue is described in the OFED release notes. > > >>>> You must turn the Tavor work-around for it on in opensm. > > >>>> This was introduced late in release cycle to it was > deemed safer > > >>>> to make it off by default. > > >>>> > > >>>> By the way, Eitan, Hal, can we turn this on by default now? > > >>>> This was we'll get more feedback from people, and > we'll still have > > >>>> time to turn it off before release if this unexpectedly > > >>>> creates issues. > > >>>> > > >>>> -- > > >>>> MST > > >>>> > > >>>> > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From dotanb at dev.mellanox.co.il Mon Dec 18 03:47:09 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Mon, 18 Dec 2006 13:47:09 +0200 Subject: [openib-general] Different low level drivers returns different return values incase of an error In-Reply-To: References: Message-ID: <45867FBD.9040300@dev.mellanox.co.il> Hi Hoang-Nam. Hoang-Nam Nguyen wrote: > Hi Dotan! > > Good point. I can speak for ehca only. We prefer to reuse existing > errno values and not to define new ones as it's also a question of > how much information we want to tell the consumer in case of error > and what it can handle for. To me the defined errno values give > enough information to caller. Anyway we should use same error > codes for both kernel and user space verbs. > Regards > Nam > I think that there should be 2 modes to the drivers: mode 1 (release mode): return "standard" errno values mode 2 (debug mode) : return "IB oriented" values This can be done in compilation time, for example: #ifdef IB_DEBUG #define IB_EINVAL_MTU 1000 #define IB_EINVAL_LID 1001 #else #define IB_EINVAL_MTU EINVAL #define IB_EINVAL_LID EINVAL #endif This way, we will be able to help developers to find out what is the problem in case of an error when using debug driver. Anyway, we need to decide on a common behavior of all low level drivers. thanks Dotan From ogerlitz at voltaire.com Mon Dec 18 04:03:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 18 Dec 2006 14:03:22 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <20061218113502.GB3169@mellanox.co.il> References: <45867929.4080300@voltaire.com> <20061218113502.GB3169@mellanox.co.il> Message-ID: <4586838A.3040500@voltaire.com> Philippe, can you try this patch, i have problems with setting a compilation env now but it should work. unpack OFED 1.1, copy this to OFED-1.1/openib-1.1/kernel_patches/fixes/xxx_cma_tavor_quirk.txt and then pack OFED 1.1 and rebuild Or. > Index: openib-1.1/drivers/infiniband/core/cma.c > =================================================================== > --- openib-1.1.orig/drivers/infiniband/core/cma.c 2006-12-18 13:27:45.213587734 +0200 > +++ openib-1.1/drivers/infiniband/core/cma.c 2006-12-18 13:34:24.921455159 +0200 > @@ -1117,6 +1117,9 @@ static void cma_query_handler(int status > route = &work->id->id.route; > > if (!status) { > + /* XXX - if returned path MTU is 2K force it to be 1K */ > + if(path_rec->mtu == IB_MTU_2048) > + path_rec->mtu = IB_MTU_1024; > route->num_paths = 1; > *route->path_rec = *path_rec; > } else { From halr at voltaire.com Mon Dec 18 04:09:02 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2006 07:09:02 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: better log message. In-Reply-To: <20061217125052.GA2521@sashak.voltaire.com> References: <20061217125052.GA2521@sashak.voltaire.com> Message-ID: <1166443684.32666.178303.camel@hal.voltaire.com> On Sun, 2006-12-17 at 07:50, Sasha Khapyorsky wrote: > Better log message for mcrecord dumping in __osm_mcmr_rcv_leave_mgrp(). > > Signed-off-by: Sasha Khapyorsky From mst at mellanox.co.il Mon Dec 18 04:23:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 14:23:22 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <4586838A.3040500@voltaire.com> References: <4586838A.3040500@voltaire.com> Message-ID: <20061218122322.GD3169@mellanox.co.il> Setting selectors for path query would be cleaner, no? Quoting r. Or Gerlitz : Subject: Re: [openib-general] Performance Degradation with OFED v. Voltaire Philippe, can you try this patch, i have problems with setting a compilation env now but it should work. unpack OFED 1.1, copy this to OFED-1.1/openib-1.1/kernel_patches/fixes/xxx_cma_tavor_quirk.txt and then pack OFED 1.1 and rebuild Or. > Index: openib-1.1/drivers/infiniband/core/cma.c > =================================================================== > --- openib-1.1.orig/drivers/infiniband/core/cma.c 2006-12-18 13:27:45.213587734 +0200 > +++ openib-1.1/drivers/infiniband/core/cma.c 2006-12-18 13:34:24.921455159 +0200 > @@ -1117,6 +1117,9 @@ static void cma_query_handler(int status > route = &work->id->id.route; > > if (!status) { > + /* XXX - if returned path MTU is 2K force it to be 1K */ > + if(path_rec->mtu == IB_MTU_2048) > + path_rec->mtu = IB_MTU_1024; > route->num_paths = 1; > *route->path_rec = *path_rec; > } else { -- MST From halr at voltaire.com Mon Dec 18 04:20:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2006 07:20:51 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: better log message. In-Reply-To: <20061217125052.GA2521@sashak.voltaire.com> References: <20061217125052.GA2521@sashak.voltaire.com> Message-ID: <1166444405.32666.178789.camel@hal.voltaire.com> On Sun, 2006-12-17 at 07:50, Sasha Khapyorsky wrote: > Better log message for mcrecord dumping in __osm_mcmr_rcv_leave_mgrp(). > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Mon Dec 18 04:43:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2006 07:43:50 -0500 Subject: [openib-general] [PATCH] opensm: sa mcmember_rec leave locking In-Reply-To: <20061217125230.GB2521@sashak.voltaire.com> References: <20061217125230.GB2521@sashak.voltaire.com> Message-ID: <1166444463.32666.178845.camel@hal.voltaire.com> On Sun, 2006-12-17 at 07:52, Sasha Khapyorsky wrote: > Hold locked multicast group leave request (MCMember Record) processing. > This prevents kind of race with multicast group join request where > those requests can be reordered during processing. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From ogerlitz at voltaire.com Mon Dec 18 04:49:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 18 Dec 2006 14:49:19 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire In-Reply-To: <20061218122322.GD3169@mellanox.co.il> References: <4586838A.3040500@voltaire.com> <20061218122322.GD3169@mellanox.co.il> Message-ID: <45868E4F.9020708@voltaire.com> Michael S. Tsirkin wrote: > Setting selectors for path query would be cleaner, no? yes, but first i want to do it very-hard-coded and see if the performance diff problem is solved and then to productize it... Or. From ogerlitz at voltaire.com Mon Dec 18 05:03:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 18 Dec 2006 15:03:22 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net> Message-ID: <4586919A.7060000@voltaire.com> Bernadat, Philippe wrote: > 3) Is there a way to change the MTU from within the lustre LND kernel > module. I saw that the IB perf programs did this with the modify_qp() > APIs. yes, go to the place where the lustre NLD active side gets RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set lustre_id->route->path_rec->mtu = IB_MTU_1024; Or. From sashak at voltaire.com Mon Dec 18 05:30:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 18 Dec 2006 15:30:32 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-18:normal completion In-Reply-To: <200612181119.kBIBJVLN029482@sw053.yok.mtl.com> References: <200612181119.kBIBJVLN029482@sw053.yok.mtl.com> Message-ID: <20061218133032.GC4808@sashak.voltaire.com> Hi Eitan, On 13:19 Mon 18 Dec , Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 > ibutils rev = Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1 > Total=221 Pass=219 Fail=2 > > Pass: > 31 LidMgr IS1-16.topo > 30 Stability IS1-16.topo > 30 Pkey IS1-16.topo > 30 Multicast IS1-16.topo > 29 OsmStress IS1-16.topo > 10 Stability IS3-loop.topo > 10 Stability IS3-128.topo > 10 Pkey IS3-128.topo > 10 Multicast IS3-loop.topo > 10 Multicast IS3-128.topo > 10 LidMgr IS3-128.topo > 9 OsmStress IS3-128.topo > > Failures: > 1 OsmStress IS3-128.topo > 1 OsmStress IS1-16.topo Is it possible to have more details about failures (in case when it is real failures)? Probably to upload the logs to somewhere? Sasha From eitan at mellanox.co.il Mon Dec 18 05:33:50 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 18 Dec 2006 15:33:50 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-18:normal completion Message-ID: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com> Hi Sasha, The failure analysis takes time and is manual... The logs and related files are pretty big and will take space to upload. Today I simulated with OpenSM that was compiled on the side (my bad - should have incorporated my patches on the clone but I was not sure this is not going to "contaminate" that git tree forever) with the fixes for DONE/DONE_PENDING. The tests that failed today are actually false violations: 1. The IS1-16 failed due to lack of free sockets to connect to the server. Still not clear why. I will increase the number of sockets the client/server try to connect on. 2. The IS3-128 fail due to temporary replacement of the opensm with the one that have my fixes for DONE/DONE_PENDING. This was a mistake I did manually by compiling the "clone". As I was watching the log I have noticed that the same wrong signal was happening. BTW: The DONE/DONE_PENDING bug was discovered by a change in simulator dispatcher that I did. The change introduced a BUG that caused the machine to be overloaded with busy loop in the simulator dispatcher. Apparently this brought up some different timing and found these bugs. EZ > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Monday, December 18, 2006 3:31 PM > To: Eitan Zahavi > Cc: Eitan Zahavi; Yevgeny Kliteynik; halr at voltaire.com; openib- > general at openib.org > Subject: Re: nightly osm_sim report 2006-12-18:normal completion > > Hi Eitan, > > On 13:19 Mon 18 Dec , Eitan Zahavi wrote: > > OSM Simulation Regression Summary > > OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 ibutils rev = > > Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1 > > Total=221 Pass=219 Fail=2 > > > > Pass: > > 31 LidMgr IS1-16.topo > > 30 Stability IS1-16.topo > > 30 Pkey IS1-16.topo > > 30 Multicast IS1-16.topo > > 29 OsmStress IS1-16.topo > > 10 Stability IS3-loop.topo > > 10 Stability IS3-128.topo > > 10 Pkey IS3-128.topo > > 10 Multicast IS3-loop.topo > > 10 Multicast IS3-128.topo > > 10 LidMgr IS3-128.topo > > 9 OsmStress IS3-128.topo > > > > Failures: > > 1 OsmStress IS3-128.topo > > 1 OsmStress IS1-16.topo > > Is it possible to have more details about failures (in case when it is real > failures)? Probably to upload the logs to somewhere? > > Sasha From mst at mellanox.co.il Mon Dec 18 05:40:10 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 15:40:10 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-18:normal completion In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com> Message-ID: <20061218134010.GE3169@mellanox.co.il> > should have incorporated my patches on the clone but I was not sure this > is not going to "contaminate" that git tree forever No, you can always git rebase to move your patches to the top of the pile, or just git reset to revert to upstream version. Just don't do this for a tree someone else might have cloned and based his development on. -- MST From philippe_bernadat at hp.com Mon Dec 18 05:42:21 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Mon, 18 Dec 2006 14:42:21 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <4586919A.7060000@voltaire.com> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E0557115F@idaexc03.emea.cpqcorp.net> I have tried both fiexs, none of these improve performance ... Let me check the number of packets again. Philippe > -----Original Message----- > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > Sent: Monday, December 18, 2006 2:03 PM > To: Bernadat, Philippe > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; > openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > Bernadat, Philippe wrote: > > 3) Is there a way to change the MTU from within the lustre > LND kernel > > module. I saw that the IB perf programs did this with the > modify_qp() > > APIs. > > yes, go to the place where the lustre NLD active side gets > RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set > > lustre_id->route->path_rec->mtu = IB_MTU_1024; > > Or. > > > > From philippe_bernadat at hp.com Mon Dec 18 06:09:19 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Mon, 18 Dec 2006 15:09:19 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net> Or, I did manage to fix it my way, by inserting this same route->path_rec->mtu = IB_MTU_1024; Before/after qp creation Before/after accept Before/after connect So not sure which one really fixes it. Philippe > -----Original Message----- > From: Bernadat, Philippe > Sent: Monday, December 18, 2006 2:42 PM > To: Or Gerlitz > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; > openib-general at openib.org > Subject: RE: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > I have tried both fiexs, none of these improve performance ... > Let me check the number of packets again. > > Philippe > > > -----Original Message----- > > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > > Sent: Monday, December 18, 2006 2:03 PM > > To: Bernadat, Philippe > > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; > > openib-general at openib.org > > Subject: Re: [openib-general] Performance Degradation with > > OFED v. Voltaire(lustre) > > > > Bernadat, Philippe wrote: > > > 3) Is there a way to change the MTU from within the lustre > > LND kernel > > > module. I saw that the IB perf programs did this with the > > modify_qp() > > > APIs. > > > > yes, go to the place where the lustre NLD active side gets > > RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set > > > > lustre_id->route->path_rec->mtu = IB_MTU_1024; > > > > Or. > > > > > > > > From wombat2 at us.ibm.com Mon Dec 18 06:09:56 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Mon, 18 Dec 2006 09:09:56 -0500 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: <20061216170328.GB24716@mellanox.co.il> Message-ID: "Michael S. Tsirkin" wrote on 12/16/2006 12:03:28 PM: > > > > > > > > > > Tried this patch, it didn't work on ehca. I couldn't change > the mode from > > > > > datagram to connected from /sys/class. > > > > > > > > It's wroking as designed in that respect. ehca does not implement > > > > srq - without > > > > srq, there is no way to prepost receive buffers for a > resonable number of > > > > connections without running out of memory. > > > > > > > > So it is falling back on datagram mode. > > > > Talk to ehca guys to implement srq and connected mode will be enabled. > > > Don't remember SRQ is a MUST for UC mode. Does this patch support > > > devices with SRQ in RC mode? > > > > I don't think the IB HCA Spec requires SRQ support for RC but is an optional > > feature. There are two adapters right now that don't support SRQ > which means to > > use IPoIB-CM on them you should make the use of SRQ an option setting. > > No, adding such "drink up all memory on real clusters but run well > on a back to back > benchmark platform" option does not seem like a good idea to me. > Rather, we should use UD mode to keep IPoIB scalable on all hardware. I agree that adapters that don't have SRQ can consume larger amounts of memory than those with SRQ ,however, that is not a good reason to prevent usage of RC or UC on those adapters. The memory consumption problem with any protocol not using SRQ and running over RC or UC is well documented. At the OpenFabrics meeting in Tampa one of several themes was that we need better IP performance to move into commercial customers and also help our current primarily HPC customers, some which are not large numbers of endpoints configurations. Even thought other ULP's are available, good IP is still the opportunity to getting more customers on IB. Not all IB customers we have a large number of endpoint deployments so having non SRQ adapters use IPoIB-CM is still important to expanding the customer base for IB. You have to let the customer decide how they want to tune their system based on the available functions/features. If not you don't have equality in potential performance across all HCA's. Some guidance on memory consumption would be good, to guide users whether they want to run IPoIB-CM without SRQ just like IPoIB-CM will be selectable. > > > I agree > > that if it is available it should be used for scaling issues probably if > > available automatically set. But I would like to see us at least support the > > current hardware that meets the current SPEC. > > SRQ support is clearly optional. But neither is IPoIB CM support a required > feature. Current code will fall back to datagram mode when SRQ is not > supported, and since UD support in not optional, all current hardware is still > supported with IPoIB - this patch does not break this. > > -- > MST Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner -------------- next part -------------- An HTML attachment was scrubbed... URL: From philippe_bernadat at hp.com Mon Dec 18 06:19:36 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Mon, 18 Dec 2006 15:19:36 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net> So after a bit more testing, setting the route path mtu to 1024 before the qp creation (rdma_create_qp()) seems sufficient. Philippe > -----Original Message----- > From: Bernadat, Philippe > Sent: Monday, December 18, 2006 3:09 PM > To: Bernadat, Philippe; Or Gerlitz > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; > openib-general at openib.org > Subject: RE: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > Or, > > I did manage to fix it my way, by inserting this same > > route->path_rec->mtu = IB_MTU_1024; > > Before/after qp creation > Before/after accept > Before/after connect > > So not sure which one really fixes it. > > Philippe > > > > > -----Original Message----- > > From: Bernadat, Philippe > > Sent: Monday, December 18, 2006 2:42 PM > > To: Or Gerlitz > > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; > > openib-general at openib.org > > Subject: RE: [openib-general] Performance Degradation with > > OFED v. Voltaire(lustre) > > > > I have tried both fiexs, none of these improve performance ... > > Let me check the number of packets again. > > > > Philippe > > > > > -----Original Message----- > > > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > > > Sent: Monday, December 18, 2006 2:03 PM > > > To: Bernadat, Philippe > > > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; > > > openib-general at openib.org > > > Subject: Re: [openib-general] Performance Degradation with > > > OFED v. Voltaire(lustre) > > > > > > Bernadat, Philippe wrote: > > > > 3) Is there a way to change the MTU from within the lustre > > > LND kernel > > > > module. I saw that the IB perf programs did this with the > > > modify_qp() > > > > APIs. > > > > > > yes, go to the place where the lustre NLD active side gets > > > RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set > > > > > > lustre_id->route->path_rec->mtu = IB_MTU_1024; > > > > > > Or. > > > > > > > > > > > > From wombat2 at us.ibm.com Mon Dec 18 06:33:04 2006 From: wombat2 at us.ibm.com (Bernard King-Smith) Date: Mon, 18 Dec 2006 09:33:04 -0500 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: Message-ID: > ----- Message from "Michael S. Tsirkin" on Sat, > 16 Dec 2006 18:47:09 +0200 ----- > > To: > > "Shirley Ma" > > cc: > > openib-general at openib.org > > Subject: > > Re: [openib-general] [PATCHv2] IPoIB CM Experimental support > > > > > Hi, Michael, > > > > > > > > Tried this patch, it didn't work on ehca. I couldn't change > the mode from > > > > datagram to connected from /sys/class. > > > > > > It's wroking as designed in that respect. ehca does not implement > > > srq - without > > > srq, there is no way to prepost receive buffers for a resonable number of > > > connections without running out of memory. > > > > > > So it is falling back on datagram mode. > > > Talk to ehca guys to implement srq and connected mode will be enabled. > > > > Don't remember SRQ is a MUST for UC mode. Does this patch support > devices with > > SRQ in RC mode? > > Yes. Only RC mode is supported by this patch. > >From what you say I am guessing that SRQ is supported by ehca HW but support > is currently lacking in the ehca driver? The current EHCA hardware does NOT support SRQ. > > -- > MST > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general Bernie King-Smith IBM Corporation Server Group Cluster System Performance wombat2 at us.ibm.com (845)433-8483 Tie. 293-8483 or wombat2 on NOTES "We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future." William Shatner -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Dec 18 06:46:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 16:46:23 +0200 Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support In-Reply-To: References: Message-ID: <20061218144623.GF3169@mellanox.co.il> > I agree that adapters that don't have SRQ can consume larger amounts of memory than those with SRQ ,however, that is not a good reason to prevent usage of RC or UC on those adapters. The memory consumption problem with any protocol not using SRQ and > running over RC or UC is well documented. But not solved. > At the OpenFabrics meeting in Tampa one of several themes was that we need better IP performance to move into commercial customers and also help our current primarily HPC customers, some which are not large numbers > of endpoints configurations. Even thought other ULP's are available, good IP is still the opportunity to getting more customers on IB. That's why you need zero configuration setup that works well on anything from back-to-back to 1000s of nodes. And this means code that's scalable by design. > Not all IB customers we have a large number of endpoint deployments so having > non SRQ adapters use IPoIB-CM is still important to expanding the customer base > for IB. You have to let the customer decide how they want to tune their system > based on the available functions/features. This just sounds too ugly. I do not *want* to special-case small clusters precisely because this way big iron flows get no testing. And people should not "tune" their systems just to have them basically not run out of memory and crash. > If not you don't have equality in > potential performance across all HCA's. ??? It's not *practical* to require equivalent performance on all HCAs. I just try to do the best I can, and I don't think each trade-off needs to be turned into a confugiration option. > Some guidance on memory consumption > would be good, to guide users whether they want to run IPoIB-CM without SRQ just > like IPoIB-CM will be selectable. I still think falling back to UD mode is the right solution if HCA does not support SRQ. I just don't see an "ignore scalability issues" option in IPoIB as being anything but a support nightmare, and having any right to existance outside a lab. But - let's see this code land upstream, then code up a patch that is not ugly, and post it. But IMO time might be better spend adding srq support in ehca. -- MST From sashak at voltaire.com Mon Dec 18 07:10:10 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 18 Dec 2006 17:10:10 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-18:normal completion In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com> Message-ID: <20061218151010.GG4808@sashak.voltaire.com> On 15:33 Mon 18 Dec , Eitan Zahavi wrote: > Hi Sasha, > > The failure analysis takes time and is manual... > The logs and related files are pretty big and will take space to upload. > > Today I simulated with OpenSM that was compiled on the side (my bad - > should have incorporated my patches on the clone but I was not sure this > is not going to "contaminate" that git tree forever) with the fixes for > DONE/DONE_PENDING. You can commit your changes to the branch, and later to rebase this branch on top of the new master, something like 'git-rebase master my-branch'. > The tests that failed today are actually false violations: > 1. The IS1-16 failed due to lack of free sockets to connect to the > server. Still not clear why. I will increase the number of sockets the > client/server try to connect on. > 2. The IS3-128 fail due to temporary replacement of the opensm with the > one that have my fixes for DONE/DONE_PENDING. This was a mistake I did > manually by compiling the "clone". As I was watching the log I have > noticed that the same wrong signal was happening. Understood. > BTW: The DONE/DONE_PENDING bug was discovered by a change in simulator > dispatcher that I did. The change introduced a BUG that caused the > machine to be overloaded with busy loop in the simulator dispatcher. > Apparently this brought up some different timing and found these bugs. So it was helpful simulator shakes. :) Thanks for catching this. BTW, > > EZ > > > -----Original Message----- > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > > Sent: Monday, December 18, 2006 3:31 PM > > To: Eitan Zahavi > > Cc: Eitan Zahavi; Yevgeny Kliteynik; halr at voltaire.com; openib- > > general at openib.org > > Subject: Re: nightly osm_sim report 2006-12-18:normal completion > > > > Hi Eitan, > > > > On 13:19 Mon 18 Dec , Eitan Zahavi wrote: > > > OSM Simulation Regression Summary > > > OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 ibutils rev = > > > Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1 > > > Total=221 Pass=219 Fail=2 > > > > > > Pass: > > > 31 LidMgr IS1-16.topo > > > 30 Stability IS1-16.topo > > > 30 Pkey IS1-16.topo > > > 30 Multicast IS1-16.topo > > > 29 OsmStress IS1-16.topo > > > 10 Stability IS3-loop.topo > > > 10 Stability IS3-128.topo > > > 10 Pkey IS3-128.topo > > > 10 Multicast IS3-loop.topo > > > 10 Multicast IS3-128.topo > > > 10 LidMgr IS3-128.topo > > > 9 OsmStress IS3-128.topo > > > > > > Failures: > > > 1 OsmStress IS3-128.topo > > > 1 OsmStress IS1-16.topo > > > > Is it possible to have more details about failures (in case when it is > real > > failures)? Probably to upload the logs to somewhere? > > > > Sasha From halr at voltaire.com Mon Dec 18 07:11:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2006 10:11:58 -0500 Subject: [openib-general] [PATCH] osm: fix a bug in ignroing pending transaction of Light Sweep In-Reply-To: <45844167.9060302@mellanox.co.il> References: <45844167.9060302@mellanox.co.il> Message-ID: <1166454599.32666.185925.camel@hal.voltaire.com> Hi Eitan, On Sat, 2006-12-16 at 13:56, Eitan Zahavi wrote: > Hi Hal > > This patch provides fixes an issue discovered by the nightly regression. > OpenSM state machine got stack due to pending SwitchInfo transaction > being ignored since one of the queries for SwitchInfo > failed (due to bad-link). > The patch below simply avoids aborting the wait for all SwitchInfo > requests to return. > > I think this issue might have hurt us in other situations too sine it > aborted the wait on "CHANGE DETECTED" too. > CHANGE_DETECTED is fired on the first switch that reported "Change Bit". > > It is possible that the issue is showing up as we added incremental > support (e.g. for routing) > Since only of there are no other SMP's sent during the heavy sweep we > will get the > "NO_PENDING_TRANSACTIONS" signal caused by the SwitchInfo requests So is the same issue applicable to OFED 1.1 ? > Eitan > > Signed-off-by: Eitan Zahavi > > osm/opensm/osm_state_mgr.c | 5 ++--- > 1 files changed, 2 insertions(+), 3 deletions(-) Thanks. Applied. -- Hal From bos at pathscale.com Mon Dec 18 07:22:17 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Mon, 18 Dec 2006 07:22:17 -0800 Subject: [openib-general] Different low level drivers returns different return values incase of an error In-Reply-To: References: Message-ID: <4586B229.5050300@pathscale.com> Hoang-Nam Nguyen wrote: > Good point. I can speak for ehca only. We prefer to reuse existing > errno values and not to define new ones as it's also a question of > how much information we want to tell the consumer in case of error > and what it can handle for. This is independent of the question of whether to return -1 or -errno to indicate an error in userspace. The standard in userspace has long been to return -1, with the error code propagated through the errno pseudo-variable. It's a crummy convention, but it's at least consistent with the rest of userspace. (By the way, libipathverbs just propagates error codes up from libibverbs. It doesn't generate any new numeric return values of its own.) References: <45867FBD.9040300@dev.mellanox.co.il> Message-ID: <4586B2B1.2060908@pathscale.com> Dotan Barak wrote: > I think that there should be 2 modes to the drivers: > mode 1 (release mode): return "standard" errno values > mode 2 (debug mode) : return "IB oriented" values No way, that's a guaranteed route to broken code. If you want to propagate IB-specific error values, define an ib_errno variable, make it use the same TLS mechanism as errno, give it well-defined values, and make it part of the ABI. Some mechanism that you can't rely on unless you know you need to tweak it is worse than useless. References: <45867FBD.9040300@dev.mellanox.co.il> <4586B2B1.2060908@pathscale.com> Message-ID: <4586B929.9080306@dev.mellanox.co.il> Bryan O'Sullivan wrote: > Dotan Barak wrote: > >> I think that there should be 2 modes to the drivers: >> mode 1 (release mode): return "standard" errno values >> mode 2 (debug mode) : return "IB oriented" values > > No way, that's a guaranteed route to broken code. If you want to > propagate IB-specific error values, define an ib_errno variable, make > it use the same TLS mechanism as errno, give it well-defined values, > and make it part of the ABI. Some mechanism that you can't rely on > unless you know you need to tweak it is worse than useless. > > References: <45846F4C.4080501@mellanox.co.il> Message-ID: <1166458043.32666.188439.camel@hal.voltaire.com> Hi Eitan, On Sat, 2006-12-16 at 17:12, Eitan Zahavi wrote: > Hi Hal > > This set of patches fixes issues of not providing back to state manager > OSM_SIGNAL_DONE_PENDING > which breaks the state machine later in the sweep. > > Eitan > > Signed-off-by: Eitan Zahavi > > osm/opensm/osm_pkey_mgr.c | 112 > ++++++++++++++++++++++++++++++++------------ This patch (here and other places) appear to be line wrapped. > osm/opensm/osm_state_mgr.c | 11 +++-- > osm/opensm/osm_ucast_mgr.c | 96 ++++++++++++++++++++++++-------------- > 4 files changed, 179 insertions(+), 88 deletions(-) Is this patch 4 files or 3 ? (How was this patch generated ?) Is this one patch or should it be 2 or 3 ? It looks to me there is an incremental change to osm_state_mgr.c and perhaps 2 other ones which can be separate (pkey and ucast_mgr). Also, see below in osm_state_mgr.c for another minor comment. > diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c > index 48837bc..a33aec7 100644 > --- a/osm/opensm/osm_pkey_mgr.c > +++ b/osm/opensm/osm_pkey_mgr.c > @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry( > > /********************************************************************** > **********************************************************************/ > -static ib_api_status_t > +static boolean_t > pkey_mgr_enforce_partition( > + IN osm_log_t *p_log, > IN const osm_req_t *p_req, > IN const osm_physp_t *p_physp, > IN const boolean_t enforce) > @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition( > osm_madw_context_t context; > uint8_t payload[IB_SMP_DATA_SIZE]; > ib_port_info_t *p_pi; > + ib_api_status_t status; > > if (!(p_pi = osm_physp_get_port_info_ptr( p_physp ))) > - return IB_ERROR; > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_enforce_partition: ERR 0507: " > + "No port info for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return FALSE; > + } > > - if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) > - return IB_SUCCESS; > + if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "pkey_mgr_enforce_partition: " > + "No need to update PortInfo for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return FALSE; > + } > > memset( payload, 0, IB_SMP_DATA_SIZE ); > memcpy( payload, p_pi, sizeof(ib_port_info_t) ); > @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition( > context.pi_context.light_sweep = FALSE; > context.pi_context.active_transition = FALSE; > > - return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), > - payload, sizeof(payload), > - IB_MAD_ATTR_PORT_INFO, > - cl_hton32( osm_physp_get_port_num( p_physp ) ), > - CL_DISP_MSGID_NONE, &context ); > + status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), > + payload, sizeof(payload), > + IB_MAD_ATTR_PORT_INFO, > + cl_hton32( osm_physp_get_port_num( p_physp ) ), > + CL_DISP_MSGID_NONE, &context ); > + if (status != IB_SUCCESS) > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_enforce_partition: ERR 0520: " > + "Failed to set PortInfo for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return FALSE; > + } > + else > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "pkey_mgr_enforce_partition: " > + "Set PortInfo for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return TRUE; > + } > } > > /********************************************************************** > @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port( > > status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, > block_index ); > if (status == IB_SUCCESS) > - ret_val = TRUE; > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "pkey_mgr_update_port: " > + "Updated " > + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", > + block_index, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + ret_val = TRUE; > + } > else > - osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_port: ERR 0506: " > - "pkey_mgr_update_pkey_entry() failed to update " > - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", > - block_index, > - cl_ntoh64( osm_node_get_node_guid( p_node ) ), > - osm_physp_get_port_num( p_physp ) ); > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_update_port: ERR 0506: " > + "pkey_mgr_update_pkey_entry() failed to update " > + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", > + block_index, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + } > } > > return ret_val; > @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port( > uint16_t peer_max_blocks; > ib_api_status_t status = IB_SUCCESS; > boolean_t ret_val = FALSE; > + boolean_t port_info_set = FALSE; > ib_pkey_table_t empty_block; > - > + > memset(&empty_block, 0, sizeof(ib_pkey_table_t)); > > p_physp = osm_port_get_default_phys_ptr( p_port ); > @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port( > enforce = FALSE; > } > > - if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) > - { > - osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_peer_port: ERR 0507: " > - "pkey_mgr_enforce_partition() failed to update " > - "node 0x%016" PRIx64 " port %u\n", > - cl_ntoh64( osm_node_get_node_guid( p_node ) ), > - osm_physp_get_port_num( peer ) ); > - } > + if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce)) > + port_info_set = TRUE; > > if (enforce == FALSE) > - return FALSE; > + return port_info_set; > > p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; > for (block_index = 0; block_index < p_pkey_tbl->used_blocks; > block_index++) > @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port( > osm_physp_get_port_num( peer ) ); > } > > + if (port_info_set) return TRUE; > return ret_val; > } > > @@ -541,10 +593,10 @@ osm_pkey_mgr_process( > signal = OSM_SIGNAL_DONE_PENDING; > p_node = osm_port_get_parent_node( p_port ); > if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && > - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, > + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, > &p_osm->subn, p_port, > !p_osm->subn.opt.no_partition_enforcement ) ) > - signal = OSM_SIGNAL_DONE_PENDING; > + signal = OSM_SIGNAL_DONE_PENDING; > } > > _err: > diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c > index 9eac038..4e61259 100644 > --- a/osm/opensm/osm_state_mgr.c > +++ b/osm/opensm/osm_state_mgr.c > @@ -1853,6 +1853,7 @@ osm_state_mgr_process( > { > ib_api_status_t status; > osm_remote_sm_t *p_remote_sm; > + osm_signal_t tmp_signal; > > CL_ASSERT( p_mgr ); > > @@ -2075,11 +2076,10 @@ osm_state_mgr_process( > case OSM_SIGNAL_CHANGE_DETECTED: > /* > * Nothing to do here. One subnet change typcially > - * begets another.... > + * begets another.... But needs to wait for all transactions > */ > signal = OSM_SIGNAL_NONE; > - break; > - This is a repeat of your previous submitted patch to this file so isn't needed. -- Hal > + break; > case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: > /* > * A change was detected on the subnet. > @@ -2219,7 +2219,10 @@ osm_state_mgr_process( > signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); > > /* the returned signal is always DONE */ > - signal = osm_qos_setup(p_mgr->p_subn->p_osm); > + tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm); > + > + if (tmp_signal == OSM_SIGNAL_DONE_PENDING) > + signal = OSM_SIGNAL_DONE_PENDING; > > /* try to restore SA DB (this should be before lid_mgr > because we may want to disable clients reregistration > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > index e977253..39973de 100644 > --- a/osm/opensm/osm_ucast_mgr.c > +++ b/osm/opensm/osm_ucast_mgr.c > @@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table( > ib_switch_info_t si; > uint32_t block_id_ho = 0; > uint8_t block[IB_SMP_DATA_SIZE]; > + boolean_t set_swinfo_require = FALSE; > + uint16_t lin_top; > + uint8_t life_state; > > CL_ASSERT( p_mgr ); > > @@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table( > Set the top of the unicast forwarding table. > */ > si = *osm_switch_get_si_ptr( p_sw ); > - si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); > + lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); > + if (si.lin_top != lin_top) > + { > + set_swinfo_require = TRUE; > + si.lin_top = lin_top; > + } > > /* check to see if the change state bit is on. If it is - then we > need to clear it. */ > - if( ib_switch_info_get_state_change( &si ) ) > - si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) > - | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; > + if ( ib_switch_info_get_state_change( &si ) ) > + life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) > + | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; > else > - si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; > + life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; > > - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > + if (life_state != si.life_state) > { > - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > - "osm_ucast_mgr_set_fwd_table: " > - "Setting switch FT top to LID 0x%X\n", > - osm_switch_get_max_lid_ho( p_sw ) ); > + set_swinfo_require = TRUE; > + si.life_state = life_state; > } > - > - context.si_context.light_sweep = FALSE; > - context.si_context.node_guid = osm_node_get_node_guid( p_node ); > - context.si_context.set_method = TRUE; > - > - status = osm_req_set( p_mgr->p_req, > - p_path, > - (uint8_t*)&si, > - sizeof(si), > - IB_MAD_ATTR_SWITCH_INFO, > - 0, > - CL_DISP_MSGID_NONE, > - &context ); > - > - if( status != IB_SUCCESS ) > + > + if ( set_swinfo_require ) > { > - osm_log( p_mgr->p_log, OSM_LOG_ERROR, > - "osm_ucast_mgr_set_fwd_table: ERR 3A06: " > - "Sending SwitchInfo attribute failed (%s)\n", > - ib_get_err_str( status ) ); > + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) > + { > + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, > + "osm_ucast_mgr_set_fwd_table: " > + "Setting switch FT top to LID 0x%X\n", > + osm_switch_get_max_lid_ho( p_sw ) ); > + } > + > + context.si_context.light_sweep = FALSE; > + context.si_context.node_guid = osm_node_get_node_guid( p_node ); > + context.si_context.set_method = TRUE; > + > + status = osm_req_set( p_mgr->p_req, > + p_path, > + (uint8_t*)&si, > + sizeof(si), > + IB_MAD_ATTR_SWITCH_INFO, > + 0, > + CL_DISP_MSGID_NONE, > + &context ); > + > + if( status != IB_SUCCESS ) > + { > + osm_log( p_mgr->p_log, OSM_LOG_ERROR, > + "osm_ucast_mgr_set_fwd_table: ERR 3A06: " > + "Sending SwitchInfo attribute failed (%s)\n", > + ib_get_err_str( status ) ); > + } > + else > + p_mgr->any_change = TRUE; > } > > /* > @@ -1215,13 +1234,14 @@ osm_ucast_mgr_process( > > CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); > > + p_mgr->any_change = FALSE; > + > /* > If there are no switches in the subnet, we are done. > */ > if (cl_qmap_count( p_sw_guid_tbl ) == 0) > goto Exit; > > - p_mgr->any_change = FALSE; > cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL); > > if (!p_routing_eng->build_lid_matrices || > @@ -1248,14 +1268,20 @@ osm_ucast_mgr_process( > if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) > __osm_ucast_mgr_dump_tables( p_mgr ); > > - if (p_mgr->any_change) > + if (p_mgr->any_change) > + { > signal = OSM_SIGNAL_DONE_PENDING; > + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, > + "osm_ucast_mgr_process: " > + "LFT Tables configured on all switches\n"); > + } > else > + { > + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, > + "osm_ucast_mgr_process: " > + "No need to set any LFT Tables on all switches\n"); > signal = OSM_SIGNAL_DONE; > - > - osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, > - "osm_ucast_mgr_process: " > - "LFT Tables configured on all switches\n"); > + } > > Exit: > CL_PLOCK_RELEASE( p_mgr->p_lock ); > > From yosefe at voltaire.com Mon Dec 18 08:21:21 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 18 Dec 2006 18:21:21 +0200 Subject: [openib-general] [PATCH] ib_sa: Fix kernel Oops caused by ib_sa unload Message-ID: <1166458881.9289.17.camel@muscida> This is a fix to Sean's multicast patches for ofed 1.2. The issuse is described in: http://www.mail-archive.com/openib-general at openib.org/msg27097.html The Oops happened because the multicast work handler was called after the multicast device structure was released. It happened because the multicast cleanup function 'mcast_remove_one' didn't wait for work queue completion on all ports before releasing the device, but only N-1 ports. The patch applies after Sean's multicast patch series. --- multicast.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index a8ff6fa..4e15fd3 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -822,7 +822,7 @@ static void mcast_remove_one(struct ib_d ib_unregister_event_handler(&event_handler); flush_workqueue(mcast_wq); - for (i = 0; i < dev->end_port - dev->start_port; i++) { + for (i = 0; i <= dev->end_port - dev->start_port; i++) { port = &dev->port[i]; deref_port(port); wait_for_completion(&port->comp); -- Yosef Etigin yosefe at voltaire.com From mst at mellanox.co.il Mon Dec 18 08:41:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 18:41:00 +0200 Subject: [openib-general] open-iscsi update for OFED 1.2 In-Reply-To: <457D2569.2000805@voltaire.com> References: <457D2569.2000805@voltaire.com> Message-ID: <20061218164100.GB24076@mellanox.co.il> > In order to create backport patches to a > specific distro, I need to know where I start from (i.e which kernel > version). I have just pulled 2.6.20-rc1 into OFED kernel tree. Most likely, there won't be major API changes before 2.6.21. So if that's what you were waiting for, go ahead, clone ~vlad/ofed_1_2/.git and start working on the backports. Try to use the kernel_addons infrastructure as much as possible (it's much easier to maintain) where not you can still use kernel_patches/backports as in OFED 1.1. At your request, Vlad added checking out iscsi to ~vlad/ofabuild.git, and I expect it does not build on any kernel older than 2.6.20-rc1. Vlad here shall be able to help with any questions on OFED build scripts. -- MST From jriotto at cisco.com Mon Dec 18 08:53:10 2006 From: jriotto at cisco.com (Jamie Riotto (jriotto)) Date: Mon, 18 Dec 2006 08:53:10 -0800 Subject: [openib-general] EWG Call Info for Dec 18, 2006 Message-ID: <944AD9DA9232E346ADF590C41BFFEC410325BF9E@xmb-sjc-232.amer.cisco.com> Date/Time: DEC 4, 2006 at 12:00PM America/New_York Length: 60 Frequency: 10 Meeting ID: 2106670 Meeting Password: Global Access Numbers: http://cisco.com/en/US/about/doing_business/conferencing/index.html US/Canada: +1.866.432.9903 United Kingdom: +44.20.8824.0117 India: +91.80.4103.3979 Germany: +49.619.6773.9002 Japan: +81.3.5763.9394 China: +86.10.8515.5666 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Mon Dec 18 08:59:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 18 Dec 2006 18:59:48 +0200 Subject: [openib-general] OFED 1.2 updated to 2.6.20-rc1, UCMA in Message-ID: <20061218165948.GC24076@mellanox.co.il> OK, I pulled v2.6.20-rc1 from linus (actually, the immediately following d1998ef38 which fixes a compilation issue in ib_verbs.h) into ofed 1.2 tree. I'll continue to pull with each RC but I expect no more major API changes. Main things that had to be backported: - ilog API f0d1b0b30d250a07627ad8b9fbbb5c7cc08422e8 - kmemdup API (actually introduced in 2.6.19: bed8bdfddd851657cf9e5fd16bb44abb02ae7f42) - work_struct related changes c4028958b6ecad064b1a6303a6a5906d4fe48d73 All backports have been updated to work with that kernel version, so if you have based you development on that, please fetch and rebase as appropriate. This brings in the UCMA module (sans the the multicast patches), so it should be now possible to run userspace that relies on UCMA on OFED 1.2 kernel code. Wrt multicast: Sean, could you please prepare multicast tree based on 2.6.20-rc1 (+hopefully recent fixes) so I can test that with OFED? Wrt iser: please note that due to recent decision to include the iscsi module in OFED 1.2, iser can't built on older kernels until someone (presumably from voltaire) clones ofed_1.2 and looks into backporting iscsi too. I'll be off Wed/Thursday, so please Cc Vlad on any questions/issues. Thanks, -- MST From dabeisein at konzept06.net Sat Dec 16 06:55:46 2006 From: dabeisein at konzept06.net (Konzept 2006) Date: Sat, 16 Dec 2006 15:55:46 +0100 Subject: [openib-general] =?iso-8859-1?q?attraktiver_Gesch=E4ftsplan?= Message-ID: <2006121615554618AA2F95CD$A470CEAB08@PC> Guten Tag, bevor Sie diese E-Mail ad acta legen, sollten Sie Eines wissen: Hierbei handelt es sich nicht um Spam oder sonstigen Unfug! Ich schreibe Ihnen diese Email, um Ihnen einen attraktiven Geschäftsplan vorzustellen. Wieso? Werden Sie sich in diesem Augenblick sicherlich fragen. Weil man mit dieser neuen Geschäftsmethode gemeinsam eine hohe Summe an Bargeld verdienen kann. Ich gehöre schon zu denjenigen, welche diesen Geschäftsplan bereits erfolgreich betreiben und Sie werden auch dazu gehören. Dieses Konzept hat es in dieser Form noch nicht gegeben und JEDER kann daran teilnehmen. Es gibt keinen schnelleren, sichereren und einfacheren Weg bares Geld absolut legal zu erwirtschaften. Alles was Sie brauchen, sind 30 Minuten, um das, was in diesem Plan geschrieben steht, umzusetzen. Ich versichere Ihnen, dieser Plan ist absolut unverbindlich und durch seine ergebnisorientierte Ausführung mehrfach ausgezeichnet. Ich bin mir vollkommen sicher, dass Sie nach dem Durchlesen dieses Konzeptes genauso begeistert sein werden wie viele andere es sind, da es bereits funktioniert hat. Nehmen Sie sich also 30 Minuten Zeit, machen Sie es sich auf ihrem Sessel oder Sofa gemütlich, holen Sie sich etwas zu knabbern und dann fangen Sie an, das Ihnen vorliegende Konzept umzusetzen. Falls dieses Konzept keine ansprechende Wirkung auf Sie hat, entschuldige ich mich für die eventuellen Unannehmlichkeiten. Ich respektiere ihre Entscheidung und wünsche Ihnen für die Zukunft viel Erfolg, aber denken Sie wenigstens darüber nach, andernfalls sind Sie im Begriff eine Menge Bargeld wegzuwerfen. Danke, und alles Gute für die Zukunft -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at mellanox.co.il Mon Dec 18 11:35:13 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 18 Dec 2006 21:35:13 +0200 Subject: [openib-general] [PATCH] osm: fix bugs related to not passing OSM_SIGNAL_DONE_PENDING In-Reply-To: <1166458043.32666.188439.camel@hal.voltaire.com> References: <45846F4C.4080501@mellanox.co.il> <1166458043.32666.188439.camel@hal.voltaire.com> Message-ID: <4586ED71.6000801@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Sat, 2006-12-16 at 17:12, Eitan Zahavi wrote: > >> Hi Hal >> >> This set of patches fixes issues of not providing back to state manager >> OSM_SIGNAL_DONE_PENDING >> which breaks the state machine later in the sweep. >> >> Eitan >> >> Signed-off-by: Eitan Zahavi >> >> osm/opensm/osm_pkey_mgr.c | 112 >> ++++++++++++++++++++++++++++++++------------ >> > > This patch (here and other places) appear to be line wrapped. > Sorry about that - I did cut and paste. I will never do that again. > >> osm/opensm/osm_state_mgr.c | 11 +++-- >> osm/opensm/osm_ucast_mgr.c | 96 ++++++++++++++++++++++++-------------- >> 4 files changed, 179 insertions(+), 88 deletions(-) >> > > Is this patch 4 files or 3 ? (How was this patch generated ?) > I did remove part of the patch since I already sent it separately. > Is this one patch or should it be 2 or 3 ? It looks to me there is an > incremental change to osm_state_mgr.c and perhaps 2 other ones which can > be separate (pkey and ucast_mgr). > I probably messed up the patch and can not tell. I will resend this patch after pulling from trunk again. > Also, see below in osm_state_mgr.c for another minor comment. > > >> diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c >> index 48837bc..a33aec7 100644 >> --- a/osm/opensm/osm_pkey_mgr.c >> +++ b/osm/opensm/osm_pkey_mgr.c >> @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry( >> >> /********************************************************************** >> **********************************************************************/ >> -static ib_api_status_t >> +static boolean_t >> pkey_mgr_enforce_partition( >> + IN osm_log_t *p_log, >> IN const osm_req_t *p_req, >> IN const osm_physp_t *p_physp, >> IN const boolean_t enforce) >> @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition( >> osm_madw_context_t context; >> uint8_t payload[IB_SMP_DATA_SIZE]; >> ib_port_info_t *p_pi; >> + ib_api_status_t status; >> >> if (!(p_pi = osm_physp_get_port_info_ptr( p_physp ))) >> - return IB_ERROR; >> + { >> + osm_log( p_log, OSM_LOG_ERROR, >> + "pkey_mgr_enforce_partition: ERR 0507: " >> + "No port info for " >> + "node 0x%016" PRIx64 " port %u\n", >> + cl_ntoh64( >> + osm_node_get_node_guid( >> + osm_physp_get_node_ptr( p_physp ))), >> + osm_physp_get_port_num( p_physp ) ); >> + return FALSE; >> + } >> >> - if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) >> - return IB_SUCCESS; >> + if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "pkey_mgr_enforce_partition: " >> + "No need to update PortInfo for " >> + "node 0x%016" PRIx64 " port %u\n", >> + cl_ntoh64( >> + osm_node_get_node_guid( >> + osm_physp_get_node_ptr( p_physp ))), >> + osm_physp_get_port_num( p_physp ) ); >> + return FALSE; >> + } >> >> memset( payload, 0, IB_SMP_DATA_SIZE ); >> memcpy( payload, p_pi, sizeof(ib_port_info_t) ); >> @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition( >> context.pi_context.light_sweep = FALSE; >> context.pi_context.active_transition = FALSE; >> >> - return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), >> - payload, sizeof(payload), >> - IB_MAD_ATTR_PORT_INFO, >> - cl_hton32( osm_physp_get_port_num( p_physp ) ), >> - CL_DISP_MSGID_NONE, &context ); >> + status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), >> + payload, sizeof(payload), >> + IB_MAD_ATTR_PORT_INFO, >> + cl_hton32( osm_physp_get_port_num( p_physp ) ), >> + CL_DISP_MSGID_NONE, &context ); >> + if (status != IB_SUCCESS) >> + { >> + osm_log( p_log, OSM_LOG_ERROR, >> + "pkey_mgr_enforce_partition: ERR 0520: " >> + "Failed to set PortInfo for " >> + "node 0x%016" PRIx64 " port %u\n", >> + cl_ntoh64( >> + osm_node_get_node_guid( >> + osm_physp_get_node_ptr( p_physp ))), >> + osm_physp_get_port_num( p_physp ) ); >> + return FALSE; >> + } >> + else >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "pkey_mgr_enforce_partition: " >> + "Set PortInfo for " >> + "node 0x%016" PRIx64 " port %u\n", >> + cl_ntoh64( >> + osm_node_get_node_guid( >> + osm_physp_get_node_ptr( p_physp ))), >> + osm_physp_get_port_num( p_physp ) ); >> + return TRUE; >> + } >> } >> >> /********************************************************************** >> @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port( >> >> status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, >> block_index ); >> if (status == IB_SUCCESS) >> - ret_val = TRUE; >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "pkey_mgr_update_port: " >> + "Updated " >> + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", >> + block_index, >> + cl_ntoh64( osm_node_get_node_guid( p_node ) ), >> + osm_physp_get_port_num( p_physp ) ); >> + ret_val = TRUE; >> + } >> else >> - osm_log( p_log, OSM_LOG_ERROR, >> - "pkey_mgr_update_port: ERR 0506: " >> - "pkey_mgr_update_pkey_entry() failed to update " >> - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", >> - block_index, >> - cl_ntoh64( osm_node_get_node_guid( p_node ) ), >> - osm_physp_get_port_num( p_physp ) ); >> + { >> + osm_log( p_log, OSM_LOG_ERROR, >> + "pkey_mgr_update_port: ERR 0506: " >> + "pkey_mgr_update_pkey_entry() failed to update " >> + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", >> + block_index, >> + cl_ntoh64( osm_node_get_node_guid( p_node ) ), >> + osm_physp_get_port_num( p_physp ) ); >> + } >> } >> >> return ret_val; >> @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port( >> uint16_t peer_max_blocks; >> ib_api_status_t status = IB_SUCCESS; >> boolean_t ret_val = FALSE; >> + boolean_t port_info_set = FALSE; >> ib_pkey_table_t empty_block; >> - >> + >> memset(&empty_block, 0, sizeof(ib_pkey_table_t)); >> >> p_physp = osm_port_get_default_phys_ptr( p_port ); >> @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port( >> enforce = FALSE; >> } >> >> - if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) >> - { >> - osm_log( p_log, OSM_LOG_ERROR, >> - "pkey_mgr_update_peer_port: ERR 0507: " >> - "pkey_mgr_enforce_partition() failed to update " >> - "node 0x%016" PRIx64 " port %u\n", >> - cl_ntoh64( osm_node_get_node_guid( p_node ) ), >> - osm_physp_get_port_num( peer ) ); >> - } >> + if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce)) >> + port_info_set = TRUE; >> >> if (enforce == FALSE) >> - return FALSE; >> + return port_info_set; >> >> p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; >> for (block_index = 0; block_index < p_pkey_tbl->used_blocks; >> block_index++) >> @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port( >> osm_physp_get_port_num( peer ) ); >> } >> >> + if (port_info_set) return TRUE; >> return ret_val; >> } >> >> @@ -541,10 +593,10 @@ osm_pkey_mgr_process( >> signal = OSM_SIGNAL_DONE_PENDING; >> p_node = osm_port_get_parent_node( p_port ); >> if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && >> - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, >> + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, >> &p_osm->subn, p_port, >> !p_osm->subn.opt.no_partition_enforcement ) ) >> - signal = OSM_SIGNAL_DONE_PENDING; >> + signal = OSM_SIGNAL_DONE_PENDING; >> } >> >> _err: >> diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c >> index 9eac038..4e61259 100644 >> --- a/osm/opensm/osm_state_mgr.c >> +++ b/osm/opensm/osm_state_mgr.c >> @@ -1853,6 +1853,7 @@ osm_state_mgr_process( >> { >> ib_api_status_t status; >> osm_remote_sm_t *p_remote_sm; >> + osm_signal_t tmp_signal; >> >> CL_ASSERT( p_mgr ); >> >> @@ -2075,11 +2076,10 @@ osm_state_mgr_process( >> case OSM_SIGNAL_CHANGE_DETECTED: >> /* >> * Nothing to do here. One subnet change typcially >> - * begets another.... >> + * begets another.... But needs to wait for all transactions >> */ >> signal = OSM_SIGNAL_NONE; >> - break; >> - >> > > This is a repeat of your previous submitted patch to this file so isn't > needed. > > Yes I will resend. > -- Hal > > >> + break; >> case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: >> /* >> * A change was detected on the subnet. >> @@ -2219,7 +2219,10 @@ osm_state_mgr_process( >> signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); >> >> /* the returned signal is always DONE */ >> - signal = osm_qos_setup(p_mgr->p_subn->p_osm); >> + tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm); >> + >> + if (tmp_signal == OSM_SIGNAL_DONE_PENDING) >> + signal = OSM_SIGNAL_DONE_PENDING; >> >> /* try to restore SA DB (this should be before lid_mgr >> because we may want to disable clients reregistration >> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c >> index e977253..39973de 100644 >> --- a/osm/opensm/osm_ucast_mgr.c >> +++ b/osm/opensm/osm_ucast_mgr.c >> @@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table( >> ib_switch_info_t si; >> uint32_t block_id_ho = 0; >> uint8_t block[IB_SMP_DATA_SIZE]; >> + boolean_t set_swinfo_require = FALSE; >> + uint16_t lin_top; >> + uint8_t life_state; >> >> CL_ASSERT( p_mgr ); >> >> @@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table( >> Set the top of the unicast forwarding table. >> */ >> si = *osm_switch_get_si_ptr( p_sw ); >> - si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); >> + lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); >> + if (si.lin_top != lin_top) >> + { >> + set_swinfo_require = TRUE; >> + si.lin_top = lin_top; >> + } >> >> /* check to see if the change state bit is on. If it is - then we >> need to clear it. */ >> - if( ib_switch_info_get_state_change( &si ) ) >> - si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) >> - | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; >> + if ( ib_switch_info_get_state_change( &si ) ) >> + life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) >> + | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; >> else >> - si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; >> + life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; >> >> - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) >> + if (life_state != si.life_state) >> { >> - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, >> - "osm_ucast_mgr_set_fwd_table: " >> - "Setting switch FT top to LID 0x%X\n", >> - osm_switch_get_max_lid_ho( p_sw ) ); >> + set_swinfo_require = TRUE; >> + si.life_state = life_state; >> } >> - >> - context.si_context.light_sweep = FALSE; >> - context.si_context.node_guid = osm_node_get_node_guid( p_node ); >> - context.si_context.set_method = TRUE; >> - >> - status = osm_req_set( p_mgr->p_req, >> - p_path, >> - (uint8_t*)&si, >> - sizeof(si), >> - IB_MAD_ATTR_SWITCH_INFO, >> - 0, >> - CL_DISP_MSGID_NONE, >> - &context ); >> - >> - if( status != IB_SUCCESS ) >> + >> + if ( set_swinfo_require ) >> { >> - osm_log( p_mgr->p_log, OSM_LOG_ERROR, >> - "osm_ucast_mgr_set_fwd_table: ERR 3A06: " >> - "Sending SwitchInfo attribute failed (%s)\n", >> - ib_get_err_str( status ) ); >> + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) >> + { >> + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, >> + "osm_ucast_mgr_set_fwd_table: " >> + "Setting switch FT top to LID 0x%X\n", >> + osm_switch_get_max_lid_ho( p_sw ) ); >> + } >> + >> + context.si_context.light_sweep = FALSE; >> + context.si_context.node_guid = osm_node_get_node_guid( p_node ); >> + context.si_context.set_method = TRUE; >> + >> + status = osm_req_set( p_mgr->p_req, >> + p_path, >> + (uint8_t*)&si, >> + sizeof(si), >> + IB_MAD_ATTR_SWITCH_INFO, >> + 0, >> + CL_DISP_MSGID_NONE, >> + &context ); >> + >> + if( status != IB_SUCCESS ) >> + { >> + osm_log( p_mgr->p_log, OSM_LOG_ERROR, >> + "osm_ucast_mgr_set_fwd_table: ERR 3A06: " >> + "Sending SwitchInfo attribute failed (%s)\n", >> + ib_get_err_str( status ) ); >> + } >> + else >> + p_mgr->any_change = TRUE; >> } >> >> /* >> @@ -1215,13 +1234,14 @@ osm_ucast_mgr_process( >> >> CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); >> >> + p_mgr->any_change = FALSE; >> + >> /* >> If there are no switches in the subnet, we are done. >> */ >> if (cl_qmap_count( p_sw_guid_tbl ) == 0) >> goto Exit; >> >> - p_mgr->any_change = FALSE; >> cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL); >> >> if (!p_routing_eng->build_lid_matrices || >> @@ -1248,14 +1268,20 @@ osm_ucast_mgr_process( >> if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) >> __osm_ucast_mgr_dump_tables( p_mgr ); >> >> - if (p_mgr->any_change) >> + if (p_mgr->any_change) >> + { >> signal = OSM_SIGNAL_DONE_PENDING; >> + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, >> + "osm_ucast_mgr_process: " >> + "LFT Tables configured on all switches\n"); >> + } >> else >> + { >> + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, >> + "osm_ucast_mgr_process: " >> + "No need to set any LFT Tables on all switches\n"); >> signal = OSM_SIGNAL_DONE; >> - >> - osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, >> - "osm_ucast_mgr_process: " >> - "LFT Tables configured on all switches\n"); >> + } >> >> Exit: >> CL_PLOCK_RELEASE( p_mgr->p_lock ); >> >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Mon Dec 18 11:35:53 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 18 Dec 2006 21:35:53 +0200 Subject: [openib-general] [PATCH] osm: fix a bug in ignroing pending transaction of Light Sweep In-Reply-To: <1166454599.32666.185925.camel@hal.voltaire.com> References: <45844167.9060302@mellanox.co.il> <1166454599.32666.185925.camel@hal.voltaire.com> Message-ID: <4586ED99.4070908@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Sat, 2006-12-16 at 13:56, Eitan Zahavi wrote: > >> Hi Hal >> >> This patch provides fixes an issue discovered by the nightly regression. >> OpenSM state machine got stack due to pending SwitchInfo transaction >> being ignored since one of the queries for SwitchInfo >> failed (due to bad-link). >> The patch below simply avoids aborting the wait for all SwitchInfo >> requests to return. >> >> I think this issue might have hurt us in other situations too sine it >> aborted the wait on "CHANGE DETECTED" too. >> CHANGE_DETECTED is fired on the first switch that reported "Change Bit". >> >> It is possible that the issue is showing up as we added incremental >> support (e.g. for routing) >> Since only of there are no other SMP's sent during the heavy sweep we >> will get the >> "NO_PENDING_TRANSACTIONS" signal caused by the SwitchInfo requests >> > > So is the same issue applicable to OFED 1.1 ? > Yes it is. > >> Eitan >> >> Signed-off-by: Eitan Zahavi >> >> osm/opensm/osm_state_mgr.c | 5 ++--- >> 1 files changed, 2 insertions(+), 3 deletions(-) >> > > Thanks. Applied. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Mon Dec 18 12:07:06 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 18 Dec 2006 22:07:06 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm/autogen.sh: error message fix Message-ID: <20061218200706.GA12834@sashak.voltaire.com> Trivial error message fixes in osm/autogen.sh Signed-off-by: Sasha Khapyorsky --- osm/autogen.sh | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/osm/autogen.sh b/osm/autogen.sh index 6570426..e463c0e 100755 --- a/osm/autogen.sh +++ b/osm/autogen.sh @@ -40,10 +40,10 @@ if [[ $lt_maj -lt 1 ]]; then echo Min libtool version is 1.4.2 exit 1 elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" exit 1 elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" exit 1 fi -- 1.4.4.2.gfc82d From eitan at mellanox.co.il Mon Dec 18 12:15:19 2006 From: eitan at mellanox.co.il (eitan at mellanox.co.il) Date: Mon, 18 Dec 2006 22:15:19 +0200 Subject: [openib-general] [PATCH] osm: state manager return wrong signal Message-ID: <1166472919660-git-send-email-eitan@mellanox.co.il> From: Eitan Zahavi diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 9eac038..94cc095 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1853,6 +1853,7 @@ osm_state_mgr_process( { ib_api_status_t status; osm_remote_sm_t *p_remote_sm; + osm_signal_t tmp_signal; CL_ASSERT( p_mgr ); @@ -2075,11 +2076,10 @@ osm_state_mgr_process( case OSM_SIGNAL_CHANGE_DETECTED: /* * Nothing to do here. One subnet change typcially - * begets another.... + * begets another.... But needs to wait for all transactions */ signal = OSM_SIGNAL_NONE; break; - case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: /* * A change was detected on the subnet. @@ -2219,7 +2219,10 @@ osm_state_mgr_process( signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); /* the returned signal is always DONE */ - signal = osm_qos_setup(p_mgr->p_subn->p_osm); + tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm); + + if (tmp_signal == OSM_SIGNAL_DONE_PENDING) + signal = OSM_SIGNAL_DONE_PENDING; /* try to restore SA DB (this should be before lid_mgr because we may want to disable clients reregistration -- 1.4.4.1.GIT From eitan at mellanox.co.il Mon Dec 18 12:17:54 2006 From: eitan at mellanox.co.il (eitan at mellanox.co.il) Date: Mon, 18 Dec 2006 22:17:54 +0200 Subject: [openib-general] [PATCH] osm: pkey manager returns wrong signal Message-ID: <11664730741410-git-send-email-eitan@mellanox.co.il> Fix cases where the pkey manager returned OSM_SIGNAL_DONE and not OSM_SIGNAL_DONE_PENDING by missing some sent packets --- osm/opensm/osm_pkey_mgr.c | 112 +++++++++++++++++++++++++++++++++------------ 1 files changed, 82 insertions(+), 30 deletions(-) diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 48837bc..a33aec7 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry( /********************************************************************** **********************************************************************/ -static ib_api_status_t +static boolean_t pkey_mgr_enforce_partition( + IN osm_log_t *p_log, IN const osm_req_t *p_req, IN const osm_physp_t *p_physp, IN const boolean_t enforce) @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition( osm_madw_context_t context; uint8_t payload[IB_SMP_DATA_SIZE]; ib_port_info_t *p_pi; + ib_api_status_t status; if (!(p_pi = osm_physp_get_port_info_ptr( p_physp ))) - return IB_ERROR; + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0507: " + "No port info for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } - if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) - return IB_SUCCESS; + if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "No need to update PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } memset( payload, 0, IB_SMP_DATA_SIZE ); memcpy( payload, p_pi, sizeof(ib_port_info_t) ); @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition( context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; - return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), - payload, sizeof(payload), - IB_MAD_ATTR_PORT_INFO, - cl_hton32( osm_physp_get_port_num( p_physp ) ), - CL_DISP_MSGID_NONE, &context ); + status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), + payload, sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32( osm_physp_get_port_num( p_physp ) ), + CL_DISP_MSGID_NONE, &context ); + if (status != IB_SUCCESS) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0520: " + "Failed to set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } + else + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "Set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return TRUE; + } } /********************************************************************** @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port( status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, block_index ); if (status == IB_SUCCESS) - ret_val = TRUE; + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_update_port: " + "Updated " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + ret_val = TRUE; + } else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0506: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p_physp ) ); + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } } return ret_val; @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port( uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; + boolean_t port_info_set = FALSE; ib_pkey_table_t empty_block; - + memset(&empty_block, 0, sizeof(ib_pkey_table_t)); p_physp = osm_port_get_default_phys_ptr( p_port ); @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port( enforce = FALSE; } - if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) - { - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0507: " - "pkey_mgr_enforce_partition() failed to update " - "node 0x%016" PRIx64 " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( peer ) ); - } + if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce)) + port_info_set = TRUE; if (enforce == FALSE) - return FALSE; + return port_info_set; p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; for (block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port( osm_physp_get_port_num( peer ) ); } + if (port_info_set) return TRUE; return ret_val; } @@ -541,10 +593,10 @@ osm_pkey_mgr_process( signal = OSM_SIGNAL_DONE_PENDING; p_node = osm_port_get_parent_node( p_port ); if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) - signal = OSM_SIGNAL_DONE_PENDING; + signal = OSM_SIGNAL_DONE_PENDING; } _err: -- 1.4.4.1.GIT From eitan at mellanox.co.il Mon Dec 18 12:19:34 2006 From: eitan at mellanox.co.il (eitan at mellanox.co.il) Date: Mon, 18 Dec 2006 22:19:34 +0200 Subject: [openib-general] [PATCH] osm: ucast manager return wrong signal Message-ID: <1166473174486-git-send-email-eitan@mellanox.co.il> Fix an issue with not providing SIGNAL_DONE_PENDING in case when SwitchInfo was sent --- osm/opensm/osm_ucast_mgr.c | 96 ++++++++++++++++++++++++++++---------------- 1 files changed, 61 insertions(+), 35 deletions(-) diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index e977253..8cfe09e 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table( ib_switch_info_t si; uint32_t block_id_ho = 0; uint8_t block[IB_SMP_DATA_SIZE]; + boolean_t set_swinfo_require = FALSE; + uint16_t lin_top; + uint8_t life_state; CL_ASSERT( p_mgr ); @@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table( Set the top of the unicast forwarding table. */ si = *osm_switch_get_si_ptr( p_sw ); - si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); + lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); + if (si.lin_top != lin_top) + { + set_swinfo_require = TRUE; + si.lin_top = lin_top; + } /* check to see if the change state bit is on. If it is - then we need to clear it. */ - if( ib_switch_info_get_state_change( &si ) ) - si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) - | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; + if ( ib_switch_info_get_state_change( &si ) ) + life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) + | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; else - si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; + life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + if (life_state != si.life_state) { - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "osm_ucast_mgr_set_fwd_table: " - "Setting switch FT top to LID 0x%X\n", - osm_switch_get_max_lid_ho( p_sw ) ); + set_swinfo_require = TRUE; + si.life_state = life_state; } - - context.si_context.light_sweep = FALSE; - context.si_context.node_guid = osm_node_get_node_guid( p_node ); - context.si_context.set_method = TRUE; - - status = osm_req_set( p_mgr->p_req, - p_path, - (uint8_t*)&si, - sizeof(si), - IB_MAD_ATTR_SWITCH_INFO, - 0, - CL_DISP_MSGID_NONE, - &context ); - - if( status != IB_SUCCESS ) + + if ( set_swinfo_require ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_ucast_mgr_set_fwd_table: ERR 3A06: " - "Sending SwitchInfo attribute failed (%s)\n", - ib_get_err_str( status ) ); + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "osm_ucast_mgr_set_fwd_table: " + "Setting switch FT top to LID 0x%X\n", + osm_switch_get_max_lid_ho( p_sw ) ); + } + + context.si_context.light_sweep = FALSE; + context.si_context.node_guid = osm_node_get_node_guid( p_node ); + context.si_context.set_method = TRUE; + + status = osm_req_set( p_mgr->p_req, + p_path, + (uint8_t*)&si, + sizeof(si), + IB_MAD_ATTR_SWITCH_INFO, + 0, + CL_DISP_MSGID_NONE, + &context ); + + if( status != IB_SUCCESS ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_ucast_mgr_set_fwd_table: ERR 3A06: " + "Sending SwitchInfo attribute failed (%s)\n", + ib_get_err_str( status ) ); + } + else + p_mgr->any_change = TRUE; } /* @@ -1215,13 +1234,14 @@ osm_ucast_mgr_process( CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); + p_mgr->any_change = FALSE; + /* If there are no switches in the subnet, we are done. */ if (cl_qmap_count( p_sw_guid_tbl ) == 0) goto Exit; - p_mgr->any_change = FALSE; cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL); if (!p_routing_eng->build_lid_matrices || @@ -1248,14 +1268,20 @@ osm_ucast_mgr_process( if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) __osm_ucast_mgr_dump_tables( p_mgr ); - if (p_mgr->any_change) + if (p_mgr->any_change) + { signal = OSM_SIGNAL_DONE_PENDING; + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_ucast_mgr_process: " + "LFT Tables configured on all switches\n"); + } else + { + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_ucast_mgr_process: " + "No need to set any LFT Tables on all switches\n"); signal = OSM_SIGNAL_DONE; - - osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, - "osm_ucast_mgr_process: " - "LFT Tables configured on all switches\n"); + } Exit: CL_PLOCK_RELEASE( p_mgr->p_lock ); -- 1.4.4.1.GIT From sashak at voltaire.com Mon Dec 18 13:18:14 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 18 Dec 2006 23:18:14 +0200 Subject: [openib-general] [PATCH] ibutils: autogen.sh(s) fixes Message-ID: <20061218211814.GC12834@sashak.voltaire.com> Couple of fixes around of tools version detections and verifications (similar to r9976): - regular expression fix - proper version string separation - numeric comparison for extracted version elements - non-zero exit status when old tools are detected - slightly improved condition statements Originally autogen.sh was claiming that automake-1.10 is older that automake-1.9.2 Signed-off-by: Sasha Khapyorsky --- autogen.sh | 57 +++++++++++++++++++++-------------------------- ibdiag/autogen.sh | 61 +++++++++++++++++++++++--------------------------- ibdm/autogen.sh | 61 +++++++++++++++++++++++--------------------------- ibis/autogen.sh | 55 +++++++++++++++++++++------------------------- ibmgtsim/autogen.sh | 55 +++++++++++++++++++++------------------------- 5 files changed, 132 insertions(+), 157 deletions(-) diff --git a/autogen.sh b/autogen.sh index 30727a8..3a560b5 100755 --- a/autogen.sh +++ b/autogen.sh @@ -1,53 +1,48 @@ -#!/bin/bash +#!/bin/bash cd ${0%*/*} # make sure autoconf is up-to-date -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` ac_maj=`echo $ac_ver|sed 's/\..*//'` ac_min=`echo $ac_ver|sed 's/.*\.//'` -if [[ $ac_maj < 2 ]]; then +if [[ $ac_maj -lt 2 ]]; then echo Min autoconf version is 2.59 - exit -fi -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + exit 1 +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then echo Min autoconf version is 2.59 - exit + exit 1 fi # make sure automake is up-to-date -am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` am_maj=`echo $am_ver|sed 's/\..*//'` -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -am_sub=`echo $am_ver|sed 's/.*\.//'` -if [[ $am_maj < 1 ]]; then +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $am_maj -lt 1 ]]; then echo Min automake version is 1.9.2 - exit -fi -if [[ $am_maj = 1 && $am_min < 9 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit -fi -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit + exit 1 fi # make sure libtool is up-to-date -lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` lt_maj=`echo $lt_ver|sed 's/\..*//'` -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -lt_sub=`echo $lt_ver|sed 's/.*\.//'` -if [[ $lt_maj < 1 ]]; then +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $lt_maj -lt 1 ]]; then echo Min libtool version is 1.4.2 - exit -fi -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit -fi -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit 1 fi # cleanup diff --git a/ibdiag/autogen.sh b/ibdiag/autogen.sh index 60732a8..0ce2866 100755 --- a/ibdiag/autogen.sh +++ b/ibdiag/autogen.sh @@ -1,57 +1,52 @@ -#!/bin/bash +#!/bin/bash # We change dir since the later utilities assume to work in the project dir cd ${0%*/*} # remove previous -\rm -rf autom4te.cache +\rm -rf autom4te.cache \rm -rf aclocal.m4 # make sure autoconf is up-to-date -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` ac_maj=`echo $ac_ver|sed 's/\..*//'` ac_min=`echo $ac_ver|sed 's/.*\.//'` -if [[ $ac_maj < 2 ]]; then +if [[ $ac_maj -lt 2 ]]; then echo Min autoconf version is 2.59 - exit -fi -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + exit 1 +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then echo Min autoconf version is 2.59 - exit + exit 1 fi # make sure automake is up-to-date -am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` am_maj=`echo $am_ver|sed 's/\..*//'` -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -am_sub=`echo $am_ver|sed 's/.*\.//'` -if [[ $am_maj < 1 ]]; then +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $am_maj -lt 1 ]]; then echo Min automake version is 1.9.2 - exit -fi -if [[ $am_maj = 1 && $am_min < 9 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit -fi -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit + exit 1 fi # make sure libtool is up-to-date -lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` lt_maj=`echo $lt_ver|sed 's/\..*//'` -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -lt_sub=`echo $lt_ver|sed 's/.*\.//'` -if [[ $lt_maj < 1 ]]; then +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $lt_maj -lt 1 ]]; then echo Min libtool version is 1.4.2 - exit + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit 1 fi -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit -fi -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit -fi - + aclocal -I config 2>&1 | grep -v "warning: underquoted definition " libtoolize --automake automake --add-missing --gnu diff --git a/ibdm/autogen.sh b/ibdm/autogen.sh index d8f08d8..51163c9 100755 --- a/ibdm/autogen.sh +++ b/ibdm/autogen.sh @@ -1,57 +1,52 @@ -#!/bin/bash +#!/bin/bash # We change dir since the later utilities assume to work in the project dir cd ${0%*/*} # remove previous -\rm -rf autom4te.cache +\rm -rf autom4te.cache \rm -rf aclocal.m4 # make sure autoconf is up-to-date -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` ac_maj=`echo $ac_ver|sed 's/\..*//'` ac_min=`echo $ac_ver|sed 's/.*\.//'` -if [[ $ac_maj < 2 ]]; then +if [[ $ac_maj -lt 2 ]]; then echo Min autoconf version is 2.59 - exit -fi -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + exit 1 +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then echo Min autoconf version is 2.59 - exit + exit 1 fi # make sure automake is up-to-date -am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` am_maj=`echo $am_ver|sed 's/\..*//'` -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -am_sub=`echo $am_ver|sed 's/.*\.//'` -if [[ $am_maj < 1 ]]; then +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $am_maj -lt 1 ]]; then echo Min automake version is 1.9.2 - exit -fi -if [[ $am_maj = 1 && $am_min < 9 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit -fi -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit + exit 1 fi # make sure libtool is up-to-date -lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` lt_maj=`echo $lt_ver|sed 's/\..*//'` -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -lt_sub=`echo $lt_ver|sed 's/.*\.//'` -if [[ $lt_maj < 1 ]]; then +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $lt_maj -lt 1 ]]; then echo Min libtool version is 1.4.2 - exit + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" + exit 1 fi -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit -fi -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit -fi - + aclocal -I config 2>&1 | grep -v "warning: underquoted definition " libtoolize --automake --copy automake --add-missing --gnu --copy diff --git a/ibis/autogen.sh b/ibis/autogen.sh index f3ed611..ae545b5 100755 --- a/ibis/autogen.sh +++ b/ibis/autogen.sh @@ -1,57 +1,52 @@ -#!/bin/sh +#!/bin/sh cd ${0%*/*} \rm -rf autom4te.cache \rm -rf aclocal.m4 \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess # make sure autoconf is up-to-date -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` ac_maj=`echo $ac_ver|sed 's/\..*//'` ac_min=`echo $ac_ver|sed 's/.*\.//'` -if [[ $ac_maj < 2 ]]; then +if [[ $ac_maj -lt 2 ]]; then echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" - exit -fi -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + exit 1 +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" - exit + exit 1 fi # make sure automake is up-to-date -am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` am_maj=`echo $am_ver|sed 's/\..*//'` -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -am_sub=`echo $am_ver|sed 's/.*\.//'` -if [[ $am_maj < 1 ]]; then +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $am_maj -lt 1 ]]; then echo Min automake version is 1.9.2 - exit -fi -if [[ $am_maj = 1 && $am_min < 9 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit -fi -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit + exit 1 fi # make sure libtool is up-to-date -lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` lt_maj=`echo $lt_ver|sed 's/\..*//'` -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -lt_sub=`echo $lt_ver|sed 's/.*\.//'` -if [[ $lt_maj < 1 ]]; then +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $lt_maj -lt 1 ]]; then echo Min libtool version is 1.4.2 - exit -fi -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit -fi -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit + exit 1 fi aclocal -I config 2>&1 | grep -v "arning: underquoted definition of" -libtoolize --automake --copy +libtoolize --automake --copy automake --add-missing --gnu --copy --force autoconf diff --git a/ibmgtsim/autogen.sh b/ibmgtsim/autogen.sh index 456c203..e48b0ac 100755 --- a/ibmgtsim/autogen.sh +++ b/ibmgtsim/autogen.sh @@ -1,57 +1,52 @@ -#!/bin/sh +#!/bin/sh cd ${0%*/*} \rm -rf autom4te.cache \rm -rf aclocal.m4 \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess # make sure autoconf is up-to-date -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` ac_maj=`echo $ac_ver|sed 's/\..*//'` ac_min=`echo $ac_ver|sed 's/.*\.//'` -if [[ $ac_maj < 2 ]]; then +if [[ $ac_maj -lt 2 ]]; then echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" - exit -fi -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then + exit 1 +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" - exit + exit 1 fi # make sure automake is up-to-date -am_ver=`automake --version | head -1 | awk '{print $NF}'` +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` am_maj=`echo $am_ver|sed 's/\..*//'` -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -am_sub=`echo $am_ver|sed 's/.*\.//'` -if [[ $am_maj < 1 ]]; then +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $am_maj -lt 1 ]]; then echo Min automake version is 1.9.2 - exit -fi -if [[ $am_maj = 1 && $am_min < 9 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit -fi -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then + exit 1 +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" - exit + exit 1 fi # make sure libtool is up-to-date -lt_ver=`libtool --version | head -1 | awk '{print $4}'` +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` lt_maj=`echo $lt_ver|sed 's/\..*//'` -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` -lt_sub=`echo $lt_ver|sed 's/.*\.//'` -if [[ $lt_maj < 1 ]]; then +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` +if [[ $lt_maj -lt 1 ]]; then echo Min libtool version is 1.4.2 - exit -fi -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit -fi -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then + exit 1 +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" - exit + exit 1 fi aclocal -I config 2>&1 | grep -v "warning: underquoted definition " -libtoolize --automake --copy --force +libtoolize --automake --copy --force automake --add-missing --copy --gnu --force autoconf -- 1.4.4.2.gfc82d From eitan at mellanox.co.il Mon Dec 18 13:24:20 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 18 Dec 2006 23:24:20 +0200 Subject: [openib-general] [PATCH] ibutils: autogen.sh(s) fixes In-Reply-To: <20061218211814.GC12834@sashak.voltaire.com> References: <20061218211814.GC12834@sashak.voltaire.com> Message-ID: <45870704.2000003@mellanox.co.il> Thanks Applied. Sasha Khapyorsky wrote: > Couple of fixes around of tools version detections and verifications > (similar to r9976): > - regular expression fix - proper version string separation > - numeric comparison for extracted version elements > - non-zero exit status when old tools are detected > - slightly improved condition statements > > Originally autogen.sh was claiming that automake-1.10 is older that > automake-1.9.2 > > Signed-off-by: Sasha Khapyorsky > --- > autogen.sh | 57 +++++++++++++++++++++-------------------------- > ibdiag/autogen.sh | 61 +++++++++++++++++++++++--------------------------- > ibdm/autogen.sh | 61 +++++++++++++++++++++++--------------------------- > ibis/autogen.sh | 55 +++++++++++++++++++++------------------------- > ibmgtsim/autogen.sh | 55 +++++++++++++++++++++------------------------- > 5 files changed, 132 insertions(+), 157 deletions(-) > > diff --git a/autogen.sh b/autogen.sh > index 30727a8..3a560b5 100755 > --- a/autogen.sh > +++ b/autogen.sh > @@ -1,53 +1,48 @@ > -#!/bin/bash > +#!/bin/bash > cd ${0%*/*} > > # make sure autoconf is up-to-date > -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` > +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` > ac_maj=`echo $ac_ver|sed 's/\..*//'` > ac_min=`echo $ac_ver|sed 's/.*\.//'` > -if [[ $ac_maj < 2 ]]; then > +if [[ $ac_maj -lt 2 ]]; then > echo Min autoconf version is 2.59 > - exit > -fi > -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then > + exit 1 > +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then > echo Min autoconf version is 2.59 > - exit > + exit 1 > fi > > # make sure automake is up-to-date > -am_ver=`automake --version | head -1 | awk '{print $NF}'` > +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` > am_maj=`echo $am_ver|sed 's/\..*//'` > -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -am_sub=`echo $am_ver|sed 's/.*\.//'` > -if [[ $am_maj < 1 ]]; then > +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $am_maj -lt 1 ]]; then > echo Min automake version is 1.9.2 > - exit > -fi > -if [[ $am_maj = 1 && $am_min < 9 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > -fi > -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > + exit 1 > fi > > # make sure libtool is up-to-date > -lt_ver=`libtool --version | head -1 | awk '{print $4}'` > +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` > lt_maj=`echo $lt_ver|sed 's/\..*//'` > -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -lt_sub=`echo $lt_ver|sed 's/.*\.//'` > -if [[ $lt_maj < 1 ]]; then > +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $lt_maj -lt 1 ]]; then > echo Min libtool version is 1.4.2 > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then > - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then > - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then > + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then > + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > + exit 1 > fi > > # cleanup > diff --git a/ibdiag/autogen.sh b/ibdiag/autogen.sh > index 60732a8..0ce2866 100755 > --- a/ibdiag/autogen.sh > +++ b/ibdiag/autogen.sh > @@ -1,57 +1,52 @@ > -#!/bin/bash > +#!/bin/bash > > # We change dir since the later utilities assume to work in the project dir > cd ${0%*/*} > # remove previous > -\rm -rf autom4te.cache > +\rm -rf autom4te.cache > \rm -rf aclocal.m4 > # make sure autoconf is up-to-date > -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` > +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` > ac_maj=`echo $ac_ver|sed 's/\..*//'` > ac_min=`echo $ac_ver|sed 's/.*\.//'` > -if [[ $ac_maj < 2 ]]; then > +if [[ $ac_maj -lt 2 ]]; then > echo Min autoconf version is 2.59 > - exit > -fi > -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then > + exit 1 > +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then > echo Min autoconf version is 2.59 > - exit > + exit 1 > fi > # make sure automake is up-to-date > -am_ver=`automake --version | head -1 | awk '{print $NF}'` > +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` > am_maj=`echo $am_ver|sed 's/\..*//'` > -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -am_sub=`echo $am_ver|sed 's/.*\.//'` > -if [[ $am_maj < 1 ]]; then > +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $am_maj -lt 1 ]]; then > echo Min automake version is 1.9.2 > - exit > -fi > -if [[ $am_maj = 1 && $am_min < 9 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > -fi > -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > + exit 1 > fi > # make sure libtool is up-to-date > -lt_ver=`libtool --version | head -1 | awk '{print $4}'` > +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` > lt_maj=`echo $lt_ver|sed 's/\..*//'` > -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -lt_sub=`echo $lt_ver|sed 's/.*\.//'` > -if [[ $lt_maj < 1 ]]; then > +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $lt_maj -lt 1 ]]; then > echo Min libtool version is 1.4.2 > - exit > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then > + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then > + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > + exit 1 > fi > -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then > - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then > - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > -fi > - > + > aclocal -I config 2>&1 | grep -v "warning: underquoted definition " > libtoolize --automake > automake --add-missing --gnu > diff --git a/ibdm/autogen.sh b/ibdm/autogen.sh > index d8f08d8..51163c9 100755 > --- a/ibdm/autogen.sh > +++ b/ibdm/autogen.sh > @@ -1,57 +1,52 @@ > -#!/bin/bash > +#!/bin/bash > > # We change dir since the later utilities assume to work in the project dir > cd ${0%*/*} > # remove previous > -\rm -rf autom4te.cache > +\rm -rf autom4te.cache > \rm -rf aclocal.m4 > # make sure autoconf is up-to-date > -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` > +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` > ac_maj=`echo $ac_ver|sed 's/\..*//'` > ac_min=`echo $ac_ver|sed 's/.*\.//'` > -if [[ $ac_maj < 2 ]]; then > +if [[ $ac_maj -lt 2 ]]; then > echo Min autoconf version is 2.59 > - exit > -fi > -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then > + exit 1 > +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then > echo Min autoconf version is 2.59 > - exit > + exit 1 > fi > # make sure automake is up-to-date > -am_ver=`automake --version | head -1 | awk '{print $NF}'` > +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` > am_maj=`echo $am_ver|sed 's/\..*//'` > -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -am_sub=`echo $am_ver|sed 's/.*\.//'` > -if [[ $am_maj < 1 ]]; then > +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $am_maj -lt 1 ]]; then > echo Min automake version is 1.9.2 > - exit > -fi > -if [[ $am_maj = 1 && $am_min < 9 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > -fi > -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > + exit 1 > fi > # make sure libtool is up-to-date > -lt_ver=`libtool --version | head -1 | awk '{print $4}'` > +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` > lt_maj=`echo $lt_ver|sed 's/\..*//'` > -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -lt_sub=`echo $lt_ver|sed 's/.*\.//'` > -if [[ $lt_maj < 1 ]]; then > +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $lt_maj -lt 1 ]]; then > echo Min libtool version is 1.4.2 > - exit > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then > + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then > + echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > + exit 1 > fi > -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then > - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then > - echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > -fi > - > + > aclocal -I config 2>&1 | grep -v "warning: underquoted definition " > libtoolize --automake --copy > automake --add-missing --gnu --copy > diff --git a/ibis/autogen.sh b/ibis/autogen.sh > index f3ed611..ae545b5 100755 > --- a/ibis/autogen.sh > +++ b/ibis/autogen.sh > @@ -1,57 +1,52 @@ > -#!/bin/sh > +#!/bin/sh > > cd ${0%*/*} > \rm -rf autom4te.cache > \rm -rf aclocal.m4 > \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess > # make sure autoconf is up-to-date > -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` > +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` > ac_maj=`echo $ac_ver|sed 's/\..*//'` > ac_min=`echo $ac_ver|sed 's/.*\.//'` > -if [[ $ac_maj < 2 ]]; then > +if [[ $ac_maj -lt 2 ]]; then > echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" > - exit > -fi > -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then > + exit 1 > +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then > echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" > - exit > + exit 1 > fi > # make sure automake is up-to-date > -am_ver=`automake --version | head -1 | awk '{print $NF}'` > +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` > am_maj=`echo $am_ver|sed 's/\..*//'` > -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -am_sub=`echo $am_ver|sed 's/.*\.//'` > -if [[ $am_maj < 1 ]]; then > +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $am_maj -lt 1 ]]; then > echo Min automake version is 1.9.2 > - exit > -fi > -if [[ $am_maj = 1 && $am_min < 9 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > -fi > -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > + exit 1 > fi > # make sure libtool is up-to-date > -lt_ver=`libtool --version | head -1 | awk '{print $4}'` > +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` > lt_maj=`echo $lt_ver|sed 's/\..*//'` > -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -lt_sub=`echo $lt_ver|sed 's/.*\.//'` > -if [[ $lt_maj < 1 ]]; then > +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $lt_maj -lt 1 ]]; then > echo Min libtool version is 1.4.2 > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then > echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then > echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > + exit 1 > fi > > aclocal -I config 2>&1 | grep -v "arning: underquoted definition of" > -libtoolize --automake --copy > +libtoolize --automake --copy > automake --add-missing --gnu --copy --force > autoconf > diff --git a/ibmgtsim/autogen.sh b/ibmgtsim/autogen.sh > index 456c203..e48b0ac 100755 > --- a/ibmgtsim/autogen.sh > +++ b/ibmgtsim/autogen.sh > @@ -1,57 +1,52 @@ > -#!/bin/sh > +#!/bin/sh > > cd ${0%*/*} > \rm -rf autom4te.cache > \rm -rf aclocal.m4 > \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess > # make sure autoconf is up-to-date > -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'` > +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'` > ac_maj=`echo $ac_ver|sed 's/\..*//'` > ac_min=`echo $ac_ver|sed 's/.*\.//'` > -if [[ $ac_maj < 2 ]]; then > +if [[ $ac_maj -lt 2 ]]; then > echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" > - exit > -fi > -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then > + exit 1 > +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then > echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59" > - exit > + exit 1 > fi > # make sure automake is up-to-date > -am_ver=`automake --version | head -1 | awk '{print $NF}'` > +am_ver=`automake --version | head -n 1 | awk '{print $NF}'` > am_maj=`echo $am_ver|sed 's/\..*//'` > -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -am_sub=`echo $am_ver|sed 's/.*\.//'` > -if [[ $am_maj < 1 ]]; then > +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $am_maj -lt 1 ]]; then > echo Min automake version is 1.9.2 > - exit > -fi > -if [[ $am_maj = 1 && $am_min < 9 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > -fi > -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then > + exit 1 > +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then > echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2" > - exit > + exit 1 > fi > # make sure libtool is up-to-date > -lt_ver=`libtool --version | head -1 | awk '{print $4}'` > +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'` > lt_maj=`echo $lt_ver|sed 's/\..*//'` > -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'` > -lt_sub=`echo $lt_ver|sed 's/.*\.//'` > -if [[ $lt_maj < 1 ]]; then > +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'` > +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'` > +if [[ $lt_maj -lt 1 ]]; then > echo Min libtool version is 1.4.2 > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then > echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > -fi > -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then > + exit 1 > +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then > echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2" > - exit > + exit 1 > fi > > aclocal -I config 2>&1 | grep -v "warning: underquoted definition " > -libtoolize --automake --copy --force > +libtoolize --automake --copy --force > automake --add-missing --copy --gnu --force > autoconf > From eitan at mellanox.co.il Mon Dec 18 13:38:17 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 18 Dec 2006 23:38:17 +0200 Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager fail to report back correct signal Message-ID: <45870A49.1070205@mellanox.co.il> Hi Hal, This is a resend as I did not see a bounce of the list of the previous posting I did using git-send-email (probably due to a miss use). The following patch fixes bugs in the ucast manager and pkey manager such that they do not report correct signal back. In both cases some some outstanding SubnSet were ignored. Signed-off-by: Eitan Zahavi -------------------------------------------------------------------------------------------- diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 48837bc..a33aec7 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry( /********************************************************************** **********************************************************************/ -static ib_api_status_t +static boolean_t pkey_mgr_enforce_partition( + IN osm_log_t *p_log, IN const osm_req_t *p_req, IN const osm_physp_t *p_physp, IN const boolean_t enforce) @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition( osm_madw_context_t context; uint8_t payload[IB_SMP_DATA_SIZE]; ib_port_info_t *p_pi; + ib_api_status_t status; if (!(p_pi = osm_physp_get_port_info_ptr( p_physp ))) - return IB_ERROR; + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0507: " + "No port info for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } - if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) - return IB_SUCCESS; + if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "No need to update PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } memset( payload, 0, IB_SMP_DATA_SIZE ); memcpy( payload, p_pi, sizeof(ib_port_info_t) ); @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition( context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; - return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), - payload, sizeof(payload), - IB_MAD_ATTR_PORT_INFO, - cl_hton32( osm_physp_get_port_num( p_physp ) ), - CL_DISP_MSGID_NONE, &context ); + status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), + payload, sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32( osm_physp_get_port_num( p_physp ) ), + CL_DISP_MSGID_NONE, &context ); + if (status != IB_SUCCESS) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0520: " + "Failed to set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } + else + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "Set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return TRUE; + } } /********************************************************************** @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port( status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, block_index ); if (status == IB_SUCCESS) - ret_val = TRUE; + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_update_port: " + "Updated " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + ret_val = TRUE; + } else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0506: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p_physp ) ); + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } } return ret_val; @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port( uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; + boolean_t port_info_set = FALSE; ib_pkey_table_t empty_block; - + memset(&empty_block, 0, sizeof(ib_pkey_table_t)); p_physp = osm_port_get_default_phys_ptr( p_port ); @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port( enforce = FALSE; } - if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) - { - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0507: " - "pkey_mgr_enforce_partition() failed to update " - "node 0x%016" PRIx64 " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( peer ) ); - } + if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce)) + port_info_set = TRUE; if (enforce == FALSE) - return FALSE; + return port_info_set; p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; for (block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port( osm_physp_get_port_num( peer ) ); } + if (port_info_set) return TRUE; return ret_val; } @@ -541,10 +593,10 @@ osm_pkey_mgr_process( signal = OSM_SIGNAL_DONE_PENDING; p_node = osm_port_get_parent_node( p_port ); if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) - signal = OSM_SIGNAL_DONE_PENDING; + signal = OSM_SIGNAL_DONE_PENDING; } _err: diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index e977253..8cfe09e 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table( ib_switch_info_t si; uint32_t block_id_ho = 0; uint8_t block[IB_SMP_DATA_SIZE]; + boolean_t set_swinfo_require = FALSE; + uint16_t lin_top; + uint8_t life_state; CL_ASSERT( p_mgr ); @@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table( Set the top of the unicast forwarding table. */ si = *osm_switch_get_si_ptr( p_sw ); - si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); + lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) ); + if (si.lin_top != lin_top) + { + set_swinfo_require = TRUE; + si.lin_top = lin_top; + } /* check to see if the change state bit is on. If it is - then we need to clear it. */ - if( ib_switch_info_get_state_change( &si ) ) - si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) - | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; + if ( ib_switch_info_get_state_change( &si ) ) + life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 ) + | ( si.life_state & IB_SWITCH_PSC ) ) & 0xfc; else - si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; + life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; - if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + if (life_state != si.life_state) { - osm_log( p_mgr->p_log, OSM_LOG_DEBUG, - "osm_ucast_mgr_set_fwd_table: " - "Setting switch FT top to LID 0x%X\n", - osm_switch_get_max_lid_ho( p_sw ) ); + set_swinfo_require = TRUE; + si.life_state = life_state; } - - context.si_context.light_sweep = FALSE; - context.si_context.node_guid = osm_node_get_node_guid( p_node ); - context.si_context.set_method = TRUE; - - status = osm_req_set( p_mgr->p_req, - p_path, - (uint8_t*)&si, - sizeof(si), - IB_MAD_ATTR_SWITCH_INFO, - 0, - CL_DISP_MSGID_NONE, - &context ); - - if( status != IB_SUCCESS ) + + if ( set_swinfo_require ) { - osm_log( p_mgr->p_log, OSM_LOG_ERROR, - "osm_ucast_mgr_set_fwd_table: ERR 3A06: " - "Sending SwitchInfo attribute failed (%s)\n", - ib_get_err_str( status ) ); + if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_mgr->p_log, OSM_LOG_DEBUG, + "osm_ucast_mgr_set_fwd_table: " + "Setting switch FT top to LID 0x%X\n", + osm_switch_get_max_lid_ho( p_sw ) ); + } + + context.si_context.light_sweep = FALSE; + context.si_context.node_guid = osm_node_get_node_guid( p_node ); + context.si_context.set_method = TRUE; + + status = osm_req_set( p_mgr->p_req, + p_path, + (uint8_t*)&si, + sizeof(si), + IB_MAD_ATTR_SWITCH_INFO, + 0, + CL_DISP_MSGID_NONE, + &context ); + + if( status != IB_SUCCESS ) + { + osm_log( p_mgr->p_log, OSM_LOG_ERROR, + "osm_ucast_mgr_set_fwd_table: ERR 3A06: " + "Sending SwitchInfo attribute failed (%s)\n", + ib_get_err_str( status ) ); + } + else + p_mgr->any_change = TRUE; } /* @@ -1215,13 +1234,14 @@ osm_ucast_mgr_process( CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock ); + p_mgr->any_change = FALSE; + /* If there are no switches in the subnet, we are done. */ if (cl_qmap_count( p_sw_guid_tbl ) == 0) goto Exit; - p_mgr->any_change = FALSE; cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL); if (!p_routing_eng->build_lid_matrices || @@ -1248,14 +1268,20 @@ osm_ucast_mgr_process( if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) __osm_ucast_mgr_dump_tables( p_mgr ); - if (p_mgr->any_change) + if (p_mgr->any_change) + { signal = OSM_SIGNAL_DONE_PENDING; + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_ucast_mgr_process: " + "LFT Tables configured on all switches\n"); + } else + { + osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, + "osm_ucast_mgr_process: " + "No need to set any LFT Tables on all switches\n"); signal = OSM_SIGNAL_DONE; - - osm_log(p_mgr->p_log, OSM_LOG_VERBOSE, - "osm_ucast_mgr_process: " - "LFT Tables configured on all switches\n"); + } Exit: CL_PLOCK_RELEASE( p_mgr->p_lock ); -- 1.4.4.1.GIT From eitan at mellanox.co.il Mon Dec 18 13:43:39 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 18 Dec 2006 23:43:39 +0200 Subject: [openib-general] [PATCH] osm: state manager ignores some outstanding transaction Message-ID: <45870B8B.9010002@mellanox.co.il> Hi Hal, This is a resend as I did not see a bounce of the list of the previous posting I did using git-send-email (probably due to a miss use). The following patch fixes bugs in the state manager: Both in light sweep and pkey assignment states the state manager could ignore outstanding SMPs (reported back by the managers) and continue to next stage. When these SMPs do complete it causes failures of further steps which receives the NO_PENDING_TRANSACTIONS signal when it is not expected. Signed-off-by: Eitan Zahavi --- osm/opensm/osm_state_mgr.c | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 9eac038..94cc095 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1853,6 +1853,7 @@ osm_state_mgr_process( { ib_api_status_t status; osm_remote_sm_t *p_remote_sm; + osm_signal_t tmp_signal; CL_ASSERT( p_mgr ); @@ -2075,11 +2076,10 @@ osm_state_mgr_process( case OSM_SIGNAL_CHANGE_DETECTED: /* * Nothing to do here. One subnet change typcially - * begets another.... + * begets another.... But needs to wait for all transactions */ signal = OSM_SIGNAL_NONE; break; - case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: /* * A change was detected on the subnet. @@ -2219,7 +2219,10 @@ osm_state_mgr_process( signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); /* the returned signal is always DONE */ - signal = osm_qos_setup(p_mgr->p_subn->p_osm); + tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm); + + if (tmp_signal == OSM_SIGNAL_DONE_PENDING) + signal = OSM_SIGNAL_DONE_PENDING; /* try to restore SA DB (this should be before lid_mgr because we may want to disable clients reregistration -- 1.4.4.1.GIT From halr at voltaire.com Mon Dec 18 13:43:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2006 16:43:41 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm/autogen.sh: error message fix In-Reply-To: <20061218200706.GA12834@sashak.voltaire.com> References: <20061218200706.GA12834@sashak.voltaire.com> Message-ID: <1166478195.32666.203147.camel@hal.voltaire.com> On Mon, 2006-12-18 at 15:07, Sasha Khapyorsky wrote: > Trivial error message fixes in osm/autogen.sh > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Mon Dec 18 13:47:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2006 16:47:47 -0500 Subject: [openib-general] [PATCH] osm: state manager return wrong signal In-Reply-To: <1166472919660-git-send-email-eitan@mellanox.co.il> References: <1166472919660-git-send-email-eitan@mellanox.co.il> Message-ID: <1166478410.32666.203255.camel@hal.voltaire.com> On Mon, 2006-12-18 at 15:15, eitan at mellanox.co.il wrote: > From: Eitan Zahavi See below comments. > diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c > index 9eac038..94cc095 100644 > --- a/osm/opensm/osm_state_mgr.c > +++ b/osm/opensm/osm_state_mgr.c > @@ -1853,6 +1853,7 @@ osm_state_mgr_process( > { > ib_api_status_t status; > osm_remote_sm_t *p_remote_sm; > + osm_signal_t tmp_signal; > > CL_ASSERT( p_mgr ); > > @@ -2075,11 +2076,10 @@ osm_state_mgr_process( > case OSM_SIGNAL_CHANGE_DETECTED: > /* > * Nothing to do here. One subnet change typcially > - * begets another.... > + * begets another.... But needs to wait for all transactions This was already done as part of your original osm_state_mgr.c patch. > */ > signal = OSM_SIGNAL_NONE; This was eliminated as part of your original osm_state_mgr.c patch. Should it be there ? If so, this isn't indicated as a +. -- Hal > break; > - > case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: > /* > * A change was detected on the subnet. > @@ -2219,7 +2219,10 @@ osm_state_mgr_process( > signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); > > /* the returned signal is always DONE */ > - signal = osm_qos_setup(p_mgr->p_subn->p_osm); > + tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm); > + > + if (tmp_signal == OSM_SIGNAL_DONE_PENDING) > + signal = OSM_SIGNAL_DONE_PENDING; > > /* try to restore SA DB (this should be before lid_mgr > because we may want to disable clients reregistration From eitan at mellanox.co.il Mon Dec 18 13:55:58 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 18 Dec 2006 23:55:58 +0200 Subject: [openib-general] [PATCH] osm: state manager return wrong signal Message-ID: <6C2C79E72C305246B504CBA17B5500C980BFED@mtlexch01.mtl.com> Hi Hal, The discrepancies are due to my lack of git practice. I do not know why these lines got back in. The following line should not be there: > > */ > > signal = OSM_SIGNAL_NONE; > > This was eliminated as part of your original osm_state_mgr.c patch. > Should it be there ? If so, this isn't indicated as a +. > > -- Hal > > > break; > > - > > case OSM_SIGNAL_NO_PENDING_TRANSACTIONS: > > /* > > * A change was detected on the subnet. > > @@ -2219,7 +2219,10 @@ osm_state_mgr_process( > > signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm ); > > > > /* the returned signal is always DONE */ > > - signal = osm_qos_setup(p_mgr->p_subn->p_osm); > > + tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm); > > + > > + if (tmp_signal == OSM_SIGNAL_DONE_PENDING) > > + signal = OSM_SIGNAL_DONE_PENDING; > > > > /* try to restore SA DB (this should be before lid_mgr > > because we may want to disable clients reregistration From halr at voltaire.com Mon Dec 18 14:49:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Dec 2006 17:49:47 -0500 Subject: [openib-general] [PATCH] osm: state manager ignores some outstanding transaction In-Reply-To: <45870B8B.9010002@mellanox.co.il> References: <45870B8B.9010002@mellanox.co.il> Message-ID: <1166482096.32666.205256.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-12-18 at 16:43, Eitan Zahavi wrote: > Hi Hal, > > This is a resend as I did not see a bounce of the list of the previous > posting I did using git-send-email (probably due to a miss use). > > The following patch fixes bugs in the state manager: > Both in light sweep and pkey assignment states the state manager could ignore > outstanding SMPs (reported back by the managers) and continue to next stage. > When these SMPs do complete it causes failures of further steps which receives > the NO_PENDING_TRANSACTIONS signal when it is not expected. > > Signed-off-by: Eitan Zahavi Thanks. Applied. Due to the confusion, please double check the result. -- Hal From kliteyn at mellanox.co.il Mon Dec 18 15:33:05 2006 From: kliteyn at mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 01:33:05 +0200 Subject: [openib-general] OSM: Using lid matrices in ucast manager Message-ID: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> Hi Hal. I have a question about some patch that I want to send regarding lid matrices usage in osm ucast manager: The FatTree routing doesn't use the min hop tables, so we can skip the lid matrices building in OSM. However, ucast manager uses these lid matrices also to get the max lid that is accessible from each switch, which defines the LTF table size. This max lid is obtained by calling osm_switch_get_max_lid_ho() function, which in turn, calls osm_lid_matrix_get_max_lid_ho() for the switch's lid matrix. If the lid matrices weren't built, then the osm_switch_get_max_lid_ho() function will return 0xFFFF, and eventually osm will crash. Of course, I don't want to build all the lid matrices just to know the max lid, so here's what I've done: * I added a field to the osm_switch_t object: max_lid_ho (with default value 0xFFFF, should it be 0x0 instead?). * Added and three osm_switch_t methods for this new field: getter, setter, and is_set that returns true if this field has been set. * The original osm_switch_get_max_lid_ho() has been updated to return this field value if it's set. * Then in FatTree routing I set this field for each switch (I get the max lid 'for free' as a byproduct of the algorithm). * Now everything in the ucast manager works fine, except for the following two dump functions: __osm_ucast_mgr_dump_ucast_routes (it uses hops) ucast_mgr_dump_lid_matrix (obviously...) These two functions check at the beginning whether the max_lid_ho was set (using the 'is_set' method), and return w/o printing anything if the answer is yes. This way any other routing engine that uses lid matrix is not affected by this change, and any routing engine that doesn't use the lid matrix has a way to set the max lid per switch explicitly. This approach works great, but I have a feeling that this is kinda hack... What do you think about this solution? Any other suggestions? Anyway, just wanted to hear your opinion before sending the patch. Regards, Yevgeny Kliteynik Mellanox Technologies LTD Tel: +972-4-909-7200 ext: 394 Fax: +972-4-959-3245 P.O. Box 586 Yokneam 20692 ISRAEL -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Mon Dec 18 17:30:56 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 19 Dec 2006 03:30:56 +0200 Subject: [openib-general] OSM: Using lid matrices in ucast manager In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> Message-ID: <1166491856.29306.15.camel@localhost> Hi Yevgeny, On Tue, 2006-12-19 at 01:33 +0200, Yevgeny Kliteynik wrote: > Hi Hal. > > > > I have a question about some patch that I want to send regarding lid > matrices usage in osm ucast > > manager: > > > > The FatTree routing doesn’t use the min hop tables, so we can skip the > lid matrices building in OSM. The lid matrices are used in mcast_mgr for multicast routes generation. > However, ucast manager uses these lid matrices also to get the max lid > that is accessible from each > > switch, which defines the LTF table size. > > This max lid is obtained by calling osm_switch_get_max_lid_ho() > function, which in turn, calls > > osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix. > > If the lid matrices weren’t built, then the > osm_switch_get_max_lid_ho() function will return 0xFFFF, > > and eventually osm will crash. > > > > Of course, I don’t want to build all the lid matrices just to know the > max lid, so here’s what I’ve done: > > > > * I added a field to the osm_switch_t object: max_lid_ho (with > default value 0xFFFF, should it > be 0x0 instead?). Good thing. 0 is fine as default value IMHO. > * Added and three osm_switch_t methods for this new field: > getter, setter, and is_set that returns > true if this field has been set. Why those methods? Everything you need is to access structure field and 'if (sw->max_lid_ho)' for "is_set" checks. > * The original osm_switch_get_max_lid_ho() has been updated to > return this field value if it’s set. > * Then in FatTree routing I set this field for each switch (I > get the max lid ‘for free’ as a byproduct > of the algorithm). > * Now everything in the ucast manager works fine, except for the > following two dump functions: > __osm_ucast_mgr_dump_ucast_routes (it uses hops) > ucast_mgr_dump_lid_matrix (obviously…) > These two functions check at the beginning whether the > max_lid_ho was set (using the ‘is_set’ > method), and return w/o printing anything if the answer is > yes. > > > > This way any other routing engine that uses lid matrix is not affected > by this change, and any routing > > engine that doesn’t use the lid matrix has a way to set the max lid > per switch explicitly. Hope you are adding this for existing code. > This approach works great, but I have a feeling that this is kinda > hack… Moving max_lid(_ho) to switch structure looks like a good idea for me regardless to lid matrix build elimination. The only problem I can see with lid matrices is mcast_mgr which uses this. Sasha > > > > What do you think about this solution? > > Any other suggestions? > > > > Anyway, just wanted to hear your opinion before sending the patch. > > > > Regards, > > > > Yevgeny Kliteynik > > > > Mellanox Technologies LTD > > Tel: +972-4-909-7200 ext: 394 > > Fax: +972-4-959-3245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > From Ashish.Batwara at lsi.com Mon Dec 18 18:55:43 2006 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Mon, 18 Dec 2006 19:55:43 -0700 Subject: [openib-general] opensm Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> Hi, I am trying to run opensm on Linux server. It has two HCAs (4-ports) and connected to IB Switch. ibnodes command displays the information about the Switch ports and HCA ports. When I start opensm, I see in /var/log/messages "Starting srp_daemon" for all the 4 ports and immediately after I see "failed srp_daemon" for all the ports and the displays "SM Port is down". I tried several times and even rebooted the server few times but no luck. Does anybody know what this problem is? Thanks Ashish From eitan at sw053.yok.mtl.com Mon Dec 18 21:23:57 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Tue, 19 Dec 2006 07:23:57 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-19:normal completion Message-ID: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3 ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 Total=308 Pass=307 Fail=1 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 41 OsmStress IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo Failures: 1 OsmStress IS1-16.topo From vlad at dev.mellanox.co.il Mon Dec 18 23:42:38 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 19 Dec 2006 09:42:38 +0200 Subject: [openib-general] ofed backports update In-Reply-To: <1166091556.926.17.camel@muscida> References: <20061211144813.GA15870@mellanox.co.il> <1166091556.926.17.camel@muscida> Message-ID: <458797EE.9050000@dev.mellanox.co.il> Yosef Etigin wrote: > On Mon, 2006-12-11 at 16:48 +0200, Michael S. Tsirkin wrote: > >> Here's a small update on OFED 1.2 backports. This describes a change >> I did a couple of weeks ago but never got to documenting. >> NOTE: This info is relevant only for people developing OFED kernel code, >> everything is transparent for others. >> >> NOTE: This is by *no means* a comprehensive writeup of OFED build process - >> just a small update for people familiar with development in OFED 1.1. >> >> Background: >> OFED 1.1 did all backports by applying patches under >> kernel_patches/backports// directory. >> To back-port a package, you just stuck a patch there >> and one OFED detected an appropriate kernel, it was applied before build. >> In many cases - where the kernel we are back-porting to was simply >> missing some macro - what patch actually did was just add a file >> under the include directory, and OFED build scripts knew to pick >> these up before standard linux includes. >> Managing these became somewhat of a pain as it is often hard to >> see the history of a patch: try git diff on a patch that sits in git tree >> and see what I mean. >> >> Update: >> So for OFED 1.2 I've created a new directory kernel_addons, and converted >> all patches that created new files to plain files under the relevant >> kernel directory. OFED scripts now look there for files before standard >> Linux headers. >> For an example, look at how backport to 2.6.18 looks: >> http://staging.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=tree;f=kernel_addons/backport/2.6.18/include/linux;h=5eabed1f98596f92ce149dae65c4ab1ceb1d6a67;hb=HEAD >> Unfortunately, not all patches are of this form - some really tweak source >> inside the infiniband subtree - but we can strive to reduce the number of this >> and in this way make maintaining backports more of a seamless process. >> >> Bottom line >> There are now 2 mechanisms for back-porting in OFED: >> - if you want to add a kernel-specific file, stick it under >> kernel_addons/backport//. >> - if you must change an existing file depending on kernel version, stick >> a patch in kernel_patches/backports//. >> >> > > I was running the ‘configure’ script under ofed root. > > In ofed 1.1, it is possible to run configure without flags to patch the > sources, and then run it again –without-patches and with the desired > flags. > > In ofed 1.2 (Vlad’s tree) this scenario causes compilation error while > running ‘make’ afterwards (2.6.9-34ELsmp and on 2.6.16.21-0.8, but NOT > 2.6.19) causes compilation errors later on. > > However, when I just ran configure on a fresh source, with all the > desired flags, it worked just fine. > > It seems to happen because the configure only patches Makefiles with the > selected components with the kernel-addons include path. > > Maybe it should patch all Makefiles, or copy the files to ./include? > > > _______________________________________________________________ > Yosef Etigin, ib-host-stack > Voltaire – The Grid Backbone > www.voltaire.com > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > I fixed configure script. Please try again. Regards, Vladimir From eitan at mellanox.co.il Mon Dec 18 23:37:55 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 19 Dec 2006 09:37:55 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-19:normal completion In-Reply-To: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com> References: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com> Message-ID: <458796D3.50709@mellanox.co.il> Clarifications: 1. The OpenSM code run includes the last patches I have sent. 2. The single failure is due to a race in ibmgtsim. ibdiagnet waits forever for a response for a "bind" message. I suspect a deadlock between the "server" and the "node" but I am not sure. 3. The regression still does not run the osmtest tests due to the fact they are all failing. EZ Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3 > ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 > Total=308 Pass=307 Fail=1 > > Pass: > 42 Stability IS1-16.topo > 42 Pkey IS1-16.topo > 42 Multicast IS1-16.topo > 42 LidMgr IS1-16.topo > 41 OsmStress IS1-16.topo > 14 Stability IS3-loop.topo > 14 Stability IS3-128.topo > 14 Pkey IS3-128.topo > 14 OsmStress IS3-128.topo > 14 Multicast IS3-loop.topo > 14 Multicast IS3-128.topo > 14 LidMgr IS3-128.topo > > Failures: > 1 OsmStress IS1-16.topo > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Mon Dec 18 23:42:28 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 19 Dec 2006 09:42:28 +0200 Subject: [openib-general] opensm In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> Message-ID: <458797E4.8010600@mellanox.co.il> This is not an OpenSM issue. Forwarded to the SRP people. EZ Batwara, Ashish wrote: > Hi, > I am trying to run opensm on Linux server. It has two HCAs (4-ports) and > connected to IB Switch. ibnodes command displays the information about > the Switch ports and HCA ports. > When I start opensm, I see in /var/log/messages "Starting srp_daemon" > for all the 4 ports and immediately after I see "failed srp_daemon" for > all the ports and the displays "SM Port is down". > > I tried several times and even rebooted the server few times but no > luck. > > Does anybody know what this problem is? > > Thanks > Ashish > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Tue Dec 19 00:20:35 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 10:20:35 +0200 Subject: [openib-general] [PATCH obvious] IB/verbs: fix 32-bit big endian platforms Message-ID: <20061219082035.GA24028@mellanox.co.il> ib_dma_alloc_coherent, introduced by commit 9b513090a3c5e4964f9ac09016c1586988abb3d5 is storing dma_handle through a pointer to u64. This is broken on big-endian 32 bit platforms since the handle will land in high-order bits of the qword. And the compiler actually warns about passing argument 3 of dma_alloc_coherent from incompatible pointer type. Signed-off-by: Michael S. Tsirkin diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 3c2e105..4214908 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1637,9 +1637,14 @@ static inline void *ib_dma_alloc_coherent(struct ib_device *dev, u64 *dma_handle, gfp_t flag) { + dma_addr_t a; + void *ptr; if (dev->dma_ops) return dev->dma_ops->alloc_coherent(dev, size, dma_handle, flag); - return dma_alloc_coherent(dev->dma_device, size, dma_handle, flag); + ptr = dma_alloc_coherent(dev->dma_device, size, &a, flag); + if (ptr) + *dma_handle = a; + return ptr; } /** -- MST From mst at mellanox.co.il Tue Dec 19 00:33:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 10:33:25 +0200 Subject: [openib-general] [PATCH] IB/verbs: include linux/kref.h explicitly Message-ID: <20061219083325.GA24952@mellanox.co.il> ib_verbs.h uses struct kref so it should include linux/kref.h directly. Signed-off-by: Michael S. Tsirkin --- diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 3c2e105..69a0a11 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -45,6 +45,7 @@ #include #include #include +#include #include #include -- MST From mst at mellanox.co.il Tue Dec 19 00:35:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 10:35:58 +0200 Subject: [openib-general] [PATCH] IB/mthca: fix FMR breakage introduced by kmemdup cleanup Message-ID: <20061219083558.GA25036@mellanox.co.il> This reverts mthca breakage intruduced by commit bed8bdfddd851657cf9e5fd16bb44abb02ae7f42 : kmemdup can not be used here since lengths passed to kmalloc/memcpy are not the same. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 7ec7c4b..7b96751 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1100,10 +1100,11 @@ static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags, struct mthca_fmr *fmr; int err; - fmr = kmemdup(fmr_attr, sizeof *fmr, GFP_KERNEL); + fmr = kmalloc(sizeof *fmr, GFP_KERNEL); if (!fmr) return ERR_PTR(-ENOMEM); + memcpy(&fmr->attr, fmr_attr, sizeof *fmr_attr); err = mthca_fmr_alloc(to_mdev(pd->device), to_mpd(pd)->pd_num, convert_access(mr_access_flags), fmr); -- MST From mst at mellanox.co.il Tue Dec 19 00:52:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 10:52:36 +0200 Subject: [openib-general] [PATCH] IB/mthca: fix FMR breakage introduced by kmemdup cleanup In-Reply-To: <20061219083558.GA25036@mellanox.co.il> References: <20061219083558.GA25036@mellanox.co.il> Message-ID: <20061219085236.GD25243@mellanox.co.il> > This reverts mthca breakage intruduced by commit > bed8bdfddd851657cf9e5fd16bb44abb02ae7f42 : > kmemdup can not be used here since lengths passed to kmalloc/memcpy > are not the same. > > Signed-off-by: Michael S. Tsirkin This was reported by Dotan Barak -- MST From ogerlitz at voltaire.com Tue Dec 19 01:14:41 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 11:14:41 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net> Message-ID: <4587AD81.2010703@voltaire.com> Bernadat, Philippe wrote: > So after a bit more testing, setting the route path mtu to 1024 before > the qp creation (rdma_create_qp()) seems sufficient. sure, rdma_create_qp is called on the create_conn flow which is executed after getting RDMA_CM_EVENT_ROUTE_RESOLVED as i suggested... OK, so where we are now, what is the current bw matrix (voltaire/ofed fmr/no-fmr)? Or. From philippe_bernadat at hp.com Tue Dec 19 01:58:38 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Tue, 19 Dec 2006 10:58:38 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <4587AD81.2010703@voltaire.com> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net> Hi Or, I didn't have time to re-run then non FMR cases. For FMR VIB and OFED are comparable. But I will e-run tests for all cases Right now I am fighting with ib_query_device() that crashes the kernel ! Trying to use this to test the HCA type. Philippe > -----Original Message----- > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > Sent: Tuesday, December 19, 2006 10:15 AM > To: Bernadat, Philippe > Cc: Roland Dreier; openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > Bernadat, Philippe wrote: > > So after a bit more testing, setting the route path mtu to > 1024 before > > the qp creation (rdma_create_qp()) seems sufficient. > > sure, rdma_create_qp is called on the create_conn flow which > is executed > after getting RDMA_CM_EVENT_ROUTE_RESOLVED as i suggested... > > OK, so where we are now, what is the current bw matrix (voltaire/ofed > fmr/no-fmr)? > > Or. > > From tziporet at dev.mellanox.co.il Tue Dec 19 02:04:53 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 19 Dec 2006 12:04:53 +0200 Subject: [openib-general] SRP problem: srp_daemon failure (was: opensm) In-Reply-To: <458797E4.8010600@mellanox.co.il> References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> <458797E4.8010600@mellanox.co.il> Message-ID: <4587B945.6060700@dev.mellanox.co.il> Eitan Zahavi wrote: > This is not an OpenSM issue. > Forwarded to the SRP people. > > EZ > Batwara, Ashish wrote: > >> Hi, >> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and >> connected to IB Switch. ibnodes command displays the information about >> the Switch ports and HCA ports. >> When I start opensm, I see in /var/log/messages "Starting srp_daemon" >> for all the 4 ports and immediately after I see "failed srp_daemon" for >> all the ports and the displays "SM Port is down". >> >> I tried several times and even rebooted the server few times but no >> luck. >> >> Does anybody know what this problem is? >> >> Thanks >> Ashish >> > Changed the subject for SRP people to be aware of the problem. Tziporet From tziporet at dev.mellanox.co.il Tue Dec 19 02:19:44 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 19 Dec 2006 12:19:44 +0200 Subject: [openib-general] [PATCH] ib_sa: Fix kernel Oops caused by ib_sa unload In-Reply-To: <1166458881.9289.17.camel@muscida> References: <1166458881.9289.17.camel@muscida> Message-ID: <4587BCC0.1020104@dev.mellanox.co.il> Yosef Etigin wrote: > This is a fix to Sean's multicast patches for ofed 1.2. > > The issuse is described in: > http://www.mail-archive.com/openib-general at openib.org/msg27097.html > > The Oops happened because the multicast work handler was called > after the multicast device structure was released. It happened because > the multicast cleanup function 'mcast_remove_one' didn't wait for > work queue completion on all ports before releasing the device, but > only N-1 ports. > > The patch applies after Sean's multicast patch series. > > Hi Yosef, Very good that you found this bug. Since Sean on vacation can you create a patch of the multicast module against the new code base of OFED (kernel 2.6.20-rc1) Thanks, Tziporet From ogerlitz at voltaire.com Tue Dec 19 02:44:13 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 12:44:13 +0200 Subject: [openib-general] [PATCH] ib_sa: Fix kernel Oops caused by ib_sa unload In-Reply-To: <4587BCC0.1020104@dev.mellanox.co.il> References: <1166458881.9289.17.camel@muscida> <4587BCC0.1020104@dev.mellanox.co.il> Message-ID: <4587C27D.50902@voltaire.com> Tziporet Koren wrote: > Yosef Etigin wrote: >> This is a fix to Sean's multicast patches for ofed 1.2. >> >> The issuse is described in: >> http://www.mail-archive.com/openib-general at openib.org/msg27097.html >> >> The Oops happened because the multicast work handler was called >> after the multicast device structure was released. It happened because >> the multicast cleanup function 'mcast_remove_one' didn't wait for >> work queue completion on all ports before releasing the device, but >> only N-1 ports. >> >> The patch applies after Sean's multicast patch series. >> >> > Hi Yosef, > Very good that you found this bug. > Since Sean on vacation can you create a patch of the multicast module > against the new code base of OFED (kernel 2.6.20-rc1) I don't think this is possible since as Michael has said, Sean has to rebase the multicast patches on top of 2.6.20-rc1 (the v2 patch series was based on 2.6.19 and v3 was the one that merged which is w.o them). Or. From ogerlitz at voltaire.com Tue Dec 19 03:03:36 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 13:03:36 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net> Message-ID: <4587C708.2010700@voltaire.com> Bernadat, Philippe wrote: > I didn't have time to re-run then non FMR cases. > For FMR VIB and OFED are comparable. > But I will e-run tests for all cases my main concern is FMR/no-FMR for OFED, FMR should be at least good as no-FMR and if this is not the case, lets look into that. > Right now I am fighting with ib_query_device() that crashes the kernel ! > Trying to use this to test the HCA type. so your approach is: if (the-active-side-is-mlx-tavor) then set-path-mtu-to-1024 then this is a bug, since the only **active** side sets the path mtu, where the passive side (SFS ...) might be mlx-tavor and the active side can be something else, and the tavor mtu bug will hit you. I am thinking what is the correct way to approach the problem, at the cma level, there will be probably some discussion here. saying all the above - ib_query_device must not crash the kernel! make sure that in case there is some issue, please report it here. Or. From eitan at mellanox.co.il Tue Dec 19 03:17:42 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 19 Dec 2006 13:17:42 +0200 Subject: [openib-general] opensm In-Reply-To: <458797E4.8010600@mellanox.co.il> References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> <458797E4.8010600@mellanox.co.il> Message-ID: <4587CA56.9080906@mellanox.co.il> Hi Ashish, SRP people say they have no such error message. OpenSM does. So I take it back. Ashish, Please provide more into: 1. ibv_devinfo 2. Version of code you are using 3. Command line you use for starting opensm 4. /var/log/osm.log Thanks and sorry for the confusion. EZ Eitan Zahavi wrote: > This is not an OpenSM issue. > Forwarded to the SRP people. > > EZ > Batwara, Ashish wrote: > >> Hi, >> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and >> connected to IB Switch. ibnodes command displays the information about >> the Switch ports and HCA ports. >> When I start opensm, I see in /var/log/messages "Starting srp_daemon" >> for all the 4 ports and immediately after I see "failed srp_daemon" for >> all the ports and the displays "SM Port is down". >> >> I tried several times and even rebooted the server few times but no >> luck. >> >> Does anybody know what this problem is? >> >> Thanks >> Ashish >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Tue Dec 19 03:59:48 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 06:59:48 -0500 Subject: [openib-general] OSM: Using lid matrices in ucast manager In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> Message-ID: <1166529491.32666.241847.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2006-12-18 at 18:33, Yevgeny Kliteynik wrote: > Hi Hal. > > > > I have a question about some patch that I want to send regarding lid > matrices usage in osm ucast > > manager: > > > > The FatTree routing doesn’t use the min hop tables, so we can skip the > lid matrices building in OSM. > > However, ucast manager uses these lid matrices also to get the max lid > that is accessible from each > > switch, which defines the LTF table size. > > This max lid is obtained by calling osm_switch_get_max_lid_ho() > function, which in turn, calls > > osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix. > > If the lid matrices weren’t built, then the > osm_switch_get_max_lid_ho() function will return 0xFFFF, > > and eventually osm will crash. > > > > Of course, I don’t want to build all the lid matrices just to know the > max lid, so here’s what I’ve done: > > > > * I added a field to the osm_switch_t object: max_lid_ho (with > default value 0xFFFF, should it > be 0x0 instead?). 0 seems better to me but I'm not sure what else this impacts. Note also there are other 0xffff initializations similar to this which IMO are also candidates for change :-( > * Added and three osm_switch_t methods for this new field: > getter, setter, and is_set that returns > true if this field has been set. Is is_set really needed ? > * The original osm_switch_get_max_lid_ho() has been updated to > return this field value if it’s set. > * Then in FatTree routing I set this field for each switch (I > get the max lid ‘for free’ as a byproduct > of the algorithm). > * Now everything in the ucast manager works fine, except for the > following two dump functions: > __osm_ucast_mgr_dump_ucast_routes (it uses hops) > ucast_mgr_dump_lid_matrix (obviously…) > These two functions check at the beginning whether the > max_lid_ho was set (using the ‘is_set’ > method), and return w/o printing anything if the answer is > yes. Perhaps a dump routine is a routine which each routing protocol should supply ? -- Hal > This way any other routing engine that uses lid matrix is not affected > by this change, and any routing > > engine that doesn’t use the lid matrix has a way to set the max lid > per switch explicitly. > > > > This approach works great, but I have a feeling that this is kinda > hack… > > > > What do you think about this solution? > > Any other suggestions? > > > > Anyway, just wanted to hear your opinion before sending the patch. > > > > Regards, > > > > Yevgeny Kliteynik > > > > Mellanox Technologies LTD > > Tel: +972-4-909-7200 ext: 394 > > Fax: +972-4-959-3245 > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > From halr at voltaire.com Tue Dec 19 04:05:56 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 07:05:56 -0500 Subject: [openib-general] opensm In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> Message-ID: <1166529940.32666.242119.camel@hal.voltaire.com> Hi Ashish, On Mon, 2006-12-18 at 21:55, Batwara, Ashish wrote: > Hi, > I am trying to run opensm on Linux server. It has two HCAs (4-ports) and > connected to IB Switch. ibnodes command displays the information about > the Switch ports and HCA ports. > When I start opensm, I see in /var/log/messages "Starting srp_daemon" > for all the 4 ports and immediately after I see "failed srp_daemon" for > all the ports and the displays "SM Port is down". "SM Port down" means there is no physical link between the SM port and it's peer. Can you investigate and fix this ? -- Hal > I tried several times and even rebooted the server few times but no > luck. > > Does anybody know what this problem is? > > Thanks > Ashish > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Tue Dec 19 04:10:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 14:10:42 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <4587C708.2010700@voltaire.com> References: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net> <4587C708.2010700@voltaire.com> Message-ID: <20061219121042.GB30743@mellanox.co.il> > I am thinking what is the correct way to approach the problem, at the > cma level, there will be probably some discussion here. I guess the right thing to do for now would be to fix the cma tavor quirk patch. But the real solution is in the SA - tricks in cma are just a partial work-around. -- MST From halr at voltaire.com Tue Dec 19 04:09:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 07:09:17 -0500 Subject: [openib-general] OSM: Using lid matrices in ucast manager In-Reply-To: <1166491856.29306.15.camel@localhost> References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> <1166491856.29306.15.camel@localhost> Message-ID: <1166530077.32666.242175.camel@hal.voltaire.com> Hi Yevgeny & Sasha, On Mon, 2006-12-18 at 20:30, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On Tue, 2006-12-19 at 01:33 +0200, Yevgeny Kliteynik wrote: > > Hi Hal. > > > > > > > > I have a question about some patch that I want to send regarding lid > > matrices usage in osm ucast > > > > manager: > > > > > > > > The FatTree routing doesn’t use the min hop tables, so we can skip the > > lid matrices building in OSM. > > The lid matrices are used in mcast_mgr for multicast routes generation. Good point but fat tree seems to work for multicast (at least in my subnet). How could that be ? -- Hal > > However, uca-st manager uses these lid matrices also to get the max lid > > that is accessible from each > > > > switch, which defines the LTF table size. > > > > This max lid is obtained by calling osm_switch_get_max_lid_ho() > > function, which in turn, calls > > > > osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix. > > > > If the lid matrices weren’t built, then the > > osm_switch_get_max_lid_ho() function will return 0xFFFF, > > > > and eventually osm will crash. > > > > > > > > Of course, I don’t want to build all the lid matrices just to know the > > max lid, so here’s what I’ve done: > > > > > > > > * I added a field to the osm_switch_t object: max_lid_ho (with > > default value 0xFFFF, should it > > be 0x0 instead?). > > Good thing. 0 is fine as default value IMHO. > > > * Added and three osm_switch_t methods for this new field: > > getter, setter, and is_set that returns > > true if this field has been set. > > Why those methods? Everything you need is to access structure field and > 'if (sw->max_lid_ho)' for "is_set" checks. > > > * The original osm_switch_get_max_lid_ho() has been updated to > > return this field value if it’s set. > > * Then in FatTree routing I set this field for each switch (I > > get the max lid ‘for free’ as a byproduct > > of the algorithm). > > * Now everything in the ucast manager works fine, except for the > > following two dump functions: > > __osm_ucast_mgr_dump_ucast_routes (it uses hops) > > ucast_mgr_dump_lid_matrix (obviously…) > > These two functions check at the beginning whether the > > max_lid_ho was set (using the ‘is_set’ > > method), and return w/o printing anything if the answer is > > yes. > > > > > > > > This way any other routing engine that uses lid matrix is not affected > > by this change, and any routing > > > > engine that doesn’t use the lid matrix has a way to set the max lid > > per switch explicitly. > > Hope you are adding this for existing code. > > > This approach works great, but I have a feeling that this is kinda > > hack… > > Moving max_lid(_ho) to switch structure looks like a good idea for me > regardless to lid matrix build elimination. > > The only problem I can see with lid matrices is mcast_mgr which uses > this. > > Sasha > > > > > > > > > What do you think about this solution? > > > > Any other suggestions? > > > > > > > > Anyway, just wanted to hear your opinion before sending the patch. > > > > > > > > Regards, > > > > > > > > Yevgeny Kliteynik > > > > > > > > Mellanox Technologies LTD > > > > Tel: +972-4-909-7200 ext: 394 > > > > Fax: +972-4-959-3245 > > > > P.O. Box 586 Yokneam 20692 ISRAEL > > > > > > > > From halr at voltaire.com Tue Dec 19 04:12:24 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 07:12:24 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-19:normal completion In-Reply-To: <458796D3.50709@mellanox.co.il> References: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com> <458796D3.50709@mellanox.co.il> Message-ID: <1166530303.32666.242284.camel@hal.voltaire.com> On Tue, 2006-12-19 at 02:37, Eitan Zahavi wrote: > Clarifications: > > 1. The OpenSM code run includes the last patches I have sent. > 2. The single failure is due to a race in ibmgtsim. ibdiagnet waits > forever for a response for a "bind" message. > I suspect a deadlock between the "server" and the "node" but I am > not sure. > 3. The regression still does not run the osmtest tests due to the fact > they are all failing. Is this due to the one issue with InformInfo ? -- Hal > EZ > > Eitan Zahavi wrote: > > OSM Simulation Regression Summary > > OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3 > > ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 > > Total=308 Pass=307 Fail=1 > > > > Pass: > > 42 Stability IS1-16.topo > > 42 Pkey IS1-16.topo > > 42 Multicast IS1-16.topo > > 42 LidMgr IS1-16.topo > > 41 OsmStress IS1-16.topo > > 14 Stability IS3-loop.topo > > 14 Stability IS3-128.topo > > 14 Pkey IS3-128.topo > > 14 OsmStress IS3-128.topo > > 14 Multicast IS3-loop.topo > > 14 Multicast IS3-128.topo > > 14 LidMgr IS3-128.topo > > > > Failures: > > 1 OsmStress IS1-16.topo > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Tue Dec 19 04:21:21 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 07:21:21 -0500 Subject: [openib-general] SRP problem: srp_daemon failure (was: opensm) In-Reply-To: <4587B945.6060700@dev.mellanox.co.il> References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com> <458797E4.8010600@mellanox.co.il> <4587B945.6060700@dev.mellanox.co.il> Message-ID: <1166530801.32666.242659.camel@hal.voltaire.com> On Tue, 2006-12-19 at 05:04, Tziporet Koren wrote: > Eitan Zahavi wrote: > > This is not an OpenSM issue. > > Forwarded to the SRP people. > > > > EZ > > Batwara, Ashish wrote: > > > >> Hi, > >> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and > >> connected to IB Switch. ibnodes command displays the information about > >> the Switch ports and HCA ports. > >> When I start opensm, I see in /var/log/messages "Starting srp_daemon" > >> for all the 4 ports and immediately after I see "failed srp_daemon" for > >> all the ports and the displays "SM Port is down". > >> > >> I tried several times and even rebooted the server few times but no > >> luck. > >> > >> Does anybody know what this problem is? > >> > >> Thanks > >> Ashish > >> > > > Changed the subject for SRP people to be aware of the problem. Not the first level issue but shouldn't the srp_daemon be able to come up without the SM or without the SM port up ? -- Hal > Tziporet > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Tue Dec 19 04:24:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 14:24:53 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net> <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net> Message-ID: <20061219122453.GC30743@mellanox.co.il> > So after a bit more testing, setting the route path mtu to 1024 before > the qp creation (rdma_create_qp()) seems sufficient. OK, so the following fixes the tavor_quirk flag in cma to actually do something. Could you please replace the patch cma_tavor_quirk.patch with this one, set tavor_quirk option for cma module, and see if this works as expected? Unpack OFED 1.1, copy the following to OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch removing the patch by the same name that is in OFED (also remove xxx_cma_tavor_quirk.txt or other patches if you put them there) and then pack OFED 1.1 and rebuild. Thanks, ----------------- Tavor systems get better performance with 1K MTU. Since there does not seem to be any way to find out whether the remote system uses Tavor, add an option to limit the MTU globally. Signed-off-by: Michael S. Tsirkin --- diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 50150c8..261bf45 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); MODULE_LICENSE("Dual BSD/GPL"); +static int tavor_quirk = 0; +module_param_named(tavor_quirk, tavor_quirk, int, 0644); +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: limit MTU to 1K if > 0"); + #define CMA_CM_RESPONSE_TIMEOUT 20 #define CMA_MAX_CM_RETRIES 3 @@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms, { struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; struct ib_sa_path_rec path_rec; + ib_sa_comp_mask mask; memset(&path_rec, 0, sizeof path_rec); ib_addr_get_sgid(addr, &path_rec.sgid); @@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms, path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; + if (tavor_quirk) { + path_rec.mtu_selector = IB_SA_LT; + path_rec.mtu = IB_MTU_2048; + mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU; + } else + mask = 0; + id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, - id_priv->id.port_num, &path_rec, + id_priv->id.port_num, &path_rec, mask | IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, timeout_ms, GFP_KERNEL, -- MST From ogerlitz at voltaire.com Tue Dec 19 04:37:31 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 14:37:31 +0200 Subject: [openib-general] tavor quirks etc Message-ID: <4587DD0B.1030403@voltaire.com> Basically, i think we should be going to the simple approach of having **one** quirk in the rdma cm kernel code saying: if (tavor_quirk) then route->path_rec->mtu = IB_MTU_1024 so users would have to set the quirk to true in the presence of tavor HCA either in the active or passive side. This patch should also go upstream. The problems i see with the current approach are: 1) there are three patches 2) of them, the cma-tavor-quirk is broken (see *** below) in its design since it assumes the opensm-tavor-quirk and it would not work with opensm that does not have it nor with 3rd party/commercial SMs which do not have similar quirk 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk and hence it was not pushed upstream, and vise-versa an upstream ipoib code is broken with the open-sm running with the quirk! (***) per 15.2.5.16 PATHRECORD, you should get from the SM "less than MTU specified" in case it has such path. Now, what does it means that "it has such path"??? looking in the opensm code @ opensm/osm_sa_path_record.c :: __osm_pr_rcv_get_path_parms you can see that when the tavor quirk patch is ***not*** applied the sm scans the path and for each port compares the port mtu to the requested mtu, such that at the end of the scan the path mtu is the minimal mtu reported along the path. and then apply this code: > if ( ( comp_mask & IB_PR_COMPMASK_MTUSELEC ) && > ( comp_mask & IB_PR_COMPMASK_MTU ) ) > { > required_mtu = ib_path_rec_mtu( p_pr ); > switch( ib_path_rec_mtu_sel( p_pr ) ) > { > case 0: /* must be greater than */ > if( mtu <= required_mtu ) > status = IB_NOT_FOUND; > break; > > case 1: /* must be less than */ > if( mtu >= required_mtu ) > status = IB_NOT_FOUND; > break; XXX - the cma_tavor_quirk is broken without the opensm-tavor-quirk > > case 2: /* exact match */ > if( mtu != required_mtu ) > status = IB_NOT_FOUND; > break; > > case 3: /* largest available */ > /* can't be disqualified by this one */ > break; this is the ipoib-selector patch > Index: ofed_1_1/drivers/infiniband/ulp/ipoib/ipoib_main.c > =================================================================== > --- ofed_1_1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ ofed_1_1/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -182,6 +182,8 @@ static int ipoib_change_mtu(struct net_d > > dev->mtu = min(priv->mcast_mtu, priv->admin_mtu); > > + queue_work(ipoib_workqueue, &priv->flush_task); > + > return 0; > } > > @@ -452,15 +454,39 @@ static int path_rec_start(struct net_dev > struct ipoib_path *path) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > + ib_sa_comp_mask comp_mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU; > + > + path->pathrec.mtu_selector = IB_SA_GT; > > - ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n", > - IPOIB_GID_ARG(path->pathrec.dgid)); > + switch (roundup_pow_of_two(dev->mtu + IPOIB_ENCAP_LEN)) { > + case 512: > + path->pathrec.mtu = IB_MTU_256; > + break; > + case 1024: > + path->pathrec.mtu = IB_MTU_512; > + break; > + case 2048: > + path->pathrec.mtu = IB_MTU_1024; > + break; > + case 4096: > + path->pathrec.mtu = IB_MTU_2048; > + break; > + default: > + /* Wildcard everything */ > + comp_mask = 0; > + path->pathrec.mtu = 0; > + path->pathrec.mtu_selector = 0; > + } > + ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT " MTU > %d\n", > + IPOIB_GID_ARG(path->pathrec.dgid), > + comp_mask ? ib_mtu_enum_to_int(path->pathrec.mtu) : 0); > > init_completion(&path->done); > > path->query_id = > ib_sa_path_rec_get(priv->ca, priv->port, > - &path->pathrec, > + &path->pathrec, comp_mask | > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_NUMB_PATH | From kliteyn at dev.mellanox.co.il Tue Dec 19 04:43:54 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 14:43:54 +0200 Subject: [openib-general] OSM: Using lid matrices in ucast manager In-Reply-To: <1166529491.32666.241847.camel@hal.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> <1166529491.32666.241847.camel@hal.voltaire.com> Message-ID: <4587DE8A.1090207@dev.mellanox.co.il> Hi Hal & Sasha. Hal Rosenstock wrote: > Hi Yevgeny, > > On Mon, 2006-12-18 at 18:33, Yevgeny Kliteynik wrote: >> Hi Hal. >> >> >> >> I have a question about some patch that I want to send regarding lid >> matrices usage in osm ucast >> >> manager: >> >> >> >> The FatTree routing doesn’t use the min hop tables, so we can skip the >> lid matrices building in OSM. >> >> However, ucast manager uses these lid matrices also to get the max lid >> that is accessible from each >> >> switch, which defines the LTF table size. >> >> This max lid is obtained by calling osm_switch_get_max_lid_ho() >> function, which in turn, calls >> >> osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix. >> >> If the lid matrices weren’t built, then the >> osm_switch_get_max_lid_ho() function will return 0xFFFF, >> >> and eventually osm will crash. >> >> >> >> Of course, I don’t want to build all the lid matrices just to know the >> max lid, so here’s what I’ve done: >> >> >> >> * I added a field to the osm_switch_t object: max_lid_ho (with >> default value 0xFFFF, should it >> be 0x0 instead?). > > 0 seems better to me but I'm not sure what else this impacts. Agree. > Note also there are other 0xffff initializations similar to this which > IMO are also candidates for change :-( > >> * Added and three osm_switch_t methods for this new field: >> getter, setter, and is_set that returns >> true if this field has been set. > > Is is_set really needed ? No, it's not - I added it just to 'encapsulate' the default value of the new field, so that this initialization value will remain osm_switch_t internal. But we can access the field directly instead. We can also replace it by something like osm_switch_lmx_exists() to make it look more general. >> * The original osm_switch_get_max_lid_ho() has been updated to >> return this field value if it’s set. >> * Then in FatTree routing I set this field for each switch (I >> get the max lid ‘for free’ as a byproduct >> of the algorithm). >> * Now everything in the ucast manager works fine, except for the >> following two dump functions: >> __osm_ucast_mgr_dump_ucast_routes (it uses hops) >> ucast_mgr_dump_lid_matrix (obviously…) >> These two functions check at the beginning whether the >> max_lid_ho was set (using the ‘is_set’ >> method), and return w/o printing anything if the answer is >> yes. > > Perhaps a dump routine is a routine which each routing protocol should > supply ? Good idea. This way the dump function will dump whatever is relevant to a certain routing engine. -- Yevgeny > -- Hal > >> This way any other routing engine that uses lid matrix is not affected >> by this change, and any routing >> >> engine that doesn’t use the lid matrix has a way to set the max lid >> per switch explicitly. >> >> >> >> This approach works great, but I have a feeling that this is kinda >> hack… >> >> >> >> What do you think about this solution? >> >> Any other suggestions? >> >> >> >> Anyway, just wanted to hear your opinion before sending the patch. >> >> >> >> Regards, >> >> >> >> Yevgeny Kliteynik >> >> >> >> Mellanox Technologies LTD >> >> Tel: +972-4-909-7200 ext: 394 >> >> Fax: +972-4-959-3245 >> >> P.O. Box 586 Yokneam 20692 ISRAEL >> >> >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Tue Dec 19 05:00:35 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 19 Dec 2006 15:00:35 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-19:normal completion In-Reply-To: <1166530303.32666.242284.camel@hal.voltaire.com> References: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com> <458796D3.50709@mellanox.co.il> <1166530303.32666.242284.camel@hal.voltaire.com> Message-ID: <4587E273.4020408@mellanox.co.il> Hal Rosenstock wrote: > On Tue, 2006-12-19 at 02:37, Eitan Zahavi wrote: > >> Clarifications: >> >> 1. The OpenSM code run includes the last patches I have sent. >> 2. The single failure is due to a race in ibmgtsim. ibdiagnet waits >> forever for a response for a "bind" message. >> I suspect a deadlock between the "server" and the "node" but I am >> not sure. >> 3. The regression still does not run the osmtest tests due to the fact >> they are all failing. >> > > Is this due to the one issue with InformInfo ? > Yup. > -- Hal > > >> EZ >> >> Eitan Zahavi wrote: >> >>> OSM Simulation Regression Summary >>> OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3 >>> ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 >>> Total=308 Pass=307 Fail=1 >>> >>> Pass: >>> 42 Stability IS1-16.topo >>> 42 Pkey IS1-16.topo >>> 42 Multicast IS1-16.topo >>> 42 LidMgr IS1-16.topo >>> 41 OsmStress IS1-16.topo >>> 14 Stability IS3-loop.topo >>> 14 Stability IS3-128.topo >>> 14 Pkey IS3-128.topo >>> 14 OsmStress IS3-128.topo >>> 14 Multicast IS3-loop.topo >>> 14 Multicast IS3-128.topo >>> 14 LidMgr IS3-128.topo >>> >>> Failures: >>> 1 OsmStress IS1-16.topo >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Tue Dec 19 05:12:35 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 19 Dec 2006 08:12:35 -0500 Subject: [openib-general] Status of old and new servers Message-ID: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com> Some important decisions regarding the old server were made on the EWG call yesterday. If you're still committing to SVN, do not ignore this e-mail. 1. The only guy with the password to the openfabrics.org domain is out of reach for the next several months. So "openfabrics.org" and "www.openfabrics.org" will continue to point to the old server for the foreseeable future. We have one name [that was intended to be temporary] that points to the new server (staging.openfabrics.org). Other than shutting down SVN, I'm not sure how we want to proceed with the rest of the server migration. 2. Committing to SVN on the old server will be disabled as of COB this *THURSDAY* (21 Dec 2006). Anonymous, read-only access will still be supported for a short time longer. 3. The SVN database will be resynchronized with the new server on Friday, 22 Dec. **If you have changes in SVN on the new server, THEY WILL BE LOST.** 3a. Reflecting that most activity is occurring in git, commits will be disabled in all SVN trees by default. If you want your tree left enabled in SVN for commits, please reply to this e-mail indicating exactly which tree you want enabled for commits and a specific list of usernames that are allowed to commit to the tree. 3b. The rest of SVN will be available for read-only access / hysterical raisins for a few more months. Proposed OFA SVN death date: March 31, 2007. Per exceptions in 3a, much of the data at the SVN HEAD will be "svn rm"'ed to reflect that they most current stuff is now in git -- you'll have to use SVN history commands to get at the older stuff. Appropriate README files will be left describing how to get to the history and to the various git repositories. 4. Everyone who had a commit account on the old server should already be setup with an account on the new server. 5. Work is progressing to figure out what content management system will be used to maintain the OFA web site on the new server. In the meantime, the old pages will simply be copied over. The OFA marketing group can figure out the rest. --> Don't know what to do about the DNS issues yet. 6. For the time being, it is likely that we'll use the same wiki on the new server (tiki) and simply copy the content over. --> Don't know what to do about the DNS issues yet. 7. You are among the elite group who managed to read this entire e- mail. Congratulations. Call your local representative to claim your fabulous prize. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From mst at mellanox.co.il Tue Dec 19 05:16:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 15:16:25 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <4587DD0B.1030403@voltaire.com> References: <4587DD0B.1030403@voltaire.com> Message-ID: <20061219131625.GE30743@mellanox.co.il> > The problems i see with the current approach are: > > 1) there are three patches Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch. It is not 100% but the only work around for proprietary SMs. Fixing the SA is a full solution. We (Mellanox) will work with SA vendors to get this addressed. But of course this takes time. > 2) of them, the cma-tavor-quirk is broken (see *** below) in its design > since it assumes the opensm-tavor-quirk and it would not work with > opensm that does not have it nor with 3rd party/commercial SMs which do > not have similar quirk cma-tavor-quirk in OFED 1.1 is broken but not by design - the patch I posted recently fixes the bug and should work with any compliant SM. I did not look at the opensm code specifically, but the "15.2.5.16 PATHRECORD" is quite explicit in its requirements: MtuSelector 2 432 In a query request: 3-largest MTU available If MTU is specified (i.e., the ComponentMask bit for MTU is 1): 0-greater than MTU specified 1-less than MTU specified 2-exactly the MTU specified So if e.g. opensm does not comply (e.g. it is not returning a path where one exists) we should simply fix it. If there are other broken SMs, we can look at how they are broken and how to solve this. > 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk > and hence it was not pushed upstream, and vise-versa an upstream ipoib > code is broken with the open-sm running with the quirk! All this is incorrect. ipoib-selector is completely irrelevant to the MTU issue - its a strict compliance fix for IPoIB. IPoIB also works fine without this patch (with or without tavor quirk activated). It does not depend on any specific SM. It is not upstream because of style issues only and due to my lack of time to fix it. -- MST From ogerlitz at voltaire.com Tue Dec 19 05:29:24 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 15:29:24 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <20061219131625.GE30743@mellanox.co.il> References: <4587DD0B.1030403@voltaire.com> <20061219131625.GE30743@mellanox.co.il> Message-ID: <4587E934.6030601@voltaire.com> I am still digesting your response where you have addressed my claims/concerns. Anyway what is your response to my suggestion of applying just one trivial patch at the rdma cm? Or. From mst at mellanox.co.il Tue Dec 19 05:37:08 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 15:37:08 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <4587E934.6030601@voltaire.com> References: <4587E934.6030601@voltaire.com> Message-ID: <20061219133708.GG30743@mellanox.co.il> > I am still digesting your response where you have addressed my > claims/concerns. Thatnks for raising this issue, I'll continue to think about this. In particular, the opensm issue that you raise needs to be addressed by the opensm guys. > Anyway what is your response to my suggestion of applying just one > trivial patch at the rdma cm? I think this would work too but I somewhat dislike using an MTU that SM did not give us - this looks like a spec violation to me. No? For example, it seems this assumes that any path supports 1/2 MTU but is that required by spec? Further, might SM make an intelligent decision in selecting a path if we tell it what MTU we actually want to use? -- MST From halr at voltaire.com Tue Dec 19 05:40:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 08:40:30 -0500 Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager fail to report back correct signal In-Reply-To: <45870A49.1070205@mellanox.co.il> References: <45870A49.1070205@mellanox.co.il> Message-ID: <1166535535.32666.246023.camel@hal.voltaire.com> Hi Eitan, On Mon, 2006-12-18 at 16:38, Eitan Zahavi wrote: > Hi Hal, > > This is a resend as I did not see a bounce of the list of the previous > posting I did using git-send-email (probably due to a miss use). > The following patch fixes bugs in the ucast manager and pkey manager > such that they do not report correct signal back. > In both cases some some outstanding SubnSet were ignored. > > Signed-off-by: Eitan Zahavi > > -------------------------------------------------------------------------------------------- > diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c > index 48837bc..a33aec7 100644 > --- a/osm/opensm/osm_pkey_mgr.c > +++ b/osm/opensm/osm_pkey_mgr.c A number of lines in osm_pkey_mgr.c are line wrapped. Please resubmit this. I am currently working on the osm_ucast_mgr.c changes though. -- Hal From eitan at mellanox.co.il Tue Dec 19 05:50:07 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 19 Dec 2006 15:50:07 +0200 Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager fail to report back correct signal In-Reply-To: <1166535535.32666.246023.camel@hal.voltaire.com> References: <45870A49.1070205@mellanox.co.il> <1166535535.32666.246023.camel@hal.voltaire.com> Message-ID: <4587EE0F.8090603@mellanox.co.il> Hi Hal Hope this will work EZ From 557b0504ab317c470d376f15d7c6d5ed1c9d11f5 Mon Sep 17 00:00:00 2001 From: Eitan Zahavi Date: Mon, 18 Dec 2006 21:48:45 +0200 Subject: [PATCH] Fix cases where the pkey manager returned OSM_SIGNAL_DONE and not OSM_SIGNAL_DONE_PENDING by missing some sent packets --- osm/opensm/osm_pkey_mgr.c | 112 +++++++++++++++++++++++++++++++++------------ 1 files changed, 82 insertions(+), 30 deletions(-) diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index 48837bc..a33aec7 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry( /********************************************************************** **********************************************************************/ -static ib_api_status_t +static boolean_t pkey_mgr_enforce_partition( + IN osm_log_t *p_log, IN const osm_req_t *p_req, IN const osm_physp_t *p_physp, IN const boolean_t enforce) @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition( osm_madw_context_t context; uint8_t payload[IB_SMP_DATA_SIZE]; ib_port_info_t *p_pi; + ib_api_status_t status; if (!(p_pi = osm_physp_get_port_info_ptr( p_physp ))) - return IB_ERROR; + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0507: " + "No port info for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } - if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) - return IB_SUCCESS; + if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "No need to update PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } memset( payload, 0, IB_SMP_DATA_SIZE ); memcpy( payload, p_pi, sizeof(ib_port_info_t) ); @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition( context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; - return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), - payload, sizeof(payload), - IB_MAD_ATTR_PORT_INFO, - cl_hton32( osm_physp_get_port_num( p_physp ) ), - CL_DISP_MSGID_NONE, &context ); + status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), + payload, sizeof(payload), + IB_MAD_ATTR_PORT_INFO, + cl_hton32( osm_physp_get_port_num( p_physp ) ), + CL_DISP_MSGID_NONE, &context ); + if (status != IB_SUCCESS) + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_enforce_partition: ERR 0520: " + "Failed to set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return FALSE; + } + else + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_enforce_partition: " + "Set PortInfo for " + "node 0x%016" PRIx64 " port %u\n", + cl_ntoh64( + osm_node_get_node_guid( + osm_physp_get_node_ptr( p_physp ))), + osm_physp_get_port_num( p_physp ) ); + return TRUE; + } } /********************************************************************** @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port( status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, block_index ); if (status == IB_SUCCESS) - ret_val = TRUE; + { + osm_log( p_log, OSM_LOG_DEBUG, + "pkey_mgr_update_port: " + "Updated " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + ret_val = TRUE; + } else - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_port: ERR 0506: " - "pkey_mgr_update_pkey_entry() failed to update " - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", - block_index, - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( p_physp ) ); + { + osm_log( p_log, OSM_LOG_ERROR, + "pkey_mgr_update_port: ERR 0506: " + "pkey_mgr_update_pkey_entry() failed to update " + "pkey table block %d for node 0x%016" PRIx64 " port %u\n", + block_index, + cl_ntoh64( osm_node_get_node_guid( p_node ) ), + osm_physp_get_port_num( p_physp ) ); + } } return ret_val; @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port( uint16_t peer_max_blocks; ib_api_status_t status = IB_SUCCESS; boolean_t ret_val = FALSE; + boolean_t port_info_set = FALSE; ib_pkey_table_t empty_block; - + memset(&empty_block, 0, sizeof(ib_pkey_table_t)); p_physp = osm_port_get_default_phys_ptr( p_port ); @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port( enforce = FALSE; } - if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) - { - osm_log( p_log, OSM_LOG_ERROR, - "pkey_mgr_update_peer_port: ERR 0507: " - "pkey_mgr_enforce_partition() failed to update " - "node 0x%016" PRIx64 " port %u\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ), - osm_physp_get_port_num( peer ) ); - } + if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce)) + port_info_set = TRUE; if (enforce == FALSE) - return FALSE; + return port_info_set; p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; for (block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++) @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port( osm_physp_get_port_num( peer ) ); } + if (port_info_set) return TRUE; return ret_val; } @@ -541,10 +593,10 @@ osm_pkey_mgr_process( signal = OSM_SIGNAL_DONE_PENDING; p_node = osm_port_get_parent_node( p_port ); if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, &p_osm->subn, p_port, !p_osm->subn.opt.no_partition_enforcement ) ) - signal = OSM_SIGNAL_DONE_PENDING; + signal = OSM_SIGNAL_DONE_PENDING; } _err: -- 1.4.4.1.GIT From mst at mellanox.co.il Tue Dec 19 05:52:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 15:52:48 +0200 Subject: [openib-general] out of office Dec 20-23 Message-ID: <20061219135248.GB2075@mellanox.co.il> I'll be out of office Dec 20-23. Ciao, -- MST From eitan at mellanox.co.il Tue Dec 19 05:59:45 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 19 Dec 2006 15:59:45 +0200 Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager fail to report back correct signal In-Reply-To: <4587EE0F.8090603@mellanox.co.il> References: <45870A49.1070205@mellanox.co.il> <1166535535.32666.246023.camel@hal.voltaire.com> <4587EE0F.8090603@mellanox.co.il> Message-ID: <4587F051.2070000@mellanox.co.il> Seems like it is line wrapped this time too. I need a new mailer. So I will attach the file and send it. Sorry about that. Eitan Eitan Zahavi wrote: > Hi Hal > > Hope this will work > > EZ > > > From 557b0504ab317c470d376f15d7c6d5ed1c9d11f5 Mon Sep 17 00:00:00 2001 > From: Eitan Zahavi > Date: Mon, 18 Dec 2006 21:48:45 +0200 > Subject: [PATCH] Fix cases where the pkey manager returned > OSM_SIGNAL_DONE and not > OSM_SIGNAL_DONE_PENDING by missing some sent packets > --- > osm/opensm/osm_pkey_mgr.c | 112 > +++++++++++++++++++++++++++++++++------------ > 1 files changed, 82 insertions(+), 30 deletions(-) > > diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c > index 48837bc..a33aec7 100644 > --- a/osm/opensm/osm_pkey_mgr.c > +++ b/osm/opensm/osm_pkey_mgr.c > @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry( > > /********************************************************************** > **********************************************************************/ > -static ib_api_status_t > +static boolean_t > pkey_mgr_enforce_partition( > + IN osm_log_t *p_log, > IN const osm_req_t *p_req, > IN const osm_physp_t *p_physp, > IN const boolean_t enforce) > @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition( > osm_madw_context_t context; > uint8_t payload[IB_SMP_DATA_SIZE]; > ib_port_info_t *p_pi; > + ib_api_status_t status; > > if (!(p_pi = osm_physp_get_port_info_ptr( p_physp ))) > - return IB_ERROR; > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_enforce_partition: ERR 0507: " > + "No port info for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return FALSE; > + } > > - if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) > - return IB_SUCCESS; > + if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "pkey_mgr_enforce_partition: " > + "No need to update PortInfo for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return FALSE; > + } > > memset( payload, 0, IB_SMP_DATA_SIZE ); > memcpy( payload, p_pi, sizeof(ib_port_info_t) ); > @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition( > context.pi_context.light_sweep = FALSE; > context.pi_context.active_transition = FALSE; > > - return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), > - payload, sizeof(payload), > - IB_MAD_ATTR_PORT_INFO, > - cl_hton32( osm_physp_get_port_num( p_physp ) ), > - CL_DISP_MSGID_NONE, &context ); > + status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ), > + payload, sizeof(payload), > + IB_MAD_ATTR_PORT_INFO, > + cl_hton32( osm_physp_get_port_num( > p_physp ) ), > + CL_DISP_MSGID_NONE, &context ); > + if (status != IB_SUCCESS) > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_enforce_partition: ERR 0520: " > + "Failed to set PortInfo for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return FALSE; > + } > + else > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "pkey_mgr_enforce_partition: " > + "Set PortInfo for " > + "node 0x%016" PRIx64 " port %u\n", > + cl_ntoh64( > + osm_node_get_node_guid( > + osm_physp_get_node_ptr( p_physp ))), > + osm_physp_get_port_num( p_physp ) ); > + return TRUE; > + } > } > > /********************************************************************** > @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port( > > status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, > block_index ); > if (status == IB_SUCCESS) > - ret_val = TRUE; > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "pkey_mgr_update_port: " > + "Updated " > + "pkey table block %d for node 0x%016" PRIx64 " > port %u\n", > + block_index, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + ret_val = TRUE; > + } > else > - osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_port: ERR 0506: " > - "pkey_mgr_update_pkey_entry() failed to update " > - "pkey table block %d for node 0x%016" PRIx64 " port %u\n", > - block_index, > - cl_ntoh64( osm_node_get_node_guid( p_node ) ), > - osm_physp_get_port_num( p_physp ) ); > + { > + osm_log( p_log, OSM_LOG_ERROR, > + "pkey_mgr_update_port: ERR 0506: " > + "pkey_mgr_update_pkey_entry() failed to update " > + "pkey table block %d for node 0x%016" PRIx64 " > port %u\n", > + block_index, > + cl_ntoh64( osm_node_get_node_guid( p_node ) ), > + osm_physp_get_port_num( p_physp ) ); > + } > } > > return ret_val; > @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port( > uint16_t peer_max_blocks; > ib_api_status_t status = IB_SUCCESS; > boolean_t ret_val = FALSE; > + boolean_t port_info_set = FALSE; > ib_pkey_table_t empty_block; > - > + > memset(&empty_block, 0, sizeof(ib_pkey_table_t)); > > p_physp = osm_port_get_default_phys_ptr( p_port ); > @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port( > enforce = FALSE; > } > > - if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS) > - { > - osm_log( p_log, OSM_LOG_ERROR, > - "pkey_mgr_update_peer_port: ERR 0507: " > - "pkey_mgr_enforce_partition() failed to update " > - "node 0x%016" PRIx64 " port %u\n", > - cl_ntoh64( osm_node_get_node_guid( p_node ) ), > - osm_physp_get_port_num( peer ) ); > - } > + if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce)) > + port_info_set = TRUE; > > if (enforce == FALSE) > - return FALSE; > + return port_info_set; > > p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks; > for (block_index = 0; block_index < p_pkey_tbl->used_blocks; > block_index++) > @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port( > osm_physp_get_port_num( peer ) ); > } > > + if (port_info_set) return TRUE; > return ret_val; > } > > @@ -541,10 +593,10 @@ osm_pkey_mgr_process( > signal = OSM_SIGNAL_DONE_PENDING; > p_node = osm_port_get_parent_node( p_port ); > if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) && > - pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, > + pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, > &p_osm->subn, p_port, > !p_osm->subn.opt.no_partition_enforcement ) ) > - signal = OSM_SIGNAL_DONE_PENDING; > + signal = OSM_SIGNAL_DONE_PENDING; > } > > _err: > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: 0003-Fix-cases-where-the-pkey-manager-returned-OSM_SIGNAL_DONE-and-not.txt URL: From ogerlitz at voltaire.com Tue Dec 19 06:01:04 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 16:01:04 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <20061219131625.GE30743@mellanox.co.il> References: <4587DD0B.1030403@voltaire.com> <20061219131625.GE30743@mellanox.co.il> Message-ID: <4587F0A0.1080401@voltaire.com> Michael S. Tsirkin wrote: >> 1) there are three patches > > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch. > It is not 100% but the only work around for proprietary SMs. > Fixing the SA is a full solution. We (Mellanox) will work with SA vendors to > get this addressed. But of course this takes time. cma_tavor_quirk.patch matches the patch you have applied to the opensm, so there are two patches at least, one at the stack and one at the opensm. I don't think you can assume that all SA vendors would apply the opensm approach and hence running with them the fixed cma tavor quirk as which you have suggested today is useless with them (Specifically before they even consider to apply it... so if someone runs OFED X.Y they would not get 1K mtu with Tavor) > cma-tavor-quirk in OFED 1.1 is broken but not by design - > the patch I posted recently fixes the bug and should work with any compliant SM. > I did not look at the opensm code specifically, but the > "15.2.5.16 PATHRECORD" is quite explicit in its requirements: > > MtuSelector 2 432 In a query request: > 3-largest MTU available > If MTU is specified (i.e., the ComponentMask bit for > MTU is 1): > 0-greater than MTU specified > 1-less than MTU specified > 2-exactly the MTU specified > > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists) > we should simply fix it. If there are other broken SMs, we can look at how they > are broken and how to solve this. The SM team here don't think our SM is broken b/c it does not return 1K path mtu where the minimal mtu as reported in the port info along the path is 2k, and as i told you so does opensm without the quirk > >> 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk >> and hence it was not pushed upstream, and vise-versa an upstream ipoib >> code is broken with the open-sm running with the quirk! > > All this is incorrect. ipoib-selector is completely irrelevant to the MTU > issue - its a strict compliance fix for IPoIB. IPoIB also works fine without > this patch (with or without tavor quirk activated). It does not depend on any > specific SM. It is not upstream because of style issues only and due to my lack > of time to fix it. this reminds me that there is a need to do OFED 1.1 wrapup in the sense we have to see which patches from the kernel_patches/fixes directory were ***not*** pushed upstream to 2.6.19-rcX nor 2.6.20-rc1 and then conduct some sort of discussion on each to decide what to do with it for OFED 1.2 > IB/ipoib: user appropriate mtu selector for path queries > > IPoIB must set mtu selector in path record query according to dev->mtu: > if we wildcard it, SM can select a path with lower MTU. > This breaks IPoIB on networks with SM Tavor quirk activates. mmm, re-reading the open sm code, i think you are right that the ipoib-selector patch is independent of the open SM tavor quirk, but than i don't understand what you were trying to say in the above two lines of the change log, what can break the SM tavor quirk??? Or. From mst at mellanox.co.il Tue Dec 19 06:17:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 16:17:36 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <4587F0A0.1080401@voltaire.com> References: <4587F0A0.1080401@voltaire.com> Message-ID: <20061219141736.GD2075@mellanox.co.il> I'm really going off in a hurry, but for now: > Quoting r. Or Gerlitz : > Subject: Re: tavor quirks etc (opensm compliance etc) > > Michael S. Tsirkin wrote: > >> 1) there are three patches > > > > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch. > > It is not 100% but the only work around for proprietary SMs. > > Fixing the SA is a full solution. We (Mellanox) will work with SA vendors to > > get this addressed. But of course this takes time. > > cma_tavor_quirk.patch matches the patch you have applied to the opensm, > so there are two patches at least, one at the stack and one at the > opensm. I don't think you can assume that all SA vendors would apply the > opensm approach and hence running with them the fixed cma tavor quirk as > which you have suggested today is useless with them (Specifically before > they even consider to apply it... so if someone runs OFED X.Y they would > not get 1K mtu with Tavor) See below. With (fixed) cma_tavor_quirk.patch we are asking the SA to give us a path with 1/2K MTU. If such path exists, SA should give it to us, if it does not exist we should not try using it. > > cma-tavor-quirk in OFED 1.1 is broken but not by design - > > the patch I posted recently fixes the bug and should work with any compliant SM. > > I did not look at the opensm code specifically, but the > > "15.2.5.16 PATHRECORD" is quite explicit in its requirements: > > > > MtuSelector 2 432 In a query request: > > 3-largest MTU available > > If MTU is specified (i.e., the ComponentMask bit for > > MTU is 1): > > 0-greater than MTU specified > > 1-less than MTU specified > > 2-exactly the MTU specified > > > > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists) > > we should simply fix it. If there are other broken SMs, we can look at how they > > are broken and how to solve this. > > The SM team here don't think our SM is broken b/c it does not return 1K > path mtu where the minimal mtu as reported in the port info along the > path is 2k, and as i told you so does opensm without the quirk Doesn't make sense to me, and I don't understand how you interpret MtuSelector. Does the port really report minimal MTU 2K? So how does lower MTU work at all then? Are you saying some HCA/switch has a broken SMA? Eitan? > > > >> 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk > >> and hence it was not pushed upstream, and vise-versa an upstream ipoib > >> code is broken with the open-sm running with the quirk! > > > > All this is incorrect. ipoib-selector is completely irrelevant to the MTU > > issue - its a strict compliance fix for IPoIB. IPoIB also works fine without > > this patch (with or without tavor quirk activated). It does not depend on any > > specific SM. It is not upstream because of style issues only and due to my lack > > of time to fix it. > > this reminds me that there is a need to do OFED 1.1 wrapup in the sense > we have to see which patches from the kernel_patches/fixes directory > were ***not*** pushed upstream to 2.6.19-rcX nor 2.6.20-rc1 and then > conduct some sort of discussion on each to decide what to do with it for > OFED 1.2 Most were pushed, a couple are outstanding. It's on my TODO, but if you want to start working on it go ahead. > > IB/ipoib: user appropriate mtu selector for path queries > > > > IPoIB must set mtu selector in path record query according to dev->mtu: > > if we wildcard it, SM can select a path with lower MTU. > > This breaks IPoIB on networks with SM Tavor quirk activates. > > mmm, re-reading the open sm code, i think you are right that the > ipoib-selector patch is independent of the open SM tavor quirk, but than > i don't understand what you were trying to say in the above two lines of > the change log, what can break the SM tavor quirk??? > > Or. The change log is wrong there. -- MST From halr at voltaire.com Tue Dec 19 06:17:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 09:17:08 -0500 Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager fail to report back correct signal In-Reply-To: <45870A49.1070205@mellanox.co.il> References: <45870A49.1070205@mellanox.co.il> Message-ID: <1166537769.32666.247557.camel@hal.voltaire.com> On Mon, 2006-12-18 at 16:38, Eitan Zahavi wrote: > Hi Hal, > > This is a resend as I did not see a bounce of the list of the previous > posting I did using git-send-email (probably due to a miss use). > The following patch fixes bugs in the ucast manager and pkey manager > such that they do not report correct signal back. > In both cases some some outstanding SubnSet were ignored. > > Signed-off-by: Eitan Zahavi > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > index e977253..8cfe09e 100644 > --- a/osm/opensm/osm_ucast_mgr.c > +++ b/osm/opensm/osm_ucast_mgr.c Thanks! I applied the osm_ucast_mgr.c part of this (and not the osm_pkey_mgr.c part). -- Hal From ogerlitz at voltaire.com Tue Dec 19 06:27:44 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 16:27:44 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <20061219141736.GD2075@mellanox.co.il> References: <4587F0A0.1080401@voltaire.com> <20061219141736.GD2075@mellanox.co.il> Message-ID: <4587F6E0.10000@voltaire.com> Michael S. Tsirkin wrote: > I'm really going off in a hurry, but for now: enjoy your vacation, don't worry, lets discuss this next week when you are back, if you want, you or Eitan or anyone else that wants to jump on it can send an RFC with the two patches (cma and opensm tavor quirks), and we can discuss why they are better from my simplified patch, what are the associated dependencies etc etc Or. From kliteyn at dev.mellanox.co.il Tue Dec 19 06:27:03 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 16:27:03 +0200 Subject: [openib-general] OSM: Using lid matrices in ucast manager In-Reply-To: <1166530077.32666.242175.camel@hal.voltaire.com> References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com> <1166491856.29306.15.camel@localhost> <1166530077.32666.242175.camel@hal.voltaire.com> Message-ID: <4587F6B7.7070805@dev.mellanox.co.il> Hal, Hal Rosenstock wrote: > Hi Yevgeny & Sasha, > > On Mon, 2006-12-18 at 20:30, Sasha Khapyorsky wrote: >> Hi Yevgeny, >> >> On Tue, 2006-12-19 at 01:33 +0200, Yevgeny Kliteynik wrote: >>> Hi Hal. >>> >>> >>> >>> I have a question about some patch that I want to send regarding lid >>> matrices usage in osm ucast >>> >>> manager: >>> >>> >>> >>> The FatTree routing doesn’t use the min hop tables, so we can skip the >>> lid matrices building in OSM. >> The lid matrices are used in mcast_mgr for multicast routes generation. > > Good point but fat tree seems to work for multicast (at least in my > subnet). How could that be ? The patch that's checked in doesn't disable lid matrices creation, so we have those matrices created, and then fat-tree routing configures LFTs, ignoring the lid matrices. -- Yevgeny > -- Hal > >>> However, uca-st manager uses these lid matrices also to get the max lid >>> that is accessible from each >>> >>> switch, which defines the LTF table size. >>> >>> This max lid is obtained by calling osm_switch_get_max_lid_ho() >>> function, which in turn, calls >>> >>> osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix. >>> >>> If the lid matrices weren’t built, then the >>> osm_switch_get_max_lid_ho() function will return 0xFFFF, >>> >>> and eventually osm will crash. >>> >>> >>> >>> Of course, I don’t want to build all the lid matrices just to know the >>> max lid, so here’s what I’ve done: >>> >>> >>> >>> * I added a field to the osm_switch_t object: max_lid_ho (with >>> default value 0xFFFF, should it >>> be 0x0 instead?). >> Good thing. 0 is fine as default value IMHO. >> >>> * Added and three osm_switch_t methods for this new field: >>> getter, setter, and is_set that returns >>> true if this field has been set. >> Why those methods? Everything you need is to access structure field and >> 'if (sw->max_lid_ho)' for "is_set" checks. >> >>> * The original osm_switch_get_max_lid_ho() has been updated to >>> return this field value if it’s set. >>> * Then in FatTree routing I set this field for each switch (I >>> get the max lid ‘for free’ as a byproduct >>> of the algorithm). >>> * Now everything in the ucast manager works fine, except for the >>> following two dump functions: >>> __osm_ucast_mgr_dump_ucast_routes (it uses hops) >>> ucast_mgr_dump_lid_matrix (obviously…) >>> These two functions check at the beginning whether the >>> max_lid_ho was set (using the ‘is_set’ >>> method), and return w/o printing anything if the answer is >>> yes. >>> >>> >>> >>> This way any other routing engine that uses lid matrix is not affected >>> by this change, and any routing >>> >>> engine that doesn’t use the lid matrix has a way to set the max lid >>> per switch explicitly. >> Hope you are adding this for existing code. >> >>> This approach works great, but I have a feeling that this is kinda >>> hack… >> Moving max_lid(_ho) to switch structure looks like a good idea for me >> regardless to lid matrix build elimination. >> >> The only problem I can see with lid matrices is mcast_mgr which uses >> this. >> >> Sasha >> >>> >>> >>> What do you think about this solution? >>> >>> Any other suggestions? >>> >>> >>> >>> Anyway, just wanted to hear your opinion before sending the patch. >>> >>> >>> >>> Regards, >>> >>> >>> >>> Yevgeny Kliteynik >>> >>> >>> >>> Mellanox Technologies LTD >>> >>> Tel: +972-4-909-7200 ext: 394 >>> >>> Fax: +972-4-959-3245 >>> >>> P.O. Box 586 Yokneam 20692 ISRAEL >>> >>> >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tziporet at dev.mellanox.co.il Tue Dec 19 06:43:22 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 19 Dec 2006 16:43:22 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <20061219122453.GC30743@mellanox.co.il> References: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net> <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net> <20061219122453.GC30743@mellanox.co.il> Message-ID: <4587FA8A.5070204@dev.mellanox.co.il> Michael S. Tsirkin wrote: >> So after a bit more testing, setting the route path mtu to 1024 before >> the qp creation (rdma_create_qp()) seems sufficient. >> > > OK, so the following fixes the tavor_quirk flag in cma to actually do something. > Could you please replace the patch cma_tavor_quirk.patch with this one, > set tavor_quirk option for cma module, and see if this works as expected? > > Unpack OFED 1.1, copy the following to > OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch > removing the patch by the same name that is in OFED > (also remove xxx_cma_tavor_quirk.txt or other patches if you put them there) > and then pack OFED 1.1 and rebuild. > > > Thanks, > > > Hi Or, Can you update OFED support page on Wiki with this issue? Thanks, Tziporet From ogerlitz at voltaire.com Tue Dec 19 06:46:44 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 19 Dec 2006 16:46:44 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <4587FA8A.5070204@dev.mellanox.co.il> References: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net> <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net> <20061219122453.GC30743@mellanox.co.il> <4587FA8A.5070204@dev.mellanox.co.il> Message-ID: <4587FB54.3050502@voltaire.com> Tziporet Koren wrote: > Hi Or, > Can you update OFED support page on Wiki with this issue? Basically, yes but actually, not... We (Michael and myself) do not agree yet on some issues here, also the cma tavor quirk will not work with some 3rd party SM/SA, so for the time being i will also put there a note on how to do it in the ULP level (eg as Philippe was fixing Lustre) Or. From tziporet at mellanox.co.il Tue Dec 19 06:53:13 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 19 Dec 2006 16:53:13 +0200 Subject: [openib-general] Status of old and new servers In-Reply-To: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com> References: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com> Message-ID: <4587FCD9.3070000@mellanox.co.il> Jeff Squyres wrote: > 3a. Reflecting that most activity is occurring in git, commits will > be disabled in all SVN trees by default. If you want your tree left > enabled in SVN for commits, please reply to this e-mail indicating > exactly which tree you want enabled for commits and a specific list > of usernames that are allowed to commit to the tree. > > Me, Vlad and Hal need permission for check-in for OFED 1.1 support. So we need check in for: https://openib.org/svn/gen2/branches/1.1/ > > 7. You are among the elite group who managed to read this entire e- > mail. Congratulations. Call your local representative to claim your > fabulous prize. > > I want my prize :-) Tziporet From philippe_bernadat at hp.com Tue Dec 19 06:57:00 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Tue, 19 Dec 2006 15:57:00 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <4587FB54.3050502@voltaire.com> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055B1FCB@idaexc03.emea.cpqcorp.net> Koren & Or, I am building and testing as we speak. But my feeling is that this issue shouldn't require user to set the tavor_quirk param. The stack should detect this HCA flavor at the appropriate end (active according to Or) and should automatically adjust the MTU. Philippe > -----Original Message----- > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > Sent: Tuesday, December 19, 2006 3:47 PM > To: Tziporet Koren > Cc: Michael S. Tsirkin; Bernadat, Philippe; Roland Dreier; > openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > Tziporet Koren wrote: > > Hi Or, > > Can you update OFED support page on Wiki with this issue? > > Basically, yes but actually, not... > > We (Michael and myself) do not agree yet on some issues here, > also the > cma tavor quirk will not work with some 3rd party SM/SA, so > for the time > being i will also put there a note on how to do it in the ULP > level (eg > as Philippe was fixing Lustre) > > Or. > > From jsquyres at cisco.com Tue Dec 19 07:07:47 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 19 Dec 2006 10:07:47 -0500 Subject: [openib-general] Status of old and new servers In-Reply-To: <4587FCD9.3070000@mellanox.co.il> References: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com> <4587FCD9.3070000@mellanox.co.il> Message-ID: <1EFD919D-A491-4238-9957-0D370F52CFBA@cisco.com> On Dec 19, 2006, at 9:53 AM, Tziporet Koren wrote: >> 3a. Reflecting that most activity is occurring in git, commits >> will be disabled in all SVN trees by default. If you want your >> tree left enabled in SVN for commits, please reply to this e-mail >> indicating exactly which tree you want enabled for commits and a >> specific list of usernames that are allowed to commit to the tree. >> > Me, Vlad and Hal need permission for check-in for OFED 1.1 support. > So we need check in for: https://openib.org/svn/gen2/branches/1.1/ It shall be so. >> 7. You are among the elite group who managed to read this entire >> e- mail. Congratulations. Call your local representative to >> claim your fabulous prize. >> > I want my prize :-) I'm sorry ma'am, only your local representative can help you with that. Please hold... ;-) -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From philippe_bernadat at hp.com Tue Dec 19 07:20:07 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Tue, 19 Dec 2006 16:20:07 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <20061219122453.GC30743@mellanox.co.il> Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055B203A@idaexc03.emea.cpqcorp.net> Sorry to say that this still doesn't do it. Are we sure we go this path ? I double checked the code I compiled and tried was: static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms, struct cma_work *work) { struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; struct ib_sa_path_rec path_rec; ib_sa_comp_mask mask; memset(&path_rec, 0, sizeof path_rec); ib_addr_get_sgid(addr, &path_rec.sgid); ib_addr_get_dgid(addr, &path_rec.dgid); path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); path_rec.numb_path = 1; if (tavor_quirk) { path_rec.mtu_selector = IB_SA_LT; path_rec.mtu = IB_MTU_2048; mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU; } else mask = 0; id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, id_priv->id.port_num, &path_rec, mask | IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID | IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH, timeout_ms, GFP_KERNEL, cma_query_handler, work, &id_priv->query); return (id_priv->query_id < 0) ? id_priv->query_id : 0; } Philippe > -----Original Message----- > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > Sent: Tuesday, December 19, 2006 1:25 PM > To: Bernadat, Philippe > Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org > Subject: Re: Performance Degradation with OFED v. Voltaire(lustre) > > > So after a bit more testing, setting the route path mtu to > 1024 before > > the qp creation (rdma_create_qp()) seems sufficient. > > OK, so the following fixes the tavor_quirk flag in cma to > actually do something. > Could you please replace the patch cma_tavor_quirk.patch with > this one, > set tavor_quirk option for cma module, and see if this works > as expected? > > Unpack OFED 1.1, copy the following to > OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch > removing the patch by the same name that is in OFED > (also remove xxx_cma_tavor_quirk.txt or other patches if you > put them there) > and then pack OFED 1.1 and rebuild. > > > Thanks, > > ----------------- > > Tavor systems get better performance with 1K MTU. Since there does > not seem to be any way to find out whether the remote system > uses Tavor, > add an option to limit the MTU globally. > > Signed-off-by: Michael S. Tsirkin > > --- > > diff --git a/drivers/infiniband/core/cma.c > b/drivers/infiniband/core/cma.c > index 50150c8..261bf45 100644 > --- a/drivers/infiniband/core/cma.c > +++ b/drivers/infiniband/core/cma.c > @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty"); > MODULE_DESCRIPTION("Generic RDMA CM Agent"); > MODULE_LICENSE("Dual BSD/GPL"); > > +static int tavor_quirk = 0; > +module_param_named(tavor_quirk, tavor_quirk, int, 0644); > +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: > limit MTU to 1K if > 0"); > + > #define CMA_CM_RESPONSE_TIMEOUT 20 > #define CMA_MAX_CM_RETRIES 3 > > @@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct > rdma_id_private *id_priv, int timeout_ms, > { > struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; > struct ib_sa_path_rec path_rec; > + ib_sa_comp_mask mask; > > memset(&path_rec, 0, sizeof path_rec); > ib_addr_get_sgid(addr, &path_rec.sgid); > @@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct > rdma_id_private *id_priv, int timeout_ms, > path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > path_rec.numb_path = 1; > > + if (tavor_quirk) { > + path_rec.mtu_selector = IB_SA_LT; > + path_rec.mtu = IB_MTU_2048; > + mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU; > + } else > + mask = 0; > + > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > - id_priv->id.port_num, &path_rec, > + id_priv->id.port_num, &path_rec, mask | > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_PKEY | > IB_SA_PATH_REC_NUMB_PATH, > timeout_ms, GFP_KERNEL, > > -- > MST > From halr at voltaire.com Tue Dec 19 07:21:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 10:21:50 -0500 Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager fail to report back correct signal In-Reply-To: <4587F051.2070000@mellanox.co.il> References: <45870A49.1070205@mellanox.co.il> <1166535535.32666.246023.camel@hal.voltaire.com> <4587EE0F.8090603@mellanox.co.il> <4587F051.2070000@mellanox.co.il> Message-ID: <1166541690.32666.249865.camel@hal.voltaire.com> On Tue, 2006-12-19 at 08:59, Eitan Zahavi wrote: > Seems like it is line wrapped this time too. > I need a new mailer. > So I will attach the file and send it. Or some different mail options/commands. > Sorry about that. > > Eitan > > Eitan Zahavi wrote: > > Hi Hal > > > > Hope this will work > > > > EZ > > > > > > From 557b0504ab317c470d376f15d7c6d5ed1c9d11f5 Mon Sep 17 00:00:00 2001 > > From: Eitan Zahavi > > Date: Mon, 18 Dec 2006 21:48:45 +0200 > > Subject: [PATCH] Fix cases where the pkey manager returned > > OSM_SIGNAL_DONE and not > > OSM_SIGNAL_DONE_PENDING by missing some sent packets > > --- > > osm/opensm/osm_pkey_mgr.c | 112 Thanks. Applied. -- Hal From philippe_bernadat at hp.com Tue Dec 19 07:33:41 2006 From: philippe_bernadat at hp.com (Bernadat, Philippe) Date: Tue, 19 Dec 2006 16:33:41 +0100 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net> I checked. We apparently never go through this path (with lustre) > -----Original Message----- > From: Bernadat, Philippe > Sent: Tuesday, December 19, 2006 4:20 PM > To: Michael S. Tsirkin > Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org > Subject: RE: Performance Degradation with OFED v. Voltaire(lustre) > > Sorry to say that this still doesn't do it. > Are we sure we go this path ? > > I double checked the code I compiled and tried was: > > static int cma_query_ib_route(struct rdma_id_private > *id_priv, int timeout_ms, > struct cma_work *work) > { > struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; > struct ib_sa_path_rec path_rec; > ib_sa_comp_mask mask; > > memset(&path_rec, 0, sizeof path_rec); > ib_addr_get_sgid(addr, &path_rec.sgid); > ib_addr_get_dgid(addr, &path_rec.dgid); > path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > path_rec.numb_path = 1; > > if (tavor_quirk) { > path_rec.mtu_selector = IB_SA_LT; > path_rec.mtu = IB_MTU_2048; > mask = IB_SA_PATH_REC_MTU_SELECTOR | > IB_SA_PATH_REC_MTU; > } else > mask = 0; > > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > id_priv->id.port_num, > &path_rec, mask | > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_PKEY | > IB_SA_PATH_REC_NUMB_PATH, > timeout_ms, GFP_KERNEL, > cma_query_handler, work, > &id_priv->query); > > return (id_priv->query_id < 0) ? id_priv->query_id : 0; > } > > Philippe > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Tuesday, December 19, 2006 1:25 PM > > To: Bernadat, Philippe > > Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org > > Subject: Re: Performance Degradation with OFED v. Voltaire(lustre) > > > > > So after a bit more testing, setting the route path mtu to > > 1024 before > > > the qp creation (rdma_create_qp()) seems sufficient. > > > > OK, so the following fixes the tavor_quirk flag in cma to > > actually do something. > > Could you please replace the patch cma_tavor_quirk.patch with > > this one, > > set tavor_quirk option for cma module, and see if this works > > as expected? > > > > Unpack OFED 1.1, copy the following to > > OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch > > removing the patch by the same name that is in OFED > > (also remove xxx_cma_tavor_quirk.txt or other patches if you > > put them there) > > and then pack OFED 1.1 and rebuild. > > > > > > Thanks, > > > > ----------------- > > > > Tavor systems get better performance with 1K MTU. Since there does > > not seem to be any way to find out whether the remote system > > uses Tavor, > > add an option to limit the MTU globally. > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > diff --git a/drivers/infiniband/core/cma.c > > b/drivers/infiniband/core/cma.c > > index 50150c8..261bf45 100644 > > --- a/drivers/infiniband/core/cma.c > > +++ b/drivers/infiniband/core/cma.c > > @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty"); > > MODULE_DESCRIPTION("Generic RDMA CM Agent"); > > MODULE_LICENSE("Dual BSD/GPL"); > > > > +static int tavor_quirk = 0; > > +module_param_named(tavor_quirk, tavor_quirk, int, 0644); > > +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: > > limit MTU to 1K if > 0"); > > + > > #define CMA_CM_RESPONSE_TIMEOUT 20 > > #define CMA_MAX_CM_RETRIES 3 > > > > @@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct > > rdma_id_private *id_priv, int timeout_ms, > > { > > struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; > > struct ib_sa_path_rec path_rec; > > + ib_sa_comp_mask mask; > > > > memset(&path_rec, 0, sizeof path_rec); > > ib_addr_get_sgid(addr, &path_rec.sgid); > > @@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct > > rdma_id_private *id_priv, int timeout_ms, > > path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > > path_rec.numb_path = 1; > > > > + if (tavor_quirk) { > > + path_rec.mtu_selector = IB_SA_LT; > > + path_rec.mtu = IB_MTU_2048; > > + mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU; > > + } else > > + mask = 0; > > + > > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > > - id_priv->id.port_num, &path_rec, > > + id_priv->id.port_num, &path_rec, mask | > > IB_SA_PATH_REC_DGID | > > IB_SA_PATH_REC_SGID | > > IB_SA_PATH_REC_PKEY | > > IB_SA_PATH_REC_NUMB_PATH, > > timeout_ms, GFP_KERNEL, > > > > -- > > MST > > From mst at mellanox.co.il Tue Dec 19 07:48:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 17:48:00 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net> Message-ID: <20061219154800.GB3428@mellanox.co.il> Interesting. So, does lustre actually work on top of rdma_cm? Quoting r. Bernadat, Philippe : Subject: RE: Performance Degradation with OFED v. Voltaire(lustre) I checked. We apparently never go through this path (with lustre) > -----Original Message----- > From: Bernadat, Philippe > Sent: Tuesday, December 19, 2006 4:20 PM > To: Michael S. Tsirkin > Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org > Subject: RE: Performance Degradation with OFED v. Voltaire(lustre) > > Sorry to say that this still doesn't do it. > Are we sure we go this path ? > > I double checked the code I compiled and tried was: > > static int cma_query_ib_route(struct rdma_id_private > *id_priv, int timeout_ms, > struct cma_work *work) > { > struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; > struct ib_sa_path_rec path_rec; > ib_sa_comp_mask mask; > > memset(&path_rec, 0, sizeof path_rec); > ib_addr_get_sgid(addr, &path_rec.sgid); > ib_addr_get_dgid(addr, &path_rec.dgid); > path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > path_rec.numb_path = 1; > > if (tavor_quirk) { > path_rec.mtu_selector = IB_SA_LT; > path_rec.mtu = IB_MTU_2048; > mask = IB_SA_PATH_REC_MTU_SELECTOR | > IB_SA_PATH_REC_MTU; > } else > mask = 0; > > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > id_priv->id.port_num, > &path_rec, mask | > IB_SA_PATH_REC_DGID | > IB_SA_PATH_REC_SGID | > IB_SA_PATH_REC_PKEY | > IB_SA_PATH_REC_NUMB_PATH, > timeout_ms, GFP_KERNEL, > cma_query_handler, work, > &id_priv->query); > > return (id_priv->query_id < 0) ? id_priv->query_id : 0; > } > > Philippe > > > -----Original Message----- > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] > > Sent: Tuesday, December 19, 2006 1:25 PM > > To: Bernadat, Philippe > > Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org > > Subject: Re: Performance Degradation with OFED v. Voltaire(lustre) > > > > > So after a bit more testing, setting the route path mtu to > > 1024 before > > > the qp creation (rdma_create_qp()) seems sufficient. > > > > OK, so the following fixes the tavor_quirk flag in cma to > > actually do something. > > Could you please replace the patch cma_tavor_quirk.patch with > > this one, > > set tavor_quirk option for cma module, and see if this works > > as expected? > > > > Unpack OFED 1.1, copy the following to > > OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch > > removing the patch by the same name that is in OFED > > (also remove xxx_cma_tavor_quirk.txt or other patches if you > > put them there) > > and then pack OFED 1.1 and rebuild. > > > > > > Thanks, > > > > ----------------- > > > > Tavor systems get better performance with 1K MTU. Since there does > > not seem to be any way to find out whether the remote system > > uses Tavor, > > add an option to limit the MTU globally. > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > diff --git a/drivers/infiniband/core/cma.c > > b/drivers/infiniband/core/cma.c > > index 50150c8..261bf45 100644 > > --- a/drivers/infiniband/core/cma.c > > +++ b/drivers/infiniband/core/cma.c > > @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty"); > > MODULE_DESCRIPTION("Generic RDMA CM Agent"); > > MODULE_LICENSE("Dual BSD/GPL"); > > > > +static int tavor_quirk = 0; > > +module_param_named(tavor_quirk, tavor_quirk, int, 0644); > > +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: > > limit MTU to 1K if > 0"); > > + > > #define CMA_CM_RESPONSE_TIMEOUT 20 > > #define CMA_MAX_CM_RETRIES 3 > > > > @@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct > > rdma_id_private *id_priv, int timeout_ms, > > { > > struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; > > struct ib_sa_path_rec path_rec; > > + ib_sa_comp_mask mask; > > > > memset(&path_rec, 0, sizeof path_rec); > > ib_addr_get_sgid(addr, &path_rec.sgid); > > @@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct > > rdma_id_private *id_priv, int timeout_ms, > > path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr)); > > path_rec.numb_path = 1; > > > > + if (tavor_quirk) { > > + path_rec.mtu_selector = IB_SA_LT; > > + path_rec.mtu = IB_MTU_2048; > > + mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU; > > + } else > > + mask = 0; > > + > > id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device, > > - id_priv->id.port_num, &path_rec, > > + id_priv->id.port_num, &path_rec, mask | > > IB_SA_PATH_REC_DGID | > > IB_SA_PATH_REC_SGID | > > IB_SA_PATH_REC_PKEY | > > IB_SA_PATH_REC_NUMB_PATH, > > timeout_ms, GFP_KERNEL, > > > > -- > > MST > > -- MST From mst at mellanox.co.il Tue Dec 19 07:50:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 17:50:48 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <4587FB54.3050502@voltaire.com> References: <4587FB54.3050502@voltaire.com> Message-ID: <20061219155048.GC3428@mellanox.co.il> > > Hi Or, > > Can you update OFED support page on Wiki with this issue? > > Basically, yes but actually, not... > > We (Michael and myself) do not agree yet on some issues here, also the > cma tavor quirk will not work with some 3rd party SM/SA, This last issue could be addressed by e.g. forcing MTU if SA does not give us a path we asked for. Any data on which SA is this? What does it do when you set MTU selector to "less than"? > so for the time > being i will also put there a note on how to do it in the ULP level (eg > as Philippe was fixing Lustre) Makes sense. -- MST From mst at mellanox.co.il Tue Dec 19 07:54:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 17:54:18 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055B1FCB@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E055B1FCB@idaexc03.emea.cpqcorp.net> Message-ID: <20061219155418.GD3428@mellanox.co.il> Correct. But endnode does not know what's on the other side of the link, and which path MTU is best. This is SA's job (SA sees all the topology), and you will get exactly this behaviour you ask for if you run opensm with "quirk mode" enabled (you must disable the *other* SM you are running though for this to take effect). This mode will be enabled by default in OFED 1.2. Quoting r. Bernadat, Philippe : Subject: RE: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) Koren & Or, I am building and testing as we speak. But my feeling is that this issue shouldn't require user to set the tavor_quirk param. The stack should detect this HCA flavor at the appropriate end (active according to Or) and should automatically adjust the MTU. Philippe > -----Original Message----- > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] > Sent: Tuesday, December 19, 2006 3:47 PM > To: Tziporet Koren > Cc: Michael S. Tsirkin; Bernadat, Philippe; Roland Dreier; > openib-general at openib.org > Subject: Re: [openib-general] Performance Degradation with > OFED v. Voltaire(lustre) > > Tziporet Koren wrote: > > Hi Or, > > Can you update OFED support page on Wiki with this issue? > > Basically, yes but actually, not... > > We (Michael and myself) do not agree yet on some issues here, > also the > cma tavor quirk will not work with some 3rd party SM/SA, so > for the time > being i will also put there a note on how to do it in the ULP > level (eg > as Philippe was fixing Lustre) > > Or. > > -- MST From mst at mellanox.co.il Tue Dec 19 08:02:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 18:02:21 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <4587F6E0.10000@voltaire.com> References: <4587F6E0.10000@voltaire.com> Message-ID: <20061219160221.GE3428@mellanox.co.il> > Subject: Re: tavor quirks etc (opensm compliance etc) > > Michael S. Tsirkin wrote: > > I'm really going off in a hurry, but for now: > > enjoy your vacation, don't worry, lets discuss this next week when you > are back, if you want, you or Eitan or anyone else that wants to jump on > it can send an RFC with the two patches (cma and opensm tavor quirks), > and we can discuss why they are better from my simplified patch, what > are the associated dependencies etc etc Or, thanks. Note opensm support is already in, and CMA patch was also in OFED 1.1 and were discussed before OFED 1.1 - it had a trivial typo but I just fixed the missing comp mask selector and it will be pushed to ofed 1.2 tree at staging in short order. I am not yet sure what is best for upstream, so I don't really think we need any RFCs. We'll need data from SM guys on whether MTU selector actually works in SMs, and if not what happens when you enable it. -- MST From tziporet at mellanox.co.il Tue Dec 19 08:04:54 2006 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 19 Dec 2006 18:04:54 +0200 Subject: [openib-general] OFED 1.2 18-Dec meeting summary Message-ID: <45880DA6.4040403@mellanox.co.il> Agenda items: 1. Daily build update 2. OFED 1.2 features status 3. SVN server change - decisions are covered on Jeff's mail Meeting summary: *1. Daily build update:* Daily build is now based on kernel 2.6.20-rc1. Testing status: * Voltaire started daily testing based on this build. * Qlogic - will start next week * IBM - will start testing this week * Mellanox - testing run daily based on daily build * Need to know what is the git branch for ucma and udapl - Sean and Arlin *2. Features update:* We reviewed the features list that is published on the Wiki and in general most items are on schedule for 31-Jan. Some updates and AIs: * Prepare SA cache for OFED 1.2 - Sean/Woody * VNIC: Qlogic are working according to the How-to explanation. Should be ready soon. * ehca - Interrupt handling for IPoIB NAPI support may miss kernel 2.6.10 but should be ready for OFED. * Memory windows may be dropped form libibverbs 1.1 * QoS - coding was not started but should be OK. * Open MPI: alpha release will include pre-release and the final version will be replaced for the beta * MVAPICH 0.9.9 is on track for the code freeze too. * iWARP - no iWARP representative joined the meeting. Need an update form some iWARP developers regarding their progress in preparing iWARP for OFED 1.2. * Bonding module - Voltaire are working on backport patches for SLES10 and Redhat EL4. Wish that this module will be part of OFED * RDS - License issue is still pending Oracle legal department. May be as an add-on package Note: There was a discussion about moving Roland's user space git from kernel.org server to OFA server. We put this discussion on hold since Roland was not on the meting. AI - Roland to participate in the next meeting to close this subject. Tziporet From sweitzen at cisco.com Tue Dec 19 08:58:17 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 19 Dec 2006 08:58:17 -0800 Subject: [openib-general] OFED release testing Task Force Message-ID: I can represent Cisco. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi Sent: Wednesday, November 22, 2006 10:30 AM To: openfabrics-ewg at openib.org Cc: openib-general at openib.org Subject: [openib-general] OFED release testing Task Force Hi, As a follow-up on the presentation prepared and presented by Amit Krig and my-self in the OFA Meeting during SC06 I'm sending out this e-mail as a call for participation. The targets of the Ad-hoc task force will be (as agreed upon in the session we had): unify the test results formats, define release quality criteria, define/assign ULP verification owners and enhance interoperability finger-print in the release process. We would like to have a participant from each contributing company and would appreciate any response sent to me with a name of a person from the company to attend and take action on behalf of this task force. BTW: I've also attached the presentation that was given in the OFA meeting. <> Happy Holidays to every one, Nimrod Gindi Mellanox Technologies Ltd. mail : nimrodg at mellanox.com Cell : +1-408-750-4801 Office: +1-347-342-0011 Fax : +1-212-987-0275 -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue Dec 19 09:26:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 19 Dec 2006 11:26:55 -0600 Subject: [openib-general] OFED 1.2 18-Dec meeting summary In-Reply-To: <45880DA6.4040403@mellanox.co.il> References: <45880DA6.4040403@mellanox.co.il> Message-ID: <1166549215.31612.18.camel@stevo-desktop> On Tue, 2006-12-19 at 18:04 +0200, Tziporet Koren wrote: > * iWARP - no iWARP representative joined the meeting. Need an > update > form some iWARP developers regarding their progress in preparing > iWARP for OFED 1.2. iWARP support is in 2.6.19. The question is really if any iWARP device drivers/libraries will be in OFED 1.2. Timewise, I'm not in a position to push in the Ammasso device. If somebody wants it in OFED 1.2, then they should drive that. The kernel driver will be in OFED 1.2 because its in 2.6.19. The library would need an owner to do the work to get it into OFED 1.2. I'm focusing now on getting the Chelsio drivers into 2.6.20. If that doesn't happen, will OFED 1.2 still entertain pulling in Chelsio drivers? Either way, I cannot begin to work on OFED 1.2 with Chelsio until the new year. Steve. From nimrodg at mellanox.com Tue Dec 19 09:28:41 2006 From: nimrodg at mellanox.com (Nimrod Gindi) Date: Tue, 19 Dec 2006 09:28:41 -0800 Subject: [openib-general] OFED release testing Task Force Message-ID: <1E3DCD1C63492545881FACB6063A57C1AF865F@mtiexch01.mti.com> Thanks - I will send a consolidating e-mail to the task force people and will try to have the kick off meeting 1st week of 2007 Nimrod Gindi Mellanox Technologies Ltd. mail: nimrodg at mellanox.com Cellular: +1-408-750-4801 Office: +1-347-342-0011 Fax: +1-212-987-0275 ----- Original Message ----- From: Scott Weitzenkamp (sweitzen) To: Nimrod Gindi; openfabrics-ewg at openib.org Cc: openib-general at openib.org Sent: Tue Dec 19 08:58:17 2006 Subject: RE: [openib-general] OFED release testing Task Force I can represent Cisco. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi Sent: Wednesday, November 22, 2006 10:30 AM To: openfabrics-ewg at openib.org Cc: openib-general at openib.org Subject: [openib-general] OFED release testing Task Force Hi, As a follow-up on the presentation prepared and presented by Amit Krig and my-self in the OFA Meeting during SC06 I'm sending out this e-mail as a call for participation. The targets of the Ad-hoc task force will be (as agreed upon in the session we had): unify the test results formats, define release quality criteria, define/assign ULP verification owners and enhance interoperability finger-print in the release process. We would like to have a participant from each contributing company and would appreciate any response sent to me with a name of a person from the company to attend and take action on behalf of this task force. BTW: I've also attached the presentation that was given in the OFA meeting. <> Happy Holidays to every one, Nimrod Gindi Mellanox Technologies Ltd. mail : nimrodg at mellanox.com Cell : +1-408-750-4801 Office: +1-347-342-0011 Fax : +1-212-987-0275 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Tue Dec 19 09:25:04 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 19:25:04 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <20061219131625.GE30743@mellanox.co.il> References: <4587DD0B.1030403@voltaire.com> <20061219131625.GE30743@mellanox.co.il> Message-ID: <45882070.8040101@dev.mellanox.co.il> Michael, Michael S. Tsirkin wrote: >> The problems i see with the current approach are: >> >> 1) there are three patches > > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch. > It is not 100% but the only work around for proprietary SMs. > Fixing the SA is a full solution. We (Mellanox) will work with SA vendors to > get this addressed. But of course this takes time. > >> 2) of them, the cma-tavor-quirk is broken (see *** below) in its design >> since it assumes the opensm-tavor-quirk and it would not work with >> opensm that does not have it nor with 3rd party/commercial SMs which do >> not have similar quirk > > cma-tavor-quirk in OFED 1.1 is broken but not by design - > the patch I posted recently fixes the bug and should work with any compliant SM. > I did not look at the opensm code specifically, but the > "15.2.5.16 PATHRECORD" is quite explicit in its requirements: > > MtuSelector 2 432 In a query request: > 3-largest MTU available > If MTU is specified (i.e., the ComponentMask bit for > MTU is 1): > 0-greater than MTU specified > 1-less than MTU specified > 2-exactly the MTU specified > > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists) > we should simply fix it. If there are other broken SMs, we can look at how they > are broken and how to solve this. OSM implementation in this case matches the IB spec. On page 905, table 207, there's an example of such a request: Required MTU = 4 (2048) Required MTUSelector = 1 ('less-than') And then it is explained that the required path records should have MTU of 1024 or lower. OSM implementation converts these rules to code AS IS. Now, what you're actually saying, is that the specification in this case is bad. In our discussion, you said that if you request MTU of X with MTU selector of 'less-than', you want to also get any path records that supports MTU greater than X, because they also support MTUs <= X. The question is, if your understanding of spec is right, what's the point of having 'less-than' selector at all? I mean, if selector says 'less-than', but you also accept MTU that are 'equal' and 'greater-than', then it looks like you actually don't care about the MTU, because any MTU would be OK. --Yevgeny >> 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk >> and hence it was not pushed upstream, and vise-versa an upstream ipoib >> code is broken with the open-sm running with the quirk! > > All this is incorrect. ipoib-selector is completely irrelevant to the MTU > issue - its a strict compliance fix for IPoIB. IPoIB also works fine without > this patch (with or without tavor quirk activated). It does not depend on any > specific SM. It is not upstream because of style issues only and due to my lack > of time to fix it. > From sweitzen at cisco.com Tue Dec 19 09:36:59 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Tue, 19 Dec 2006 09:36:59 -0800 Subject: [openib-general] OFED 1.2 18-Dec meeting summary Message-ID: > Meeting summary: > *1. Daily build update:* > Daily build is now based on kernel 2.6.20-rc1. Where is the daily build? Scott From mst at mellanox.co.il Tue Dec 19 10:35:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 20:35:22 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <45882070.8040101@dev.mellanox.co.il> References: <4587DD0B.1030403@voltaire.com> <20061219131625.GE30743@mellanox.co.il> <45882070.8040101@dev.mellanox.co.il> Message-ID: <20061219183522.GC8163@mellanox.co.il> Quoting r. Yevgeny Kliteynik : Subject: Re: [openib-general] tavor quirks etc (opensm compliance etc) Michael, > Michael S. Tsirkin wrote: > >> The problems i see with the current approach are: > >> > >> 1) there are three patches > > > > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch. > > It is not 100% but the only work around for proprietary SMs. > > Fixing the SA is a full solution. We (Mellanox) will work with SA vendors to > > get this addressed. But of course this takes time. > > > >> 2) of them, the cma-tavor-quirk is broken (see *** below) in its design > >> since it assumes the opensm-tavor-quirk and it would not work with > >> opensm that does not have it nor with 3rd party/commercial SMs which do > >> not have similar quirk > > > > cma-tavor-quirk in OFED 1.1 is broken but not by design - > > the patch I posted recently fixes the bug and should work with any compliant SM. > > I did not look at the opensm code specifically, but the > > "15.2.5.16 PATHRECORD" is quite explicit in its requirements: > > > > MtuSelector 2 432 In a query request: > > 3-largest MTU available > > If MTU is specified (i.e., the ComponentMask bit for > > MTU is 1): > > 0-greater than MTU specified > > 1-less than MTU specified > > 2-exactly the MTU specified > > > > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists) > > we should simply fix it. If there are other broken SMs, we can look at how they > > are broken and how to solve this. > > OSM implementation in this case matches the IB spec. > On page 905, table 207, there's an example of such a > request: > Required MTU = 4 (2048) > Required MTUSelector = 1 ('less-than') > And then it is explained that the required path records > should have MTU of 1024 or lower. > > OSM implementation converts these rules to code AS IS. > > Now, what you're actually saying, is that the specification > in this case is bad. In our discussion, you said that if > you request MTU of X with MTU selector of 'less-than', you > want to also get any path records that supports MTU greater > than X, because they also support MTUs <= X. > The question is, if your understanding of spec is right, > what's the point of having 'less-than' selector at all? > I mean, if selector says 'less-than', but you also accept > MTU that are 'equal' and 'greater-than', then it looks like > you actually don't care about the MTU, because any MTU would > be OK. > I believe you misrepresent what I am saying. I understand the spec in the following way: if I set MTU selector to less than 1K, and there is a path that can support MTU of 1/2 K, I expect it is legal for SM to select that path and return it to me, setting the MTU selector to a value of 1/2K or less. Whether that path *also* supports higher MTUs need not be relevant - whether SM will prefer another path in this case is up to SM, but it is clear that if there are paths that satisfy the request, it does not make sense to fail the request because paths have more capabilities. --Yevgeny -- MST From ggrundstrom at NetEffect.com Tue Dec 19 11:16:06 2006 From: ggrundstrom at NetEffect.com (Glenn Grundstrom) Date: Tue, 19 Dec 2006 13:16:06 -0600 Subject: [openib-general] OFED release testing Task Force Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC0681FF42@venom2> I will represent NetEffect. Glenn Grundstrom. ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi Sent: Tuesday, December 19, 2006 11:29 AM To: sweitzen at cisco.com; openfabrics-ewg at openib.org Cc: openib-general at openib.org Subject: Re: [openib-general] OFED release testing Task Force Thanks - I will send a consolidating e-mail to the task force people and will try to have the kick off meeting 1st week of 2007 Nimrod Gindi Mellanox Technologies Ltd. mail: nimrodg at mellanox.com Cellular: +1-408-750-4801 Office: +1-347-342-0011 Fax: +1-212-987-0275 ----- Original Message ----- From: Scott Weitzenkamp (sweitzen) To: Nimrod Gindi; openfabrics-ewg at openib.org Cc: openib-general at openib.org Sent: Tue Dec 19 08:58:17 2006 Subject: RE: [openib-general] OFED release testing Task Force I can represent Cisco. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi Sent: Wednesday, November 22, 2006 10:30 AM To: openfabrics-ewg at openib.org Cc: openib-general at openib.org Subject: [openib-general] OFED release testing Task Force Hi, As a follow-up on the presentation prepared and presented by Amit Krig and my-self in the OFA Meeting during SC06 I'm sending out this e-mail as a call for participation. The targets of the Ad-hoc task force will be (as agreed upon in the session we had): unify the test results formats, define release quality criteria, define/assign ULP verification owners and enhance interoperability finger-print in the release process. We would like to have a participant from each contributing company and would appreciate any response sent to me with a name of a person from the company to attend and take action on behalf of this task force. BTW: I've also attached the presentation that was given in the OFA meeting. <> Happy Holidays to every one, Nimrod Gindi Mellanox Technologies Ltd. mail : nimrodg at mellanox.com Cell : +1-408-750-4801 Office: +1-347-342-0011 Fax : +1-212-987-0275 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Tue Dec 19 11:35:16 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 21:35:16 +0200 Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to osm_switch_t Message-ID: <45883EF4.1050705@dev.mellanox.co.il> Hi Hal Adding max_lid_ho field to osm_switch_t to allow routing engines that don't use lid matrices to explicitly set the max lid (in host order) that is reachable from the switch. Signed-off-by: Yevgeny Kliteynik --- osm/include/opensm/osm_switch.h | 37 +++++++++++++++++++++++++++++++++++++ osm/opensm/osm_switch.c | 2 ++ 2 files changed, 39 insertions(+), 0 deletions(-) diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h index 4570f61..d2089bd 100644 --- a/osm/include/opensm/osm_switch.h +++ b/osm/include/opensm/osm_switch.h @@ -107,6 +107,7 @@ typedef struct _osm_switch ib_switch_info_t switch_info; osm_fwd_tbl_t fwd_tbl; osm_lid_matrix_t lmx; + uint16_t max_lid_ho; osm_port_profile_t *p_prof; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; @@ -129,6 +130,9 @@ typedef struct _osm_switch * LID Matrix for this switch containing the hop count * to every LID from every port. * +* max_lid_ho +* Max LID that is accessible from this switch +* * p_pro * Pointer to array of Port Profile objects for this switch. * @@ -793,6 +797,8 @@ static inline uint16_t osm_switch_get_max_lid_ho( IN const osm_switch_t* const p_sw ) { + if (p_sw->max_lid_ho != 0) + return p_sw->max_lid_ho; return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) ); } /* @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho( * SEE ALSO *********/ +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho +* NAME +* osm_switch_set_max_lid_ho +* +* DESCRIPTION +* Set the maximum LID (host order) value accessed from this switch +* SYNOPSIS +*/ +static inline void +osm_switch_set_max_lid_ho( + IN osm_switch_t* const p_sw, + IN uint16_t max_lid_ho ) +{ + p_sw->max_lid_ho = max_lid_ho; +} +/* +* PARAMETERS +* p_sw +* [in] Pointer to a switch object. +* +* max_lid_ho +* Max LID (host order) value accessed from this switch +* +* RETURN VALUES +* None. +* +* NOTES +* +* SEE ALSO +*********/ + /****f* OpenSM: Switch/osm_switch_get_num_ports * NAME * osm_switch_get_num_ports diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c index 0dd3de5..4ca713a 100644 --- a/osm/opensm/osm_switch.c +++ b/osm/opensm/osm_switch.c @@ -122,6 +122,8 @@ osm_switch_init( for( port_num = 0; port_num < num_ports; port_num++ ) osm_port_prof_construct( &p_sw->p_prof[port_num] ); + p_sw->max_lid_ho = 0; + Exit: return( status ); } -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Tue Dec 19 11:37:29 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 21:37:29 +0200 Subject: [openib-general] [PATCH] osm: added an option for providing dump function per routing engine Message-ID: <45883F79.6090109@dev.mellanox.co.il> Hi Hal As you suggested, added an option for providing dump function per routing engine. Signed-off-by: Yevgeny Kliteynik --- osm/include/opensm/osm_opensm.h | 4 ++++ osm/opensm/osm_ucast_mgr.c | 23 ++++++++++++++--------- 2 files changed, 18 insertions(+), 9 deletions(-) diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h index 653c8ec..16fef37 100644 --- a/osm/include/opensm/osm_opensm.h +++ b/osm/include/opensm/osm_opensm.h @@ -104,6 +104,7 @@ struct osm_routing_engine { void *context; int (*build_lid_matrices)(void *context); int (*ucast_build_fwd_tables)(void *context); + void (*ucast_dump_tables)(void *context); void (*delete)(void *context); }; /* @@ -121,6 +122,9 @@ struct osm_routing_engine { * ucast_build_fwd_tables * The callback for unicast forwarding table generation. * +* ucast_dump_tables +* The callback for dumping unicast routing tables. +* * delete * The delete method, may be used for routing engine * internals cleanup. diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index e051c66..fcf6f72 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -149,7 +149,7 @@ ucast_mgr_dump(osm_ucast_mgr_t *p_mgr, F cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, func, &dump_context); } -static void +void ucast_mgr_dump_to_file(osm_ucast_mgr_t *p_mgr, const char *file_name, void (*func)(cl_map_item_t *, void *)) { @@ -350,7 +350,7 @@ ucast_mgr_dump_lid_matrix(cl_map_item_t /********************************************************************** **********************************************************************/ -static void +void ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt) { osm_switch_t* p_sw = (osm_switch_t *)p_map_item; @@ -1226,6 +1226,7 @@ osm_ucast_mgr_process( struct osm_routing_engine *p_routing_eng; osm_signal_t signal = OSM_SIGNAL_DONE; cl_qmap_t *p_sw_guid_tbl; + boolean_t default_routing = TRUE; OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process ); @@ -1256,16 +1257,20 @@ osm_ucast_mgr_process( build and download the switch forwarding tables. */ - if (!p_routing_eng->ucast_build_fwd_tables || - p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0) - { - cl_qmap_apply_func( p_sw_guid_tbl, - __osm_ucast_mgr_process_tbl, p_mgr ); - } + if ( p_routing_eng->ucast_build_fwd_tables && + (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) ) + default_routing = FALSE; + else + cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr ); /* dump fdb into file: */ if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) - __osm_ucast_mgr_dump_tables( p_mgr ); + { + if ( !default_routing && p_routing_eng->ucast_dump_tables ) + p_routing_eng->ucast_dump_tables(p_routing_eng->context); + else + __osm_ucast_mgr_dump_tables( p_mgr ); + } if (p_mgr->any_change) { -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Tue Dec 19 11:54:46 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 21:54:46 +0200 Subject: [openib-general] [PATCH] osm: Improving FatTree routing engi Message-ID: <45884386.5060106@dev.mellanox.co.il> Hi Hal. FatTree routing engine improvemets: 1. Improved building of LFTs 2. Setting max lid on osm switches 3. Using ucast manager LFT dump function 4. Stoped using global variable 'osm' 5. Improved logging 6. Some cosmetics Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 439 +++++++++++++++++++++++++++--------------- 1 files changed, 281 insertions(+), 158 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index 15e4cd0..0d7188a 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -57,9 +57,6 @@ #include #include -/* This var is predefined and initialized */ -extern osm_opensm_t osm; - /* * FatTree rank is bounded between 2 and 8: * - Tree of rank 1 has only trivial routing pathes, @@ -211,14 +208,16 @@ typedef struct ftree_hca_t_ { typedef struct ftree_fabric_t_ { - cl_qmap_t hca_tbl; - cl_qmap_t sw_tbl; - cl_qmap_t sw_by_tuple_tbl; - uint32_t tree_rank; - ftree_sw_t ** leaf_switches; - uint32_t leaf_switches_num; - uint16_t max_hcas_per_leaf; - cl_pool_t sw_fwd_tbl_pool; + osm_opensm_t * p_osm; + cl_qmap_t hca_tbl; + cl_qmap_t sw_tbl; + cl_qmap_t sw_by_tuple_tbl; + uint32_t tree_rank; + ftree_sw_t ** leaf_switches; + uint32_t leaf_switches_num; + uint16_t max_hcas_per_leaf; + cl_pool_t sw_fwd_tbl_pool; + uint16_t lft_max_lid_ho; } ftree_fabric_t; /*************************************************** @@ -506,6 +505,7 @@ __osm_ftree_port_group_destroy( static void __osm_ftree_port_group_dump( + IN ftree_fabric_t *p_ftree, IN ftree_port_group_t * p_group, IN ftree_direction_t direction) { @@ -517,7 +517,7 @@ __osm_ftree_port_group_dump( if (!p_group) return; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG)) return; size = cl_ptr_vector_get_size(&p_group->ports); @@ -533,7 +533,7 @@ __osm_ftree_port_group_dump( sprintf(buff + strlen(buff), "%u", p_port->port_num); } - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_port_group_dump:" " Port Group of size %u, port(s): %s, direction: %s\n" " Local <--> Remote GUID (LID):" @@ -648,16 +648,17 @@ __osm_ftree_sw_destroy( static void __osm_ftree_sw_dump( + IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) { uint32_t i; if (!p_sw) return; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG)) return; - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_sw_dump: " "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n", __osm_ftree_tuple_to_str(p_sw->tuple), @@ -665,10 +666,14 @@ __osm_ftree_sw_dump( p_sw->down_port_groups_num, p_sw->up_port_groups_num); - for( i = 0; i < p_sw->down_port_groups_num; i++ ) - __osm_ftree_port_group_dump(p_sw->down_port_groups[i], FTREE_DIRECTION_DOWN); - for( i = 0; i < p_sw->up_port_groups_num; i++ ) - __osm_ftree_port_group_dump(p_sw->up_port_groups[i], FTREE_DIRECTION_UP); + for( i = 0; i < p_sw->down_port_groups_num; i++ ) + __osm_ftree_port_group_dump(p_ftree, + p_sw->down_port_groups[i], + FTREE_DIRECTION_DOWN); + for( i = 0; i < p_sw->up_port_groups_num; i++ ) + __osm_ftree_port_group_dump(p_ftree, + p_sw->up_port_groups[i], + FTREE_DIRECTION_UP); } /* __osm_ftree_sw_dump() */ @@ -823,23 +828,26 @@ __osm_ftree_hca_destroy( static void __osm_ftree_hca_dump( + IN ftree_fabric_t * p_ftree, IN ftree_hca_t * p_hca) { uint32_t i; if (!p_hca) return; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG)) return; - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_hca_dump: " "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), p_hca->up_port_groups_num); for( i = 0; i < p_hca->up_port_groups_num; i++ ) - __osm_ftree_port_group_dump(p_hca->up_port_groups[i],FTREE_DIRECTION_UP); + __osm_ftree_port_group_dump(p_ftree, + p_hca->up_port_groups[i], + FTREE_DIRECTION_UP); } /***************************************************/ @@ -1050,6 +1058,10 @@ __osm_ftree_fabric_add_sw(ftree_fabric_t cl_qmap_insert(&p_ftree->sw_tbl, p_osm_sw->p_node->node_info.node_guid, &p_sw->map_item); + + /* track the max lid (in host order) that exists in the fabric */ + if (cl_ntoh16(p_sw->base_lid) > p_ftree->lft_max_lid_ho) + p_ftree->lft_max_lid_ho = cl_ntoh16(p_sw->base_lid); } /***************************************************/ @@ -1096,38 +1108,38 @@ __osm_ftree_fabric_dump(ftree_fabric_t * ftree_hca_t * p_hca; ftree_sw_t * p_sw; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG)) return; - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" " |-------------------------------|\n" " |- Full fabric topology dump -|\n" " |-------------------------------|\n\n"); - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_dump: -- HCAs:\n"); for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl); p_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item) ) { - __osm_ftree_hca_dump(p_hca); + __osm_ftree_hca_dump(p_ftree, p_hca); } for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++) { - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_dump: -- Rank %u switches\n", i); for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl); p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) ) { if (p_sw->rank == i) - __osm_ftree_sw_dump(p_sw); + __osm_ftree_sw_dump(p_ftree, p_sw); } } - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" " |---------------------------------------|\n" " |- Full fabric topology dump completed -|\n" " |---------------------------------------|\n\n"); @@ -1143,16 +1155,18 @@ __osm_ftree_fabric_dump_general_info( ftree_sw_t * p_sw; char * addition_str; - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info:\n"); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " "General fabric topology info\n"); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " "============================\n"); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " " - FatTree rank (switches only): %u\n", p_ftree->tree_rank); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " " - Fabric has %u HCAs, %u switches\n", cl_qmap_count(&p_ftree->hca_tbl), cl_qmap_count(&p_ftree->sw_tbl)); @@ -1174,13 +1188,15 @@ __osm_ftree_fabric_dump_general_info( addition_str = " (leaf) "; else addition_str = " "; - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " - " - Fabric has %u rank %u%sswitches\n",j,i,addition_str); + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " + " - Fabric has %u rank %u%sswitches\n", + j,i,addition_str); } - if (osm_log_is_active(&osm.log,OSM_LOG_VERBOSE)) + if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_VERBOSE)) { - osm_log(&osm.log, OSM_LOG_VERBOSE, + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " - Root switches:\n"); for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); @@ -1188,7 +1204,7 @@ __osm_ftree_fabric_dump_general_info( p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) ) { if (p_sw->rank == 0) - osm_log(&osm.log, OSM_LOG_VERBOSE, + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), @@ -1196,15 +1212,17 @@ __osm_ftree_fabric_dump_general_info( __osm_ftree_tuple_to_str(p_sw->tuple)); } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_dump_general_info: " " - Leaf switches (sorted by index):\n"); for (i = 0; i < p_ftree->leaf_switches_num; i++) { - osm_log(&osm.log, OSM_LOG_VERBOSE, + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", cl_ntoh64(osm_node_get_node_guid( - osm_switch_get_node_ptr(p_ftree->leaf_switches[i]->p_osm_sw))), + osm_switch_get_node_ptr( + p_ftree->leaf_switches[i]->p_osm_sw))), cl_ntoh16(p_ftree->leaf_switches[i]->base_lid), __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple)); } @@ -1229,15 +1247,15 @@ __osm_ftree_fabric_dump_hca_ordering( char * filename = "osm-ftree-ca-order.dump"; snprintf(path, sizeof(path), "%s/%s", - osm.subn.opt.dump_files_dir, filename); + p_ftree->p_osm->subn.opt.dump_files_dir, filename); p_hca_ordering_file = fopen(path, "w"); if (!p_hca_ordering_file) { - osm_log(&osm.log, OSM_LOG_ERROR, + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: " "cannot open file \'%s\': %s\n", filename, strerror(errno)); - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return; } @@ -1383,9 +1401,9 @@ __osm_ftree_fabric_make_indexing( cl_list_t bfs_list; ftree_sw_tbl_element_t * p_sw_tbl_element; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_make_indexing); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_make_indexing); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " "Starting FatTree indexing\n"); /* create array of leaf switches */ @@ -1411,8 +1429,8 @@ __osm_ftree_fabric_make_indexing( This fuction also adds the switch it into the switch_by_tuple table. */ __osm_ftree_fabric_assign_first_tuple(p_ftree,p_sw); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " - "Indexing starting point:\n" + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_make_indexing: Indexing starting point:\n" " - Switch rank : %u\n" " - Switch index : %s\n" " - Node LID : 0x%x\n" @@ -1537,7 +1555,7 @@ __osm_ftree_fabric_make_indexing( sizeof(ftree_sw_t *), /* size of each element */ __osm_ftree_compare_switches_by_index); /* comparator */ - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); } /* __osm_ftree_fabric_make_indexing() */ /***************************************************/ @@ -1555,15 +1573,17 @@ __osm_ftree_fabric_validate_topology( boolean_t res = TRUE; uint8_t i; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_validate_topology); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_validate_topology); - osm_log(&osm.log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_validate_topology: " "Validating fabric topology\n"); reference_sw_arr = (ftree_sw_t **)malloc(tree_rank * sizeof(ftree_sw_t *)); if ( reference_sw_arr == NULL ) { - osm_log(&osm.log, OSM_LOG_SYS,"Fat-tree routing: Memory allocation failed\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fat-tree routing: Memory allocation failed\n"); return FALSE; } memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *)); @@ -1587,7 +1607,8 @@ __osm_ftree_fabric_validate_topology( if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != p_sw->up_port_groups_num ) { - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB09: Different number of upward port groups on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n", @@ -1607,7 +1628,8 @@ __osm_ftree_fabric_validate_topology( reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num ) { /* we're allowing some hca's to be missing */ - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB0A: Different number of downward port groups on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n", @@ -1631,7 +1653,8 @@ __osm_ftree_fabric_validate_topology( p_group = p_sw->up_port_groups[i]; if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports)) { - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB0B: Different number of ports in an upward port group on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", @@ -1658,7 +1681,8 @@ __osm_ftree_fabric_validate_topology( p_group = p_sw->down_port_groups[0]; if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports)) { - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB0C: Different number of ports in an downward port group on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", @@ -1679,14 +1703,16 @@ __osm_ftree_fabric_validate_topology( } /* end of while */ if (res == TRUE) - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_validate_topology: " - "Fabric topology has been identified as FatTree\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_validate_topology: " + "Fabric topology has been identified as FatTree\n"); else - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " - "ERR AB0D: Fabric topology hasn't been identified as FatTree\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " + "ERR AB0D: Fabric topology hasn't been identified as FatTree\n"); free(reference_sw_arr); - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_fabric_validate_topology() */ @@ -1699,8 +1725,17 @@ __osm_ftree_set_sw_fwd_table( IN void *context) { ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item; - memcpy(osm.sm.ucast_mgr.lft_buf, p_sw->lft_buf, FTREE_FWD_TBL_LEN); - osm_ucast_mgr_set_fwd_table(&osm.sm.ucast_mgr,p_sw->p_osm_sw); + ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; + + /* calculate lft length rounded up to a multiple of 64 (block length) */ + uint16_t lft_len = 64 * ((p_ftree->lft_max_lid_ho + 1 + 63) / 64); + + osm_switch_set_max_lid_ho(p_sw->p_osm_sw, p_ftree->lft_max_lid_ho); + + memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, + p_sw->lft_buf, + lft_len); + osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw); } /*************************************************** @@ -1746,8 +1781,6 @@ __osm_ftree_fabric_route_upgoing_by_goin if (p_sw->down_port_groups_num == 0) return; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_upgoing_by_going_down); - /* foreach down-going port group (in indexing order) */ for (i = 0; i < p_sw->down_port_groups_num; i++) { @@ -1823,7 +1856,7 @@ __osm_ftree_fabric_route_upgoing_by_goin __osm_ftree_sw_set_fwd_table_block(p_remote_sw, cl_ntoh16(target_lid), p_min_port->remote_port_num); - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_upgoing_by_going_down: " "Switch %s: set path to HCA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), @@ -1855,7 +1888,6 @@ __osm_ftree_fabric_route_upgoing_by_goin } /* done scanning all the down-going port groups */ - OSM_LOG_EXIT(&(osm.log)); } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ /***************************************************/ @@ -1892,8 +1924,6 @@ __osm_ftree_fabric_route_downgoing_by_go /* we shouldn't enter here if both real_lid and main_path are false */ CL_ASSERT(is_real_lid || is_main_path); - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_downgoing_by_going_up); - /* If this switch isn't a leaf switch: Assign upgoing ports by stepping down, starting on THIS switch. */ if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)) @@ -1909,10 +1939,7 @@ __osm_ftree_fabric_route_downgoing_by_go /* recursion stop condition - if it's a root switch, */ if (p_sw->rank == 0) - { - OSM_LOG_EXIT(&(osm.log)); return; - } /* Find the least loaded port of all the upgoing port groups (in indexing order of the remote switches). */ @@ -1982,7 +2009,7 @@ __osm_ftree_fabric_route_downgoing_by_go { if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) { - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n", (is_real_lid)? "real" : "DUMMY", @@ -2000,7 +2027,7 @@ __osm_ftree_fabric_route_downgoing_by_go cl_ntoh16(target_lid), p_min_port->remote_port_num); p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = p_min_port->remote_port_num; - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " "Switch %s: set path to HCA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), @@ -2020,10 +2047,7 @@ __osm_ftree_fabric_route_downgoing_by_go /* we're done for the third case */ if (!is_real_lid) - { - OSM_LOG_EXIT(&(osm.log)); return; - } /* What's left to do at this point: * @@ -2064,7 +2088,7 @@ __osm_ftree_fabric_route_downgoing_by_go if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) { - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " " - Routing SECONDARY path for LID 0x%x: %s --> %s\n", cl_ntoh16(target_lid), @@ -2087,7 +2111,6 @@ __osm_ftree_fabric_route_downgoing_by_go FALSE); /* whether this is path to HCA that should by tracked by counters */ } - OSM_LOG_EXIT(&(osm.log)); } /* ftree_fabric_route_downgoing_by_going_up() */ /***************************************************/ @@ -2114,7 +2137,7 @@ __osm_ftree_fabric_route_to_hcas( uint32_t j; ib_net16_t remote_lid; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_hcas); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_hcas); /* for each leaf switch (in indexing order) */ for(i = 0; i < p_ftree->leaf_switches_num; i++) @@ -2133,7 +2156,7 @@ __osm_ftree_fabric_route_to_hcas( __osm_ftree_sw_set_fwd_table_block(p_sw, cl_ntoh16(remote_lid), p_port->port_num); - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_to_hcas: " "Switch %s: set path to HCA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_sw->tuple), @@ -2154,7 +2177,7 @@ __osm_ftree_fabric_route_to_hcas( if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num) { - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " "Routing %u dummy HCAs\n", p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++) @@ -2171,7 +2194,7 @@ __osm_ftree_fabric_route_to_hcas( } } /* done going through all the leaf switches */ - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); } /* __osm_ftree_fabric_route_to_hcas() */ /***************************************************/ @@ -2195,7 +2218,7 @@ __osm_ftree_fabric_route_to_switches( ftree_sw_t * p_sw; ftree_sw_t * p_next_sw; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_switches); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_switches); p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ) @@ -2208,7 +2231,8 @@ __osm_ftree_fabric_route_to_switches( cl_ntoh16(p_sw->base_lid), 0); - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_switches: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_to_switches: " "Switch %s (LID 0x%x): routing switch-to-switch pathes\n", __osm_ftree_tuple_to_str(p_sw->tuple), cl_ntoh16(p_sw->base_lid)); @@ -2222,7 +2246,7 @@ __osm_ftree_fabric_route_to_switches( FALSE); /* whether this path should by tracked by counters */ } - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); } /* __osm_ftree_fabric_route_to_switches() */ /*************************************************** @@ -2234,18 +2258,17 @@ __osm_ftree_fabric_populate_switches( { osm_switch_t * p_osm_sw; osm_switch_t * p_next_osm_sw; - osm_opensm_t * p_osm = &osm; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_switches); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_switches); - p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_osm->subn.sw_guid_tbl); - while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl) ) + p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_ftree->p_osm->subn.sw_guid_tbl); + while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_ftree->p_osm->subn.sw_guid_tbl) ) { p_osm_sw = p_next_osm_sw; p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item ); __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw); } - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return 0; } /* __osm_ftree_fabric_populate_switches() */ @@ -2258,12 +2281,11 @@ __osm_ftree_fabric_populate_hcas( { osm_node_t * p_osm_node; osm_node_t * p_next_osm_node; - osm_opensm_t * p_osm = &osm; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_hcas); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_hcas); - p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_osm->subn.node_guid_tbl); - while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_osm->subn.node_guid_tbl) ) + p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_ftree->p_osm->subn.node_guid_tbl); + while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_ftree->p_osm->subn.node_guid_tbl) ) { p_osm_node = p_next_osm_node; p_next_osm_node = (osm_node_t *)cl_qmap_next(&p_osm_node->map_item); @@ -2278,16 +2300,17 @@ __osm_ftree_fabric_populate_hcas( /* all the switches added separately */ break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_populate_hcas: ERR AB0E: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_populate_hcas: ERR AB0E: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(osm_node_get_node_guid(p_osm_node)), ib_get_node_type_str(osm_node_get_type(p_osm_node))); - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return -1; } } - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return 0; } /* __osm_ftree_fabric_populate_hcas() */ @@ -2372,7 +2395,7 @@ __osm_ftree_rank_switches_from_hca( static uint16_t i = 0; int res = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_rank_switches_from_hca); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_rank_switches_from_hca); for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++) { @@ -2388,7 +2411,8 @@ __osm_ftree_rank_switches_from_hca( { case IB_NODE_TYPE_CA: /* HCA connected directly to another HCA - not FatTree */ - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB0F: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_rank_switches_from_hca: ERR AB0F: " "HCA conected directly to another HCA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), @@ -2405,7 +2429,8 @@ __osm_ftree_rank_switches_from_hca( break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB10: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_rank_switches_from_hca: ERR AB10: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)), ib_get_node_type_str(osm_node_get_type(p_remote_osm_node))); @@ -2423,7 +2448,8 @@ __osm_ftree_rank_switches_from_hca( if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0) continue; - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_rank_switches_from_hca: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_rank_switches_from_hca: " "Marking rank of switch that is directly connected to HCA:\n" " - HCA guid : 0x%016" PRIx64 "\n" " - Switch guid: 0x%016" PRIx64 "\n" @@ -2435,7 +2461,7 @@ __osm_ftree_rank_switches_from_hca( } Exit: - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_rank_switches_from_hca() */ @@ -2495,7 +2521,8 @@ __osm_ftree_fabric_construct_hca_ports( case IB_NODE_TYPE_CA: /* HCA connected directly to another HCA - not FatTree */ - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB11: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_construct_hca_ports: ERR AB11: " "HCA conected directly to another HCA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_node)), @@ -2508,7 +2535,8 @@ __osm_ftree_fabric_construct_hca_ports( break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB12: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_construct_hca_ports: ERR AB12: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(remote_node_guid), ib_get_node_type_str(remote_node_type)); @@ -2625,7 +2653,8 @@ __osm_ftree_fabric_construct_sw_ports( break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_sw_ports: ERR AB13: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_construct_sw_ports: ERR AB13: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(remote_node_guid), ib_get_node_type_str(remote_node_type)); @@ -2646,6 +2675,10 @@ __osm_ftree_fabric_construct_sw_ports( remote_node_type, /* remote node type */ p_remote_hca_or_sw, /* remote ftree_hca/sw object */ direction); /* port direction (up or down) */ + + /* Track the max lid (in host order) that exists in the fabric */ + if (cl_ntoh16(remote_base_lid) > p_ftree->lft_max_lid_ho) + p_ftree->lft_max_lid_ho = cl_ntoh16(remote_base_lid); } Exit: @@ -2665,7 +2698,7 @@ __osm_ftree_fabric_perform_ranking( ftree_hca_t * p_next_hca; int res = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_perform_ranking); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking); /* Mark REVERSED rank of all the switches in the subnet. Start from switches that are connected to hca's, and @@ -2678,7 +2711,8 @@ __osm_ftree_fabric_perform_ranking( if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0) { res = -1; - osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB14: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_perform_ranking: ERR AB14: " "Subnet ranking failed - subnet is not FatTree"); goto Exit; } @@ -2686,7 +2720,8 @@ __osm_ftree_fabric_perform_ranking( /* calculate and set FatTree rank */ __osm_ftree_fabric_calculate_rank(p_ftree); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_perform_ranking: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_perform_ranking: " "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree)); /* fix ranking of the switches by reversing the ranking direction */ @@ -2695,7 +2730,8 @@ __osm_ftree_fabric_perform_ranking( if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK || __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK ) { - osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB15: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_perform_ranking: ERR AB15: " "Tree rank is %u (should be between %u and %u)\n", __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MIN_RANK, @@ -2705,7 +2741,7 @@ __osm_ftree_fabric_perform_ranking( } Exit: - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_fabric_perform_ranking() */ @@ -2722,7 +2758,7 @@ __osm_ftree_fabric_populate_ports( ftree_sw_t * p_next_sw; int res = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_ports); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_ports); p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) ) @@ -2748,7 +2784,7 @@ __osm_ftree_fabric_populate_ports( } } Exit: - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_fabric_populate_ports() */ @@ -2756,58 +2792,61 @@ __osm_ftree_fabric_populate_ports( ***************************************************/ static int -__osm_ftree_do_routing(void *context) +__osm_ftree_construct_fabric( + IN void * context) { ftree_fabric_t * p_ftree = context; int status = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_do_routing); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_construct_fabric); - if ( cl_qmap_count(&osm.subn.sw_guid_tbl) < 2 ) + if ( cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl) < 2 ) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric has %u switches - topology is not fat-tree.\n" "Falling back to default routing.\n", - cl_qmap_count(&osm.subn.sw_guid_tbl)); + cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)); status = -1; goto Exit; } - if ( (cl_qmap_count(&osm.subn.node_guid_tbl) - - cl_qmap_count(&osm.subn.sw_guid_tbl)) < 2) + if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) - + cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)) < 2) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric has %u nodes (%u switches) - topology is not fat-tree.\n" "Falling back to default routing.\n", - cl_qmap_count(&osm.subn.node_guid_tbl), - cl_qmap_count(&osm.subn.sw_guid_tbl)); + cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl), + cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)); status = -1; goto Exit; } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n" - " |------------------------------|\n" - " |- Starting FatTree Routing -|\n" - " |------------------------------|\n\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_construct_fabric: \n" + " |----------------------------------------|\n" + " |- Starting FatTree fabric construction -|\n" + " |----------------------------------------|\n\n"); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " "Populating FatTree switch table\n"); /* ToDo: now that the pointer from node to switch exists, no need to fill the switch table in a separate loop */ if (__osm_ftree_fabric_populate_switches(p_ftree) != 0) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not fat-tree - " "falling back to default routing\n"); status = -1; goto Exit; } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " "Populating FatTree HCA table\n"); if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not fat-tree - " "falling back to default routing\n"); status = -1; @@ -2816,7 +2855,7 @@ __osm_ftree_do_routing(void *context) if (cl_qmap_count(&p_ftree->hca_tbl) < 2) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric has %u HCAa - topology is not fat-tree.\n" "Falling back to default routing.\n", cl_qmap_count(&p_ftree->hca_tbl)); @@ -2825,12 +2864,13 @@ __osm_ftree_do_routing(void *context) } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " - "Ranking FatTree\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: Ranking FatTree\n"); + if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) { if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric rank is %u (>%u) - " "fat-tree routing falls back to default routing\n", __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK); @@ -2841,11 +2881,12 @@ __osm_ftree_do_routing(void *context) /* For each hca and switch, construct array of ports. This is done after the whole FatTree data structure is ready, because we want the ports to have pointers to ftree_{sw,hca}_t objects.*/ - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " "Populating HCA & switch ports\n"); if (__osm_ftree_fabric_populate_ports(p_ftree) != 0) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not a fat-tree - " "routing falls back to default routing\n"); status = -1; @@ -2863,7 +2904,7 @@ __osm_ftree_do_routing(void *context) __osm_ftree_fabric_dump_general_info(p_ftree); /* dump full tree topology */ - if (osm_log_is_active(&osm.log, OSM_LOG_DEBUG)) + if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG)) __osm_ftree_fabric_dump(p_ftree); if (! __osm_ftree_fabric_validate_topology(p_ftree)) @@ -2872,46 +2913,118 @@ __osm_ftree_do_routing(void *context) goto Exit; } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " + "Max LID in switch LFTs (in host order): 0x%x\n", + p_ftree->lft_max_lid_ho); + + Exit: + if (status != 0) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " + "Clearing FatTree Fabric data structures\n"); + __osm_ftree_fabric_clear(p_ftree); + } + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: \n" + " |--------------------------------------------------|\n" + " |- Done constructing FatTree fabric (status = %d) -|\n" + " |--------------------------------------------------|\n\n", + status); + + OSM_LOG_EXIT(&p_ftree->p_osm->log); + return status; +} + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_do_routing( + IN void * context) +{ + ftree_fabric_t * p_ftree = context; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_do_routing); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Starting FatTree routing\n"); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " "Filling switch forwarding tables for routes to HCAs\n"); __osm_ftree_fabric_route_to_hcas(p_ftree); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " "Filling switch forwarding tables for switch-to-switch pathes\n"); __osm_ftree_fabric_route_to_switches(p_ftree); /* for each switch, set its fwd table */ - cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, NULL); + cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, (void *)p_ftree); /* write out hca ordering file */ __osm_ftree_fabric_dump_hca_ordering(p_ftree); - Exit: - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " - "Clearing FatTree Fabric data structures\n"); - __osm_ftree_fabric_clear(p_ftree); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "FatTree routing is done\n"); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n" - " |---------------------------------------|\n" - " |- Done FatTree Routing (status = %d) -|\n" - " |---------------------------------------|\n\n", status); + OSM_LOG_EXIT(&p_ftree->p_osm->log); + return 0; +} - OSM_LOG_EXIT(&(osm.log)); - return status; +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_routing( + IN void * context) +{ + int status = __osm_ftree_construct_fabric(context); + if (status != 0) + return status; + + __osm_ftree_do_routing(context); + return 0; } /*************************************************** ***************************************************/ +void +ucast_mgr_dump_to_file( + IN osm_ucast_mgr_t *p_mgr, + IN const char *file_name, + IN void (*func)(cl_map_item_t *, void *)); + +void +ucast_mgr_dump_lfts( + IN cl_map_item_t *p_map_item, + void *cxt); + static void -__osm_ftree_delete(void * context) +__osm_ftree_dump_tables( + IN void * context) { - ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; + ftree_fabric_t * p_ftree = context; if (!p_ftree) return; - __osm_ftree_fabric_destroy(p_ftree); + ucast_mgr_dump_to_file(&p_ftree->p_osm->sm.ucast_mgr, + "opensm-lfts.dump", + ucast_mgr_dump_lfts); +} +/*************************************************** + ***************************************************/ + +static void +__osm_ftree_delete( + IN void * context) +{ + if (!context) + return; + __osm_ftree_fabric_destroy((ftree_fabric_t *)context); } /*************************************************** @@ -2923,11 +3036,21 @@ int osm_ucast_ftree_setup(osm_opensm_t * if (!p_ftree) return -1; + p_ftree->p_osm = p_osm; + p_osm->routing_engine.context = (void *)p_ftree; - p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing; + p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_routing; + /* ToDo: Resolve multicast routing. + * Until then lid matrices are built, despite the + * fact that FatTree routing doesn't need them. + * When the multicast routing will be resolved, + * __osm_ftree_routing() function should be removed, + * and here's how the FatTree routing will be set: + * p_osm->routing_engine.build_lid_matrices = __osm_ftree_construct_fabric; + * p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing; + */ + p_osm->routing_engine.ucast_dump_tables = __osm_ftree_dump_tables; p_osm->routing_engine.delete = __osm_ftree_delete; - /* ToDo: fat-tree routing doesn't use min_hop tables, so we - shouldn't fill them (p_osm->routing_engine.build_lid_matrices) */ return 0; } -- 1.4.4.1.GIT From sashak at voltaire.com Tue Dec 19 12:30:44 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 19 Dec 2006 22:30:44 +0200 Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to osm_switch_t In-Reply-To: <45883EF4.1050705@dev.mellanox.co.il> References: <45883EF4.1050705@dev.mellanox.co.il> Message-ID: <20061219203044.GE19795@sashak.voltaire.com> On 21:35 Tue 19 Dec , Yevgeny Kliteynik wrote: > Hi Hal > > Adding max_lid_ho field to osm_switch_t to allow routing > engines that don't use lid matrices to explicitly set the > max lid (in host order) that is reachable from the switch. > > Signed-off-by: Yevgeny Kliteynik > --- Looks fine. Small comments below. > @@ -793,6 +797,8 @@ static inline uint16_t > osm_switch_get_max_lid_ho( > IN const osm_switch_t* const p_sw ) > { > + if (p_sw->max_lid_ho != 0) > + return p_sw->max_lid_ho; > return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) ); > } What do you think about to remove osm_lid_matrix_get_max_lid_ho() at all? > @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho( > * SEE ALSO > *********/ > > +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho > +* NAME > +* osm_switch_set_max_lid_ho > +* > +* DESCRIPTION > +* Set the maximum LID (host order) value accessed from this switch > +* SYNOPSIS > +*/ > +static inline void > +osm_switch_set_max_lid_ho( > + IN osm_switch_t* const p_sw, > + IN uint16_t max_lid_ho ) > +{ > + p_sw->max_lid_ho = max_lid_ho; > +} > +/* > +* PARAMETERS > +* p_sw > +* [in] Pointer to a switch object. > +* > +* max_lid_ho > +* Max LID (host order) value accessed from this switch > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +*********/ > + Do we need those +31 lines of code instead of just p_sw->max_lid_ho = N; ? Sasha From mst at mellanox.co.il Tue Dec 19 13:04:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 19 Dec 2006 23:04:54 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <45882070.8040101@dev.mellanox.co.il> References: <4587DD0B.1030403@voltaire.com> <20061219131625.GE30743@mellanox.co.il> <45882070.8040101@dev.mellanox.co.il> Message-ID: <20061219210454.GB9321@mellanox.co.il> > Michael S. Tsirkin wrote: > >> The problems i see with the current approach are: > >> > >> 1) there are three patches > > > > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch. > > It is not 100% but the only work around for proprietary SMs. > > Fixing the SA is a full solution. We (Mellanox) will work with SA vendors to > > get this addressed. But of course this takes time. > > > >> 2) of them, the cma-tavor-quirk is broken (see *** below) in its design > >> since it assumes the opensm-tavor-quirk and it would not work with > >> opensm that does not have it nor with 3rd party/commercial SMs which do > >> not have similar quirk > > > > cma-tavor-quirk in OFED 1.1 is broken but not by design - > > the patch I posted recently fixes the bug and should work with any compliant SM. > > I did not look at the opensm code specifically, but the > > "15.2.5.16 PATHRECORD" is quite explicit in its requirements: > > > > MtuSelector 2 432 In a query request: > > 3-largest MTU available > > If MTU is specified (i.e., the ComponentMask bit for > > MTU is 1): > > 0-greater than MTU specified > > 1-less than MTU specified > > 2-exactly the MTU specified > > > > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists) > > we should simply fix it. If there are other broken SMs, we can look at how they > > are broken and how to solve this. > > OSM implementation in this case matches the IB spec. > On page 905, table 207, there's an example of such a > request: > Required MTU = 4 (2048) > Required MTUSelector = 1 ('less-than') > And then it is explained that the required path records > should have MTU of 1024 or lower. > > OSM implementation converts these rules to code AS IS. In this example, since everyone must support a 2K MTU, will opensm return a path, or fail the query? If it fails the query it seems opensm violates the spec and needs to be fixed. And of course the MTU in path record query response must be 1K or lower. -- MST From kliteyn at dev.mellanox.co.il Tue Dec 19 13:03:28 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 23:03:28 +0200 Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to osm_switch_t In-Reply-To: <20061219203044.GE19795@sashak.voltaire.com> References: <45883EF4.1050705@dev.mellanox.co.il> <20061219203044.GE19795@sashak.voltaire.com> Message-ID: <458853A0.9060909@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 21:35 Tue 19 Dec , Yevgeny Kliteynik wrote: >> Hi Hal >> >> Adding max_lid_ho field to osm_switch_t to allow routing >> engines that don't use lid matrices to explicitly set the >> max lid (in host order) that is reachable from the switch. >> >> Signed-off-by: Yevgeny Kliteynik >> --- > > Looks fine. Small comments below. > >> @@ -793,6 +797,8 @@ static inline uint16_t >> osm_switch_get_max_lid_ho( >> IN const osm_switch_t* const p_sw ) >> { >> + if (p_sw->max_lid_ho != 0) >> + return p_sw->max_lid_ho; >> return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) ); >> } > > What do you think about to remove osm_lid_matrix_get_max_lid_ho() at > all? Basically, I have no objection to this. We just have to update the switch.max_lid_ho in the default, updn and file routings. >> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho( >> * SEE ALSO >> *********/ >> >> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho >> +* NAME >> +* osm_switch_set_max_lid_ho >> +* >> +* DESCRIPTION >> +* Set the maximum LID (host order) value accessed from this switch >> +* SYNOPSIS >> +*/ >> +static inline void >> +osm_switch_set_max_lid_ho( >> + IN osm_switch_t* const p_sw, >> + IN uint16_t max_lid_ho ) >> +{ >> + p_sw->max_lid_ho = max_lid_ho; >> +} >> +/* >> +* PARAMETERS >> +* p_sw >> +* [in] Pointer to a switch object. >> +* >> +* max_lid_ho >> +* Max LID (host order) value accessed from this switch >> +* >> +* RETURN VALUES >> +* None. >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +*********/ >> + > > Do we need those +31 lines of code instead of just > p_sw->max_lid_ho = N; ? Since there are access functions for the rest of the fields, I didn't want to make an exception in this case either. -- Yevgeny. > Sasha > From kliteyn at dev.mellanox.co.il Tue Dec 19 13:43:48 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 19 Dec 2006 23:43:48 +0200 Subject: [openib-general] [PATCH] osm: Added FatTree routing to the osm manual Message-ID: <45885D14.4090200@dev.mellanox.co.il> Added FatTree routing to the osm manual Signed-off-by: Yevgeny Kliteynik --- osm/man/opensm.8 | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/osm/man/opensm.8 b/osm/man/opensm.8 index 316232d..225918d 100644 --- a/osm/man/opensm.8 +++ b/osm/man/opensm.8 @@ -391,7 +391,7 @@ Examples: .SH ROUTING .PP -OpenSM offers two routing engines: +OpenSM offers three routing engines: 1. Min Hop Algorithm - based on the minimum hops to each node where the path length is optimized. @@ -401,6 +401,12 @@ node, but it is constrained to ranking r if the subnet is not a pure Fat Tree, and deadlock may occur due to a loop in the subnet. +3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing +for congestion-free "shift" communication pattern. +It should be chosen if a subnet is a symmetrical Fat Trees of various types, +not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio. +Similar to UPDN, Fat Tree routing is constrained to ranking rules. + OpenSM also supports a file method which can load routes from a table. See \'Modular Routing Engine\' for more information on this. -- 1.4.4.1.GIT From robert.j.woodruff at intel.com Tue Dec 19 14:26:38 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 19 Dec 2006 14:26:38 -0800 Subject: [openib-general] OFED 1.2 git tree Message-ID: Hi Tziporet, I took a look at the OFED 1.2 git tree, daily builds and how to wiki. As for the OFED 1.2 tree, I was able to clone it and get it running with the OFA userspace git tree, although I did it manually as the build scripts for OFED are not intuitively obvious on how to use. I did look at the How to build OFED 1.2 Wiki and I think it could use a bit more work, as I was not able to take the daily build tar balls and easily make a dist tar ball from them, so a little more exact step by step instructions on your wiki would be helpful. As we discussed yesterday in the OFED meeting, for the OFED 1.2 kernel (based on 2.6.20-rc1) you have to checkout Sean and Arlins rdma_cm branch of the userspace code if you are not doing so already, I also updated the how to check out OFA code from git, on the wiki. https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OF A+git+Repositories As for the local_sa cache and multicast branches of Sean's trees. He is still based on 2.6.19. I took a quick look at trying to port this to the OFED_1.2 tree based on 2.6.20-rc1 and it looks like it needs a few more changes than I want to deal with this week while he is out on vacation. Probably best to wait for his return to port the code up to the 2.6.20-rc1 code base, but I see no problem with getting this ready for the Jan 30 feature freeze date. woody -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Tue Dec 19 14:27:32 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 17:27:32 -0500 Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to osm_switch_t In-Reply-To: <45883EF4.1050705@dev.mellanox.co.il> References: <45883EF4.1050705@dev.mellanox.co.il> Message-ID: <1166567251.4519.442.camel@hal.voltaire.com> Hi Yevgeny, On Tue, 2006-12-19 at 14:35, Yevgeny Kliteynik wrote: > Hi Hal > > Adding max_lid_ho field to osm_switch_t to allow routing > engines that don't use lid matrices to explicitly set the > max lid (in host order) that is reachable from the switch. One minor comment below. > Signed-off-by: Yevgeny Kliteynik > --- > osm/include/opensm/osm_switch.h | 37 +++++++++++++++++++++++++++++++++++++ > osm/opensm/osm_switch.c | 2 ++ > 2 files changed, 39 insertions(+), 0 deletions(-) > > diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h > index 4570f61..d2089bd 100644 > --- a/osm/include/opensm/osm_switch.h > +++ b/osm/include/opensm/osm_switch.h > @@ -107,6 +107,7 @@ typedef struct _osm_switch > ib_switch_info_t switch_info; > osm_fwd_tbl_t fwd_tbl; > osm_lid_matrix_t lmx; > + uint16_t max_lid_ho; > osm_port_profile_t *p_prof; > osm_mcast_tbl_t mcast_tbl; > uint32_t discovery_count; > @@ -129,6 +130,9 @@ typedef struct _osm_switch > * LID Matrix for this switch containing the hop count > * to every LID from every port. > * > +* max_lid_ho > +* Max LID that is accessible from this switch > +* > * p_pro > * Pointer to array of Port Profile objects for this switch. > * > @@ -793,6 +797,8 @@ static inline uint16_t > osm_switch_get_max_lid_ho( > IN const osm_switch_t* const p_sw ) > { > + if (p_sw->max_lid_ho != 0) > + return p_sw->max_lid_ho; > return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) ); > } > /* > @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho( > * SEE ALSO > *********/ > > +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho > +* NAME > +* osm_switch_set_max_lid_ho > +* > +* DESCRIPTION > +* Set the maximum LID (host order) value accessed from this switch > +* SYNOPSIS > +*/ > +static inline void > +osm_switch_set_max_lid_ho( > + IN osm_switch_t* const p_sw, > + IN uint16_t max_lid_ho ) > +{ > + p_sw->max_lid_ho = max_lid_ho; > +} > +/* > +* PARAMETERS > +* p_sw > +* [in] Pointer to a switch object. > +* > +* max_lid_ho > +* Max LID (host order) value accessed from this switch > +* > +* RETURN VALUES > +* None. > +* > +* NOTES > +* > +* SEE ALSO > +*********/ > + > /****f* OpenSM: Switch/osm_switch_get_num_ports > * NAME > * osm_switch_get_num_ports > diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c > index 0dd3de5..4ca713a 100644 > --- a/osm/opensm/osm_switch.c > +++ b/osm/opensm/osm_switch.c > @@ -122,6 +122,8 @@ osm_switch_init( > for( port_num = 0; port_num < num_ports; port_num++ ) > osm_port_prof_construct( &p_sw->p_prof[port_num] ); > > + p_sw->max_lid_ho = 0; This isn't really needed, is it ? Doesn't osm_switch_construct clear this ? -- Hal > + > Exit: > return( status ); > } From Ashish.Batwara at lsi.com Tue Dec 19 14:43:07 2006 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Tue, 19 Dec 2006 15:43:07 -0700 Subject: [openib-general] opensm Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159CEA@NAMAIL2.ad.lsil.com> Hi, Here is the info that you have asked. I am seeing the Subnet manager is up now having the port active. But server is not able to discover the target. I am seeing the error "Got failed path rec status -110" on Linux console. Below are the output of different commands. I am using following to discover the target: /etc/init.d/opensmd start /etc/init.d/openibd start modprobe ib_srp echo id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > /sys/class/infiniband_srp/srp-mthca0-2/add_target [root at p49 ~]# ibv_devinfo hca_id: mthca0 fw_ver: 5.1.400 node_guid: 0002:c902:0022:cce0 sys_image_guid: 0002:c902:0022:cce3 vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0xA0 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 1 port_lmc: 0x00 hca_id: mthca1 fw_ver: 5.1.400 node_guid: 0002:c902:0022:cd2c sys_image_guid: 0002:c902:0022:cd2f vendor_id: 0x02c9 vendor_part_id: 25218 hw_ver: 0xA0 board_id: MT_0370130002 phys_port_cnt: 2 port: 1 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 [root at p49 ~]# uname -a Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root at p49 ~]# cat /etc/infiniband/info #!/bin/bash echo prefix=/usr/local/ofed echo Kernel=2.6.9-42.0.3.ELsmp echo echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" echo OFED Version: OFED-1.1 Thanks Ashish -----Original Message----- From: Eitan Zahavi [mailto:eitan at mellanox.co.il] Sent: Tuesday, December 19, 2006 5:18 AM To: Batwara, Ashish Cc: ishai at mellanox.co.il; openib-general at openib.org Subject: Re: [openib-general] opensm Hi Ashish, SRP people say they have no such error message. OpenSM does. So I take it back. Ashish, Please provide more into: 1. ibv_devinfo 2. Version of code you are using 3. Command line you use for starting opensm 4. /var/log/osm.log Thanks and sorry for the confusion. EZ Eitan Zahavi wrote: > This is not an OpenSM issue. > Forwarded to the SRP people. > > EZ > Batwara, Ashish wrote: > >> Hi, >> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and >> connected to IB Switch. ibnodes command displays the information about >> the Switch ports and HCA ports. >> When I start opensm, I see in /var/log/messages "Starting srp_daemon" >> for all the 4 ports and immediately after I see "failed srp_daemon" for >> all the ports and the displays "SM Port is down". >> >> I tried several times and even rebooted the server few times but no >> luck. >> >> Does anybody know what this problem is? >> >> Thanks >> Ashish >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Tue Dec 19 15:05:53 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 20 Dec 2006 01:05:53 +0200 Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to osm_switch_t In-Reply-To: <458853A0.9060909@dev.mellanox.co.il> References: <45883EF4.1050705@dev.mellanox.co.il> <20061219203044.GE19795@sashak.voltaire.com> <458853A0.9060909@dev.mellanox.co.il> Message-ID: <20061219230553.GG19795@sashak.voltaire.com> On 23:03 Tue 19 Dec , Yevgeny Kliteynik wrote: > > >> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho( > >> * SEE ALSO > >> *********/ > >> > >> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho > >> +* NAME > >> +* osm_switch_set_max_lid_ho > >> +* > >> +* DESCRIPTION > >> +* Set the maximum LID (host order) value accessed from this switch > >> +* SYNOPSIS > >> +*/ > >> +static inline void > >> +osm_switch_set_max_lid_ho( > >> + IN osm_switch_t* const p_sw, > >> + IN uint16_t max_lid_ho ) > >> +{ > >> + p_sw->max_lid_ho = max_lid_ho; > >> +} > >> +/* > >> +* PARAMETERS > >> +* p_sw > >> +* [in] Pointer to a switch object. > >> +* > >> +* max_lid_ho > >> +* Max LID (host order) value accessed from this switch > >> +* > >> +* RETURN VALUES > >> +* None. > >> +* > >> +* NOTES > >> +* > >> +* SEE ALSO > >> +*********/ > >> + > > > > Do we need those +31 lines of code instead of just > > p_sw->max_lid_ho = N; ? > > Since there are access functions for the rest of the fields, > I didn't want to make an exception in this case either. I think you did anyway - there is no full set of access methods. I'm perfectly fine with it. And don't call you to cleanup the rest, just to not add new ones. Sasha From halr at voltaire.com Tue Dec 19 15:06:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 18:06:26 -0500 Subject: [openib-general] opensm In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159CEA@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A01159CEA@NAMAIL2.ad.lsil.com> Message-ID: <1166569585.4519.2439.camel@hal.voltaire.com> Ashish, On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: > Hi, > > Here is the info that you have asked. I am seeing the Subnet manager > is up now having the port active. But server is not able to discover > the target. I am seeing the error “Got failed path rec status -110†on > Linux console. That means the request for an SA PathRecord from the initiator to the target failed (-110 is ETIMEDOUT). Are you sure the target is up (ACTIVE) on the subnet ? If it is, can you send the opensm log ? -- Hal > Below are the output of different commands. I am using following to > discover the target: > > > > /etc/init.d/opensmd start > > /etc/init.d/openibd start > > modprobe ib_srp > > echo > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > /sys/class/infiniband_srp/srp-mthca0-2/add_target > > > > > > [root at p49 ~]# ibv_devinfo > > hca_id: mthca0 > > fw_ver: 5.1.400 > > node_guid: 0002:c902:0022:cce0 > > sys_image_guid: 0002:c902:0022:cce3 > > vendor_id: 0x02c9 > > vendor_part_id: 25218 > > hw_ver: 0xA0 > > board_id: MT_0370130002 > > phys_port_cnt: 2 > > port: 1 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > port: 2 > > state: PORT_ACTIVE (4) > > max_mtu: 2048 (4) > > active_mtu: 2048 (4) > > sm_lid: 1 > > port_lid: 1 > > port_lmc: 0x00 > hca_id: mthca1 > > fw_ver: 5.1.400 > > node_guid: 0002:c902:0022:cd2c > > sys_image_guid: 0002:c902:0022:cd2f > > vendor_id: 0x02c9 > > vendor_part_id: 25218 > > hw_ver: 0xA0 > > board_id: MT_0370130002 > > phys_port_cnt: 2 > > port: 1 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > port: 2 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > > > [root at p49 ~]# uname -a > > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > [root at p49 ~]# cat /etc/infiniband/info > > #!/bin/bash > > > > echo prefix=/usr/local/ofed > > echo Kernel=2.6.9-42.0.3.ELsmp > > echo > > echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm > --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm > --with-libsdp --with-openib-diags --with-srptools --with-mstflint > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" > > echo > > > > OFED Version: OFED-1.1 > > Thanks > > Ashish > > -----Original Message----- > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Tuesday, December 19, 2006 5:18 AM > To: Batwara, Ashish > Cc: ishai at mellanox.co.il; openib-general at openib.org > Subject: Re: [openib-general] opensm > > > > Hi Ashish, > > > > SRP people say they have no such error message. > > OpenSM does. So I take it back. > > > > Ashish, > > Please provide more into: > > > > 1. ibv_devinfo > > 2. Version of code you are using > > 3. Command line you use for starting opensm > > 4. /var/log/osm.log > > > > Thanks and sorry for the confusion. > > > > EZ > > > > Eitan Zahavi wrote: > > > This is not an OpenSM issue. > > > Forwarded to the SRP people. > > > > > > EZ > > > Batwara, Ashish wrote: > > > > > >> Hi, > > >> I am trying to run opensm on Linux server. It has two HCAs > (4-ports) and > > >> connected to IB Switch. ibnodes command displays the information > about > > >> the Switch ports and HCA ports. > > >> When I start opensm, I see in /var/log/messages "Starting > srp_daemon" > > >> for all the 4 ports and immediately after I see "failed srp_daemon" > for > > >> all the ports and the displays "SM Port is down". > > >> > > >> I tried several times and even rebooted the server few times but no > > >> luck. > > >> > > >> Does anybody know what this problem is? > > >> > > >> Thanks > > >> Ashish > > >> > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Ashish.Batwara at lsi.com Tue Dec 19 15:22:03 2006 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Tue, 19 Dec 2006 16:22:03 -0700 Subject: [openib-general] opensm Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159CFD@NAMAIL2.ad.lsil.com> Hi, Please look towards the end of the attached file. Thanks Ashish -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, December 19, 2006 5:06 PM To: Batwara, Ashish Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org Subject: Re: [openib-general] opensm Ashish, On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: > Hi, > > Here is the info that you have asked. I am seeing the Subnet manager > is up now having the port active. But server is not able to discover > the target. I am seeing the error "Got failed path rec status -110" on > Linux console. That means the request for an SA PathRecord from the initiator to the target failed (-110 is ETIMEDOUT). Are you sure the target is up (ACTIVE) on the subnet ? If it is, can you send the opensm log ? -- Hal > Below are the output of different commands. I am using following to > discover the target: > > > > /etc/init.d/opensmd start > > /etc/init.d/openibd start > > modprobe ib_srp > > echo > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > /sys/class/infiniband_srp/srp-mthca0-2/add_target > > > > > > [root at p49 ~]# ibv_devinfo > > hca_id: mthca0 > > fw_ver: 5.1.400 > > node_guid: 0002:c902:0022:cce0 > > sys_image_guid: 0002:c902:0022:cce3 > > vendor_id: 0x02c9 > > vendor_part_id: 25218 > > hw_ver: 0xA0 > > board_id: MT_0370130002 > > phys_port_cnt: 2 > > port: 1 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > port: 2 > > state: PORT_ACTIVE (4) > > max_mtu: 2048 (4) > > active_mtu: 2048 (4) > > sm_lid: 1 > > port_lid: 1 > > port_lmc: 0x00 > hca_id: mthca1 > > fw_ver: 5.1.400 > > node_guid: 0002:c902:0022:cd2c > > sys_image_guid: 0002:c902:0022:cd2f > > vendor_id: 0x02c9 > > vendor_part_id: 25218 > > hw_ver: 0xA0 > > board_id: MT_0370130002 > > phys_port_cnt: 2 > > port: 1 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > port: 2 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > > > [root at p49 ~]# uname -a > > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > [root at p49 ~]# cat /etc/infiniband/info > > #!/bin/bash > > > > echo prefix=/usr/local/ofed > > echo Kernel=2.6.9-42.0.3.ELsmp > > echo > > echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm > --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm > --with-libsdp --with-openib-diags --with-srptools --with-mstflint > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" > > echo > > > > OFED Version: OFED-1.1 > > Thanks > > Ashish > > -----Original Message----- > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Tuesday, December 19, 2006 5:18 AM > To: Batwara, Ashish > Cc: ishai at mellanox.co.il; openib-general at openib.org > Subject: Re: [openib-general] opensm > > > > Hi Ashish, > > > > SRP people say they have no such error message. > > OpenSM does. So I take it back. > > > > Ashish, > > Please provide more into: > > > > 1. ibv_devinfo > > 2. Version of code you are using > > 3. Command line you use for starting opensm > > 4. /var/log/osm.log > > > > Thanks and sorry for the confusion. > > > > EZ > > > > Eitan Zahavi wrote: > > > This is not an OpenSM issue. > > > Forwarded to the SRP people. > > > > > > EZ > > > Batwara, Ashish wrote: > > > > > >> Hi, > > >> I am trying to run opensm on Linux server. It has two HCAs > (4-ports) and > > >> connected to IB Switch. ibnodes command displays the information > about > > >> the Switch ports and HCA ports. > > >> When I start opensm, I see in /var/log/messages "Starting > srp_daemon" > > >> for all the 4 ports and immediately after I see "failed srp_daemon" > for > > >> all the ports and the displays "SM Port is down". > > >> > > >> I tried several times and even rebooted the server few times but no > > >> luck. > > >> > > >> Does anybody know what this problem is? > > >> > > >> Thanks > > >> Ashish > > >> > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- A non-text attachment was scrubbed... Name: osm.log Type: application/octet-stream Size: 1846569 bytes Desc: osm.log URL: From Ashish.Batwara at lsi.com Tue Dec 19 17:12:18 2006 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Tue, 19 Dec 2006 18:12:18 -0700 Subject: [openib-general] opensm Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159D2F@NAMAIL2.ad.lsil.com> Logs from the end of the osm.log: Dec 19 15:48:26 984523 [43204960] -> SUBNET UP Dec 19 15:48:36 985477 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b1d) -- dropping Dec 19 15:48:36 985538 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:48:36 985560 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:48:36 985643 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b1d attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:48:36 985728 [42803960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:48:36 985754 [42803960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:48:36 986161 [42803960] -> SUBNET UP Dec 19 15:48:46 986814 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b22) -- dropping Dec 19 15:48:46 986868 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:48:46 986895 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:48:46 986935 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b22 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:48:46 987025 [41401960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:48:46 987050 [41401960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:48:46 987459 [41401960] -> SUBNET UP Dec 19 15:48:56 988475 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b27) -- dropping Dec 19 15:48:56 988536 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:48:56 988562 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:48:56 988601 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b27 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:48:56 988681 [41E02960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:48:56 988706 [41E02960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:48:56 989146 [41E02960] -> SUBNET UP Dec 19 15:49:06 990152 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b2c) -- dropping Dec 19 15:49:06 990209 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:49:06 990231 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:49:06 990292 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b2c attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:49:06 990375 [43204960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:49:06 990399 [43204960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:49:06 990815 [43204960] -> SUBNET UP Dec 19 15:49:16 991042 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b31) -- dropping Dec 19 15:49:16 991095 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:49:16 991122 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:49:16 991174 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b31 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:49:16 991281 [41401960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:49:16 991306 [41401960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:49:16 991719 [41401960] -> SUBNET UP Dec 19 15:49:26 992226 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b36) -- dropping Dec 19 15:49:26 992280 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:49:26 992306 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:49:26 992347 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b36 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:49:26 992442 [42803960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:49:26 992468 [42803960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:49:26 993031 [42803960] -> SUBNET UP Dec 19 15:49:36 995288 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b3b) -- dropping Dec 19 15:49:36 995341 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:49:36 995360 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:49:36 995428 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b3b attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:49:36 995515 [43204960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:49:36 995538 [43204960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:49:36 996077 [43204960] -> SUBNET UP Dec 19 15:49:46 995190 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b40) -- dropping Dec 19 15:49:46 995243 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:49:46 995265 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:49:46 995308 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b40 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:49:46 995383 [42803960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:49:46 995407 [42803960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:49:46 995960 [42803960] -> SUBNET UP Dec 19 15:49:56 997558 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b45) -- dropping Dec 19 15:49:56 997609 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:49:56 997624 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:49:56 997663 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b45 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:49:56 997780 [43204960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:49:56 997805 [43204960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:49:56 998216 [43204960] -> SUBNET UP Dec 19 15:50:06 999247 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b4a) -- dropping Dec 19 15:50:06 999296 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:50:06 999311 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:50:06 999351 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b4a attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:50:06 999425 [42803960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:50:06 999487 [42803960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:50:06 999996 [42803960] -> SUBNET UP Dec 19 15:50:17 003083 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b4f) -- dropping Dec 19 15:50:17 003139 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:50:17 003159 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:50:17 003217 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b4f attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:50:17 003297 [41401960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:50:17 003360 [41401960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:50:17 003779 [41401960] -> SUBNET UP Dec 19 15:50:27 002576 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b54) -- dropping Dec 19 15:50:27 002663 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:50:27 002683 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:50:27 002744 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b54 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:50:27 002837 [41E02960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:50:27 002891 [41E02960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:50:27 003312 [41E02960] -> SUBNET UP Dec 19 15:50:37 004082 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b59) -- dropping Dec 19 15:50:37 004139 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:50:37 004162 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:50:37 004205 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b59 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:50:37 004290 [42803960] -> osm_drop_mgr_process: ERR 0108: Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light sweep sampling list Dec 19 15:50:37 004315 [42803960] -> Directed Path Dump of 0 hop path: Path = [0] Dec 19 15:50:37 004730 [42803960] -> SUBNET UP Dec 19 15:50:46 205115 [42803960] -> SM port is down Dec 19 15:50:56 206763 [42803960] -> SM port is down Dec 19 15:50:56 206903 [42803960] -> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING Dec 19 15:51:06 209285 [42803960] -> SM port is down Dec 19 15:51:06 209448 [42803960] -> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING Dec 19 15:51:16 209877 [41E02960] -> SM port is down Dec 19 15:51:16 210032 [41E02960] -> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING Dec 19 15:51:26 210935 [41401960] -> SM port is down Dec 19 15:51:26 211100 [41401960] -> __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING Dec 19 15:51:36 214582 [41E02960] -> Entering MASTER state Dec 19 15:51:36 228305 [42803960] -> SUBNET UP Dec 19 15:51:36 992447 [41E02960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0009 TID:0x0000000000000003 Dec 19 15:51:36 992663 [41E02960] -> osm_report_notice: Reporting Generic Notice type:4 num:144 from LID:0x0009 GID:0xfe80000000000000,0x0002c9020022cd26 Dec 19 15:51:36 994495 [41401960] -> SUBNET UP Dec 19 15:51:47 014297 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x2500001b89) -- dropping Dec 19 15:51:47 014371 [45007960] -> umad_receiver: ERR 5411: DR SMP Dec 19 15:51:47 014386 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Dec 19 15:51:47 014426 [45007960] -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1b89 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................0xFFFF dr_dlid.................0xFFFF Initial path: [0][2] Return path: [0][0] Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Dec 19 15:51:47 014531 [41E02960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0001 GID:0xfe80000000000000,0x0002c9020022cce2 Dec 19 15:51:47 014552 [41E02960] -> Removed port with GUID:0x0002c9020022cd26 LID range [0x9,0x9] of node:Native Infiniband Storage - LSI Logic, Engenio Storage Group Dec 19 15:51:47 014570 [41E02960] -> osm_report_notice: Reporting Generic Notice type:3 num:65 from LID:0x0001 GID:0xfe80000000000000,0x0002c9020022cce2 Dec 19 15:51:47 014586 [41E02960] -> Removed port with GUID:0x0002c9020022cce2 LID range [0x1,0x1] of node:p49 HCA-1 Dec 19 15:51:47 014658 [41E02960] -> __osm_lid_mgr_process_our_sm_node: ERR 0308: Can't acquire SM's port object, GUID = 0x0002c9020022cce2 Dec 19 15:51:47 015001 [41E02960] -> SUBNET UP Dec 19 15:51:51 371737 [41401960] -> osm_pr_rcv_process: ERR 1F16: Cannot find requester physical port Dec 19 15:51:56 216932 [41401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0001 GID:0xfe80000000000000,0x0002c9020022cce2 Dec 19 15:51:56 217034 [41401960] -> Discovered new port with GUID:0x0002c9020022cce2 LID range [0x1,0x1] of node:p49 HCA-1 Dec 19 15:51:56 217045 [41401960] -> osm_report_notice: Reporting Generic Notice type:3 num:64 from LID:0x0001 GID:0xfe80000000000000,0x0002c9020022cce2 Dec 19 15:51:56 217122 [41401960] -> Discovered new port with GUID:0x0002c9020022cd26 LID range [0x9,0x9] of node:Native Infiniband Storage - LSI Logic, Engenio Storage Group Dec 19 15:51:56 217432 [41401960] -> SUBNET UP Dec 19 15:52:06 217884 [43204960] -> SUBNET UP Dec 19 15:52:16 222523 [42803960] -> SUBNET UP Dec 19 15:52:26 221109 [42803960] -> SUBNET UP Dec 19 15:52:36 222369 [42803960] -> SUBNET UP Dec 19 15:52:46 224523 [41401960] -> SUBNET UP Dec 19 15:52:52 902536 [95AB6160] -> Exiting SM Dec 19 15:54:17 354494 [95AB6160] -> OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision Dec 19 17:09:20 792650 [95AB6160] -> OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision -----Original Message----- From: Batwara, Ashish Sent: Tuesday, December 19, 2006 5:22 PM To: 'Hal Rosenstock' Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org Subject: RE: [openib-general] opensm Hi, Please look towards the end of the attached file. Thanks Ashish -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, December 19, 2006 5:06 PM To: Batwara, Ashish Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org Subject: Re: [openib-general] opensm Ashish, On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: > Hi, > > Here is the info that you have asked. I am seeing the Subnet manager > is up now having the port active. But server is not able to discover > the target. I am seeing the error "Got failed path rec status -110" on > Linux console. That means the request for an SA PathRecord from the initiator to the target failed (-110 is ETIMEDOUT). Are you sure the target is up (ACTIVE) on the subnet ? If it is, can you send the opensm log ? -- Hal > Below are the output of different commands. I am using following to > discover the target: > > > > /etc/init.d/opensmd start > > /etc/init.d/openibd start > > modprobe ib_srp > > echo > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > /sys/class/infiniband_srp/srp-mthca0-2/add_target > > > > > > [root at p49 ~]# ibv_devinfo > > hca_id: mthca0 > > fw_ver: 5.1.400 > > node_guid: 0002:c902:0022:cce0 > > sys_image_guid: 0002:c902:0022:cce3 > > vendor_id: 0x02c9 > > vendor_part_id: 25218 > > hw_ver: 0xA0 > > board_id: MT_0370130002 > > phys_port_cnt: 2 > > port: 1 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > port: 2 > > state: PORT_ACTIVE (4) > > max_mtu: 2048 (4) > > active_mtu: 2048 (4) > > sm_lid: 1 > > port_lid: 1 > > port_lmc: 0x00 > hca_id: mthca1 > > fw_ver: 5.1.400 > > node_guid: 0002:c902:0022:cd2c > > sys_image_guid: 0002:c902:0022:cd2f > > vendor_id: 0x02c9 > > vendor_part_id: 25218 > > hw_ver: 0xA0 > > board_id: MT_0370130002 > > phys_port_cnt: 2 > > port: 1 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > port: 2 > > state: PORT_DOWN (1) > > max_mtu: 2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > > > [root at p49 ~]# uname -a > > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > [root at p49 ~]# cat /etc/infiniband/info > > #!/bin/bash > > > > echo prefix=/usr/local/ofed > > echo Kernel=2.6.9-42.0.3.ELsmp > > echo > > echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm > --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm > --with-libsdp --with-openib-diags --with-srptools --with-mstflint > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" > > echo > > > > OFED Version: OFED-1.1 > > Thanks > > Ashish > > -----Original Message----- > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > Sent: Tuesday, December 19, 2006 5:18 AM > To: Batwara, Ashish > Cc: ishai at mellanox.co.il; openib-general at openib.org > Subject: Re: [openib-general] opensm > > > > Hi Ashish, > > > > SRP people say they have no such error message. > > OpenSM does. So I take it back. > > > > Ashish, > > Please provide more into: > > > > 1. ibv_devinfo > > 2. Version of code you are using > > 3. Command line you use for starting opensm > > 4. /var/log/osm.log > > > > Thanks and sorry for the confusion. > > > > EZ > > > > Eitan Zahavi wrote: > > > This is not an OpenSM issue. > > > Forwarded to the SRP people. > > > > > > EZ > > > Batwara, Ashish wrote: > > > > > >> Hi, > > >> I am trying to run opensm on Linux server. It has two HCAs > (4-ports) and > > >> connected to IB Switch. ibnodes command displays the information > about > > >> the Switch ports and HCA ports. > > >> When I start opensm, I see in /var/log/messages "Starting > srp_daemon" > > >> for all the 4 ports and immediately after I see "failed srp_daemon" > for > > >> all the ports and the displays "SM Port is down". > > >> > > >> I tried several times and even rebooted the server few times but no > > >> luck. > > >> > > >> Does anybody know what this problem is? > > >> > > >> Thanks > > >> Ashish > > >> > > >> _______________________________________________ > > >> openib-general mailing list > > >> openib-general at openib.org > > >> http://openib.org/mailman/listinfo/openib-general > > >> > > >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > >> > > >> > > > > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vishal at endace.com Tue Dec 19 18:03:17 2006 From: vishal at endace.com (vishal) Date: Wed, 20 Dec 2006 15:03:17 +1300 Subject: [openib-general] iSER target Message-ID: <1166580197.6798.2.camel@julia.et.endace.com> Hi, I would like to confirm if the iSER target code in the gen2 branch is functional. If yes, is there a readme/installation guide available... Thanks a lot! Vishal From halr at voltaire.com Tue Dec 19 20:35:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 19 Dec 2006 23:35:00 -0500 Subject: [openib-general] opensm In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159CFD@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A01159CFD@NAMAIL2.ad.lsil.com> Message-ID: <1166589299.4519.18010.camel@hal.voltaire.com> On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote: > Hi, > Please look towards the end of the attached file. What options are you starting opensm with ? What is the command line ? Also, it looks like (at least at one point) you have another SM on the subnet. What is the make (vendor) for your switch ? I see many SM port is DOWN. What is going on with this port ? Why is the physical link not LinkUp and stable ? That is the main issue and is likely why the SubnGet of NodeInfo is not being responded to. -- Hal > Thanks > Ashish > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 19, 2006 5:06 PM > To: Batwara, Ashish > Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org > Subject: Re: [openib-general] opensm > > Ashish, > > On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: > > Hi, > > > > Here is the info that you have asked. I am seeing the Subnet manager > > is up now having the port active. But server is not able to discover > > the target. I am seeing the error "Got failed path rec status -110" on > > Linux console. > > That means the request for an SA PathRecord from the initiator to the > target failed (-110 is ETIMEDOUT). Are you sure the target is up > (ACTIVE) on the subnet ? If it is, can you send the opensm log ? > > -- Hal > > > Below are the output of different commands. I am using following to > > discover the target: > > > > > > > > /etc/init.d/opensmd start > > > > /etc/init.d/openibd start > > > > modprobe ib_srp > > > > echo > > > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 > 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > > /sys/class/infiniband_srp/srp-mthca0-2/add_target > > > > > > > > > > > > [root at p49 ~]# ibv_devinfo > > > > hca_id: mthca0 > > > > fw_ver: 5.1.400 > > > > node_guid: 0002:c902:0022:cce0 > > > > sys_image_guid: 0002:c902:0022:cce3 > > > > vendor_id: 0x02c9 > > > > vendor_part_id: 25218 > > > > hw_ver: 0xA0 > > > > board_id: MT_0370130002 > > > > phys_port_cnt: 2 > > > > port: 1 > > > > state: PORT_DOWN (1) > > > > max_mtu: 2048 (4) > > > > active_mtu: 512 (2) > > > > sm_lid: 0 > > > > port_lid: 0 > > > > port_lmc: 0x00 > > > > > > > > port: 2 > > > > state: PORT_ACTIVE (4) > > > > max_mtu: 2048 (4) > > > > active_mtu: 2048 (4) > > > > sm_lid: 1 > > > > port_lid: 1 > > > > port_lmc: 0x00 > > hca_id: mthca1 > > > > fw_ver: 5.1.400 > > > > node_guid: 0002:c902:0022:cd2c > > > > sys_image_guid: 0002:c902:0022:cd2f > > > > vendor_id: 0x02c9 > > > > vendor_part_id: 25218 > > > > hw_ver: 0xA0 > > > > board_id: MT_0370130002 > > > > phys_port_cnt: 2 > > > > port: 1 > > > > state: PORT_DOWN (1) > > > > max_mtu: 2048 (4) > > > > active_mtu: 512 (2) > > > > sm_lid: 0 > > > > port_lid: 0 > > > > port_lmc: 0x00 > > > > > > > > port: 2 > > > > state: PORT_DOWN (1) > > > > max_mtu: 2048 (4) > > > > active_mtu: 512 (2) > > > > sm_lid: 0 > > > > port_lid: 0 > > > > port_lmc: 0x00 > > > > > > > > > > > > [root at p49 ~]# uname -a > > > > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 > > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > [root at p49 ~]# cat /etc/infiniband/info > > > > #!/bin/bash > > > > > > > > echo prefix=/usr/local/ofed > > > > echo Kernel=2.6.9-42.0.3.ELsmp > > > > echo > > > > echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm > > --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs > > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm > > --with-libsdp --with-openib-diags --with-srptools --with-mstflint > > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod > > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod > > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" > > > > echo > > > > > > > > OFED Version: OFED-1.1 > > > > > > > Thanks > > > > Ashish > > > > -----Original Message----- > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > > Sent: Tuesday, December 19, 2006 5:18 AM > > To: Batwara, Ashish > > Cc: ishai at mellanox.co.il; openib-general at openib.org > > Subject: Re: [openib-general] opensm > > > > > > > > Hi Ashish, > > > > > > > > SRP people say they have no such error message. > > > > OpenSM does. So I take it back. > > > > > > > > Ashish, > > > > Please provide more into: > > > > > > > > 1. ibv_devinfo > > > > 2. Version of code you are using > > > > 3. Command line you use for starting opensm > > > > 4. /var/log/osm.log > > > > > > > > Thanks and sorry for the confusion. > > > > > > > > EZ > > > > > > > > Eitan Zahavi wrote: > > > > > This is not an OpenSM issue. > > > > > Forwarded to the SRP people. > > > > > > > > > > EZ > > > > > Batwara, Ashish wrote: > > > > > > > > > >> Hi, > > > > >> I am trying to run opensm on Linux server. It has two HCAs > > (4-ports) and > > > > >> connected to IB Switch. ibnodes command displays the information > > about > > > > >> the Switch ports and HCA ports. > > > > >> When I start opensm, I see in /var/log/messages "Starting > > srp_daemon" > > > > >> for all the 4 ports and immediately after I see "failed srp_daemon" > > for > > > > >> all the ports and the displays "SM Port is down". > > > > >> > > > > >> I tried several times and even rebooted the server few times but no > > > > >> luck. > > > > >> > > > > >> Does anybody know what this problem is? > > > > >> > > > > >> Thanks > > > > >> Ashish > > > > >> > > > > >> _______________________________________________ > > > > >> openib-general mailing list > > > > >> openib-general at openib.org > > > > >> http://openib.org/mailman/listinfo/openib-general > > > > >> > > > > >> To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > >> > > > > >> > > > > > > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Tue Dec 19 23:48:09 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 20 Dec 2006 09:48:09 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <20061219160221.GE3428@mellanox.co.il> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> Message-ID: <4588EAB9.6080106@voltaire.com> Michael S. Tsirkin wrote: > I am not yet sure what is best for upstream, so I don't really think we need > any RFCs. > We'll need data from SM guys on whether MTU selector actually works > in SMs, and if not what happens when you enable it. Eitan, Can you please post here the tavor-quirk patch which was integrated into opensm? i can see the ***code*** of the opensm but might make some wrong assumptions or get into wrong understandings as i am not able to see the patch as is. Or. From kliteyn at dev.mellanox.co.il Wed Dec 20 00:48:53 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 10:48:53 +0200 Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to osm_switch_t In-Reply-To: <1166567251.4519.442.camel@hal.voltaire.com> References: <45883EF4.1050705@dev.mellanox.co.il> <1166567251.4519.442.camel@hal.voltaire.com> Message-ID: <4588F8F5.70007@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Tue, 2006-12-19 at 14:35, Yevgeny Kliteynik wrote: >> Hi Hal >> >> Adding max_lid_ho field to osm_switch_t to allow routing >> engines that don't use lid matrices to explicitly set the >> max lid (in host order) that is reachable from the switch. > > One minor comment below. > >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/include/opensm/osm_switch.h | 37 +++++++++++++++++++++++++++++++++++++ >> osm/opensm/osm_switch.c | 2 ++ >> 2 files changed, 39 insertions(+), 0 deletions(-) >> >> diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h >> index 4570f61..d2089bd 100644 >> --- a/osm/include/opensm/osm_switch.h >> +++ b/osm/include/opensm/osm_switch.h >> @@ -107,6 +107,7 @@ typedef struct _osm_switch >> ib_switch_info_t switch_info; >> osm_fwd_tbl_t fwd_tbl; >> osm_lid_matrix_t lmx; >> + uint16_t max_lid_ho; >> osm_port_profile_t *p_prof; >> osm_mcast_tbl_t mcast_tbl; >> uint32_t discovery_count; >> @@ -129,6 +130,9 @@ typedef struct _osm_switch >> * LID Matrix for this switch containing the hop count >> * to every LID from every port. >> * >> +* max_lid_ho >> +* Max LID that is accessible from this switch >> +* >> * p_pro >> * Pointer to array of Port Profile objects for this switch. >> * >> @@ -793,6 +797,8 @@ static inline uint16_t >> osm_switch_get_max_lid_ho( >> IN const osm_switch_t* const p_sw ) >> { >> + if (p_sw->max_lid_ho != 0) >> + return p_sw->max_lid_ho; >> return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) ); >> } >> /* >> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho( >> * SEE ALSO >> *********/ >> >> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho >> +* NAME >> +* osm_switch_set_max_lid_ho >> +* >> +* DESCRIPTION >> +* Set the maximum LID (host order) value accessed from this switch >> +* SYNOPSIS >> +*/ >> +static inline void >> +osm_switch_set_max_lid_ho( >> + IN osm_switch_t* const p_sw, >> + IN uint16_t max_lid_ho ) >> +{ >> + p_sw->max_lid_ho = max_lid_ho; >> +} >> +/* >> +* PARAMETERS >> +* p_sw >> +* [in] Pointer to a switch object. >> +* >> +* max_lid_ho >> +* Max LID (host order) value accessed from this switch >> +* >> +* RETURN VALUES >> +* None. >> +* >> +* NOTES >> +* >> +* SEE ALSO >> +*********/ >> + >> /****f* OpenSM: Switch/osm_switch_get_num_ports >> * NAME >> * osm_switch_get_num_ports >> diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c >> index 0dd3de5..4ca713a 100644 >> --- a/osm/opensm/osm_switch.c >> +++ b/osm/opensm/osm_switch.c >> @@ -122,6 +122,8 @@ osm_switch_init( >> for( port_num = 0; port_num < num_ports; port_num++ ) >> osm_port_prof_construct( &p_sw->p_prof[port_num] ); >> >> + p_sw->max_lid_ho = 0; > > This isn't really needed, is it ? > > Doesn't osm_switch_construct clear this ? Right, it does. I will issue a V2 series of patches that will address this and Sasha's comments. > -- Hal > >> + >> Exit: >> return( status ); >> } > From kliteyn at dev.mellanox.co.il Wed Dec 20 00:49:12 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 10:49:12 +0200 Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to osm_switch_t In-Reply-To: <20061219230553.GG19795@sashak.voltaire.com> References: <45883EF4.1050705@dev.mellanox.co.il> <20061219203044.GE19795@sashak.voltaire.com> <458853A0.9060909@dev.mellanox.co.il> <20061219230553.GG19795@sashak.voltaire.com> Message-ID: <4588F908.8050306@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 23:03 Tue 19 Dec , Yevgeny Kliteynik wrote: >> >>>> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho( >>>> * SEE ALSO >>>> *********/ >>>> >>>> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho >>>> +* NAME >>>> +* osm_switch_set_max_lid_ho >>>> +* >>>> +* DESCRIPTION >>>> +* Set the maximum LID (host order) value accessed from this switch >>>> +* SYNOPSIS >>>> +*/ >>>> +static inline void >>>> +osm_switch_set_max_lid_ho( >>>> + IN osm_switch_t* const p_sw, >>>> + IN uint16_t max_lid_ho ) >>>> +{ >>>> + p_sw->max_lid_ho = max_lid_ho; >>>> +} >>>> +/* >>>> +* PARAMETERS >>>> +* p_sw >>>> +* [in] Pointer to a switch object. >>>> +* >>>> +* max_lid_ho >>>> +* Max LID (host order) value accessed from this switch >>>> +* >>>> +* RETURN VALUES >>>> +* None. >>>> +* >>>> +* NOTES >>>> +* >>>> +* SEE ALSO >>>> +*********/ >>>> + >>> Do we need those +31 lines of code instead of just >>> p_sw->max_lid_ho = N; ? >> Since there are access functions for the rest of the fields, >> I didn't want to make an exception in this case either. > > I think you did anyway - there is no full set of access methods. I'm > perfectly fine with it. And don't call you to cleanup the rest, just to > not add new ones. You're right - setter is not needed. I will issue a V2 series of patches that will address this and Hal's comments. -- Yevgeny > Sasha > From danb at voltaire.com Wed Dec 20 00:54:16 2006 From: danb at voltaire.com (Dan Bar Dov) Date: Wed, 20 Dec 2006 10:54:16 +0200 Subject: [openib-general] iSER target Message-ID: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com> The iser target code in the gen2 branch is functional over kdapl. It requires an iscsi target code above it, however such an iscsi code is not open. It was opened as a precursor for an open-source iscsi/iser-target project. That project is still in its early stages, and the plan is to add iser-target support, loosly based on the open-iser-target code, to the stgt project. Due to the above, there is no readme/installation guide. Dan > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of vishal > Sent: Wednesday, December 20, 2006 4:03 AM > To: openib-general at openib.org > Subject: [openib-general] iSER target > > Hi, > > I would like to confirm if the iSER target code in the gen2 branch > is functional. If yes, is there a readme/installation guide > available... > > Thanks a lot! > > Vishal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From kliteyn at dev.mellanox.co.il Wed Dec 20 00:51:56 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 10:51:56 +0200 Subject: [openib-general] [PATCHv2] osm: adding max_lid_ho field to osm_switch_t Message-ID: <4588F9AC.5040401@dev.mellanox.co.il> Hi Hal [V2 of the patch - removed setter and unnecessary initialization] Adding max_lid_ho field to osm_switch_t to allow routing engines that don't use lid matrices to explicitly set the max lid (in host order) that is reachable from the switch. Signed-off-by: Yevgeny Kliteynik --- osm/include/opensm/osm_switch.h | 6 ++++++ 1 file changed, 6 insertions(+), 0 deletions(-) diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h index 4570f61..d2089bd 100644 --- a/osm/include/opensm/osm_switch.h +++ b/osm/include/opensm/osm_switch.h @@ -107,6 +107,7 @@ typedef struct _osm_switch ib_switch_info_t switch_info; osm_fwd_tbl_t fwd_tbl; osm_lid_matrix_t lmx; + uint16_t max_lid_ho; osm_port_profile_t *p_prof; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; @@ -129,6 +130,9 @@ typedef struct _osm_switch * LID Matrix for this switch containing the hop count * to every LID from every port. * +* max_lid_ho +* Max LID that is accessible from this switch +* * p_pro * Pointer to array of Port Profile objects for this switch. * @@ -793,6 +797,8 @@ static inline uint16_t osm_switch_get_max_lid_ho( IN const osm_switch_t* const p_sw ) { + if (p_sw->max_lid_ho != 0) + return p_sw->max_lid_ho; return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) ); } /* -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Wed Dec 20 00:54:50 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 10:54:50 +0200 Subject: [openib-general] [PATCHv2] osm: improving FatTree routing engine Message-ID: <4588FA5A.1070802@dev.mellanox.co.il> Hi Hal [V2 of the patch - not using max_lid_ho setter] FatTree routing engine improvemets: 1. Improved building of LFTs 2. Setting max lid on osm switches 3. Using ucast manager LFT dump function 4. Stoped using global variable 'osm' 5. Improved logging 6. Some cosmetics Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 439 +++++++++++++++++++++++++++--------------- 1 files changed, 281 insertions(+), 158 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index 15e4cd0..0d7188a 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -57,9 +57,6 @@ #include #include -/* This var is predefined and initialized */ -extern osm_opensm_t osm; - /* * FatTree rank is bounded between 2 and 8: * - Tree of rank 1 has only trivial routing pathes, @@ -211,14 +208,16 @@ typedef struct ftree_hca_t_ { typedef struct ftree_fabric_t_ { - cl_qmap_t hca_tbl; - cl_qmap_t sw_tbl; - cl_qmap_t sw_by_tuple_tbl; - uint32_t tree_rank; - ftree_sw_t ** leaf_switches; - uint32_t leaf_switches_num; - uint16_t max_hcas_per_leaf; - cl_pool_t sw_fwd_tbl_pool; + osm_opensm_t * p_osm; + cl_qmap_t hca_tbl; + cl_qmap_t sw_tbl; + cl_qmap_t sw_by_tuple_tbl; + uint32_t tree_rank; + ftree_sw_t ** leaf_switches; + uint32_t leaf_switches_num; + uint16_t max_hcas_per_leaf; + cl_pool_t sw_fwd_tbl_pool; + uint16_t lft_max_lid_ho; } ftree_fabric_t; /*************************************************** @@ -506,6 +505,7 @@ __osm_ftree_port_group_destroy( static void __osm_ftree_port_group_dump( + IN ftree_fabric_t *p_ftree, IN ftree_port_group_t * p_group, IN ftree_direction_t direction) { @@ -517,7 +517,7 @@ __osm_ftree_port_group_dump( if (!p_group) return; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG)) return; size = cl_ptr_vector_get_size(&p_group->ports); @@ -533,7 +533,7 @@ __osm_ftree_port_group_dump( sprintf(buff + strlen(buff), "%u", p_port->port_num); } - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_port_group_dump:" " Port Group of size %u, port(s): %s, direction: %s\n" " Local <--> Remote GUID (LID):" @@ -648,16 +648,17 @@ __osm_ftree_sw_destroy( static void __osm_ftree_sw_dump( + IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) { uint32_t i; if (!p_sw) return; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG)) return; - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_sw_dump: " "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n", __osm_ftree_tuple_to_str(p_sw->tuple), @@ -665,10 +666,14 @@ __osm_ftree_sw_dump( p_sw->down_port_groups_num, p_sw->up_port_groups_num); - for( i = 0; i < p_sw->down_port_groups_num; i++ ) - __osm_ftree_port_group_dump(p_sw->down_port_groups[i], FTREE_DIRECTION_DOWN); - for( i = 0; i < p_sw->up_port_groups_num; i++ ) - __osm_ftree_port_group_dump(p_sw->up_port_groups[i], FTREE_DIRECTION_UP); + for( i = 0; i < p_sw->down_port_groups_num; i++ ) + __osm_ftree_port_group_dump(p_ftree, + p_sw->down_port_groups[i], + FTREE_DIRECTION_DOWN); + for( i = 0; i < p_sw->up_port_groups_num; i++ ) + __osm_ftree_port_group_dump(p_ftree, + p_sw->up_port_groups[i], + FTREE_DIRECTION_UP); } /* __osm_ftree_sw_dump() */ @@ -823,23 +828,26 @@ __osm_ftree_hca_destroy( static void __osm_ftree_hca_dump( + IN ftree_fabric_t * p_ftree, IN ftree_hca_t * p_hca) { uint32_t i; if (!p_hca) return; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG)) return; - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_hca_dump: " "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), p_hca->up_port_groups_num); for( i = 0; i < p_hca->up_port_groups_num; i++ ) - __osm_ftree_port_group_dump(p_hca->up_port_groups[i],FTREE_DIRECTION_UP); + __osm_ftree_port_group_dump(p_ftree, + p_hca->up_port_groups[i], + FTREE_DIRECTION_UP); } /***************************************************/ @@ -1050,6 +1058,10 @@ __osm_ftree_fabric_add_sw(ftree_fabric_t cl_qmap_insert(&p_ftree->sw_tbl, p_osm_sw->p_node->node_info.node_guid, &p_sw->map_item); + + /* track the max lid (in host order) that exists in the fabric */ + if (cl_ntoh16(p_sw->base_lid) > p_ftree->lft_max_lid_ho) + p_ftree->lft_max_lid_ho = cl_ntoh16(p_sw->base_lid); } /***************************************************/ @@ -1096,38 +1108,38 @@ __osm_ftree_fabric_dump(ftree_fabric_t * ftree_hca_t * p_hca; ftree_sw_t * p_sw; - if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG)) + if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG)) return; - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" " |-------------------------------|\n" " |- Full fabric topology dump -|\n" " |-------------------------------|\n\n"); - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_dump: -- HCAs:\n"); for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl); p_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item) ) { - __osm_ftree_hca_dump(p_hca); + __osm_ftree_hca_dump(p_ftree, p_hca); } for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++) { - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_dump: -- Rank %u switches\n", i); for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl); p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) ) { if (p_sw->rank == i) - __osm_ftree_sw_dump(p_sw); + __osm_ftree_sw_dump(p_ftree, p_sw); } } - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n" " |---------------------------------------|\n" " |- Full fabric topology dump completed -|\n" " |---------------------------------------|\n\n"); @@ -1143,16 +1155,18 @@ __osm_ftree_fabric_dump_general_info( ftree_sw_t * p_sw; char * addition_str; - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info:\n"); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " "General fabric topology info\n"); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " "============================\n"); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " " - FatTree rank (switches only): %u\n", p_ftree->tree_rank); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " " - Fabric has %u HCAs, %u switches\n", cl_qmap_count(&p_ftree->hca_tbl), cl_qmap_count(&p_ftree->sw_tbl)); @@ -1174,13 +1188,15 @@ __osm_ftree_fabric_dump_general_info( addition_str = " (leaf) "; else addition_str = " "; - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: " - " - Fabric has %u rank %u%sswitches\n",j,i,addition_str); + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_dump_general_info: " + " - Fabric has %u rank %u%sswitches\n", + j,i,addition_str); } - if (osm_log_is_active(&osm.log,OSM_LOG_VERBOSE)) + if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_VERBOSE)) { - osm_log(&osm.log, OSM_LOG_VERBOSE, + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " - Root switches:\n"); for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); @@ -1188,7 +1204,7 @@ __osm_ftree_fabric_dump_general_info( p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) ) { if (p_sw->rank == 0) - osm_log(&osm.log, OSM_LOG_VERBOSE, + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), @@ -1196,15 +1212,17 @@ __osm_ftree_fabric_dump_general_info( __osm_ftree_tuple_to_str(p_sw->tuple)); } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_dump_general_info: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_dump_general_info: " " - Leaf switches (sorted by index):\n"); for (i = 0; i < p_ftree->leaf_switches_num; i++) { - osm_log(&osm.log, OSM_LOG_VERBOSE, + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_dump_general_info: " " GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n", cl_ntoh64(osm_node_get_node_guid( - osm_switch_get_node_ptr(p_ftree->leaf_switches[i]->p_osm_sw))), + osm_switch_get_node_ptr( + p_ftree->leaf_switches[i]->p_osm_sw))), cl_ntoh16(p_ftree->leaf_switches[i]->base_lid), __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple)); } @@ -1229,15 +1247,15 @@ __osm_ftree_fabric_dump_hca_ordering( char * filename = "osm-ftree-ca-order.dump"; snprintf(path, sizeof(path), "%s/%s", - osm.subn.opt.dump_files_dir, filename); + p_ftree->p_osm->subn.opt.dump_files_dir, filename); p_hca_ordering_file = fopen(path, "w"); if (!p_hca_ordering_file) { - osm_log(&osm.log, OSM_LOG_ERROR, + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: " "cannot open file \'%s\': %s\n", filename, strerror(errno)); - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return; } @@ -1383,9 +1401,9 @@ __osm_ftree_fabric_make_indexing( cl_list_t bfs_list; ftree_sw_tbl_element_t * p_sw_tbl_element; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_make_indexing); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_make_indexing); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " "Starting FatTree indexing\n"); /* create array of leaf switches */ @@ -1411,8 +1429,8 @@ __osm_ftree_fabric_make_indexing( This fuction also adds the switch it into the switch_by_tuple table. */ __osm_ftree_fabric_assign_first_tuple(p_ftree,p_sw); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: " - "Indexing starting point:\n" + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_make_indexing: Indexing starting point:\n" " - Switch rank : %u\n" " - Switch index : %s\n" " - Node LID : 0x%x\n" @@ -1537,7 +1555,7 @@ __osm_ftree_fabric_make_indexing( sizeof(ftree_sw_t *), /* size of each element */ __osm_ftree_compare_switches_by_index); /* comparator */ - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); } /* __osm_ftree_fabric_make_indexing() */ /***************************************************/ @@ -1555,15 +1573,17 @@ __osm_ftree_fabric_validate_topology( boolean_t res = TRUE; uint8_t i; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_validate_topology); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_validate_topology); - osm_log(&osm.log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_validate_topology: " "Validating fabric topology\n"); reference_sw_arr = (ftree_sw_t **)malloc(tree_rank * sizeof(ftree_sw_t *)); if ( reference_sw_arr == NULL ) { - osm_log(&osm.log, OSM_LOG_SYS,"Fat-tree routing: Memory allocation failed\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fat-tree routing: Memory allocation failed\n"); return FALSE; } memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *)); @@ -1587,7 +1607,8 @@ __osm_ftree_fabric_validate_topology( if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != p_sw->up_port_groups_num ) { - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB09: Different number of upward port groups on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n", @@ -1607,7 +1628,8 @@ __osm_ftree_fabric_validate_topology( reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num ) { /* we're allowing some hca's to be missing */ - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB0A: Different number of downward port groups on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n", @@ -1631,7 +1653,8 @@ __osm_ftree_fabric_validate_topology( p_group = p_sw->up_port_groups[i]; if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports)) { - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB0B: Different number of ports in an upward port group on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", @@ -1658,7 +1681,8 @@ __osm_ftree_fabric_validate_topology( p_group = p_sw->down_port_groups[0]; if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports)) { - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " "ERR AB0C: Different number of ports in an downward port group on switches:\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n" " GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n", @@ -1679,14 +1703,16 @@ __osm_ftree_fabric_validate_topology( } /* end of while */ if (res == TRUE) - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_validate_topology: " - "Fabric topology has been identified as FatTree\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_fabric_validate_topology: " + "Fabric topology has been identified as FatTree\n"); else - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: " - "ERR AB0D: Fabric topology hasn't been identified as FatTree\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_validate_topology: " + "ERR AB0D: Fabric topology hasn't been identified as FatTree\n"); free(reference_sw_arr); - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_fabric_validate_topology() */ @@ -1699,8 +1725,17 @@ __osm_ftree_set_sw_fwd_table( IN void *context) { ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item; - memcpy(osm.sm.ucast_mgr.lft_buf, p_sw->lft_buf, FTREE_FWD_TBL_LEN); - osm_ucast_mgr_set_fwd_table(&osm.sm.ucast_mgr,p_sw->p_osm_sw); + ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; + + /* calculate lft length rounded up to a multiple of 64 (block length) */ + uint16_t lft_len = 64 * ((p_ftree->lft_max_lid_ho + 1 + 63) / 64); + + p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho; + + memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, + p_sw->lft_buf, + lft_len); + osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw); } /*************************************************** @@ -1746,8 +1781,6 @@ __osm_ftree_fabric_route_upgoing_by_goin if (p_sw->down_port_groups_num == 0) return; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_upgoing_by_going_down); - /* foreach down-going port group (in indexing order) */ for (i = 0; i < p_sw->down_port_groups_num; i++) { @@ -1823,7 +1856,7 @@ __osm_ftree_fabric_route_upgoing_by_goin __osm_ftree_sw_set_fwd_table_block(p_remote_sw, cl_ntoh16(target_lid), p_min_port->remote_port_num); - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_upgoing_by_going_down: " "Switch %s: set path to HCA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), @@ -1855,7 +1888,6 @@ __osm_ftree_fabric_route_upgoing_by_goin } /* done scanning all the down-going port groups */ - OSM_LOG_EXIT(&(osm.log)); } /* __osm_ftree_fabric_route_upgoing_by_going_down() */ /***************************************************/ @@ -1892,8 +1924,6 @@ __osm_ftree_fabric_route_downgoing_by_go /* we shouldn't enter here if both real_lid and main_path are false */ CL_ASSERT(is_real_lid || is_main_path); - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_downgoing_by_going_up); - /* If this switch isn't a leaf switch: Assign upgoing ports by stepping down, starting on THIS switch. */ if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1)) @@ -1909,10 +1939,7 @@ __osm_ftree_fabric_route_downgoing_by_go /* recursion stop condition - if it's a root switch, */ if (p_sw->rank == 0) - { - OSM_LOG_EXIT(&(osm.log)); return; - } /* Find the least loaded port of all the upgoing port groups (in indexing order of the remote switches). */ @@ -1982,7 +2009,7 @@ __osm_ftree_fabric_route_downgoing_by_go { if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) { - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n", (is_real_lid)? "real" : "DUMMY", @@ -2000,7 +2027,7 @@ __osm_ftree_fabric_route_downgoing_by_go cl_ntoh16(target_lid), p_min_port->remote_port_num); p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = p_min_port->remote_port_num; - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " "Switch %s: set path to HCA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), @@ -2020,10 +2047,7 @@ __osm_ftree_fabric_route_downgoing_by_go /* we're done for the third case */ if (!is_real_lid) - { - OSM_LOG_EXIT(&(osm.log)); return; - } /* What's left to do at this point: * @@ -2064,7 +2088,7 @@ __osm_ftree_fabric_route_downgoing_by_go if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1)) { - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_downgoing_by_going_up: " " - Routing SECONDARY path for LID 0x%x: %s --> %s\n", cl_ntoh16(target_lid), @@ -2087,7 +2111,6 @@ __osm_ftree_fabric_route_downgoing_by_go FALSE); /* whether this is path to HCA that should by tracked by counters */ } - OSM_LOG_EXIT(&(osm.log)); } /* ftree_fabric_route_downgoing_by_going_up() */ /***************************************************/ @@ -2114,7 +2137,7 @@ __osm_ftree_fabric_route_to_hcas( uint32_t j; ib_net16_t remote_lid; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_hcas); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_hcas); /* for each leaf switch (in indexing order) */ for(i = 0; i < p_ftree->leaf_switches_num; i++) @@ -2133,7 +2156,7 @@ __osm_ftree_fabric_route_to_hcas( __osm_ftree_sw_set_fwd_table_block(p_sw, cl_ntoh16(remote_lid), p_port->port_num); - osm_log(&osm.log, OSM_LOG_DEBUG, + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "__osm_ftree_fabric_route_to_hcas: " "Switch %s: set path to HCA LID 0x%x through port %u\n", __osm_ftree_tuple_to_str(p_sw->tuple), @@ -2154,7 +2177,7 @@ __osm_ftree_fabric_route_to_hcas( if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num) { - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " "Routing %u dummy HCAs\n", p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++) @@ -2171,7 +2194,7 @@ __osm_ftree_fabric_route_to_hcas( } } /* done going through all the leaf switches */ - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); } /* __osm_ftree_fabric_route_to_hcas() */ /***************************************************/ @@ -2195,7 +2218,7 @@ __osm_ftree_fabric_route_to_switches( ftree_sw_t * p_sw; ftree_sw_t * p_next_sw; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_switches); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_switches); p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl); while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) ) @@ -2208,7 +2231,8 @@ __osm_ftree_fabric_route_to_switches( cl_ntoh16(p_sw->base_lid), 0); - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_switches: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_fabric_route_to_switches: " "Switch %s (LID 0x%x): routing switch-to-switch pathes\n", __osm_ftree_tuple_to_str(p_sw->tuple), cl_ntoh16(p_sw->base_lid)); @@ -2222,7 +2246,7 @@ __osm_ftree_fabric_route_to_switches( FALSE); /* whether this path should by tracked by counters */ } - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); } /* __osm_ftree_fabric_route_to_switches() */ /*************************************************** @@ -2234,18 +2258,17 @@ __osm_ftree_fabric_populate_switches( { osm_switch_t * p_osm_sw; osm_switch_t * p_next_osm_sw; - osm_opensm_t * p_osm = &osm; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_switches); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_switches); - p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_osm->subn.sw_guid_tbl); - while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl) ) + p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_ftree->p_osm->subn.sw_guid_tbl); + while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_ftree->p_osm->subn.sw_guid_tbl) ) { p_osm_sw = p_next_osm_sw; p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item ); __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw); } - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return 0; } /* __osm_ftree_fabric_populate_switches() */ @@ -2258,12 +2281,11 @@ __osm_ftree_fabric_populate_hcas( { osm_node_t * p_osm_node; osm_node_t * p_next_osm_node; - osm_opensm_t * p_osm = &osm; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_hcas); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_hcas); - p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_osm->subn.node_guid_tbl); - while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_osm->subn.node_guid_tbl) ) + p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_ftree->p_osm->subn.node_guid_tbl); + while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_ftree->p_osm->subn.node_guid_tbl) ) { p_osm_node = p_next_osm_node; p_next_osm_node = (osm_node_t *)cl_qmap_next(&p_osm_node->map_item); @@ -2278,16 +2300,17 @@ __osm_ftree_fabric_populate_hcas( /* all the switches added separately */ break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_populate_hcas: ERR AB0E: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_populate_hcas: ERR AB0E: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(osm_node_get_node_guid(p_osm_node)), ib_get_node_type_str(osm_node_get_type(p_osm_node))); - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return -1; } } - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return 0; } /* __osm_ftree_fabric_populate_hcas() */ @@ -2372,7 +2395,7 @@ __osm_ftree_rank_switches_from_hca( static uint16_t i = 0; int res = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_rank_switches_from_hca); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_rank_switches_from_hca); for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++) { @@ -2388,7 +2411,8 @@ __osm_ftree_rank_switches_from_hca( { case IB_NODE_TYPE_CA: /* HCA connected directly to another HCA - not FatTree */ - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB0F: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_rank_switches_from_hca: ERR AB0F: " "HCA conected directly to another HCA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), @@ -2405,7 +2429,8 @@ __osm_ftree_rank_switches_from_hca( break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB10: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_rank_switches_from_hca: ERR AB10: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)), ib_get_node_type_str(osm_node_get_type(p_remote_osm_node))); @@ -2423,7 +2448,8 @@ __osm_ftree_rank_switches_from_hca( if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0) continue; - osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_rank_switches_from_hca: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG, + "__osm_ftree_rank_switches_from_hca: " "Marking rank of switch that is directly connected to HCA:\n" " - HCA guid : 0x%016" PRIx64 "\n" " - Switch guid: 0x%016" PRIx64 "\n" @@ -2435,7 +2461,7 @@ __osm_ftree_rank_switches_from_hca( } Exit: - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_rank_switches_from_hca() */ @@ -2495,7 +2521,8 @@ __osm_ftree_fabric_construct_hca_ports( case IB_NODE_TYPE_CA: /* HCA connected directly to another HCA - not FatTree */ - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB11: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_construct_hca_ports: ERR AB11: " "HCA conected directly to another HCA: " "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_node)), @@ -2508,7 +2535,8 @@ __osm_ftree_fabric_construct_hca_ports( break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB12: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_construct_hca_ports: ERR AB12: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(remote_node_guid), ib_get_node_type_str(remote_node_type)); @@ -2625,7 +2653,8 @@ __osm_ftree_fabric_construct_sw_ports( break; default: - osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_sw_ports: ERR AB13: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_construct_sw_ports: ERR AB13: " "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n", cl_ntoh64(remote_node_guid), ib_get_node_type_str(remote_node_type)); @@ -2646,6 +2675,10 @@ __osm_ftree_fabric_construct_sw_ports( remote_node_type, /* remote node type */ p_remote_hca_or_sw, /* remote ftree_hca/sw object */ direction); /* port direction (up or down) */ + + /* Track the max lid (in host order) that exists in the fabric */ + if (cl_ntoh16(remote_base_lid) > p_ftree->lft_max_lid_ho) + p_ftree->lft_max_lid_ho = cl_ntoh16(remote_base_lid); } Exit: @@ -2665,7 +2698,7 @@ __osm_ftree_fabric_perform_ranking( ftree_hca_t * p_next_hca; int res = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_perform_ranking); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking); /* Mark REVERSED rank of all the switches in the subnet. Start from switches that are connected to hca's, and @@ -2678,7 +2711,8 @@ __osm_ftree_fabric_perform_ranking( if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0) { res = -1; - osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB14: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_perform_ranking: ERR AB14: " "Subnet ranking failed - subnet is not FatTree"); goto Exit; } @@ -2686,7 +2720,8 @@ __osm_ftree_fabric_perform_ranking( /* calculate and set FatTree rank */ __osm_ftree_fabric_calculate_rank(p_ftree); - osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_perform_ranking: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO, + "__osm_ftree_fabric_perform_ranking: " "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree)); /* fix ranking of the switches by reversing the ranking direction */ @@ -2695,7 +2730,8 @@ __osm_ftree_fabric_perform_ranking( if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK || __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK ) { - osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB15: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, + "__osm_ftree_fabric_perform_ranking: ERR AB15: " "Tree rank is %u (should be between %u and %u)\n", __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MIN_RANK, @@ -2705,7 +2741,7 @@ __osm_ftree_fabric_perform_ranking( } Exit: - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_fabric_perform_ranking() */ @@ -2722,7 +2758,7 @@ __osm_ftree_fabric_populate_ports( ftree_sw_t * p_next_sw; int res = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_ports); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_ports); p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl); while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) ) @@ -2748,7 +2784,7 @@ __osm_ftree_fabric_populate_ports( } } Exit: - OSM_LOG_EXIT(&(osm.log)); + OSM_LOG_EXIT(&p_ftree->p_osm->log); return res; } /* __osm_ftree_fabric_populate_ports() */ @@ -2756,58 +2792,61 @@ __osm_ftree_fabric_populate_ports( ***************************************************/ static int -__osm_ftree_do_routing(void *context) +__osm_ftree_construct_fabric( + IN void * context) { ftree_fabric_t * p_ftree = context; int status = 0; - OSM_LOG_ENTER(&(osm.log), __osm_ftree_do_routing); + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_construct_fabric); - if ( cl_qmap_count(&osm.subn.sw_guid_tbl) < 2 ) + if ( cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl) < 2 ) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric has %u switches - topology is not fat-tree.\n" "Falling back to default routing.\n", - cl_qmap_count(&osm.subn.sw_guid_tbl)); + cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)); status = -1; goto Exit; } - if ( (cl_qmap_count(&osm.subn.node_guid_tbl) - - cl_qmap_count(&osm.subn.sw_guid_tbl)) < 2) + if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) - + cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)) < 2) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric has %u nodes (%u switches) - topology is not fat-tree.\n" "Falling back to default routing.\n", - cl_qmap_count(&osm.subn.node_guid_tbl), - cl_qmap_count(&osm.subn.sw_guid_tbl)); + cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl), + cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)); status = -1; goto Exit; } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n" - " |------------------------------|\n" - " |- Starting FatTree Routing -|\n" - " |------------------------------|\n\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_construct_fabric: \n" + " |----------------------------------------|\n" + " |- Starting FatTree fabric construction -|\n" + " |----------------------------------------|\n\n"); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " "Populating FatTree switch table\n"); /* ToDo: now that the pointer from node to switch exists, no need to fill the switch table in a separate loop */ if (__osm_ftree_fabric_populate_switches(p_ftree) != 0) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not fat-tree - " "falling back to default routing\n"); status = -1; goto Exit; } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " "Populating FatTree HCA table\n"); if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not fat-tree - " "falling back to default routing\n"); status = -1; @@ -2816,7 +2855,7 @@ __osm_ftree_do_routing(void *context) if (cl_qmap_count(&p_ftree->hca_tbl) < 2) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric has %u HCAa - topology is not fat-tree.\n" "Falling back to default routing.\n", cl_qmap_count(&p_ftree->hca_tbl)); @@ -2825,12 +2864,13 @@ __osm_ftree_do_routing(void *context) } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " - "Ranking FatTree\n"); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: Ranking FatTree\n"); + if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) { if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric rank is %u (>%u) - " "fat-tree routing falls back to default routing\n", __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK); @@ -2841,11 +2881,12 @@ __osm_ftree_do_routing(void *context) /* For each hca and switch, construct array of ports. This is done after the whole FatTree data structure is ready, because we want the ports to have pointers to ftree_{sw,hca}_t objects.*/ - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " "Populating HCA & switch ports\n"); if (__osm_ftree_fabric_populate_ports(p_ftree) != 0) { - osm_log(&osm.log, OSM_LOG_SYS, + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, "Fabric topology is not a fat-tree - " "routing falls back to default routing\n"); status = -1; @@ -2863,7 +2904,7 @@ __osm_ftree_do_routing(void *context) __osm_ftree_fabric_dump_general_info(p_ftree); /* dump full tree topology */ - if (osm_log_is_active(&osm.log, OSM_LOG_DEBUG)) + if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG)) __osm_ftree_fabric_dump(p_ftree); if (! __osm_ftree_fabric_validate_topology(p_ftree)) @@ -2872,46 +2913,118 @@ __osm_ftree_do_routing(void *context) goto Exit; } - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " + "Max LID in switch LFTs (in host order): 0x%x\n", + p_ftree->lft_max_lid_ho); + + Exit: + if (status != 0) + { + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: " + "Clearing FatTree Fabric data structures\n"); + __osm_ftree_fabric_clear(p_ftree); + } + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, + "__osm_ftree_construct_fabric: \n" + " |--------------------------------------------------|\n" + " |- Done constructing FatTree fabric (status = %d) -|\n" + " |--------------------------------------------------|\n\n", + status); + + OSM_LOG_EXIT(&p_ftree->p_osm->log); + return status; +} + +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_do_routing( + IN void * context) +{ + ftree_fabric_t * p_ftree = context; + + OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_do_routing); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "Starting FatTree routing\n"); + + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " "Filling switch forwarding tables for routes to HCAs\n"); __osm_ftree_fabric_route_to_hcas(p_ftree); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " "Filling switch forwarding tables for switch-to-switch pathes\n"); __osm_ftree_fabric_route_to_switches(p_ftree); /* for each switch, set its fwd table */ - cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, NULL); + cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, (void *)p_ftree); /* write out hca ordering file */ __osm_ftree_fabric_dump_hca_ordering(p_ftree); - Exit: - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " - "Clearing FatTree Fabric data structures\n"); - __osm_ftree_fabric_clear(p_ftree); + osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: " + "FatTree routing is done\n"); - osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n" - " |---------------------------------------|\n" - " |- Done FatTree Routing (status = %d) -|\n" - " |---------------------------------------|\n\n", status); + OSM_LOG_EXIT(&p_ftree->p_osm->log); + return 0; +} - OSM_LOG_EXIT(&(osm.log)); - return status; +/*************************************************** + ***************************************************/ + +static int +__osm_ftree_routing( + IN void * context) +{ + int status = __osm_ftree_construct_fabric(context); + if (status != 0) + return status; + + __osm_ftree_do_routing(context); + return 0; } /*************************************************** ***************************************************/ +void +ucast_mgr_dump_to_file( + IN osm_ucast_mgr_t *p_mgr, + IN const char *file_name, + IN void (*func)(cl_map_item_t *, void *)); + +void +ucast_mgr_dump_lfts( + IN cl_map_item_t *p_map_item, + void *cxt); + static void -__osm_ftree_delete(void * context) +__osm_ftree_dump_tables( + IN void * context) { - ftree_fabric_t * p_ftree = (ftree_fabric_t *)context; + ftree_fabric_t * p_ftree = context; if (!p_ftree) return; - __osm_ftree_fabric_destroy(p_ftree); + ucast_mgr_dump_to_file(&p_ftree->p_osm->sm.ucast_mgr, + "opensm-lfts.dump", + ucast_mgr_dump_lfts); +} +/*************************************************** + ***************************************************/ + +static void +__osm_ftree_delete( + IN void * context) +{ + if (!context) + return; + __osm_ftree_fabric_destroy((ftree_fabric_t *)context); } /*************************************************** @@ -2923,11 +3036,21 @@ int osm_ucast_ftree_setup(osm_opensm_t * if (!p_ftree) return -1; + p_ftree->p_osm = p_osm; + p_osm->routing_engine.context = (void *)p_ftree; - p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing; + p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_routing; + /* ToDo: Resolve multicast routing. + * Until then lid matrices are built, despite the + * fact that FatTree routing doesn't need them. + * When the multicast routing will be resolved, + * __osm_ftree_routing() function should be removed, + * and here's how the FatTree routing will be set: + * p_osm->routing_engine.build_lid_matrices = __osm_ftree_construct_fabric; + * p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing; + */ + p_osm->routing_engine.ucast_dump_tables = __osm_ftree_dump_tables; p_osm->routing_engine.delete = __osm_ftree_delete; - /* ToDo: fat-tree routing doesn't use min_hop tables, so we - shouldn't fill them (p_osm->routing_engine.build_lid_matrices) */ return 0; } -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Wed Dec 20 01:09:09 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 11:09:09 +0200 Subject: [openib-general] Routing patches Message-ID: <4588FDB5.6080909@dev.mellanox.co.il> Hi Hal. Just wanted to put some order in the routing-related patches. There are four patches that are waiting to be reviewed and applied: 1. Added an option for providing dump function per routing engine 2. [v2] Adding max_lid_ho field to osm_switch_t 3. [v2] Improving FatTree routing engine 4. Added FatTree routing to the osm manual Thanks. -- Yevgeny From ogerlitz at voltaire.com Wed Dec 20 01:35:49 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 20 Dec 2006 11:35:49 +0200 Subject: [openib-general] Performance Degradation with OFED v. Voltaire(lustre) In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net> References: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net> Message-ID: <458903F5.8030905@voltaire.com> Bernadat, Philippe wrote: > I checked. We apparently never go through this path (with lustre) Philippe, Lustre's openib nld (o2ibnld) always go through rdma_resolve_route --> cma_resolve_ib_route --> cma_query_ib_route !!! please add a sanity check printk to the __init code of the rdma_cm module @ drivers/infiniband/core/cma.c to see that the code you are working on is actually loaded into the kernel But this will not help you, the Voltaire SM/SA that you are using will not return you 1K MTU based on the fixed cma-tavor-quirk patch that Michael has sent. This is actually correct also for the Open SM/SA when it does not apply a tavor quirk of its own... So basically, for the time being please patch Lustre o2ibnal to set the MTU to 1K (either always or under some mod param whose default is true), till the issue is discussed and decided over this list. Per the best of my knowledge (Mellanox people please correct me if i am wrong): basically if you use 2K MTU for IB RC with MLX/Tavor you get 50% BW drop, and if you use 2K MTU for IB/RC with MLX/Arble or Sinai you get 5% BW increase. And the BW drop problem holds if either of the parties is tavor. Thanks for pointing on the problem and raising the issue! Or. From halr at voltaire.com Wed Dec 20 04:33:10 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 07:33:10 -0500 Subject: [openib-general] [PATCH] osm: added an option for providing dump function per routing engine In-Reply-To: <45883F79.6090109@dev.mellanox.co.il> References: <45883F79.6090109@dev.mellanox.co.il> Message-ID: <1166617989.4519.40648.camel@hal.voltaire.com> Hi Yevgeny, On Tue, 2006-12-19 at 14:37, Yevgeny Kliteynik wrote: > Hi Hal > > As you suggested, added an option for providing dump > function per routing engine. > > Signed-off-by: Yevgeny Kliteynik Thanks, Applied. One minor question below: > osm/include/opensm/osm_opensm.h | 4 ++++ > osm/opensm/osm_ucast_mgr.c | 23 ++++++++++++++--------- > 2 files changed, 18 insertions(+), 9 deletions(-) [snip...] > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > index e051c66..fcf6f72 100644 > --- a/osm/opensm/osm_ucast_mgr.c > +++ b/osm/opensm/osm_ucast_mgr.c [snip...] > @@ -1256,16 +1257,20 @@ osm_ucast_mgr_process( > build and download the switch forwarding tables. > */ > > - if (!p_routing_eng->ucast_build_fwd_tables || > - p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0) > - { > - cl_qmap_apply_func( p_sw_guid_tbl, > - __osm_ucast_mgr_process_tbl, p_mgr ); > - } > + if ( p_routing_eng->ucast_build_fwd_tables && > + (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) ) > + default_routing = FALSE; > + else > + cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr ); > > /* dump fdb into file: */ > if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) > - __osm_ucast_mgr_dump_tables( p_mgr ); > + { > + if ( !default_routing && p_routing_eng->ucast_dump_tables ) > + p_routing_eng->ucast_dump_tables(p_routing_eng->context); > + else > + __osm_ucast_mgr_dump_tables( p_mgr ); > + } Not sure if this is best going forward. Should it be like this: if ( default_routing ) __osm_ucast_mgr_dump_tables( p_mgr ); else { if ( p_routing_eng->ucast_dump_tables != 0 ) p_routing_eng->ucast_dump_tables(p_routing_eng->context); } -- Hal From halr at voltaire.com Wed Dec 20 04:49:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 07:49:23 -0500 Subject: [openib-general] [PATCHv2] osm: adding max_lid_ho field to osm_switch_t In-Reply-To: <4588F9AC.5040401@dev.mellanox.co.il> References: <4588F9AC.5040401@dev.mellanox.co.il> Message-ID: <1166618113.4519.40761.camel@hal.voltaire.com> On Wed, 2006-12-20 at 03:51, Yevgeny Kliteynik wrote: > Hi Hal > > [V2 of the patch - removed setter and unnecessary initialization] > > Adding max_lid_ho field to osm_switch_t to allow routing > engines that don't use lid matrices to explicitly set the > max lid (in host order) that is reachable from the switch. > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. -- Hal From kliteyn at dev.mellanox.co.il Wed Dec 20 05:49:52 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 15:49:52 +0200 Subject: [openib-general] [PATCH] osm: added an option for providing dump function per routing engine In-Reply-To: <1166617989.4519.40648.camel@hal.voltaire.com> References: <45883F79.6090109@dev.mellanox.co.il> <1166617989.4519.40648.camel@hal.voltaire.com> Message-ID: <45893F80.6060901@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > On Tue, 2006-12-19 at 14:37, Yevgeny Kliteynik wrote: >> Hi Hal >> >> As you suggested, added an option for providing dump >> function per routing engine. >> >> Signed-off-by: Yevgeny Kliteynik > > Thanks, Applied. > > One minor question below: > >> osm/include/opensm/osm_opensm.h | 4 ++++ >> osm/opensm/osm_ucast_mgr.c | 23 ++++++++++++++--------- >> 2 files changed, 18 insertions(+), 9 deletions(-) > > [snip...] > >> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c >> index e051c66..fcf6f72 100644 >> --- a/osm/opensm/osm_ucast_mgr.c >> +++ b/osm/opensm/osm_ucast_mgr.c > > [snip...] > >> @@ -1256,16 +1257,20 @@ osm_ucast_mgr_process( >> build and download the switch forwarding tables. >> */ >> >> - if (!p_routing_eng->ucast_build_fwd_tables || >> - p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0) >> - { >> - cl_qmap_apply_func( p_sw_guid_tbl, >> - __osm_ucast_mgr_process_tbl, p_mgr ); >> - } >> + if ( p_routing_eng->ucast_build_fwd_tables && >> + (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) ) >> + default_routing = FALSE; >> + else >> + cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr ); >> >> /* dump fdb into file: */ >> if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) >> - __osm_ucast_mgr_dump_tables( p_mgr ); >> + { >> + if ( !default_routing && p_routing_eng->ucast_dump_tables ) >> + p_routing_eng->ucast_dump_tables(p_routing_eng->context); >> + else >> + __osm_ucast_mgr_dump_tables( p_mgr ); >> + } > > Not sure if this is best going forward. Should it be like this: > > if ( default_routing ) > __osm_ucast_mgr_dump_tables( p_mgr ); > else > { > if ( p_routing_eng->ucast_dump_tables != 0 ) > p_routing_eng->ucast_dump_tables(p_routing_eng->context); > } But then what if I have some routing engine that wants to use default dump functions, like updn? So in my approach is as follows: - If a routing engine wants to use default dump functions, it should *not* define any dump function of its own. - If a routing engine does *not* want to dump anything, it should define a dummy dump function of its own. You're suggesting the following: - If a routing engine wants to use default dump functions, it should define dump function that will call default function. - If a routing engine does *not* want to dump anything, it should *not* define any dump function of its own. I'm OK with both approaches - your call. -- Yevgeny. > -- Hal > From vlad at dev.mellanox.co.il Wed Dec 20 06:22:06 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 20 Dec 2006 16:22:06 +0200 Subject: [openib-general] OFED 1.2 18-Dec meeting summary In-Reply-To: References: Message-ID: <4589470E.1020105@dev.mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: >> Meeting summary: >> *1. Daily build update:* >> Daily build is now based on kernel 2.6.20-rc1. >> > > Where is the daily build? > > Scott > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > User: http://staging.openfabrics.org/builds/ofa_1_2_user/ Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/ But I see that staging website is down for some reason... Regards, Vladimir From halr at voltaire.com Wed Dec 20 06:29:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 09:29:19 -0500 Subject: [openib-general] [PATCHv2] osm: improving FatTree routing engine In-Reply-To: <4588FA5A.1070802@dev.mellanox.co.il> References: <4588FA5A.1070802@dev.mellanox.co.il> Message-ID: <1166624959.4519.46181.camel@hal.voltaire.com> Hi Yevgeny, On Wed, 2006-12-20 at 03:54, Yevgeny Kliteynik wrote: > Hi Hal > > [V2 of the patch - not using max_lid_ho setter] > > FatTree routing engine improvemets: > 1. Improved building of LFTs > 2. Setting max lid on osm switches > 3. Using ucast manager LFT dump function > 4. Stoped using global variable 'osm' > 5. Improved logging > 6. Some cosmetics In general, it should be one "thought" per patch but since this is so new I will incorporate this all in one patch. > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. One minor comment below. > --- > osm/opensm/osm_ucast_ftree.c | 439 +++++++++++++++++++++++++++--------------- > 1 files changed, 281 insertions(+), 158 deletions(-) > > diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c > index 15e4cd0..0d7188a 100644 > --- a/osm/opensm/osm_ucast_ftree.c > +++ b/osm/opensm/osm_ucast_ftree.c [snip...] > +void > +ucast_mgr_dump_to_file( > + IN osm_ucast_mgr_t *p_mgr, > + IN const char *file_name, > + IN void (*func)(cl_map_item_t *, void *)); > + > +void > +ucast_mgr_dump_lfts( > + IN cl_map_item_t *p_map_item, > + void *cxt); > + Rather than declaring these here, should these go into osm_ucast_mgr.h ? -- Hal From halr at voltaire.com Wed Dec 20 06:33:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 09:33:41 -0500 Subject: [openib-general] Routing patches In-Reply-To: <4588FDB5.6080909@dev.mellanox.co.il> References: <4588FDB5.6080909@dev.mellanox.co.il> Message-ID: <1166624963.4519.46185.camel@hal.voltaire.com> Hi Yevgeny, On Wed, 2006-12-20 at 04:09, Yevgeny Kliteynik wrote: > Hi Hal. > > Just wanted to put some order in the routing-related patches. > > There are four patches that are waiting to be reviewed and applied: > > 1. Added an option for providing dump function per routing engine > 2. [v2] Adding max_lid_ho field to osm_switch_t > 3. [v2] Improving FatTree routing engine > 4. Added FatTree routing to the osm manual All completed now. Thanks. -- Hal > Thanks. > > -- Yevgeny From halr at voltaire.com Wed Dec 20 06:33:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 09:33:36 -0500 Subject: [openib-general] [PATCH] osm: Added FatTree routing to the osm manual In-Reply-To: <45885D14.4090200@dev.mellanox.co.il> References: <45885D14.4090200@dev.mellanox.co.il> Message-ID: <1166624961.4519.46183.camel@hal.voltaire.com> On Tue, 2006-12-19 at 16:43, Yevgeny Kliteynik wrote: > Added FatTree routing to the osm manual > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. > osm/man/opensm.8 | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > diff --git a/osm/man/opensm.8 b/osm/man/opensm.8 > index 316232d..225918d 100644 > --- a/osm/man/opensm.8 > +++ b/osm/man/opensm.8 > @@ -391,7 +391,7 @@ Examples: > > .SH ROUTING > .PP > -OpenSM offers two routing engines: > +OpenSM offers three routing engines: > > 1. Min Hop Algorithm - based on the minimum hops to each node where the > path length is optimized. > @@ -401,6 +401,12 @@ node, but it is constrained to ranking r > if the subnet is not a pure Fat Tree, and deadlock may occur due to a > loop in the subnet. > > +3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing > +for congestion-free "shift" communication pattern. > +It should be chosen if a subnet is a symmetrical Fat Trees of various types, > +not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio. > +Similar to UPDN, Fat Tree routing is constrained to ranking rules. Is there a reference or a more complete writeup of what it does ? See the descriptions of the other algorithms for what I'm referring to here. -- Hal > OpenSM also supports a file method which can load routes from a table. See > \'Modular Routing Engine\' for more information on this. From halr at voltaire.com Wed Dec 20 06:36:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 09:36:55 -0500 Subject: [openib-general] [PATCH] osm: added an option for providing dump function per routing engine In-Reply-To: <45893F80.6060901@dev.mellanox.co.il> References: <45883F79.6090109@dev.mellanox.co.il> <1166617989.4519.40648.camel@hal.voltaire.com> <45893F80.6060901@dev.mellanox.co.il> Message-ID: <1166625414.4519.46593.camel@hal.voltaire.com> Hi Yevgeny, On Wed, 2006-12-20 at 08:49, Yevgeny Kliteynik wrote: > Hi Hal, > > Hal Rosenstock wrote: > > Hi Yevgeny, > > > > On Tue, 2006-12-19 at 14:37, Yevgeny Kliteynik wrote: > >> Hi Hal > >> > >> As you suggested, added an option for providing dump > >> function per routing engine. > >> > >> Signed-off-by: Yevgeny Kliteynik > > > > Thanks, Applied. > > > > One minor question below: > > > >> osm/include/opensm/osm_opensm.h | 4 ++++ > >> osm/opensm/osm_ucast_mgr.c | 23 ++++++++++++++--------- > >> 2 files changed, 18 insertions(+), 9 deletions(-) > > > > [snip...] > > > >> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > >> index e051c66..fcf6f72 100644 > >> --- a/osm/opensm/osm_ucast_mgr.c > >> +++ b/osm/opensm/osm_ucast_mgr.c > > > > [snip...] > > > >> @@ -1256,16 +1257,20 @@ osm_ucast_mgr_process( > >> build and download the switch forwarding tables. > >> */ > >> > >> - if (!p_routing_eng->ucast_build_fwd_tables || > >> - p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0) > >> - { > >> - cl_qmap_apply_func( p_sw_guid_tbl, > >> - __osm_ucast_mgr_process_tbl, p_mgr ); > >> - } > >> + if ( p_routing_eng->ucast_build_fwd_tables && > >> + (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) ) > >> + default_routing = FALSE; > >> + else > >> + cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr ); > >> > >> /* dump fdb into file: */ > >> if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) ) > >> - __osm_ucast_mgr_dump_tables( p_mgr ); > >> + { > >> + if ( !default_routing && p_routing_eng->ucast_dump_tables ) > >> + p_routing_eng->ucast_dump_tables(p_routing_eng->context); > >> + else > >> + __osm_ucast_mgr_dump_tables( p_mgr ); > >> + } > > > > Not sure if this is best going forward. Should it be like this: > > > > if ( default_routing ) > > __osm_ucast_mgr_dump_tables( p_mgr ); > > else > > { > > if ( p_routing_eng->ucast_dump_tables != 0 ) > > p_routing_eng->ucast_dump_tables(p_routing_eng->context); > > } > > But then what if I have some routing engine that wants to use > default dump functions, like updn? > > So in my approach is as follows: > - If a routing engine wants to use default dump functions, > it should *not* define any dump function of its own. > - If a routing engine does *not* want to dump anything, it > should define a dummy dump function of its own. > > You're suggesting the following: > - If a routing engine wants to use default dump functions, > it should define dump function that will call default function. > - If a routing engine does *not* want to dump anything, it > should *not* define any dump function of its own. > > I'm OK with both approaches - your call. You're right. It's 6 of one half a dozen of another. Let's leave it alone. -- Hal > -- Yevgeny. > > > -- Hal > > > From halr at voltaire.com Wed Dec 20 06:38:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 09:38:18 -0500 Subject: [openib-general] OFED 1.2 18-Dec meeting summary In-Reply-To: <4589470E.1020105@dev.mellanox.co.il> References: <4589470E.1020105@dev.mellanox.co.il> Message-ID: <1166625497.4519.46653.camel@hal.voltaire.com> On Wed, 2006-12-20 at 09:22, Vladimir Sokolovsky wrote: > But I see that staging website is down for some reason... http/https appear to be not working. "staging" is up though. -- Hal From kliteyn at dev.mellanox.co.il Wed Dec 20 06:42:29 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 16:42:29 +0200 Subject: [openib-general] [PATCHv2] osm: improving FatTree routing engine In-Reply-To: <1166624959.4519.46181.camel@hal.voltaire.com> References: <4588FA5A.1070802@dev.mellanox.co.il> <1166624959.4519.46181.camel@hal.voltaire.com> Message-ID: <45894BD5.7040808@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > On Wed, 2006-12-20 at 03:54, Yevgeny Kliteynik wrote: >> Hi Hal >> >> [V2 of the patch - not using max_lid_ho setter] >> >> FatTree routing engine improvemets: >> 1. Improved building of LFTs >> 2. Setting max lid on osm switches >> 3. Using ucast manager LFT dump function >> 4. Stoped using global variable 'osm' >> 5. Improved logging >> 6. Some cosmetics > > In general, it should be one "thought" per patch > but since this is so new This is the reason why all these changes are in one patch - they are not appearing in the code in an incremental manner, but rather as a bunch of changes all over the code. > I will incorporate this all in one patch. > >> Signed-off-by: Yevgeny Kliteynik > > Thanks. Applied. > > One minor comment below. > >> --- >> osm/opensm/osm_ucast_ftree.c | 439 +++++++++++++++++++++++++++--------------- >> 1 files changed, 281 insertions(+), 158 deletions(-) >> >> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c >> index 15e4cd0..0d7188a 100644 >> --- a/osm/opensm/osm_ucast_ftree.c >> +++ b/osm/opensm/osm_ucast_ftree.c > > [snip...] > >> +void >> +ucast_mgr_dump_to_file( >> + IN osm_ucast_mgr_t *p_mgr, >> + IN const char *file_name, >> + IN void (*func)(cl_map_item_t *, void *)); >> + >> +void >> +ucast_mgr_dump_lfts( >> + IN cl_map_item_t *p_map_item, >> + void *cxt); >> + > > Rather than declaring these here, should these go into osm_ucast_mgr.h ? I thought about it, but was reluctant to do it because osm_ucast_mgr.h contains only "important" functions. But now that the dump function is one of the routing engine capabilities, I guess you're right - it's better to declare these functions in the header file. -- Yevgeny > -- Hal > From halr at voltaire.com Wed Dec 20 06:47:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 09:47:31 -0500 Subject: [openib-general] Minor question on fat tree routing Message-ID: <1166626050.4519.47111.camel@hal.voltaire.com> Hi Yevgeny, Minor question on fat tree routing: osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code: if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) { if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback to default routing for this case too ? Thanks. -- Hal From kliteyn at dev.mellanox.co.il Wed Dec 20 07:02:34 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 17:02:34 +0200 Subject: [openib-general] Minor question on fat tree routing In-Reply-To: <1166626050.4519.47111.camel@hal.voltaire.com> References: <1166626050.4519.47111.camel@hal.voltaire.com> Message-ID: <4589508A.30901@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > Minor question on fat tree routing: > > osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code: > > if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) > { > if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) > > Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback > to default routing for this case too ? This is also checked, but as part of more earlier checks in the same function: FatTree routing will abort even before ranking the tree and fallback to the default routing if a fabric has less than 2 switches. -- Yevgeny > > Thanks. > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Wed Dec 20 07:15:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 10:15:31 -0500 Subject: [openib-general] Minor question on fat tree routing In-Reply-To: <4589508A.30901@dev.mellanox.co.il> References: <1166626050.4519.47111.camel@hal.voltaire.com> <4589508A.30901@dev.mellanox.co.il> Message-ID: <1166627731.4519.48417.camel@hal.voltaire.com> On Wed, 2006-12-20 at 10:02, Yevgeny Kliteynik wrote: > Hi Hal, > > Hal Rosenstock wrote: > > Hi Yevgeny, > > > > Minor question on fat tree routing: > > > > osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code: > > > > if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) > > { > > if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) > > > > Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback > > to default routing for this case too ? > > This is also checked, but as part of more earlier checks in the same function: > FatTree routing will abort even before ranking the tree and fallback to the default > routing if a fabric has less than 2 switches. What about 2 or more switches but rank is 1 ? Isn't that possible too ? -- Hal > > -- Yevgeny > > > > Thanks. > > > > -- Hal > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From kliteyn at dev.mellanox.co.il Wed Dec 20 07:32:23 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 20 Dec 2006 17:32:23 +0200 Subject: [openib-general] Minor question on fat tree routing In-Reply-To: <1166627731.4519.48417.camel@hal.voltaire.com> References: <1166626050.4519.47111.camel@hal.voltaire.com> <4589508A.30901@dev.mellanox.co.il> <1166627731.4519.48417.camel@hal.voltaire.com> Message-ID: <45895787.1080800@dev.mellanox.co.il> Hal Rosenstock wrote: > On Wed, 2006-12-20 at 10:02, Yevgeny Kliteynik wrote: >> Hi Hal, >> >> Hal Rosenstock wrote: >>> Hi Yevgeny, >>> >>> Minor question on fat tree routing: >>> >>> osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code: >>> >>> if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0) >>> { >>> if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK) >>> >>> Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback >>> to default routing for this case too ? >> This is also checked, but as part of more earlier checks in the same function: >> FatTree routing will abort even before ranking the tree and fallback to the default >> routing if a fabric has less than 2 switches. > > What about 2 or more switches but rank is 1 ? Isn't that possible too ? 2 or more switches and tree rank 1 means that all the switches are leaf switches, which means that they all connected directly to HCAs. So either these switches are not connected to each other, which means that we actually have several disconnected subnets, or they are connected to each other, which means that they have connections the same rank of the tree, which is illegal and is discovered by indexing. But I agree - adding the (< FAT_TREE_MIN_RANK) check will improve readability. -- Yevgeny > -- Hal > >> -- Yevgeny >>> Thanks. >>> >>> -- Hal >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> > From sweitzen at cisco.com Wed Dec 20 08:29:14 2006 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed, 20 Dec 2006 08:29:14 -0800 Subject: [openib-general] OFED 1.2 18-Dec meeting summary Message-ID: > > Where is the daily build? > > User: http://staging.openfabrics.org/builds/ofa_1_2_user/ > Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/ How do I compile this daily build? Can I get a daily build that is packaged like the release candidates, with install.sh? Scott From steve.apo at googlemail.com Wed Dec 20 08:46:43 2006 From: steve.apo at googlemail.com (Steven Wooding) Date: Wed, 20 Dec 2006 16:46:43 +0000 Subject: [openib-general] RDMA to shared memory causing corruption Message-ID: <2cfcf21e0612200846t41231b45qec26d6f9f9a01a8@mail.gmail.com> Hi, I need some advice on a problem I've got RDMAing some data into a shared memory segment. Everything works great until I try to transfer a message of 294Kbytes or larger in size. There is some management info in the top end of the share memory segment (we're using Boost shm library). This management area gets corrupted after the RDMA transfer has occurred. I've tried various things to try and debug this. Allocating more memory than I need from the shared memory segment for the landing buffer. Making whole shared memory segment larger, and making the management area smaller. But always I'm hit by this 294K limit. I don't know whether it's a problem with Boost shmem or with RDMA writing to memory areas that it shouldn't. Any help or glues would be great. Thanks a lot. Steve. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chrise at sgi.com Wed Dec 20 08:56:24 2006 From: chrise at sgi.com (Chris Elmquist) Date: Wed, 20 Dec 2006 10:56:24 -0600 Subject: [openib-general] building and running IBMgtsim? Message-ID: <20061220165624.GL31149@sgi.com> Folks, I am trying to build and run IBMgtsim so that I can explore some different topologies and system sizes. But I am having a lot of trouble getting OpenSM to work with the simulator. I pulled down Eitan's ibutils git tree (to get the simulator) and am otherwise using the OFED 1.1 tarball for the rest of the stuff. I suspect I have a problem with OpenSM not being built correctly to use the simulator. Does anyone have a recipe on how to build and install all of these pieces (ie, openib, openSM and ibmgtsim) so that they will work together? I have been just trying to run one of the tests provided with the simulator like this: % cd ~/ibutils/ibmgtsim/tests % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm but we get this sort of output: -I- Using random seed:43204 -I- Simulation directory is: /tmp/ibmgtsim.29716 -I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log -I- Simulator Ready -I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726 5 -I- Connected to the simulator control server -I- Defined 51 guids -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a 2} -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009 ... -I- Waiting for OpenSM subnet up ... -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: ERR 5422: Unable to find requested CA guid 0x2c90000000009 -I- New 1 events of /tmp/ibmgtsim.29716/osm.log -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5 424: Unable to Open Port 0x2c90000000009 -I- New 1 events of /tmp/ibmgtsim.29716/osm.log -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed -I- New 1 events of /tmp/ibmgtsim.29716/osm.log -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR) -I- New 1 events of /tmp/ibmgtsim.29716/osm.log -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind : ERR 1A11: No previous bind -I- New 1 events of /tmp/ibmgtsim.29716/osm.log Thank you. Chris SGI Network Engineering -- Chris Elmquist mailto:chrise at sgi.com (651)683-3093 Silicon Graphics, Inc. Eagan, MN From vlad at dev.mellanox.co.il Wed Dec 20 09:27:19 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 20 Dec 2006 19:27:19 +0200 Subject: [openib-general] OFED 1.2 18-Dec meeting summary In-Reply-To: References: Message-ID: <45897277.50406@dev.mellanox.co.il> Scott Weitzenkamp (sweitzen) wrote: >>> Where is the daily build? >>> >> User: http://staging.openfabrics.org/builds/ofa_1_2_user/ >> Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/ >> > > How do I compile this daily build? > > Can I get a daily build that is packaged like the release candidates, > with install.sh? > > Scott > Download and open tgz files for user and kernel Execute ./configure ... (see --help) make make install Example for userspace: ./configure --with-libibverbs --with-libmthca --with-libipathverbs --with-libibcm --with-libsdp --with-librdmacm --with-opensm --with-openib-diags --with-perftest --with-mstflint --with-srptools --with-ipoibtools make make install Example for kernel: ./configure --with-ipoib-mod --with-sdp-mod --with-srp-mod --with-user_mad-mod --with-user_access-mod --with-mthca-mod --with-core-mod --with-addr_trans-mod make make install Updated: https://openib.org/tiki/tiki-index.php?page=OFED+1.2+release+plan+and+features Regards, Vladimir From Ashish.Batwara at lsi.com Wed Dec 20 10:06:08 2006 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Wed, 20 Dec 2006 11:06:08 -0700 Subject: [openib-general] opensm Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com> Hi, Please see the information below This is what I did: /etc/init.d/openibd start /etc/init.d/opensmd start modprobe ib_srp Issued the command /usr/local/ofed/sbin/ibsrpdm -c to get the information about target and used them in echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114 6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target Yes, earlier I had silverstorm switch which was running SM but now I have taken that out and directly connecting the target and host. I have only one port connected between the host and the target. The reason behind link is not stable is that I am restarting and stopping again and again, as this does not seem to be working and I did not know the issue until I looked at the console log which was indicating "Got failed path rec status -110" and after seeing that I searched on goggle and found that "https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht ml" it seems to be a bug with 64-bit machine. BTW, my linux server is 64-bit. When I hooked up 32-bit server running OFED-1.1, I see my target discovered with the same procedure. So, whole question is that what is the fix for issue "Got failed path rec status -110" on 64-bit machine. Thanks Ashish -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Tuesday, December 19, 2006 10:35 PM To: Batwara, Ashish Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org Subject: RE: [openib-general] opensm On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote: > Hi, > Please look towards the end of the attached file. What options are you starting opensm with ? What is the command line ? Also, it looks like (at least at one point) you have another SM on the subnet. What is the make (vendor) for your switch ? I see many SM port is DOWN. What is going on with this port ? Why is the physical link not LinkUp and stable ? That is the main issue and is likely why the SubnGet of NodeInfo is not being responded to. -- Hal > Thanks > Ashish > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 19, 2006 5:06 PM > To: Batwara, Ashish > Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org > Subject: Re: [openib-general] opensm > > Ashish, > > On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: > > Hi, > > > > Here is the info that you have asked. I am seeing the Subnet manager > > is up now having the port active. But server is not able to discover > > the target. I am seeing the error "Got failed path rec status -110" on > > Linux console. > > That means the request for an SA PathRecord from the initiator to the > target failed (-110 is ETIMEDOUT). Are you sure the target is up > (ACTIVE) on the subnet ? If it is, can you send the opensm log ? > > -- Hal > > > Below are the output of different commands. I am using following to > > discover the target: > > > > > > > > /etc/init.d/opensmd start > > > > /etc/init.d/openibd start > > > > modprobe ib_srp > > > > echo > > > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 > 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > > /sys/class/infiniband_srp/srp-mthca0-2/add_target > > > > > > > > > > > > [root at p49 ~]# ibv_devinfo > > > > hca_id: mthca0 > > > > fw_ver: 5.1.400 > > > > node_guid: 0002:c902:0022:cce0 > > > > sys_image_guid: 0002:c902:0022:cce3 > > > > vendor_id: 0x02c9 > > > > vendor_part_id: 25218 > > > > hw_ver: 0xA0 > > > > board_id: MT_0370130002 > > > > phys_port_cnt: 2 > > > > port: 1 > > > > state: PORT_DOWN (1) > > > > max_mtu: 2048 (4) > > > > active_mtu: 512 (2) > > > > sm_lid: 0 > > > > port_lid: 0 > > > > port_lmc: 0x00 > > > > > > > > port: 2 > > > > state: PORT_ACTIVE (4) > > > > max_mtu: 2048 (4) > > > > active_mtu: 2048 (4) > > > > sm_lid: 1 > > > > port_lid: 1 > > > > port_lmc: 0x00 > > hca_id: mthca1 > > > > fw_ver: 5.1.400 > > > > node_guid: 0002:c902:0022:cd2c > > > > sys_image_guid: 0002:c902:0022:cd2f > > > > vendor_id: 0x02c9 > > > > vendor_part_id: 25218 > > > > hw_ver: 0xA0 > > > > board_id: MT_0370130002 > > > > phys_port_cnt: 2 > > > > port: 1 > > > > state: PORT_DOWN (1) > > > > max_mtu: 2048 (4) > > > > active_mtu: 512 (2) > > > > sm_lid: 0 > > > > port_lid: 0 > > > > port_lmc: 0x00 > > > > > > > > port: 2 > > > > state: PORT_DOWN (1) > > > > max_mtu: 2048 (4) > > > > active_mtu: 512 (2) > > > > sm_lid: 0 > > > > port_lid: 0 > > > > port_lmc: 0x00 > > > > > > > > > > > > [root at p49 ~]# uname -a > > > > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 > > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > [root at p49 ~]# cat /etc/infiniband/info > > > > #!/bin/bash > > > > > > > > echo prefix=/usr/local/ofed > > > > echo Kernel=2.6.9-42.0.3.ELsmp > > > > echo > > > > echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm > > --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs > > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm > > --with-libsdp --with-openib-diags --with-srptools --with-mstflint > > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod > > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod > > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" > > > > echo > > > > > > > > OFED Version: OFED-1.1 > > > > > > > Thanks > > > > Ashish > > > > -----Original Message----- > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > > Sent: Tuesday, December 19, 2006 5:18 AM > > To: Batwara, Ashish > > Cc: ishai at mellanox.co.il; openib-general at openib.org > > Subject: Re: [openib-general] opensm > > > > > > > > Hi Ashish, > > > > > > > > SRP people say they have no such error message. > > > > OpenSM does. So I take it back. > > > > > > > > Ashish, > > > > Please provide more into: > > > > > > > > 1. ibv_devinfo > > > > 2. Version of code you are using > > > > 3. Command line you use for starting opensm > > > > 4. /var/log/osm.log > > > > > > > > Thanks and sorry for the confusion. > > > > > > > > EZ > > > > > > > > Eitan Zahavi wrote: > > > > > This is not an OpenSM issue. > > > > > Forwarded to the SRP people. > > > > > > > > > > EZ > > > > > Batwara, Ashish wrote: > > > > > > > > > >> Hi, > > > > >> I am trying to run opensm on Linux server. It has two HCAs > > (4-ports) and > > > > >> connected to IB Switch. ibnodes command displays the information > > about > > > > >> the Switch ports and HCA ports. > > > > >> When I start opensm, I see in /var/log/messages "Starting > > srp_daemon" > > > > >> for all the 4 ports and immediately after I see "failed srp_daemon" > > for > > > > >> all the ports and the displays "SM Port is down". > > > > >> > > > > >> I tried several times and even rebooted the server few times but no > > > > >> luck. > > > > >> > > > > >> Does anybody know what this problem is? > > > > >> > > > > >> Thanks > > > > >> Ashish > > > > >> > > > > >> _______________________________________________ > > > > >> openib-general mailing list > > > > >> openib-general at openib.org > > > > >> http://openib.org/mailman/listinfo/openib-general > > > > >> > > > > >> To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > >> > > > > >> > > > > > > > > > > > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > openib-general at openib.org > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Wed Dec 20 10:21:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 13:21:49 -0500 Subject: [openib-general] opensm In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com> Message-ID: <1166638908.4519.57147.camel@hal.voltaire.com> Hi, On Wed, 2006-12-20 at 13:06, Batwara, Ashish wrote: > Hi, > Please see the information below > > This is what I did: > /etc/init.d/openibd start > /etc/init.d/opensmd start >From where does OpenSM get its parameters ? What are they ? > modprobe ib_srp > > Issued the command /usr/local/ofed/sbin/ibsrpdm -c to get the > information about target and used them in > > echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, > > dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114 > 6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target > > Yes, earlier I had silverstorm switch which was running SM but now I > have taken that out and directly connecting the target and host. > > I have only one port connected between the host and the target. > The reason behind link is not stable is that I am restarting and > stopping again and again, as this does not seem to be working and I did > not know the issue until I looked at the console log which was > indicating "Got failed path rec status -110" and after seeing that I > searched on goggle and found that > "https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht > ml" That's pretty old email. > it seems to be a bug with 64-bit machine. > BTW, my linux server is 64-bit. > When I hooked up 32-bit server running OFED-1.1, I see my target > discovered with the same procedure. OpenSM has run successfully on 64 bit servers (as part of OFED 1.1). > So, whole question is that what is the fix for issue "Got failed path > rec status -110" on 64-bit machine. I'm not sure what the problem is and I'm not sufficiently familiar with building it from the OFED distribution on a 64 bit machine. -- Hal > Thanks > Ashish > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 19, 2006 10:35 PM > To: Batwara, Ashish > Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org > Subject: RE: [openib-general] opensm > > On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote: > > Hi, > > Please look towards the end of the attached file. > > What options are you starting opensm with ? What is the command line ? > > Also, it looks like (at least at one point) you have another SM on the > subnet. What is the make (vendor) for your switch ? > > I see many SM port is DOWN. What is going on with this port ? Why is the > physical link not LinkUp and stable ? That is the main issue and is > likely why the SubnGet of NodeInfo is not being responded to. > > -- Hal > > > Thanks > > Ashish > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Tuesday, December 19, 2006 5:06 PM > > To: Batwara, Ashish > > Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org > > Subject: Re: [openib-general] opensm > > > > Ashish, > > > > On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: > > > Hi, > > > > > > Here is the info that you have asked. I am seeing the Subnet manager > > > is up now having the port active. But server is not able to discover > > > the target. I am seeing the error "Got failed path rec status -110" > on > > > Linux console. > > > > That means the request for an SA PathRecord from the initiator to the > > target failed (-110 is ETIMEDOUT). Are you sure the target is up > > (ACTIVE) on the subnet ? If it is, can you send the opensm log ? > > > > -- Hal > > > > > Below are the output of different commands. I am using following to > > > discover the target: > > > > > > > > > > > > /etc/init.d/opensmd start > > > > > > /etc/init.d/openibd start > > > > > > modprobe ib_srp > > > > > > echo > > > > > > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 > > 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > > > /sys/class/infiniband_srp/srp-mthca0-2/add_target > > > > > > > > > > > > > > > > > > [root at p49 ~]# ibv_devinfo > > > > > > hca_id: mthca0 > > > > > > fw_ver: 5.1.400 > > > > > > node_guid: 0002:c902:0022:cce0 > > > > > > sys_image_guid: 0002:c902:0022:cce3 > > > > > > vendor_id: 0x02c9 > > > > > > vendor_part_id: 25218 > > > > > > hw_ver: 0xA0 > > > > > > board_id: MT_0370130002 > > > > > > phys_port_cnt: 2 > > > > > > port: 1 > > > > > > state: PORT_DOWN (1) > > > > > > max_mtu: 2048 (4) > > > > > > active_mtu: 512 (2) > > > > > > sm_lid: 0 > > > > > > port_lid: 0 > > > > > > port_lmc: 0x00 > > > > > > > > > > > > port: 2 > > > > > > state: PORT_ACTIVE (4) > > > > > > max_mtu: 2048 (4) > > > > > > active_mtu: 2048 (4) > > > > > > sm_lid: 1 > > > > > > port_lid: 1 > > > > > > port_lmc: 0x00 > > > hca_id: mthca1 > > > > > > fw_ver: 5.1.400 > > > > > > node_guid: 0002:c902:0022:cd2c > > > > > > sys_image_guid: 0002:c902:0022:cd2f > > > > > > vendor_id: 0x02c9 > > > > > > vendor_part_id: 25218 > > > > > > hw_ver: 0xA0 > > > > > > board_id: MT_0370130002 > > > > > > phys_port_cnt: 2 > > > > > > port: 1 > > > > > > state: PORT_DOWN (1) > > > > > > max_mtu: 2048 (4) > > > > > > active_mtu: 512 (2) > > > > > > sm_lid: 0 > > > > > > port_lid: 0 > > > > > > port_lmc: 0x00 > > > > > > > > > > > > port: 2 > > > > > > state: PORT_DOWN (1) > > > > > > max_mtu: 2048 (4) > > > > > > active_mtu: 512 (2) > > > > > > sm_lid: 0 > > > > > > port_lid: 0 > > > > > > port_lmc: 0x00 > > > > > > > > > > > > > > > > > > [root at p49 ~]# uname -a > > > > > > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 > > > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > > > > > > > > > > [root at p49 ~]# cat /etc/infiniband/info > > > > > > #!/bin/bash > > > > > > > > > > > > echo prefix=/usr/local/ofed > > > > > > echo Kernel=2.6.9-42.0.3.ELsmp > > > > > > echo > > > > > > echo "Configure options: --with-dapl --with-ipoibtools > --with-libibcm > > > --with-libibcommon --with-libibmad --with-libibumad > --with-libibverbs > > > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm > > > --with-libsdp --with-openib-diags --with-srptools --with-mstflint > > > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod > > > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod > > > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" > > > > > > echo > > > > > > > > > > > > OFED Version: OFED-1.1 > > > > > > > > > > > > Thanks > > > > > > Ashish > > > > > > -----Original Message----- > > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] > > > Sent: Tuesday, December 19, 2006 5:18 AM > > > To: Batwara, Ashish > > > Cc: ishai at mellanox.co.il; openib-general at openib.org > > > Subject: Re: [openib-general] opensm > > > > > > > > > > > > Hi Ashish, > > > > > > > > > > > > SRP people say they have no such error message. > > > > > > OpenSM does. So I take it back. > > > > > > > > > > > > Ashish, > > > > > > Please provide more into: > > > > > > > > > > > > 1. ibv_devinfo > > > > > > 2. Version of code you are using > > > > > > 3. Command line you use for starting opensm > > > > > > 4. /var/log/osm.log > > > > > > > > > > > > Thanks and sorry for the confusion. > > > > > > > > > > > > EZ > > > > > > > > > > > > Eitan Zahavi wrote: > > > > > > > This is not an OpenSM issue. > > > > > > > Forwarded to the SRP people. > > > > > > > > > > > > > > EZ > > > > > > > Batwara, Ashish wrote: > > > > > > > > > > > > > >> Hi, > > > > > > >> I am trying to run opensm on Linux server. It has two HCAs > > > (4-ports) and > > > > > > >> connected to IB Switch. ibnodes command displays the information > > > about > > > > > > >> the Switch ports and HCA ports. > > > > > > >> When I start opensm, I see in /var/log/messages "Starting > > > srp_daemon" > > > > > > >> for all the 4 ports and immediately after I see "failed > srp_daemon" > > > for > > > > > > >> all the ports and the displays "SM Port is down". > > > > > > >> > > > > > > >> I tried several times and even rebooted the server few times but > no > > > > > > >> luck. > > > > > > >> > > > > > > >> Does anybody know what this problem is? > > > > > > >> > > > > > > >> Thanks > > > > > > >> Ashish > > > > > > >> > > > > > > >> _______________________________________________ > > > > > > >> openib-general mailing list > > > > > > >> openib-general at openib.org > > > > > > >> http://openib.org/mailman/listinfo/openib-general > > > > > > >> > > > > > > >> To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > openib-general mailing list > > > > > > > openib-general at openib.org > > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > > > _______________________________________________ > > > openib-general mailing list > > > openib-general at openib.org > > > http://openib.org/mailman/listinfo/openib-general > > > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Wed Dec 20 11:03:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 14:03:30 -0500 Subject: [openib-general] [PATCH 1/2]: OpenSM/osm_sa_informinfo.c: Fix InformInfoRecord searches Message-ID: <1166641409.4519.59078.camel@hal.voltaire.com> OpenSM/osm_sa_informinfo.c: Fix InformInfoRecord searches Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index 06ea90c..374b61d 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -368,8 +368,6 @@ __osm_sa_inform_info_rec_by_comp_mask( osm_port_t * p_subscriber_port; osm_physp_t * p_subscriber_physp; const osm_physp_t* p_req_physp; - osm_infr_t* p_infr_rec = NULL; - ib_inform_info_record_t inform_info_rec; osm_iir_item_t* p_rec_item; OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_inform_info_rec_by_comp_mask ); @@ -378,72 +376,58 @@ __osm_sa_inform_info_rec_by_comp_mask( comp_mask = p_ctxt->comp_mask; p_req_physp = p_ctxt->p_req_physp; - /* Both subscriber GID and enum specified */ - if ((comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) && - (comp_mask & IB_IIR_COMPMASK_ENUM)) - { - inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid; - inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum; - p_infr_rec = osm_infr_get_by_rid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec); - goto Done; - } - if (comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) { - inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid; - p_infr_rec = osm_infr_get_by_gid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec); - goto Done; + if (memcmp(&p_infr->inform_record.subscriber_gid, + &p_ctxt->subscriber_gid, + sizeof(p_infr->inform_record.subscriber_gid))) + goto Exit; } if (comp_mask & IB_IIR_COMPMASK_ENUM) { - inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum; - p_infr_rec = osm_infr_get_by_enum(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec); - goto Done; + if (p_infr->inform_record.subscriber_enum != p_ctxt->subscriber_enum) + goto Exit; } /* Implement any other needed search cases */ -Done: - if (p_infr_rec) + /* Ensure pkey is shared before returning any records */ + portguid = p_infr->inform_record.subscriber_gid.unicast.interface_id; + p_subscriber_port = osm_get_port_by_guid( p_rcv->p_subn, portguid ); + if ( p_subscriber_port == NULL ) { - /* Ensure pkey is shared before returning any records */ - portguid = p_infr_rec->inform_record.subscriber_gid.unicast.interface_id; - p_subscriber_port = osm_get_port_by_guid( p_rcv->p_subn, portguid ); - if ( p_subscriber_port == NULL ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_sa_inform_info_rec_by_comp_mask: ERR 430D: " - "Invalid subscriber port guid: 0x%016" PRIx64 "\n", - cl_ntoh64(portguid) ); - goto Exit; - } + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_inform_info_rec_by_comp_mask: ERR 430D: " + "Invalid subscriber port guid: 0x%016" PRIx64 "\n", + cl_ntoh64(portguid) ); + goto Exit; + } - /* get the subscriber InformInfo physical port */ - p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port); - /* make sure that the requester and subscriber port can access each other - according to the current partitioning. */ - if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp )) - { - osm_log( p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_sa_inform_info_rec_by_comp_mask: " - "requester and subscriber ports don't share pkey\n" ); - goto Exit; - } + /* get the subscriber InformInfo physical port */ + p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port); + /* make sure that the requester and subscriber port can access each other + according to the current partitioning. */ + if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp )) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_sa_inform_info_rec_by_comp_mask: " + "requester and subscriber ports don't share pkey\n" ); + goto Exit; + } - p_rec_item = (osm_iir_item_t*)cl_qlock_pool_get( &p_rcv->pool ); - if( p_rec_item == NULL ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_sa_inform_info_rec_by_comp_mask: ERR 430E: " - "cl_qlock_pool_get failed\n" ); - goto Exit; - } - - memcpy((void *)&p_rec_item->rec, (void *)&p_infr_rec->inform_record, sizeof(ib_inform_info_record_t)); - cl_qlist_insert_tail( p_ctxt->p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + p_rec_item = (osm_iir_item_t*)cl_qlock_pool_get( &p_rcv->pool ); + if( p_rec_item == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sa_inform_info_rec_by_comp_mask: ERR 430E: " + "cl_qlock_pool_get failed\n" ); + goto Exit; } + memcpy((void *)&p_rec_item->rec, (void *)&p_infr->inform_record, sizeof(ib_inform_info_record_t)); + cl_qlist_insert_tail( p_ctxt->p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + Exit: OSM_LOG_EXIT( p_rcv->p_log ); } From halr at voltaire.com Wed Dec 20 11:03:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 14:03:45 -0500 Subject: [openib-general] [PATCH 2/2] OpenSM: Eliminate no longer needed routines in osm_inform.c Message-ID: <1166641411.4519.59080.camel@hal.voltaire.com> OpenSM: Eliminate no longer needed routines in osm_inform.c Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h index 0bc8810..3e8e122 100644 --- a/osm/include/opensm/osm_inform.h +++ b/osm/include/opensm/osm_inform.h @@ -223,103 +223,6 @@ osm_infr_destroy( * Inform Record, osm_infr_construct, osm_infr_destroy *********/ -/****f* OpenSM: Inform Record/osm_infr_get_by_rid -* NAME -* osm_infr_get_by_rid -* -* DESCRIPTION -* Find a matching osm_infr_t in the subnet DB by inform_info_record RID -* -* SYNOPSIS -*/ -osm_infr_t* -osm_infr_get_by_rid( - IN osm_subn_t const *p_subn, - IN osm_log_t *p_log, - IN ib_inform_info_record_t* const p_inf_rec ); -/* -* PARAMETERS -* p_subn -* [in] Pointer to the subnet object -* -* p_log -* [in] Pointer to the log object -* -* p_inf_rec -* [in] Pointer to an inform_info record with the search RID -* -* RETURN -* The matching osm_infr_t -* SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy -*********/ - -/****f* OpenSM: Inform Record/osm_infr_get_by_gid -* NAME -* osm_infr_get_by_gid -* -* DESCRIPTION -* Find a matching osm_infr_t in the subnet DB by inform_info_record -* subscriber GID -* -* SYNOPSIS -*/ -osm_infr_t* -osm_infr_get_by_gid( - IN osm_subn_t const *p_subn, - IN osm_log_t *p_log, - IN ib_inform_info_record_t* const p_inf_rec ); -/* -* PARAMETERS -* p_subn -* [in] Pointer to the subnet object -* -* p_log -* [in] Pointer to the log object -* -* p_inf_rec -* [in] Pointer to an inform_info record with the search -* subscriber GID -* -* RETURN -* The matching osm_infr_t -* SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy -*********/ - -/****f* OpenSM: Inform Record/osm_infr_get_by_enum -* NAME -* osm_infr_get_by_enum -* -* DESCRIPTION -* Find a matching osm_infr_t in the subnet DB by inform_info_record -* subscriber enum -* -* SYNOPSIS -*/ -osm_infr_t* -osm_infr_get_by_enum( - IN osm_subn_t const *p_subn, - IN osm_log_t *p_log, - IN ib_inform_info_record_t* const p_inf_rec ); -/* -* PARAMETERS -* p_subn -* [in] Pointer to the subnet object -* -* p_log -* [in] Pointer to the log object -* -* p_inf_rec -* [in] Pointer to an inform_info record with the search -* subscriber enum -* -* RETURN -* The matching osm_infr_t -* SEE ALSO -* Inform Record, osm_infr_construct, osm_infr_destroy -*********/ - /****f* OpenSM: Inform Record/osm_infr_get_by_rec * NAME * osm_infr_get_by_rec diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c index 074a3f9..98b7ec4 100644 --- a/osm/opensm/osm_inform.c +++ b/osm/opensm/osm_inform.c @@ -117,148 +117,6 @@ osm_infr_new( } /********************************************************************** - * Match an infr by the RID of the stored inform_info_record - **********************************************************************/ -static cl_status_t -__match_rid_of_inf_rec( - IN const cl_list_item_t* const p_list_item, - IN void* context ) -{ - ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t *)context; - osm_infr_t* p_infr = (osm_infr_t*)p_list_item; - int32_t count; - - count = memcmp( - &p_infr->inform_record, - p_infr_rec, - sizeof(p_infr_rec->subscriber_gid) + - sizeof(p_infr_rec->subscriber_enum) ); - - if(count == 0) - return CL_SUCCESS; - else - return CL_NOT_FOUND; -} - -/********************************************************************** - * Match an infr by the subscriber GID of the stored inform_info_record - **********************************************************************/ -static cl_status_t -__match_gid_of_inf_rec( - IN const cl_list_item_t* const p_list_item, - IN void* context ) -{ - ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t *)context; - osm_infr_t* p_infr = (osm_infr_t*)p_list_item; - int32_t count; - - count = memcmp( - &p_infr->inform_record, - p_infr_rec, - sizeof(p_infr_rec->subscriber_gid) ); - - if(count == 0) - return CL_SUCCESS; - else - return CL_NOT_FOUND; -} - -/********************************************************************** - * Match an infr by the subscriber enum of the stored inform_info_record - **********************************************************************/ -static cl_status_t -__match_enum_of_inf_rec( - IN const cl_list_item_t* const p_list_item, - IN void* context ) -{ - ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t *)context; - osm_infr_t* p_infr = (osm_infr_t*)p_list_item; - int32_t count; - - count = memcmp( - &p_infr->inform_record.subscriber_enum, - &p_infr_rec->subscriber_enum, - sizeof(p_infr_rec->subscriber_enum) ); - - if(count == 0) - return CL_SUCCESS; - else - return CL_NOT_FOUND; -} - -/********************************************************************** - **********************************************************************/ -osm_infr_t* -osm_infr_get_by_rid( - IN osm_subn_t const *p_subn, - IN osm_log_t *p_log, - IN ib_inform_info_record_t* const p_infr_rec ) -{ - cl_list_item_t* p_list_item; - - OSM_LOG_ENTER( p_log, osm_infr_get_by_rid ); - - p_list_item = cl_qlist_find_from_head( - &p_subn->sa_infr_list, - __match_rid_of_inf_rec, - p_infr_rec ); - - if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) ) - p_list_item = NULL; - - OSM_LOG_EXIT( p_log ); - return (osm_infr_t*)p_list_item; -} - -/********************************************************************** - **********************************************************************/ -osm_infr_t* -osm_infr_get_by_gid( - IN osm_subn_t const *p_subn, - IN osm_log_t *p_log, - IN ib_inform_info_record_t* const p_infr_rec ) -{ - cl_list_item_t* p_list_item; - - OSM_LOG_ENTER( p_log, osm_infr_get_by_gid ); - - p_list_item = cl_qlist_find_from_head( - &p_subn->sa_infr_list, - __match_gid_of_inf_rec, - p_infr_rec ); - - if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) ) - p_list_item = NULL; - - OSM_LOG_EXIT( p_log ); - return (osm_infr_t*)p_list_item; -} - -/********************************************************************** - **********************************************************************/ -osm_infr_t* -osm_infr_get_by_enum( - IN osm_subn_t const *p_subn, - IN osm_log_t *p_log, - IN ib_inform_info_record_t* const p_infr_rec ) -{ - cl_list_item_t* p_list_item; - - OSM_LOG_ENTER( p_log, osm_infr_get_by_enum ); - - p_list_item = cl_qlist_find_from_head( - &p_subn->sa_infr_list, - __match_enum_of_inf_rec, - p_infr_rec ); - - if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) ) - p_list_item = NULL; - - OSM_LOG_EXIT( p_log ); - return (osm_infr_t*)p_list_item; -} - -/********************************************************************** **********************************************************************/ void __dump_all_informs( From swise at opengridcomputing.com Wed Dec 20 11:17:54 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:17:54 -0600 Subject: [openib-general] [PATCH v5 00/13] iw_cxgb3 - Chelsio T3 RDMA Driver Message-ID: <20061220191754.19316.4914.stgit@dell3.ogc.int> Roland, I think this is ready to go once the ethernet driver is pulled in by Jeff. Also: I'm gone after today returning Wednesday Jan 3rd. I'll address any new issues when I return. Cheers! Steve. ---- Version 5 changes: - BugFix: fixed broken endpoint state serialization - Merged up to linus's tree as of 12/18/2006 (2.6.20-rc1) - Removed all blank characters at the end of lines The following series implements the Chelsio T3 iWARP/RDMA Driver to be considered for inclusion in 2.6.20. It depends on the Chelsio T3 Ethernet driver which is also under review now for 2.6.20. The latest Chelsio T3 Ethernet driver patch can be pulled from: http://service.chelsio.com/kernel.org/cxgb3.patch.bz2 This T3 iWARP/RDMA Driver patch series can be pulled from: http://www.opengridcomputing.com/downloads/iw_cxgb3_patches_v5.tar.bz2 A complete GIT kernel tree with all the T3 drivers can be pulled from: git://staging.openfabrics.org/~swise/cxgb3.git From swise at opengridcomputing.com Wed Dec 20 11:18:24 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:18:24 -0600 Subject: [openib-general] [PATCH v5 01/13] iw_cxgb3 Linux RDMA Core Changes In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220191824.19316.93248.stgit@dell3.ogc.int> Support provider-specific data in ib_uverbs_cmd_req_notify_cq(). The Chelsio iwarp provider library needs to pass information to the kernel verb for re-arming the CQ. Signed-off-by: Steve Wise --- drivers/infiniband/core/uverbs_cmd.c | 9 +++++++-- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 3 ++- drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 ++- drivers/infiniband/hw/ehca/ehca_reqs.c | 3 ++- drivers/infiniband/hw/ipath/ipath_cq.c | 4 +++- drivers/infiniband/hw/ipath/ipath_verbs.h | 3 ++- drivers/infiniband/hw/mthca/mthca_cq.c | 6 ++++-- drivers/infiniband/hw/mthca/mthca_dev.h | 4 ++-- include/rdma/ib_verbs.h | 5 +++-- 10 files changed, 28 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 743247e..5dd1de9 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i int out_len) { struct ib_uverbs_req_notify_cq cmd; + struct ib_udata udata; struct ib_cq *cq; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i if (!cq) return -EINVAL; - ib_req_notify_cq(cq, cmd.solicited_only ? - IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + INIT_UDATA(&udata, buf + sizeof cmd, 0, + in_len - sizeof cmd, 0); + + cq->device->req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP, + &udata); put_cq_read(cq); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h index 04a9db5..9a76869 100644 --- a/drivers/infiniband/hw/amso1100/c2.h +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata); /* CM */ extern int c2_llp_connect(struct iw_cm_id *cm_id, diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 05c9154..7ce8bca 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n return npolled; } -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct c2_mq_shared __iomem *shared; struct c2_cq *cq; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 3720e30..566b30c 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata); struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index b46bda1..3ed6992 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -634,7 +634,8 @@ poll_cq_exit0: return ret; } -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 87462e0..27ba4db 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq) * ipath_req_notify_cq - change the notification type for a completion queue * @ibcq: the completion queue * @notify: the type of notification to request + * @udata: user data * * Returns 0 for success. * * This may be called from interrupt context. Also called by * ib_req_notify_cq() in the generic verbs code. */ -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); unsigned long flags; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index c0c8d5b..7db01ae 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_ int ipath_destroy_cq(struct ib_cq *ibcq); -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata); int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 283d50b..15cbd49 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -722,7 +722,8 @@ repoll: return err == 0 || err == -EAGAIN ? npolled : err; } -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, + struct ib_udata *udata) { __be32 doorbell[2]; @@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, return 0; } -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct mthca_cq *cq = to_mcq(ibcq); __be32 doorbell[2]; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index fe5cecf..6b9ccf6 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 0bfa332..4dc771f 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -986,7 +986,8 @@ struct ib_device { struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); int (*req_notify_cq)(struct ib_cq *cq, - enum ib_cq_notify cq_notify); + enum ib_cq_notify cq_notify, + struct ib_udata *udata); int (*req_ncomp_notif)(struct ib_cq *cq, int wc_cnt); struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, @@ -1420,7 +1421,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_ static inline int ib_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) { - return cq->device->req_notify_cq(cq, cq_notify); + return cq->device->req_notify_cq(cq, cq_notify, NULL); } /** From swise at opengridcomputing.com Wed Dec 20 11:18:54 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:18:54 -0600 Subject: [openib-general] [PATCH v5 02/13] iw_cxgb3 Device Discovery and ULLD Linkage In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220191854.19316.18353.stgit@dell3.ogc.int> Code to discover all the T3 devices and register them with the T3 RDMA Core and the Linux RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 189 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 175 +++++++++++++++++++++++++++++++++ 2 files changed, 364 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c new file mode 100644 index 0000000..0c95f2c --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -0,0 +1,189 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" +#include "iwch_user.h" +#include "iwch.h" +#include "iwch_cm.h" + +#define DRV_VERSION "1.1" + +MODULE_AUTHOR("Boyd Faulkner, Steve Wise"); +MODULE_DESCRIPTION("Chelsio T3 RDMA Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; + +static void open_rnic_dev(struct t3cdev *); +static void close_rnic_dev(struct t3cdev *); + +struct cxgb3_client t3c_client = { + .name = "iw_cxgb3", + .add = open_rnic_dev, + .remove = close_rnic_dev, + .handlers = t3c_handlers, + .redirect = iwch_ep_redirect +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(dev_mutex); + +static void rnic_init(struct iwch_dev *rnicp) +{ + PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); + idr_init(&rnicp->cqidr); + idr_init(&rnicp->qpidr); + idr_init(&rnicp->mmidr); + spin_lock_init(&rnicp->lock); + + rnicp->attr.vendor_id = 0x168; + rnicp->attr.vendor_part_id = 7; + rnicp->attr.max_qps = T3_MAX_NUM_QP - 32; + rnicp->attr.max_wrs = (1UL << 24) - 1; + rnicp->attr.max_sge_per_wr = T3_MAX_SGE; + rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE; + rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1; + rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1; + rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev); + rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE; + rnicp->attr.max_pds = T3_MAX_NUM_PD - 1; + rnicp->attr.mem_pgsizes_bitmask = 0x7FFF; /* 4KB-128MB */ + rnicp->attr.can_resize_wq = 0; + rnicp->attr.max_rdma_reads_per_qp = 8; + rnicp->attr.max_rdma_read_resources = + rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps; + rnicp->attr.max_rdma_read_qp_depth = 8; /* IRD */ + rnicp->attr.max_rdma_read_depth = + rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps; + rnicp->attr.rq_overflow_handled = 0; + rnicp->attr.can_modify_ird = 0; + rnicp->attr.can_modify_ord = 0; + rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1; + rnicp->attr.stag0_value = 1; + rnicp->attr.zbva_support = 1; + rnicp->attr.local_invalidate_fence = 1; + rnicp->attr.cq_overflow_detection = 1; + return; +} + +static void open_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *rnicp; + static int vers_printed; + + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + if (!vers_printed++) + printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", + DRV_VERSION); + rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); + if (!rnicp) { + printk(KERN_ERR MOD "Cannot allocate ib device\n"); + return; + } + rnicp->rdev.ulp = rnicp; + rnicp->rdev.t3cdev_p = tdev; + + if (cxio_rdev_open(&rnicp->rdev)) { + printk(KERN_ERR MOD "Unable to open CXIO rdev\n"); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + rnic_init(rnicp); + + mutex_lock(&dev_mutex); + list_add_tail(&rnicp->entry, &dev_list); + mutex_unlock(&dev_mutex); + + if (iwch_register_device(rnicp)) { + printk(KERN_ERR MOD "Unable to register device\n"); + close_rnic_dev(tdev); + } + printk(KERN_INFO MOD "Initialized device %s\n", + pci_name(rnicp->rdev.rnic_info.pdev)); + return; +} + +static void close_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *dev, *tmp; + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + mutex_lock(&dev_mutex); + list_for_each_entry_safe(dev, tmp, &dev_list, entry) { + if (dev->rdev.t3cdev_p == tdev) { + list_del(&dev->entry); + iwch_unregister_device(dev); + cxio_rdev_close(&dev->rdev); + idr_destroy(&dev->cqidr); + idr_destroy(&dev->qpidr); + idr_destroy(&dev->mmidr); + ib_dealloc_device(&dev->ibdev); + break; + } + } + mutex_unlock(&dev_mutex); +} + +extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb); + +static int __init iwch_init_module(void) +{ + int err; + + err = cxio_hal_init(); + if (err) + return err; + err = iwch_cm_init(); + if (err) + return err; + cxio_register_ev_cb(iwch_ev_dispatch); + cxgb3_register_client(&t3c_client); + return 0; +} + +static void __exit iwch_exit_module(void) +{ + cxgb3_unregister_client(&t3c_client); + cxio_unregister_ev_cb(iwch_ev_dispatch); + iwch_cm_term(); + cxio_hal_exit(); +} + +module_init(iwch_init_module); +module_exit(iwch_exit_module); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h new file mode 100644 index 0000000..8b11198 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_H__ +#define __IWCH_H__ + +#include +#include +#include +#include + +#include + +#include "cxio_hal.h" +#include "cxgb3_offload.h" + +struct iwch_pd; +struct iwch_cq; +struct iwch_qp; +struct iwch_mr; + +struct iwch_rnic_attributes { + u32 vendor_id; + u32 vendor_part_id; + u32 max_qps; + u32 max_wrs; /* Max for any SQ/RQ */ + u32 max_sge_per_wr; + u32 max_sge_per_rdma_write_wr; /* for RDMA Write WR */ + u32 max_cqs; + u32 max_cqes_per_cq; + u32 max_mem_regs; + u32 max_phys_buf_entries; /* for phys buf list */ + u32 max_pds; + + /* + * The memory page sizes supported by this RNIC. + * Bit position i in bitmap indicates page of + * size (4k)^i. Phys block list mode unsupported. + */ + u32 mem_pgsizes_bitmask; + u8 can_resize_wq; + + /* + * The maximum number of RDMA Reads that can be outstanding + * per QP with this RNIC as the target. + */ + u32 max_rdma_reads_per_qp; + + /* + * The maximum number of resources used for RDMA Reads + * by this RNIC with this RNIC as the target. + */ + u32 max_rdma_read_resources; + + /* + * The max depth per QP for initiation of RDMA Read + * by this RNIC. + */ + u32 max_rdma_read_qp_depth; + + /* + * The maximum depth for initiation of RDMA Read + * operations by this RNIC on all QPs + */ + u32 max_rdma_read_depth; + u8 rq_overflow_handled; + u32 can_modify_ird; + u32 can_modify_ord; + u32 max_mem_windows; + u32 stag0_value; + u8 zbva_support; + u8 local_invalidate_fence; + u32 cq_overflow_detection; +}; + +struct iwch_dev { + struct ib_device ibdev; + struct cxio_rdev rdev; + u32 device_cap_flags; + struct iwch_rnic_attributes attr; + struct idr cqidr; + struct idr qpidr; + struct idr mmidr; + spinlock_t lock; + struct list_head entry; +}; + +static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct iwch_dev, ibdev); +} + +static inline int t3b_device(const struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3B); +} + +static inline int t3a_device(const struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3A); +} + +static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid) +{ + return idr_find(&rhp->cqidr, cqid); +} + +static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid) +{ + return idr_find(&rhp->qpidr, qpid); +} + +static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) +{ + return idr_find(&rhp->mmidr, mmid); +} + +static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, + void *handle, u32 id) +{ + int ret; + u32 newid; + + do { + if (!idr_pre_get(idr, GFP_KERNEL)) { + return -ENOMEM; + } + spin_lock_irq(&rhp->lock); + ret = idr_get_new_above(idr, handle, id, &newid); + BUG_ON(newid != id); + spin_unlock_irq(&rhp->lock); + } while (ret == -EAGAIN); + + return ret; +} + +static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id) +{ + spin_lock_irq(&rhp->lock); + idr_remove(idr, id); + spin_unlock_irq(&rhp->lock); +} + +extern struct cxgb3_client t3c_client; +extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; +#endif From swise at opengridcomputing.com Wed Dec 20 11:19:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:19:25 -0600 Subject: [openib-general] [PATCH v5 03/13] iw_cxgb3 Provider Methods and Data Structures In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220191925.19316.38974.stgit@dell3.ogc.int> Provider methods to support the Linux RDMA verbs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_provider.c | 1171 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_provider.h | 363 ++++++++ drivers/infiniband/hw/cxgb3/iwch_user.h | 68 ++ 3 files changed, 1602 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c new file mode 100644 index 0000000..ab99202 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -0,0 +1,1171 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" +#include "iwch_user.h" + +static int iwch_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return -ENOSYS; +} + +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + return ERR_PTR(-ENOSYS); +} + +static int iwch_ah_destroy(struct ib_ah *ah) +{ + return -ENOSYS; +} + +static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + return -ENOSYS; +} + +static int iwch_dealloc_ucontext(struct ib_ucontext *context) +{ + struct iwch_dev *rhp = to_iwch_dev(context->device); + struct iwch_ucontext *ucontext = to_iwch_ucontext(context); + PDBG("%s context %p\n", __FUNCTION__, context); + cxio_release_ucontext(&rhp->rdev, &ucontext->uctx); + kfree(ucontext); + return 0; +} + +static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct iwch_ucontext *context; + struct iwch_dev *rhp = to_iwch_dev(ibdev); + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + cxio_init_ucontext(&rhp->rdev, &context->uctx); + INIT_LIST_HEAD(&context->mmaps); + spin_lock_init(&context->mmap_lock); + return &context->ibucontext; +} + +static int iwch_destroy_cq(struct ib_cq *ib_cq) +{ + struct iwch_cq *chp; + + PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq); + chp = to_iwch_cq(ib_cq); + + remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid); + atomic_dec(&chp->refcnt); + wait_event(chp->wait, !atomic_read(&chp->refcnt)); + + cxio_destroy_cq(&chp->rhp->rdev, &chp->cq); + kfree(chp); + return 0; +} + +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + struct iwch_create_cq_resp uresp; + + PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries); + rhp = to_iwch_dev(ibdev); + chp = kzalloc(sizeof(*chp), GFP_KERNEL); + if (!chp) + return ERR_PTR(-ENOMEM); + + if (t3a_device(rhp)) { + + /* + * T3A: Add some fluff to handle extra CQEs inserted + * for various errors. + * Additional CQE possibilities: + * TERMINATE, + * incoming RDMA WRITE Failures + * incoming RDMA READ REQUEST FAILUREs + * NOTE: We cannot ensure the CQ won't overflow. + */ + entries += 16; + } + entries = roundup_pow_of_two(entries); + chp->cq.size_log2 = ilog2(entries); + + if (cxio_create_cq(&rhp->rdev, &chp->cq)) { + kfree(chp); + return ERR_PTR(-ENOMEM); + } + chp->rhp = rhp; + chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1; + spin_lock_init(&chp->lock); + atomic_set(&chp->refcnt, 1); + init_waitqueue_head(&chp->wait); + insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid); + + if (context) { + struct iwch_mm_entry *mm; + + mm = kmalloc(sizeof *mm, GFP_KERNEL); + if (!mm) { + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-ENOMEM); + } + uresp.cqid = chp->cq.cqid; + uresp.size_log2 = chp->cq.size_log2; + uresp.physaddr = virt_to_phys(chp->cq.queue); + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm); + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-EFAULT); + } + mm->addr = uresp.physaddr; + mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * + sizeof (struct t3_cqe)); + insert_mmap(to_iwch_ucontext(context), mm); + } + PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n", + chp->cq.cqid, chp, (1 << chp->cq.size_log2), + (u64)chp->cq.dma_addr); + return &chp->ibcq; +} + +static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + struct iwch_cq *chp = to_iwch_cq(cq); + struct t3_cq oldcq, newcq; + int ret; + + PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe); + + /* We don't downsize... */ + if (cqe <= cq->cqe) + return 0; + + /* create new t3_cq with new size */ + cqe = roundup_pow_of_two(cqe+1); + newcq.size_log2 = ilog2(cqe); + + /* Dont allow resize to less than the current wce count */ + if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) { + return -ENOMEM; + } + + /* Quiesce all QPs using this CQ */ + ret = iwch_quiesce_qps(chp); + if (ret) { + return ret; + } + + ret = cxio_create_cq(&chp->rhp->rdev, &newcq); + if (ret) { + kfree(chp); + return ret; + } + + /* copy CQEs */ + memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * + sizeof(struct t3_cqe)); + + /* old iwch_qp gets new t3_cq but keeps old cqid */ + oldcq = chp->cq; + chp->cq = newcq; + chp->cq.cqid = oldcq.cqid; + + /* resize new t3_cq to update the HW context */ + ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq); + if (ret) { + chp->cq = oldcq; + return ret; + } + chp->ibcq.cqe = (1<cq.size_log2) - 1; + + /* destroy old t3_cq */ + oldcq.cqid = newcq.cqid; + ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq); + if (ret) { + printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", + __FUNCTION__, ret); + } + + /* add user hooks here */ + + /* resume qps */ + ret = iwch_resume_qps(chp); + return ret; +} + +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + enum t3_cq_opcode cq_op; + int err; + unsigned long flag; + struct iwch_req_notify_cq ucmd; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + if (notify == IB_CQ_SOLICITED) + cq_op = CQ_ARM_SE; + else + cq_op = CQ_ARM_AN; + if (udata && t3b_device(rhp)) { + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return -EFAULT; + spin_lock_irqsave(&chp->lock, flag); + chp->cq.rptr = ucmd.rptr; + } else + spin_lock_irqsave(&chp->lock, flag); + PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); + err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); + spin_unlock_irqrestore(&chp->lock, flag); + if (err) + printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, + chp->cq.cqid); + return err; +} + +static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + int len = vma->vm_end - vma->vm_start; + u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT; + struct cxio_rdev *rdev_p; + int ret = 0; + struct iwch_mm_entry *mm; + struct iwch_ucontext *ucontext; + + PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, + pgaddr, len); + + if (vma->vm_start & (PAGE_SIZE-1)) { + return -EINVAL; + } + + rdev_p = &(to_iwch_dev(context->device)->rdev); + ucontext = to_iwch_ucontext(context); + + mm = remove_mmap(ucontext, pgaddr, len); + if (!mm) + return -EINVAL; + kfree(mm); + + if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && + (pgaddr < (rdev_p->rnic_info.udbell_physbase + + rdev_p->rnic_info.udbell_len))) { + + /* + * Map T3 DB register. + */ + if (vma->vm_flags & VM_READ) { + return -EPERM; + } + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; + vma->vm_flags &= ~VM_MAYREAD; + ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } else { + + /* + * Map WQ or CQ contig dma memory... + */ + ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } + + return ret; +} + +static int iwch_deallocate_pd(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + + php = to_iwch_pd(pd); + rhp = php->rhp; + PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid); + cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid); + kfree(php); + return 0; +} + +static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_pd *php; + u32 pdid; + struct iwch_dev *rhp; + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + rhp = (struct iwch_dev *) ibdev; + pdid = cxio_hal_get_pdid(rhp->rdev.rscp); + if (!pdid) + return ERR_PTR(-EINVAL); + php = kzalloc(sizeof(*php), GFP_KERNEL); + if (!php) { + cxio_hal_put_pdid(rhp->rdev.rscp, pdid); + return ERR_PTR(-ENOMEM); + } + php->pdid = pdid; + php->rhp = rhp; + if (context) { + if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) { + iwch_deallocate_pd(&php->ibpd); + return ERR_PTR(-EFAULT); + } + } + PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php); + return &php->ibpd; +} + +static int iwch_dereg_mr(struct ib_mr *ib_mr) +{ + struct iwch_dev *rhp; + struct iwch_mr *mhp; + u32 mmid; + + PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr); + /* There can be no memory windows */ + if (atomic_read(&ib_mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(ib_mr); + rhp = mhp->rhp; + mmid = mhp->attr.stag >> 8; + cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, + mhp->attr.pbl_addr); + remove_handle(rhp, &rhp->mmidr, mmid); + if (mhp->kva) + kfree((void *) (unsigned long) mhp->kva); + PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); + kfree(mhp); + return 0; +} + +static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + __be64 *page_list; + int shift; + u64 total_size; + int npages; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + int ret; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + php = to_iwch_pd(pd); + rhp = php->rhp; + + acc = iwch_convert_access(acc); + + + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, + &total_size, &npages, &shift, &page_list); + if (ret) + goto err; + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + + /* NOTE: TPT perms are backwards from BIND WR perms! */ + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + ret = iwch_register_mem(rhp, php, mhp, shift, page_list); + kfree(page_list); + if (ret) { + goto err; + } + return &mhp->ibmr; +err: + kfree(mhp); + return ERR_PTR(ret); + +} + +static int iwch_reregister_phys_mem(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 * iova_start) +{ + + struct iwch_mr mh, *mhp; + struct iwch_pd *php; + struct iwch_dev *rhp; + int new_acc; + __be64 *page_list = NULL; + int shift = 0; + u64 total_size; + int npages; + int ret; + + PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd); + + /* There can be no memory windows */ + if (atomic_read(&mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(mr); + rhp = mhp->rhp; + php = to_iwch_pd(mr->pd); + + /* make sure we are on the same adapter */ + if (rhp != php->rhp) + return -EINVAL; + + new_acc = mhp->attr.perms; + + memcpy(&mh, mhp, sizeof *mhp); + + if (mr_rereg_mask & IB_MR_REREG_PD) + php = to_iwch_pd(pd); + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mh.attr.perms = iwch_convert_access(acc); + if (mr_rereg_mask & IB_MR_REREG_TRANS) + ret = build_phys_page_list(buffer_list, num_phys_buf, + iova_start, + &total_size, &npages, + &shift, &page_list); + + ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages); + kfree(page_list); + if (ret) { + return ret; + } + if (mr_rereg_mask & IB_MR_REREG_PD) + mhp->attr.pdid = php->pdid; + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mhp->attr.perms = acc; + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + mhp->attr.zbva = 0; + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + } + + return 0; +} + + +struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + __be64 *pages; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + struct iwch_reg_user_mr_resp uresp; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + shift = ffs(region->page_size) - 1; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + acc = iwch_convert_access(acc); + + i = n = 0; + + list_for_each_entry(chunk, ®ion->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = cpu_to_be64(sg_dma_address( + &chunk->page_list[j]) + + region->page_size * k); + } + } + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + mhp->attr.va_fbo = region->virt_base; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) region->length; + mhp->attr.pbl_size = i; + err = iwch_register_mem(rhp, php, mhp, shift, pages); + kfree(pages); + if (err) + goto err; + + if (udata && t3b_device(rhp)) { + uresp.pbl_addr = (mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3; + PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, + uresp.pbl_addr); + + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + iwch_dereg_mr(&mhp->ibmr); + err = -EFAULT; + goto err; + } + } + + return &mhp->ibmr; + +err: + kfree(mhp); + return ERR_PTR(err); +} + +struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva; + struct ib_mr *ibmr; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + + /* + * T3 only supports 32 bits of size. + */ + bl.size = 0xffffffff; + bl.addr = 0; + kva = 0; + ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva); + return ibmr; +} + +struct ib_mw *iwch_alloc_mw(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mw *mhp; + u32 mmid; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_MW; + mhp->attr.stag = stag; + mmid = (stag) >> 8; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag); + return &(mhp->ibmw); +} + +int iwch_dealloc_mw(struct ib_mw *mw) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + u32 mmid; + + mhp = to_iwch_mw(mw); + rhp = mhp->rhp; + mmid = (mw->rkey) >> 8; + cxio_deallocate_window(&rhp->rdev, mhp->attr.stag); + remove_handle(rhp, &rhp->mmidr, mmid); + kfree(mhp); + PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp); + return 0; +} + +static int iwch_destroy_qp(struct ib_qp *ib_qp) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_qp_attributes attrs; + struct iwch_ucontext *ucontext; + + qhp = to_iwch_qp(ib_qp); + rhp = qhp->rhp; + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0); + } + wait_event(qhp->wait, !qhp->ep); + + remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid); + + atomic_dec(&qhp->refcnt); + wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); + + ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) + : NULL; + cxio_destroy_qp(&rhp->rdev, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx); + + PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, + ib_qp, qhp->wq.qpid, qhp); + kfree(qhp); + return 0; +} + +static struct ib_qp *iwch_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *attrs, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_pd *php; + struct iwch_cq *schp; + struct iwch_cq *rchp; + struct iwch_create_qp_resp uresp; + int wqsize, sqsize, rqsize; + struct iwch_ucontext *ucontext; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + if (attrs->qp_type != IB_QPT_RC) + return ERR_PTR(-EINVAL); + php = to_iwch_pd(pd); + rhp = php->rhp; + schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid); + rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid); + if (!schp || !rchp) + return ERR_PTR(-EINVAL); + + /* The RQT size must be # of entries + 1 rounded up to a power of two */ + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr); + if (rqsize == attrs->cap.max_recv_wr) + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1); + + /* T3 doesn't support RQT depth < 16 */ + if (rqsize < 16) + rqsize = 16; + + if (rqsize > T3_MAX_RQ_SIZE) + return ERR_PTR(-EINVAL); + + /* + * NOTE: The SQ and total WQ sizes don't need to be + * a power of two. However, all the code assumes + * they are. EG: Q_FREECNT() and friends. + */ + sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); + wqsize = roundup_pow_of_two(rqsize + sqsize); + PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, + wqsize, sqsize, rqsize); + qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); + if (!qhp) + return ERR_PTR(-ENOMEM); + qhp->wq.size_log2 = ilog2(wqsize); + qhp->wq.rq_size_log2 = ilog2(rqsize); + qhp->wq.sq_size_log2 = ilog2(sqsize); + ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; + if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) { + kfree(qhp); + return ERR_PTR(-ENOMEM); + } + attrs->cap.max_recv_wr = rqsize - 1; + attrs->cap.max_send_wr = sqsize; + qhp->rhp = rhp; + qhp->attr.pd = php->pdid; + qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid; + qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid; + qhp->attr.sq_num_entries = attrs->cap.max_send_wr; + qhp->attr.rq_num_entries = attrs->cap.max_recv_wr; + qhp->attr.sq_max_sges = attrs->cap.max_send_sge; + qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge; + qhp->attr.rq_max_sges = attrs->cap.max_recv_sge; + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.next_state = IWCH_QP_STATE_IDLE; + + /* + * XXX - These don't get passed in from the openib user + * at create time. The CM sets them via a QP modify. + * Need to fix... I think the CM should + */ + qhp->attr.enable_rdma_read = 1; + qhp->attr.enable_rdma_write = 1; + qhp->attr.enable_bind = 1; + qhp->attr.max_ord = 1; + qhp->attr.max_ird = 1; + + spin_lock_init(&qhp->lock); + init_waitqueue_head(&qhp->wait); + atomic_set(&qhp->refcnt, 1); + insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid); + + if (udata) { + + struct iwch_mm_entry *mm1, *mm2; + + mm1 = kmalloc(sizeof *mm1, GFP_KERNEL); + if (!mm1) { + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + mm2 = kmalloc(sizeof *mm2, GFP_KERNEL); + if (!mm2) { + kfree(mm1); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + uresp.qpid = qhp->wq.qpid; + uresp.size_log2 = qhp->wq.size_log2; + uresp.sq_size_log2 = qhp->wq.sq_size_log2; + uresp.rq_size_log2 = qhp->wq.rq_size_log2; + uresp.physaddr = virt_to_phys(qhp->wq.queue); + uresp.doorbell = qhp->wq.udb; + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm1); + kfree(mm2); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-EFAULT); + } + mm1->addr = uresp.physaddr; + mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr)); + insert_mmap(ucontext, mm1); + mm2->addr = uresp.doorbell & PAGE_MASK; + mm2->len = PAGE_SIZE; + insert_mmap(ucontext, mm2); + } + qhp->ibqp.qp_num = qhp->wq.qpid; + init_timer(&(qhp->timer)); + PDBG("%s sq_num_entries %d, rq_num_entries %d " + "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n", + __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries, + qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2); + return (&qhp->ibqp); +} + +static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + enum iwch_qp_attr_mask mask = 0; + struct iwch_qp_attributes attrs; + + PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp); + + /* iwarp does not support the RTR state */ + if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR)) + attr_mask &= ~IB_QP_STATE; + + /* Make sure we still have something left to do */ + if (!attr_mask) + return 0; + + memset(&attrs, 0, sizeof attrs); + qhp = to_iwch_qp(ibqp); + rhp = qhp->rhp; + + attrs.next_state = iwch_convert_state(attr->qp_state); + attrs.enable_rdma_read = (attr->qp_access_flags & + IB_ACCESS_REMOTE_READ) ? 1 : 0; + attrs.enable_rdma_write = (attr->qp_access_flags & + IB_ACCESS_REMOTE_WRITE) ? 1 : 0; + attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0; + + + mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0; + mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? + (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0; + + return iwch_modify_qp(rhp, qhp, mask, &attrs, 0); +} + +void iwch_qp_add_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + atomic_inc(&(to_iwch_qp(qp)->refcnt)); +} + +void iwch_qp_rem_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt))) + wake_up(&(to_iwch_qp(qp)->wait)); +} + +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn) +{ + PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn); + return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn); +} + + +static int iwch_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + *pkey = 0; + return 0; +} + +static int iwch_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct iwch_dev *dev; + + PDBG("%s ibdev %p, port %d, index %d, gid %p\n", + __FUNCTION__, ibdev, port, index, gid); + dev = to_iwch_dev(ibdev); + BUG_ON(port == 0 || port > 2); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6); + return 0; +} + +static int iwch_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + + struct iwch_dev *dev; + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + + dev = to_iwch_dev(ibdev); + memset(props, 0, sizeof *props); + memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + props->device_cap_flags = dev->device_cap_flags; + props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor; + props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device; + props->max_mr_size = ~0ull; + props->max_qp = dev->attr.max_qps; + props->max_qp_wr = dev->attr.max_wrs; + props->max_sge = dev->attr.max_sge_per_wr; + props->max_sge_rd = 1; + props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp; + props->max_cq = dev->attr.max_cqs; + props->max_cqe = dev->attr.max_cqes_per_cq; + props->max_mr = dev->attr.max_mem_regs; + props->max_pd = dev->attr.max_pds; + props->local_ca_ack_delay = 0; + + return 0; +} + +static int iwch_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 2; + props->active_speed = 2; + props->max_msg_sz = -1; + + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.fw_version); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.driver); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, dev); + return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor, + dev->rdev.rnic_info.pdev->device); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *iwch_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +int iwch_register_device(struct iwch_dev *dev) +{ + int ret; + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX); + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + dev->ibdev.owner = THIS_MODULE; + dev->device_cap_flags = + (IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + dev->ibdev.node_type = RDMA_NODE_RNIC; + memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); + dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports; + dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.query_device = iwch_query_device; + dev->ibdev.query_port = iwch_query_port; + dev->ibdev.modify_port = iwch_modify_port; + dev->ibdev.query_pkey = iwch_query_pkey; + dev->ibdev.query_gid = iwch_query_gid; + dev->ibdev.alloc_ucontext = iwch_alloc_ucontext; + dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext; + dev->ibdev.mmap = iwch_mmap; + dev->ibdev.alloc_pd = iwch_allocate_pd; + dev->ibdev.dealloc_pd = iwch_deallocate_pd; + dev->ibdev.create_ah = iwch_ah_create; + dev->ibdev.destroy_ah = iwch_ah_destroy; + dev->ibdev.create_qp = iwch_create_qp; + dev->ibdev.modify_qp = iwch_ib_modify_qp; + dev->ibdev.destroy_qp = iwch_destroy_qp; + dev->ibdev.create_cq = iwch_create_cq; + dev->ibdev.destroy_cq = iwch_destroy_cq; + dev->ibdev.resize_cq = iwch_resize_cq; + dev->ibdev.poll_cq = iwch_poll_cq; + dev->ibdev.get_dma_mr = iwch_get_dma_mr; + dev->ibdev.reg_phys_mr = iwch_register_phys_mem; + dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem; + dev->ibdev.reg_user_mr = iwch_reg_user_mr; + dev->ibdev.dereg_mr = iwch_dereg_mr; + dev->ibdev.alloc_mw = iwch_alloc_mw; + dev->ibdev.bind_mw = iwch_bind_mw; + dev->ibdev.dealloc_mw = iwch_dealloc_mw; + + dev->ibdev.attach_mcast = iwch_multicast_attach; + dev->ibdev.detach_mcast = iwch_multicast_detach; + dev->ibdev.process_mad = iwch_process_mad; + + dev->ibdev.req_notify_cq = iwch_arm_cq; + dev->ibdev.post_send = iwch_post_send; + dev->ibdev.post_recv = iwch_post_receive; + + + dev->ibdev.iwcm = + (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs), + GFP_KERNEL); + dev->ibdev.iwcm->connect = iwch_connect; + dev->ibdev.iwcm->accept = iwch_accept_cr; + dev->ibdev.iwcm->reject = iwch_reject_cr; + dev->ibdev.iwcm->create_listen = iwch_create_listen; + dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen; + dev->ibdev.iwcm->add_ref = iwch_qp_add_ref; + dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref; + dev->ibdev.iwcm->get_qp = iwch_get_qp; + + ret = ib_register_device(&dev->ibdev); + if (ret) + goto bail1; + + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + if (ret) { + goto bail2; + } + } + return 0; +bail2: + ib_unregister_device(&dev->ibdev); +bail1: + return ret; +} + +void iwch_unregister_device(struct iwch_dev *dev) +{ + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) + class_device_remove_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + ib_unregister_device(&dev->ibdev); + return; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h new file mode 100644 index 0000000..f339427 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -0,0 +1,363 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_PROVIDER_H__ +#define __IWCH_PROVIDER_H__ + +#include +#include +#include +#include +#include "t3cdev.h" +#include "iwch.h" +#include "cxio_wr.h" +#include "cxio_hal.h" + +struct iwch_pd { + struct ib_pd ibpd; + u32 pdid; + struct iwch_dev *rhp; +}; + +static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct iwch_pd, ibpd); +} + +struct tpt_attributes { + u32 stag; + u32 state:1; + u32 type:2; + u32 rsvd:1; + enum tpt_mem_perm perms; + u32 remote_invaliate_disable:1; + u32 zbva:1; + u32 mw_bind_enable:1; + u32 page_size:5; + + u32 pdid; + u32 qpid; + u32 pbl_addr; + u32 len; + u64 va_fbo; + u32 pbl_size; +}; + +struct iwch_mr { + struct ib_mr ibmr; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +typedef struct iwch_mw iwch_mw_handle; + +static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct iwch_mr, ibmr); +} + +struct iwch_mw { + struct ib_mw ibmw; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw) +{ + return container_of(ibmw, struct iwch_mw, ibmw); +} + +struct iwch_cq { + struct ib_cq ibcq; + struct iwch_dev *rhp; + struct t3_cq cq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; +}; + +static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct iwch_cq, ibcq); +} + +enum IWCH_QP_FLAGS { + QP_QUIESCED = 0x01 +}; + +struct iwch_mpa_attributes { + u8 recv_marker_enabled; + u8 xmit_marker_enabled; /* iWARP: enable inbound Read Resp. */ + u8 crc_enabled; + u8 version; /* 0 or 1 */ +}; + +struct iwch_qp_attributes { + u32 scq; + u32 rcq; + u32 sq_num_entries; + u32 rq_num_entries; + u32 sq_max_sges; + u32 sq_max_sges_rdma_write; + u32 rq_max_sges; + u32 state; + u8 enable_rdma_read; + u8 enable_rdma_write; /* enable inbound Read Resp. */ + u8 enable_bind; + u8 enable_mmid0_fastreg; /* Enable STAG0 + Fast-register */ + /* + * Next QP state. If specify the current state, only the + * QP attributes will be modified. + */ + u32 max_ord; + u32 max_ird; + u32 pd; /* IN */ + u32 next_state; + char terminate_buffer[52]; + u32 terminate_msg_len; + u8 is_terminate_local; + struct iwch_mpa_attributes mpa_attr; /* IN-OUT */ + struct iwch_ep *llp_stream_handle; + char *stream_msg_buf; /* Last stream msg. before Idle -> RTS */ + u32 stream_msg_buf_len; /* Only on Idle -> RTS */ +}; + +struct iwch_qp { + struct ib_qp ibqp; + struct iwch_dev *rhp; + struct iwch_ep *ep; + struct iwch_qp_attributes attr; + struct t3_wq wq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; + enum IWCH_QP_FLAGS flags; + struct timer_list timer; +}; + +static inline int qp_quiesced(struct iwch_qp *qhp) +{ + return (qhp->flags & QP_QUIESCED); +} + +static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct iwch_qp, ibqp); +} + +void iwch_qp_add_ref(struct ib_qp *qp); +void iwch_qp_rem_ref(struct ib_qp *qp); +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn); + +struct iwch_ucontext { + struct ib_ucontext ibucontext; + struct cxio_ucontext uctx; + spinlock_t mmap_lock; + struct list_head mmaps; +}; + +static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c) +{ + return container_of(c, struct iwch_ucontext, ibucontext); +} + +struct iwch_mm_entry { + struct list_head entry; + u64 addr; + unsigned len; +}; + +static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext, + u64 addr, unsigned len) +{ + struct list_head *pos, *nxt; + struct iwch_mm_entry *mm; + + spin_lock_irq(&ucontext->mmap_lock); + list_for_each_safe(pos, nxt, &ucontext->mmaps) { + + mm = list_entry(pos, struct iwch_mm_entry, entry); + if (mm->addr == addr && mm->len == len) { + list_del_init(&mm->entry); + spin_unlock_irq(&ucontext->mmap_lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, + mm->len); + return mm; + } + } + spin_unlock_irq(&ucontext->mmap_lock); + return NULL; +} + +static inline void insert_mmap(struct iwch_ucontext *ucontext, + struct iwch_mm_entry *mm) +{ + spin_lock_irq(&ucontext->mmap_lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len); + list_add_tail(&mm->entry, &ucontext->mmaps); + spin_unlock_irq(&ucontext->mmap_lock); +} + +enum iwch_qp_attr_mask { + IWCH_QP_ATTR_NEXT_STATE = 1 << 0, + IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7, + IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8, + IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9, + IWCH_QP_ATTR_MAX_ORD = 1 << 11, + IWCH_QP_ATTR_MAX_IRD = 1 << 12, + IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22, + IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23, + IWCH_QP_ATTR_MPA_ATTR = 1 << 24, + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25, + IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_MAX_ORD | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_STREAM_MSG_BUFFER | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE) +}; + +int iwch_modify_qp(struct iwch_dev *rhp, + struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal); + +enum iwch_qp_state { + IWCH_QP_STATE_IDLE, + IWCH_QP_STATE_RTS, + IWCH_QP_STATE_ERROR, + IWCH_QP_STATE_TERMINATE, + IWCH_QP_STATE_CLOSING, + IWCH_QP_STATE_TOT +}; + +static inline int iwch_convert_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + case IB_QPS_INIT: + return IWCH_QP_STATE_IDLE; + case IB_QPS_RTS: + return IWCH_QP_STATE_RTS; + case IB_QPS_SQD: + return IWCH_QP_STATE_CLOSING; + case IB_QPS_SQE: + return IWCH_QP_STATE_TERMINATE; + case IB_QPS_ERR: + return IWCH_QP_STATE_ERROR; + default: + return -1; + } +} + +enum iwch_mem_perms { + IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0, + IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1, + IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2, + IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3, + IWCH_MEM_ACCESS_ATOMICS = 1 << 4, + IWCH_MEM_ACCESS_BINDING = 1 << 5, + IWCH_MEM_ACCESS_LOCAL = + (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE), + IWCH_MEM_ACCESS_REMOTE = + (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ) + /* cannot go beyond 1 << 31 */ +} __attribute__ ((packed)); + +static inline u32 iwch_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0) + | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) | + (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) | + IWCH_MEM_ACCESS_LOCAL_READ; +} + +enum iwch_mmid_state { + IWCH_STAG_STATE_VALID, + IWCH_STAG_STATE_INVALID +}; + +enum iwch_qp_query_flags { + IWCH_QP_QUERY_CONTEXT_NONE = 0x0, /* No ctx; Only attrs */ + IWCH_QP_QUERY_CONTEXT_GET = 0x1, /* Get ctx + attrs */ + IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2, /* Not Supported */ + + /* + * Quiesce QP context; Consumer + * will NOT replay outstanding WR + */ + IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4, + IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8, + IWCH_QP_QUERY_TEST_USERWRITE = 0x32 /* Test special */ +}; + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc); +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg); +int iwch_register_device(struct iwch_dev *dev); +void iwch_unregister_device(struct iwch_dev *dev); +int iwch_quiesce_qps(struct iwch_cq *chp); +int iwch_resume_qps(struct iwch_cq *chp); +void stop_read_rep_timer(struct iwch_qp *qhp); +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list); +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages); +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list); + + +#define IWCH_NODE_DESC "cxgb3 Chelsio Communications" + +#endif diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h new file mode 100644 index 0000000..4e4b9c9 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_user.h @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_USER_H__ +#define __IWCH_USER_H__ + +#define IWCH_UVERBS_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct iwch_create_cq_resp { + __u64 physaddr; + __u32 cqid; + __u32 size_log2; +}; + +struct iwch_create_qp_resp { + __u64 physaddr; + __u64 doorbell; + __u32 qpid; + __u32 size_log2; + __u32 sq_size_log2; + __u32 rq_size_log2; +}; + +struct iwch_reg_user_mr_resp { + __u32 pbl_addr; +}; + +struct iwch_req_notify_cq { + __u32 rptr; +}; +#endif From swise at opengridcomputing.com Wed Dec 20 11:19:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:19:55 -0600 Subject: [openib-general] [PATCH v5 04/13] iw_cxgb3 Connection Manager In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220191955.19316.52717.stgit@dell3.ogc.int> This code implements the iWARP CM provider methods for the Chelsio driver. The Chelsio ULLD is used to setup and teardown TCP connections, and the T3 RDMA Core is used to move the connections in and out of RDMA mode. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 2077 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 223 ++++ drivers/infiniband/hw/cxgb3/tcb.h | 603 ++++++++++ 3 files changed, 2903 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c new file mode 100644 index 0000000..69fcb59 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -0,0 +1,2077 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "tcb.h" +#include "cxgb3_offload.h" +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" + +char *states[] = { + "idle", + "listen", + "connecting", + "mpa_wait_req", + "mpa_req_sent", + "mpa_req_rcvd", + "mpa_rep_sent", + "fpdu_mode", + "aborting", + "closing", + "moribund", + "dead", + NULL, +}; + +static int ep_timeout_secs = 10; +module_param(ep_timeout_secs, int, 0444); +MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " + "in seconds (default=10)"); + +static int mpa_rev = 1; +module_param(mpa_rev, int, 0444); +MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " + "1 is spec compliant. (default=1)"); + +static int markers_enabled = 0; +module_param(markers_enabled, int, 0444); +MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)"); + +static int crc_enabled = 1; +module_param(crc_enabled, int, 0444); +MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)"); + +static int rcv_win = 512 * 1024; +module_param(rcv_win, int, 0444); +MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)"); + +static int snd_win = 512 * 1024; +module_param(snd_win, int, 0444); +MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)"); + +static unsigned int nocong = 1; +module_param(nocong, uint, 0444); +MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)"); + +static void process_work(struct work_struct *work); +static struct workqueue_struct *workq; +DECLARE_WORK(skb_work, process_work); + +static struct sk_buff_head rxq; +static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS]; + +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp); +static void ep_timeout(unsigned long arg); +static void connect_reply_upcall(struct iwch_ep *ep, int status); + +static void start_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + if (timer_pending(&ep->timer)) { + PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + } else + get_ep(&ep->com); + ep->timer.expires = jiffies + ep_timeout_secs * HZ; + ep->timer.data = (unsigned long)ep; + ep->timer.function = ep_timeout; + add_timer(&ep->timer); +} + +static void stop_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + put_ep(&ep->com); +} + +static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) +{ + struct cpl_tid_release *req; + + skb = get_skb(skb, sizeof *req, GFP_KERNEL); + if (!skb) + return; + req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); + skb->priority = CPL_PRIORITY_SETUP; + tdev->send(tdev, skb); + return; +} + +int iwch_quiesce_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +int iwch_resume_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) + return -ENOMEM; + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = 0; + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static void set_emss(struct iwch_ep *ep, u16 opt) +{ + PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt); + ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40; + if (G_TCPOPT_TSTAMP(opt)) + ep->emss -= 12; + if (ep->emss < 128) + ep->emss = 128; + PDBG("emss=%d\n", ep->emss); +} + +static enum iwch_ep_state state_read(struct iwch_ep_common *epc) +{ + unsigned long flags; + enum iwch_ep_state state; + + spin_lock_irqsave(&epc->lock, flags); + state = epc->state; + spin_unlock_irqrestore(&epc->lock, flags); + return state; +} + +static inline void __state_set(struct iwch_ep_common *epc, + enum iwch_ep_state new) +{ + epc->state = new; +} + +static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new) +{ + unsigned long flags; + + spin_lock_irqsave(&epc->lock, flags); + PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], states[new]); + __state_set(epc, new); + spin_unlock_irqrestore(&epc->lock, flags); + return; +} + +static void *alloc_ep(int size, gfp_t gfp) +{ + struct iwch_ep_common *epc; + + epc = kmalloc(size, gfp); + if (epc) { + memset(epc, 0, size); + kref_init(&epc->kref); + spin_lock_init(&epc->lock); + init_waitqueue_head(&epc->waitq); + } + PDBG("%s alloc ep %p\n", __FUNCTION__, epc); + return (void *) epc; +} + +void __free_ep(struct kref *kref) +{ + struct iwch_ep_common *epc; + epc = container_of(kref, struct iwch_ep_common, kref); + PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]); + kfree(epc); +} + +static void release_ep_resources(struct iwch_ep *ep) +{ + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, ep->hwtid, NULL); + put_ep(&ep->com); +} + +static void process_work(struct work_struct *work) +{ + struct sk_buff *skb = NULL; + void *ep; + struct t3cdev *tdev; + int ret; + + while ((skb = skb_dequeue(&rxq))) { + ep = *((void **) (skb->cb)); + tdev = *((struct t3cdev **) (skb->cb + sizeof(void *))); + ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep); + if (ret & CPL_RET_BUF_DONE) + kfree_skb(skb); + + /* + * ep was referenced in sched(), and is freed here. + */ + put_ep((struct iwch_ep_common *)ep); + } +} + +static int status2errno(int status) +{ + switch (status) { + case CPL_ERR_NONE: + return 0; + case CPL_ERR_CONN_RESET: + return -ECONNRESET; + case CPL_ERR_ARP_MISS: + return -EHOSTUNREACH; + case CPL_ERR_CONN_TIMEDOUT: + return -ETIMEDOUT; + case CPL_ERR_TCAM_FULL: + return -ENOMEM; + case CPL_ERR_CONN_EXIST: + return -EADDRINUSE; + default: + return -EIO; + } +} + +/* + * Try and reuse skbs already allocated... + */ +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp) +{ + if (skb) { + BUG_ON(skb_cloned(skb)); + skb_trim(skb, 0); + skb_get(skb); + } else { + skb = alloc_skb(len, gfp); + } + return skb; +} + +static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip, + __be32 peer_ip, __be16 local_port, + __be16 peer_port, u8 tos) +{ + struct rtable *rt; + struct flowi fl = { + .oif = 0, + .nl_u = { + .ip4_u = { + .daddr = peer_ip, + .saddr = local_ip, + .tos = tos} + }, + .proto = IPPROTO_TCP, + .uli_u = { + .ports = { + .sport = local_port, + .dport = peer_port} + } + }; + + if (ip_route_output_flow(&rt, &fl, NULL, 0)) + return NULL; + return rt; +} + +static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu) +{ + int i = 0; + + while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu) + ++i; + return i; +} + +static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb) +{ + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for an active open. + */ +static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + printk(KERN_ERR MOD "ARP failure duing connect\n"); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for a CPL_ABORT_REQ. Change it into a no RST variant + * and send it along. + */ +static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_abort_req *req = cplhdr(skb); + + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + req->cmd = CPL_ABORT_NO_RST; + cxgb3_ofld_send(dev, skb); +} + +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) +{ + struct cpl_close_con_req *req; + struct sk_buff *skb; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) +{ + struct cpl_abort_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(skb, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, abort_arp_failure); + req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); + req->cmd = CPL_ABORT_SEND_RST; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_connect(struct iwch_ep *ep) +{ + struct cpl_act_open_req *req; + struct sk_buff *skb; + u32 opt0h, opt0l, opt2; + unsigned int mtu_idx; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + skb->priority = CPL_PRIORITY_SETUP; + set_arp_failure_handler(skb, act_open_req_arp_failure); + + req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid)); + req->local_port = ep->com.local_addr.sin_port; + req->peer_port = ep->com.remote_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_ip = ep->com.remote_addr.sin_addr.s_addr; + req->opt0h = htonl(opt0h); + req->opt0l = htonl(opt0l); + req->params = 0; + req->opt2 = htonl(opt2); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + + PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen); + + BUG_ON(skb_cloned(skb)); + + mpalen = sizeof(*mpa) + ep->plen; + if (skb->data + mpalen + sizeof(*req) > skb->end) { + kfree_skb(skb); + skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + connect_reply_upcall(ep, -ENOMEM); + return; + } + } + skb_trim(skb, 0); + skb_reserve(skb, sizeof(*req)); + skb_put(skb, mpalen); + skb->priority = CPL_PRIORITY_DATA; + mpa = (struct mpa_message *) skb->data; + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key)); + mpa->flags = (crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->private_data_size = htons(ep->plen); + mpa->revision = mpa_rev; + + if (ep->plen) + memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + start_ep_timer(ep); + state_set(&ep->com, MPA_REQ_SENT); + return; +} + +static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = MPA_REJECT; + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb again. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(mpalen); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) + memcpy(mpa->private_data, pdata, plen); + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + ep->mpa_skb = skb; + state_set(&ep->com, MPA_REP_SENT); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_establish *req = cplhdr(skb); + unsigned int tid = GET_TID(req); + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid); + + dst_confirm(ep->dst); + + /* setup the hwtid for this connection */ + ep->hwtid = tid; + cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid); + + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + /* dealloc the atid */ + cxgb3_free_atid(ep->com.tdev, ep->atid); + + /* start MPA negotiation */ + send_mpa_req(ep, skb); + + return 0; +} + +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) +{ + PDBG("%s ep %p\n", __FILE__, ep); + state_set(&ep->com, ABORTING); + send_abort(ep, skb, gfp); +} + +static void close_complete_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + if (ep->com.cm_id) { + PDBG("close complete delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void peer_close_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_DISCONNECT; + if (ep->com.cm_id) { + PDBG("peer close delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static void peer_abort_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + event.status = -ECONNRESET; + if (ep->com.cm_id) { + PDBG("abort delivered ep %p cm_id %p tid %d\n", ep, + ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_reply_upcall(struct iwch_ep *ep, int status) +{ + struct iw_cm_event event; + + PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REPLY; + event.status = status; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + + if ((status == 0) || (status == -ECONNREFUSED)) { + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + } + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, + ep->hwtid, status); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } + if (status < 0) { + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_request_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REQUEST; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + event.provider_data = ep; + if (state_read(&ep->parent_ep->com) != DEAD) + ep->parent_ep->com.cm_id->event_handler( + ep->parent_ep->com.cm_id, + &event); + put_ep(&ep->parent_ep->com); + ep->parent_ep = NULL; +} + +static void established_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_ESTABLISHED; + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static int update_rx_credits(struct iwch_ep *ep, u32 credits) +{ + struct cpl_rx_data_ack *req; + struct sk_buff *skb; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n"); + return 0; + } + + req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); + req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); + skb->priority = CPL_PRIORITY_ACK; + ep->com.tdev->send(ep->com.tdev, skb); + return credits; +} + +static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + int err; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state has + * changed and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) != MPA_REQ_SENT) + return; + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + err = -EINVAL; + goto err; + } + + /* + * copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * if we don't even have the mpa message, then bail. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* Validate MPA header. */ + if (mpa->revision != mpa_rev) { + err = -EPROTO; + goto err; + } + if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) { + err = -EPROTO; + goto err; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + err = -EPROTO; + goto err; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + err = -EPROTO; + goto err; + } + + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + if (mpa->flags & MPA_REJECT) { + err = -ECONNREFUSED; + goto err; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. And + * the MPA header is valid. + */ + state_set(&ep->com, FPDU_MODE); + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ird; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD; + + /* bind QP and TID with INIT_WR */ + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + if (!err) + goto out; +err: + abort_connection(ep, skb, GFP_KERNEL); +out: + connect_reply_upcall(ep, err); + return; +} + +static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state has + * changed and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) != MPA_REQ_WAIT) + return; + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + abort_connection(ep, skb, GFP_KERNEL); + return; + } + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * If we don't even have the mpa message, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) + return; + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* + * Validate MPA Header. + */ + if (mpa->revision != mpa_rev) { + abort_connection(ep, skb, GFP_KERNEL); + return; + } + + if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) { + abort_connection(ep, skb, GFP_KERNEL); + return; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + abort_connection(ep, skb, GFP_KERNEL); + return; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + abort_connection(ep, skb, GFP_KERNEL); + return; + } + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) + return; + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. + */ + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + state_set(&ep->com, MPA_REQ_RCVD); + + /* drive upcall */ + connect_request_upcall(ep); + return; +} + +static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_rx_data *hdr = cplhdr(skb); + unsigned int dlen = ntohs(hdr->len); + + PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen); + + skb_pull(skb, sizeof(*hdr)); + skb_trim(skb, dlen); + + switch (state_read(&ep->com)) { + case MPA_REQ_SENT: + process_mpa_reply(ep, skb); + break; + case MPA_REQ_WAIT: + process_mpa_request(ep, skb); + break; + case MPA_REP_SENT: + break; + default: + printk(KERN_ERR MOD "%s Unexpected streaming data." + " ep %p state %d tid %d\n", + __FUNCTION__, ep, state_read(&ep->com), ep->hwtid); + + /* + * The ep will timeout and inform the ULP of the failure. + * See ep_timeout(). + */ + break; + } + + /* update RX credits */ + update_rx_credits(ep, dlen); + + return CPL_RET_BUF_DONE; +} + +/* + * Upcall from the adapter indicating data has been transmitted. + * For us its just the single MPA request or reply. We can now free + * the skb holding the mpa message. + */ +static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_wr_ack *hdr = cplhdr(skb); + unsigned int credits = ntohs(hdr->credits); + enum iwch_qp_attr_mask mask; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + + if (credits == 0) + return CPL_RET_BUF_DONE; + BUG_ON(credits != 1); + BUG_ON(ep->mpa_skb == NULL); + kfree_skb(ep->mpa_skb); + ep->mpa_skb = NULL; + dst_confirm(ep->dst); + if (state_read(&ep->com) == MPA_REP_SENT) { + struct iwch_qp_attributes attrs; + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (!ep->com.rpl_err) { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + + ep->com.rpl_done = 1; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + } + return CPL_RET_BUF_DONE; +} + +static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + close_complete_upcall(ep); + state_set(&ep->com, DEAD); + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status, + status2errno(rpl->status)); + connect_reply_upcall(ep, status2errno(rpl->status)); + state_set(&ep->com, DEAD); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, GET_TID(rpl), NULL); + cxgb3_free_atid(ep->com.tdev, ep->atid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + return -ENOMEM; + } + + req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); + req->local_port = ep->com.local_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_port = 0; + req->peer_ip = 0; + req->peer_netmask = 0; + req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS); + req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10)); + req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); + + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_pass_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, + rpl->status, status2errno(rpl->status)); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + + return CPL_RET_BUF_DONE; +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_close_listserv_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, + void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_close_listserv_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + return CPL_RET_BUF_DONE; +} + +static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb) +{ + struct cpl_pass_accept_rpl *rpl; + unsigned int mtu_idx; + u32 opt0h, opt0l, opt2; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(*rpl)); + skb_get(skb); + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + + rpl = cplhdr(skb); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(opt0h); + rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT); + rpl->opt2 = htonl(opt2); + rpl->rsvd = rpl->opt2; /* workaround for HW bug */ + skb->priority = CPL_PRIORITY_SETUP; + l2t_send(ep->com.tdev, skb, ep->l2t); + + return; +} + +static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip, + struct sk_buff *skb) +{ + PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, + peer_ip); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(struct cpl_tid_release)); + skb_get(skb); + + if (tdev->type == T3B) + release_tid(tdev, hwtid, skb); + else { + struct cpl_pass_accept_rpl *rpl; + + rpl = cplhdr(skb); + skb->priority = CPL_PRIORITY_SETUP; + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, + hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(F_TCAM_BYPASS); + rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); + rpl->opt2 = 0; + rpl->rsvd = rpl->opt2; + tdev->send(tdev, skb); + } +} + +static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *child_ep, *parent_ep = ctx; + struct cpl_pass_accept_req *req = cplhdr(skb); + unsigned int hwtid = GET_TID(req); + struct dst_entry *dst; + struct l2t_entry *l2t; + struct rtable *rt; + struct iff_mac tim; + + PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid); + + if (state_read(&parent_ep->com) != LISTEN) { + printk(KERN_ERR "%s - listening ep not in LISTEN\n", + __FUNCTION__); + goto reject; + } + + /* + * Find the netdev for this connection request. + */ + tim.mac_addr = req->dst_mac; + tim.vlan_tag = ntohs(req->vlan_tag); + if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) { + printk(KERN_ERR + "%s bad dst mac %02x %02x %02x %02x %02x %02x\n", + __FUNCTION__, + req->dst_mac[0], + req->dst_mac[1], + req->dst_mac[2], + req->dst_mac[3], + req->dst_mac[4], + req->dst_mac[5]); + goto reject; + } + + /* Find output route */ + rt = find_route(tdev, + req->local_ip, + req->peer_ip, + req->local_port, + req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid))); + if (!rt) { + printk(KERN_ERR MOD "%s - failed to find dst entry!\n", + __FUNCTION__); + goto reject; + } + dst = &rt->u.dst; + l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev); + if (!l2t) { + printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n", + __FUNCTION__); + dst_release(dst); + goto reject; + } + child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL); + if (!child_ep) { + printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n", + __FUNCTION__); + l2t_release(L2DATA(tdev), l2t); + dst_release(dst); + goto reject; + } + state_set(&child_ep->com, CONNECTING); + child_ep->com.tdev = tdev; + child_ep->com.cm_id = NULL; + child_ep->com.local_addr.sin_family = PF_INET; + child_ep->com.local_addr.sin_port = req->local_port; + child_ep->com.local_addr.sin_addr.s_addr = req->local_ip; + child_ep->com.remote_addr.sin_family = PF_INET; + child_ep->com.remote_addr.sin_port = req->peer_port; + child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip; + get_ep(&parent_ep->com); + child_ep->parent_ep = parent_ep; + child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid)); + child_ep->l2t = l2t; + child_ep->dst = dst; + child_ep->hwtid = hwtid; + init_timer(&child_ep->timer); + cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid); + accept_cr(child_ep, req->peer_ip, skb); + goto out; +reject: + reject_cr(tdev, hwtid, req->peer_ip, skb); +out: + return CPL_RET_BUF_DONE; +} + +static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_pass_establish *req = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + dst_confirm(ep->dst); + state_set(&ep->com, MPA_REQ_WAIT); + start_ep_timer(ep); + + return CPL_RET_BUF_DONE; +} + +static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + unsigned long flags; + int disconnect = 1; + int release = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + dst_confirm(ep->dst); + + spin_lock_irqsave(&ep->com.lock, flags); + switch (ep->com.state) { + case MPA_REQ_WAIT: + __state_set(&ep->com, CLOSING); + break; + case MPA_REQ_SENT: + __state_set(&ep->com, CLOSING); + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + __state_set(&ep->com, CLOSING); + get_ep(&ep->com); + break; + case MPA_REP_SENT: + __state_set(&ep->com, CLOSING); + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case FPDU_MODE: + __state_set(&ep->com, CLOSING); + attrs.next_state = IWCH_QP_STATE_CLOSING; + iwch_modify_qp(ep->com.qp->rhp, ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, &attrs, 1); + peer_close_upcall(ep); + break; + case ABORTING: + disconnect = 0; + break; + case CLOSING: + start_ep_timer(ep); + __state_set(&ep->com, MORIBUND); + disconnect = 0; + break; + case MORIBUND: + stop_ep_timer(ep); + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, &attrs, 1); + } + close_complete_upcall(ep); + __state_set(&ep->com, DEAD); + release = 1; + disconnect = 0; + break; + case DEAD: + disconnect = 0; + break; + default: + BUG_ON(1); + } + spin_unlock_irqrestore(&ep->com.lock, flags); + if (disconnect) + iwch_ep_disconnect(ep, 0, GFP_KERNEL); + if (release) + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +/* + * Returns whether an ABORT_REQ_RSS message is a negative advice. + */ +static inline int is_neg_adv_abort(unsigned int status) +{ + return status == CPL_ERR_RTX_NEG_ADVICE || + status == CPL_ERR_PERSIST_NEG_ADVICE; +} + +static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_abort_req_rss *req = cplhdr(skb); + struct iwch_ep *ep = ctx; + struct cpl_abort_rpl *rpl; + struct sk_buff *rpl_skb; + struct iwch_qp_attributes attrs; + int ret; + int state; + + if (is_neg_adv_abort(req->status)) { + PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, + ep->hwtid); + t3_l2t_send_event(ep->com.tdev, ep->l2t); + return CPL_RET_BUF_DONE; + } + + state = state_read(&ep->com); + PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state); + switch (state) { + case CONNECTING: + break; + case MPA_REQ_WAIT: + break; + case MPA_REQ_SENT: + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REP_SENT: + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + get_ep(&ep->com); + break; + case MORIBUND: + stop_ep_timer(ep); + case FPDU_MODE: + case CLOSING: + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) + printk(KERN_ERR MOD + "%s - qp <- error failed!\n", + __FUNCTION__); + } + peer_abort_upcall(ep); + break; + case ABORTING: + break; + case DEAD: + PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__); + return CPL_RET_BUF_DONE; + default: + BUG_ON(1); + break; + } + dst_confirm(ep->dst); + + rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL); + if (!rpl_skb) { + printk(KERN_ERR MOD "%s - cannot allocate skb!\n", + __FUNCTION__); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + put_ep(&ep->com); + return CPL_RET_BUF_DONE; + } + rpl_skb->priority = CPL_PRIORITY_DATA; + rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl)); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL)); + rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); + rpl->cmd = CPL_ABORT_NO_RST; + ep->com.tdev->send(ep->com.tdev, rpl_skb); + if (state != ABORTING) { + state_set(&ep->com, DEAD); + release_ep_resources(ep); + } + return CPL_RET_BUF_DONE; +} + +static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + unsigned long flags; + int release = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(!ep); + + /* The cm_id may be null if we failed to connect */ + spin_lock_irqsave(&ep->com.lock, flags); + switch (ep->com.state) { + case CLOSING: + start_ep_timer(ep); + __state_set(&ep->com, MORIBUND); + break; + case MORIBUND: + stop_ep_timer(ep); + if ((ep->com.cm_id) && (ep->com.qp)) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + __state_set(&ep->com, DEAD); + release = 1; + break; + case DEAD: + default: + BUG_ON(1); + break; + } + spin_unlock_irqrestore(&ep->com.lock, flags); + if (release) + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +/* + * T3A does 3 things when a TERM is received: + * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet + * 2) generate an async event on the QP with the TERMINATE opcode + * 3) post a TERMINATE opcde cqe into the associated CQ. + * + * For (1), we save the message in the qp for later consumer consumption. + * For (2), we move the QP into TERMINATE, post a QP event and disconnect. + * For (3), we toss the CQE in cxio_poll_cq(). + * + * terminate() handles case (1)... + */ +static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb_pull(skb, sizeof(struct cpl_rdma_terminate)); + PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len); + memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len); + ep->com.qp->attr.terminate_msg_len = skb->len; + ep->com.qp->attr.is_terminate_local = 0; + return CPL_RET_BUF_DONE; +} + +static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_rdma_ec_status *rep = cplhdr(skb); + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, + rep->status); + if (rep->status) { + struct iwch_qp_attributes attrs; + + printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n", + __FUNCTION__, ep->hwtid); + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + abort_connection(ep, NULL, GFP_KERNEL); + } + return CPL_RET_BUF_DONE; +} + +static void ep_timeout(unsigned long arg) +{ + struct iwch_ep *ep = (struct iwch_ep *)arg; + struct iwch_qp_attributes attrs; + unsigned long flags; + + spin_lock_irqsave(&ep->com.lock, flags); + PDBG("%s ep %p tid %u state %d\n", __FUNCTION__, ep, ep->hwtid, + ep->com.state); + switch (ep->com.state) { + case MPA_REQ_SENT: + connect_reply_upcall(ep, -ETIMEDOUT); + break; + case MPA_REQ_WAIT: + break; + case MORIBUND: + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + break; + default: + BUG(); + } + __state_set(&ep->com, CLOSING); + spin_unlock_irqrestore(&ep->com.lock, flags); + abort_connection(ep, NULL, GFP_ATOMIC); + put_ep(&ep->com); +} + +int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + state_set(&ep->com, CLOSING); + if (mpa_rev == 0) + abort_connection(ep, NULL, GFP_KERNEL); + else { + err = send_mpa_reject(ep, pdata, pdata_len); + err = send_halfclose(ep, GFP_KERNEL); + } + return 0; +} + +int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + struct iwch_ep *ep = to_ep(cm_id); + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_qp *qp = get_qhp(h, conn_param->qpn); + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_read(&ep->com) == DEAD) { + put_ep(&ep->com); + return -ECONNRESET; + } + + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + BUG_ON(!qp); + + if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) || + (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) { + abort_connection(ep, NULL, GFP_KERNEL); + return -EINVAL; + } + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = qp; + + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord); + get_ep(&ep->com); + err = send_mpa_reply(ep, conn_param->private_data, + conn_param->private_data_len); + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL, GFP_KERNEL); + put_ep(&ep->com); + return err; + } + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL, GFP_KERNEL); + } else { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + put_ep(&ep->com); + return err; +} + +int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_ep *ep; + struct rtable *rt; + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto out; + } + init_timer(&ep->timer); + ep->plen = conn_param->private_data_len; + if (ep->plen) + memcpy(ep->mpa_pkt + sizeof(struct mpa_message), + conn_param->private_data, ep->plen); + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + ep->com.tdev = h->rdev.t3cdev_p; + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = get_qhp(h, conn_param->qpn); + BUG_ON(!ep->com.qp); + PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, + ep->com.qp, cm_id); + + /* + * Allocate an active TID to initiate a TCP connection. + */ + ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->atid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + /* find a route */ + rt = find_route(h->rdev.t3cdev_p, + cm_id->local_addr.sin_addr.s_addr, + cm_id->remote_addr.sin_addr.s_addr, + cm_id->local_addr.sin_port, + cm_id->remote_addr.sin_port, IPTOS_LOWDELAY); + if (!rt) { + printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__); + err = -EHOSTUNREACH; + goto fail3; + } + ep->dst = &rt->u.dst; + + /* get a l2t entry */ + ep->l2t = t3_l2t_get(ep->com.tdev, ep->dst->neighbour, + ep->dst->neighbour->dev); + if (!ep->l2t) { + printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__); + err = -ENOMEM; + goto fail4; + } + + state_set(&ep->com, CONNECTING); + ep->tos = IPTOS_LOWDELAY; + ep->com.local_addr = cm_id->local_addr; + ep->com.remote_addr = cm_id->remote_addr; + + /* send connect request to rnic */ + err = send_connect(ep); + if (!err) + goto out; + + l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t); +fail4: + dst_release(ep->dst); +fail3: + cxgb3_free_atid(ep->com.tdev, ep->atid); +fail2: + put_ep(&ep->com); +out: + return err; +} + +int iwch_create_listen(struct iw_cm_id *cm_id, int backlog) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_listen_ep *ep; + + + might_sleep(); + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto fail1; + } + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.tdev = h->rdev.t3cdev_p; + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->backlog = backlog; + ep->com.local_addr = cm_id->local_addr; + + /* + * Allocate a server TID. + */ + ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + state_set(&ep->com, LISTEN); + err = listen_start(ep); + if (err) + goto fail3; + + /* wait for pass_open_rpl */ + wait_event(ep->com.waitq, ep->com.rpl_done); + err = ep->com.rpl_err; + if (!err) { + cm_id->provider_data = ep; + goto out; + } +fail3: + cxgb3_free_stid(ep->com.tdev, ep->stid); +fail2: + put_ep(&ep->com); +fail1: +out: + return err; +} + +int iwch_destroy_listen(struct iw_cm_id *cm_id) +{ + int err; + struct iwch_listen_ep *ep = to_listen_ep(cm_id); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + might_sleep(); + state_set(&ep->com, DEAD); + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + err = listen_stop(ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + cxgb3_free_stid(ep->com.tdev, ep->stid); + err = ep->com.rpl_err; + cm_id->rem_ref(cm_id); + put_ep(&ep->com); + return err; +} + +int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) +{ + int ret=0; + unsigned long flags; + int close = 0; + + spin_lock_irqsave(&ep->com.lock, flags); + + PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, + states[ep->com.state], abrupt); + + if (ep->com.state == DEAD) { + PDBG("%s already dead ep %p\n", __FUNCTION__, ep); + goto out; + } + + if (abrupt) { + if (ep->com.state != ABORTING) { + ep->com.state = ABORTING; + close = 1; + } + goto out; + } + + switch (ep->com.state) { + case MPA_REQ_WAIT: + case MPA_REQ_SENT: + case MPA_REQ_RCVD: + case MPA_REP_SENT: + case FPDU_MODE: + ep->com.state = CLOSING; + close = 1; + break; + case CLOSING: + start_ep_timer(ep); + ep->com.state = MORIBUND; + close = 1; + break; + case MORIBUND: + break; + default: + BUG(); + break; + } +out: + spin_unlock_irqrestore(&ep->com.lock, flags); + if (close) { + if (abrupt) + ret = send_abort(ep, NULL, gfp); + else + ret = send_halfclose(ep, gfp); + } + return ret; +} + +int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, + struct l2t_entry *l2t) +{ + struct iwch_ep *ep = ctx; + + if (ep->dst != old) + return 0; + + PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, + l2t); + dst_hold(new); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + ep->l2t = l2t; + dst_release(old); + ep->dst = new; + return 1; +} + +/* + * All the CM events are handled on a work queue to have a safe context. + */ +static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep_common *epc = ctx; + + get_ep(epc); + + /* + * Save ctx and tdev in the skb->cb area. + */ + *((void **) skb->cb) = ctx; + *((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev; + + /* + * Queue the skb and schedule the worker thread. + */ + skb_queue_tail(&rxq, skb); + queue_work(workq, &skb_work); + return 0; +} + +int __init iwch_cm_init(void) +{ + skb_queue_head_init(&rxq); + + workq = create_singlethread_workqueue("iw_cxgb3"); + if (!workq) + return -ENOMEM; + + /* + * All upcalls from the T3 Core go to sched() to + * schedule the processing on a work queue. + */ + t3c_handlers[CPL_ACT_ESTABLISH] = sched; + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; + t3c_handlers[CPL_RX_DATA] = sched; + t3c_handlers[CPL_TX_DMA_ACK] = sched; + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; + t3c_handlers[CPL_ABORT_RPL] = sched; + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; + t3c_handlers[CPL_PASS_ESTABLISH] = sched; + t3c_handlers[CPL_PEER_CLOSE] = sched; + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; + t3c_handlers[CPL_RDMA_TERMINATE] = sched; + t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + + /* + * These are the real handlers that are called from a + * work queue. + */ + work_handlers[CPL_ACT_ESTABLISH] = act_establish; + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; + work_handlers[CPL_RX_DATA] = rx_data; + work_handlers[CPL_TX_DMA_ACK] = tx_ack; + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; + work_handlers[CPL_ABORT_RPL] = abort_rpl; + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; + work_handlers[CPL_PEER_CLOSE] = peer_close; + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; + work_handlers[CPL_RDMA_TERMINATE] = terminate; + work_handlers[CPL_RDMA_EC_STATUS] = ec_status; + return 0; +} + +void __exit iwch_cm_term(void) +{ + flush_workqueue(workq); + destroy_workqueue(workq); +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h new file mode 100644 index 0000000..893f9d0 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -0,0 +1,223 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _IWCH_CM_H_ +#define _IWCH_CM_H_ + +#include +#include +#include +#include + +#include +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" + +#define MPA_KEY_REQ "MPA ID Req Frame" +#define MPA_KEY_REP "MPA ID Rep Frame" + +#define MPA_MAX_PRIVATE_DATA 256 +#define MPA_REV 0 /* XXX - amso1100 uses rev 0 ! */ +#define MPA_REJECT 0x20 +#define MPA_CRC 0x40 +#define MPA_MARKERS 0x80 +#define MPA_FLAGS_MASK 0xE0 + +#define put_ep(ep) { \ + PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_put(&((ep)->kref), __free_ep); \ +} + +#define get_ep(ep) { \ + PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \ + ep, atomic_read(&((ep)->kref.refcount))); \ + kref_get(&((ep)->kref)); \ +} + +struct mpa_message { + u8 key[16]; + u8 flags; + u8 revision; + __be16 private_data_size; + u8 private_data[0]; +}; + +struct terminate_message { + u8 layer_etype; + u8 ecode; + __be16 hdrct_rsvd; + u8 len_hdrs[0]; +}; + +#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28) + +enum iwch_layers_types { + LAYER_RDMAP = 0x00, + LAYER_DDP = 0x10, + LAYER_MPA = 0x20, + RDMAP_LOCAL_CATA = 0x00, + RDMAP_REMOTE_PROT = 0x01, + RDMAP_REMOTE_OP = 0x02, + DDP_LOCAL_CATA = 0x00, + DDP_TAGGED_ERR = 0x01, + DDP_UNTAGGED_ERR = 0x02, + DDP_LLP = 0x03 +}; + +enum iwch_rdma_ecodes { + RDMAP_INV_STAG = 0x00, + RDMAP_BASE_BOUNDS = 0x01, + RDMAP_ACC_VIOL = 0x02, + RDMAP_STAG_NOT_ASSOC = 0x03, + RDMAP_TO_WRAP = 0x04, + RDMAP_INV_VERS = 0x05, + RDMAP_INV_OPCODE = 0x06, + RDMAP_STREAM_CATA = 0x07, + RDMAP_GLOBAL_CATA = 0x08, + RDMAP_CANT_INV_STAG = 0x09, + RDMAP_UNSPECIFIED = 0xff +}; + +enum iwch_ddp_ecodes { + DDPT_INV_STAG = 0x00, + DDPT_BASE_BOUNDS = 0x01, + DDPT_STAG_NOT_ASSOC = 0x02, + DDPT_TO_WRAP = 0x03, + DDPT_INV_VERS = 0x04, + DDPU_INV_QN = 0x01, + DDPU_INV_MSN_NOBUF = 0x02, + DDPU_INV_MSN_RANGE = 0x03, + DDPU_INV_MO = 0x04, + DDPU_MSG_TOOBIG = 0x05, + DDPU_INV_VERS = 0x06 +}; + +enum iwch_mpa_ecodes { + MPA_CRC_ERR = 0x02, + MPA_MARKER_ERR = 0x03 +}; + +enum iwch_ep_state { + IDLE = 0, + LISTEN, + CONNECTING, + MPA_REQ_WAIT, + MPA_REQ_SENT, + MPA_REQ_RCVD, + MPA_REP_SENT, + FPDU_MODE, + ABORTING, + CLOSING, + MORIBUND, + DEAD, +}; + +struct iwch_ep_common { + struct iw_cm_id *cm_id; + struct iwch_qp *qp; + struct t3cdev *tdev; + enum iwch_ep_state state; + struct kref kref; + spinlock_t lock; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + wait_queue_head_t waitq; + int rpl_done; + int rpl_err; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; + unsigned int stid; + int backlog; +}; + +struct iwch_ep { + struct iwch_ep_common com; + struct iwch_ep *parent_ep; + struct timer_list timer; + unsigned int atid; + u32 hwtid; + u32 snd_seq; + struct l2t_entry *l2t; + struct dst_entry *dst; + struct sk_buff *mpa_skb; + struct iwch_mpa_attributes mpa_attr; + unsigned int mpa_pkt_len; + u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA]; + u8 tos; + u16 emss; + u16 plen; + u32 ird; + u32 ord; +}; + +static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_ep *)cm_id->provider_data; +} + +static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_listen_ep *)cm_id->provider_data; +} + +static inline int compute_wscale(int win) +{ + int wscale = 0; + + while (wscale < 14 && (65535< References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192025.19316.13831.stgit@dell3.ogc.int> Code to manipulate the QP. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++ 1 files changed, 1007 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c new file mode 100644 index 0000000..ad044bd --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -0,0 +1,1007 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" + +#define NO_SUPPORT -1 + +static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr, + u8 * flit_cnt) +{ + int i; + u32 plen; + + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE; + else + wqe->send.rdmaop = T3_SEND; + wqe->send.rem_stag = 0; + break; +#if 0 /* Not currently supported */ + case TYPE_SEND_INVALIDATE: + case TYPE_SEND_INVALIDATE_IMMEDIATE: + wqe->send.rdmaop = T3_SEND_WITH_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; + case TYPE_SEND_SE_INVALIDATE: + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; +#endif + default: + break; + } + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->send.reserved[0] = 0; + wqe->send.reserved[1] = 0; + wqe->send.reserved[2] = 0; + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + plen = 4; + wqe->send.sgl[0].stag = wr->imm_data; + wqe->send.sgl[0].len = __constant_cpu_to_be32(0); + wqe->send.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 5; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + } + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); + } + wqe->send.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + int i; + u32 plen; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->write.rdmaop = T3_RDMA_WRITE; + wqe->write.reserved[0] = 0; + wqe->write.reserved[1] = 0; + wqe->write.reserved[2] = 0; + wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey); + wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr); + + if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + plen = 4; + wqe->write.sgl[0].stag = wr->imm_data; + wqe->write.sgl[0].len = __constant_cpu_to_be32(0); + wqe->write.num_sgle = __constant_cpu_to_be32(0); + *flit_cnt = 6; + } else { + plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((plen + wr->sg_list[i].length) < plen) { + return -EMSGSIZE; + } + plen += wr->sg_list[i].length; + wqe->write.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->write.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->write.sgl[i].to = + cpu_to_be64(wr->sg_list[i].addr); + } + wqe->write.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 5 + ((wr->num_sge) << 1); + } + wqe->write.plen = cpu_to_be32(plen); + return 0; +} + +static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr, + u8 *flit_cnt) +{ + if (wr->num_sge > 1) + return -EINVAL; + wqe->read.rdmaop = T3_READ_REQ; + wqe->read.reserved[0] = 0; + wqe->read.reserved[1] = 0; + wqe->read.reserved[2] = 0; + wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr); + wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey); + wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length); + wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr); + *flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3; + return 0; +} + +/* + * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. + */ +static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp, + struct ib_sge *sg_list, u32 num_sgle, + u32 * pbl_addr, u8 * page_size) +{ + int i; + struct iwch_mr *mhp; + u32 offset; + for (i = 0; i < num_sgle; i++) { + + mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8); + if (!mhp) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (!mhp->attr.state) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (mhp->attr.zbva) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + + if (sg_list[i].addr < mhp->attr.va_fbo) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) < + sg_list[i].addr) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) > + mhp->attr.va_fbo + ((u64) mhp->attr.len)) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + offset = sg_list[i].addr - mhp->attr.va_fbo; + offset += ((u32) mhp->attr.va_fbo) % + (1UL << (12 + mhp->attr.page_size)); + pbl_addr[i] = ((mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3) + + (offset >> (12 + mhp->attr.page_size)); + page_size[i] = mhp->attr.page_size; + } + return 0; +} + +static inline int iwch_build_rdma_recv(struct iwch_dev *rhp, + union t3_wr *wqe, + struct ib_recv_wr *wr) +{ + int i, err = 0; + u32 pbl_addr[4]; + u8 page_size[4]; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, + page_size); + if (err) + return err; + wqe->recv.pagesz[0] = page_size[0]; + wqe->recv.pagesz[1] = page_size[1]; + wqe->recv.pagesz[2] = page_size[2]; + wqe->recv.pagesz[3] = page_size[3]; + wqe->recv.num_sgle = cpu_to_be32(wr->num_sge); + for (i = 0; i < wr->num_sge; i++) { + wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + + /* to in the WQE == the offset into the page */ + wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % + (1UL << (12 + page_size[i]))); + + /* pbl_addr is the adapters address in the PBL */ + wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); + } + for (; i < T3_MAX_SGE; i++) { + wqe->recv.sgl[i].stag = 0; + wqe->recv.sgl[i].len = 0; + wqe->recv.sgl[i].to = 0; + wqe->recv.pbl_addr[i] = 0; + } + return 0; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + int err = 0; + u8 t3_wr_flit_cnt; + enum t3_wr_opcode t3_wr_opcode = 0; + enum t3_wr_flags t3_wr_flags; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if (num_wrs <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + while (wr) { + if (num_wrs == 0) { + err = -ENOMEM; + *bad_wr = wr; + break; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + t3_wr_flags = 0; + if (wr->send_flags & IB_SEND_SOLICITED) + t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; + if (wr->send_flags & IB_SEND_SIGNALED) + t3_wr_flags |= T3_COMPLETION_FLAG; + sqp = qhp->wq.sq + + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + t3_wr_opcode = T3_WR_SEND; + err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + t3_wr_opcode = T3_WR_WRITE; + err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_READ: + t3_wr_opcode = T3_WR_READ; + t3_wr_flags = 0; /* T3 reads are always signaled */ + err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt); + if (err) + break; + sqp->read_len = wqe->read.local_len; + if (!qhp->wq.oldest_read) + qhp->wq.oldest_read = sqp; + break; + default: + PDBG("%s post of type=%d TBD!\n", __FUNCTION__, + wr->opcode); + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp->wr_id = wr->wr_id; + sqp->opcode = wr2opcode(t3_wr_opcode); + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED); + + build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, t3_wr_flit_cnt); + PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", + __FUNCTION__, wr->wr_id, idx, + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2), + sqp->opcode); + wr = wr->next; + num_wrs--; + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + int err = 0; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + unsigned long flag; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, + qhp->wq.rq_size_log2) - 1; + if (!wr) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + while (wr) { + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + if (num_wrs) + err = iwch_build_rdma_recv(qhp->rhp, wqe, wr); + else + err = -ENOMEM; + if (err) { + *bad_wr = wr; + break; + } + qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = + wr->wr_id; + build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, sizeof(struct t3_receive_wr) >> 3); + PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x " + "wqe %p \n", __FUNCTION__, wr->wr_id, idx, + qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); + ++(qhp->wq.rq_wptr); + ++(qhp->wq.wptr); + wr = wr->next; + num_wrs--; + } + spin_unlock_irqrestore(&qhp->lock, flag); + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_qp *qhp; + union t3_wr *wqe; + u32 pbl_addr; + u8 page_size; + u32 num_wrs; + unsigned long flag; + struct ib_sge sgl; + int err=0; + enum t3_wr_flags t3_wr_flags; + u32 idx; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(qp); + mhp = to_iwch_mw(mw); + rhp = qhp->rhp; + + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if ((num_wrs) <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, + mw, mw_bind); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + + t3_wr_flags = 0; + if (mw_bind->send_flags & IB_SEND_SIGNALED) + t3_wr_flags = T3_COMPLETION_FLAG; + + sgl.addr = mw_bind->addr; + sgl.lkey = mw_bind->mr->lkey; + sgl.length = mw_bind->length; + wqe->bind.reserved = 0; + wqe->bind.type = T3_VA_BASED_TO; + + /* TBD: check perms */ + wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags); + wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey); + wqe->bind.mw_stag = cpu_to_be32(mw->rkey); + wqe->bind.mw_len = cpu_to_be32(mw_bind->length); + wqe->bind.mw_va = cpu_to_be64(mw_bind->addr); + err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size); + if (err) { + spin_unlock_irqrestore(&qhp->lock, flag); + return err; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + sqp->wr_id = mw_bind->wr_id; + sqp->opcode = T3_BIND_MW; + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED); + wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr); + wqe->bind.mr_pagesz = page_size; + wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; + build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, + sizeof(struct t3_bind_mw_wr) >> 3); + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + spin_unlock_irqrestore(&qhp->lock, flag); + + ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid); + + return err; +} + +static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode, + int tagged) +{ + switch (t3err) { + case TPT_ERR_STAG: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_STAG; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_INV_STAG; + } + break; + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_STAG_NOT_ASSOC; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_STAG_NOT_ASSOC; + } + break; + case TPT_ERR_WRAP: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_TO_WRAP; + break; + case TPT_ERR_BOUND: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_BASE_BOUNDS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + } + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_CANT_INV_STAG; + break; + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + *layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_OUT_OF_RQE: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_NOBUF; + break; + case TPT_ERR_PBL_ADDR_BOUND: + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + break; + case TPT_ERR_CRC: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_CRC_ERR; + break; + case TPT_ERR_MARKER: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_MARKER_ERR; + break; + case TPT_ERR_PDU_LEN_ERR: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + break; + case TPT_ERR_DDP_VERSION: + if (tagged) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_VERS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_VERS; + } + break; + case TPT_ERR_RDMA_VERSION: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_VERS; + break; + case TPT_ERR_OPCODE: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_OPCODE; + break; + case TPT_ERR_DDP_QUEUE_NUM: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_QN; + break; + case TPT_ERR_MSN: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_RANGE; + break; + case TPT_ERR_TBIT: + *layer_type = LAYER_DDP|DDP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_MO: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MO; + break; + default: + *layer_type = LAYER_RDMAP|DDP_LOCAL_CATA; + *ecode = 0; + break; + } +} + +/* + * This posts a TERMINATE with layer=RDMA, type=catastrophic. + */ +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) +{ + union t3_wr *wqe; + struct terminate_message *term; + int status; + int tagged = 0; + struct sk_buff *skb; + + PDBG("%s %d\n", __FUNCTION__, __LINE__); + skb = alloc_skb(40, GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (union t3_wr *)skb_put(skb, 40); + memset(wqe, 0, 40); + wqe->send.rdmaop = T3_TERMINATE; + + /* immediate data length */ + wqe->send.plen = htonl(4); + + /* immediate data starts here. */ + term = (struct terminate_message *)wqe->send.sgl; + if (rsp_msg) { + status = CQE_STATUS(rsp_msg->cqe); + if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE) + tagged = 1; + if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) || + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) + tagged = 2; + } else { + status = TPT_ERR_INTERNAL_ERR; + } + build_term_codes(status, &term->layer_etype, &term->ecode, tagged); + build_fw_riwrh((void *)wqe, T3_WR_SEND, + T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, + qhp->ep->hwtid, 5); + skb->priority = CPL_PRIORITY_DATA; + return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb)); +} + +/* + * Assumes qhp lock is held. + */ +static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + struct iwch_cq *rchp, *schp; + int count; + + rchp = get_chp(qhp->rhp, qhp->attr.rcq); + schp = get_chp(qhp->rhp, qhp->attr.scq); + + PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp); + /* take a ref on the qhp since we must release the lock */ + atomic_inc(&qhp->refcnt); + spin_unlock_irqrestore(&qhp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&rchp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&rchp->cq); + cxio_count_rcqes(&rchp->cq, &qhp->wq, &count); + cxio_flush_rq(&qhp->wq, &rchp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&rchp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&schp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&schp->cq); + cxio_count_scqes(&schp->cq, &qhp->wq, &count); + cxio_flush_sq(&qhp->wq, &schp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&schp->lock, *flag); + + /* deref */ + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); + + spin_lock_irqsave(&qhp->lock, *flag); +} + +static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag) +{ + if (t3b_device(qhp->rhp)) + cxio_set_wq_in_error(&qhp->wq); + else + __flush_qp(qhp, flag); +} + + +/* + * Return non zero if at least one RECV was pre-posted. + */ +static inline int rqes_posted(struct iwch_qp *qhp) +{ + return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV); +} + +static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs) +{ + struct t3_rdma_init_attr init_attr; + int ret; + + init_attr.tid = qhp->ep->hwtid; + init_attr.qpid = qhp->wq.qpid; + init_attr.pdid = qhp->attr.pd; + init_attr.scqid = qhp->attr.scq; + init_attr.rcqid = qhp->attr.rcq; + init_attr.rq_addr = qhp->wq.rq_addr; + init_attr.rq_size = 1 << qhp->wq.rq_size_log2; + init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | + qhp->attr.mpa_attr.recv_marker_enabled | + (qhp->attr.mpa_attr.xmit_marker_enabled << 1) | + (qhp->attr.mpa_attr.crc_enabled << 2); + + /* + * XXX - The IWCM doesn't quite handle getting these + * attrs set before going into RTS. For now, just turn + * them on always... + */ +#if 0 + init_attr.qpcaps = qhp->attr.enableRdmaRead | + (qhp->attr.enableRdmaWrite << 1) | + (qhp->attr.enableBind << 2) | + (qhp->attr.enable_stag0_fastreg << 3) | + (qhp->attr.enable_stag0_fastreg << 4); +#else + init_attr.qpcaps = 0x1f; +#endif + init_attr.tcp_emss = qhp->ep->emss; + init_attr.ord = qhp->attr.max_ord; + init_attr.ird = qhp->attr.max_ird; + init_attr.qp_dma_addr = qhp->wq.dma_addr; + init_attr.qp_dma_size = (1UL << qhp->wq.size_log2); + init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0; + PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d " + "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, + init_attr.rq_addr, init_attr.rq_size, + init_attr.flags, init_attr.qpcaps); + ret = cxio_rdma_init(&rhp->rdev, &init_attr); + PDBG("%s ret %d\n", __FUNCTION__, ret); + return ret; +} + +int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal) +{ + int ret = 0; + struct iwch_qp_attributes newattr = qhp->attr; + unsigned long flag; + int disconnect = 0; + int terminate = 0; + int abort = 0; + int free = 0; + struct iwch_ep *ep = NULL; + + PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, + qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, + (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1); + + spin_lock_irqsave(&qhp->lock, flag); + + /* Process attr changes if in IDLE */ + if (mask & IWCH_QP_ATTR_VALID_MODIFY) { + if (qhp->attr.state != IWCH_QP_STATE_IDLE) { + ret = -EIO; + goto out; + } + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ) + newattr.enable_rdma_read = attrs->enable_rdma_read; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE) + newattr.enable_rdma_write = attrs->enable_rdma_write; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND) + newattr.enable_bind = attrs->enable_bind; + if (mask & IWCH_QP_ATTR_MAX_ORD) { + if (attrs->max_ord > + rhp->attr.max_rdma_read_qp_depth) { + ret = -EINVAL; + goto out; + } + newattr.max_ord = attrs->max_ord; + } + if (mask & IWCH_QP_ATTR_MAX_IRD) { + if (attrs->max_ird > + rhp->attr.max_rdma_reads_per_qp) { + ret = -EINVAL; + goto out; + } + newattr.max_ird = attrs->max_ird; + } + qhp->attr = newattr; + } + + if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) + goto out; + if (qhp->attr.state == attrs->next_state) + goto out; + + switch (qhp->attr.state) { + case IWCH_QP_STATE_IDLE: + switch (attrs->next_state) { + case IWCH_QP_STATE_RTS: + if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) { + ret = -EINVAL; + goto out; + } + if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) { + ret = -EINVAL; + goto out; + } + qhp->attr.mpa_attr = attrs->mpa_attr; + qhp->attr.llp_stream_handle = attrs->llp_stream_handle; + qhp->ep = qhp->attr.llp_stream_handle; + qhp->attr.state = IWCH_QP_STATE_RTS; + + /* + * Ref the endpoint here and deref when we + * disassociate the endpoint from the QP. This + * happens in CLOSING->IDLE transition or *->ERROR + * transition. + */ + get_ep(&qhp->ep->com); + spin_unlock_irqrestore(&qhp->lock, flag); + ret = rdma_init(rhp, qhp, mask, attrs); + spin_lock_irqsave(&qhp->lock, flag); + if (ret) + goto err; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + flush_qp(qhp, &flag); + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_RTS: + switch (attrs->next_state) { + case IWCH_QP_STATE_CLOSING: + BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2); + qhp->attr.state = IWCH_QP_STATE_CLOSING; + if (!internal) { + abort=0; + disconnect = 1; + ep = qhp->ep; + } + break; + case IWCH_QP_STATE_TERMINATE: + qhp->attr.state = IWCH_QP_STATE_TERMINATE; + if (!internal) + terminate = 1; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + if (!internal) { + abort=1; + disconnect = 1; + ep = qhp->ep; + } + goto err; + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_CLOSING: + if (!internal) { + ret = -EINVAL; + goto out; + } + switch (attrs->next_state) { + case IWCH_QP_STATE_IDLE: + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.llp_stream_handle = NULL; + put_ep(&qhp->ep->com); + qhp->ep = NULL; + wake_up(&qhp->wait); + break; + case IWCH_QP_STATE_ERROR: + goto err; + default: + ret = -EINVAL; + goto err; + } + break; + case IWCH_QP_STATE_ERROR: + if (attrs->next_state != IWCH_QP_STATE_IDLE) { + ret = -EINVAL; + goto out; + } + + if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || + !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) { + ret = -EINVAL; + goto out; + } + qhp->attr.state = IWCH_QP_STATE_IDLE; + memset(&qhp->attr, 0, sizeof(qhp->attr)); + break; + case IWCH_QP_STATE_TERMINATE: + if (!internal) { + ret = -EINVAL; + goto out; + } + goto err; + break; + default: + printk(KERN_ERR "%s in a bad state %d\n", + __FUNCTION__, qhp->attr.state); + ret = -EINVAL; + goto err; + break; + } + goto out; +err: + PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, + qhp->wq.qpid); + + /* disassociate the LLP connection */ + qhp->attr.llp_stream_handle = NULL; + ep = qhp->ep; + qhp->ep = NULL; + qhp->attr.state = IWCH_QP_STATE_ERROR; + free=1; + wake_up(&qhp->wait); + BUG_ON(!ep); + flush_qp(qhp, &flag); +out: + spin_unlock_irqrestore(&qhp->lock, flag); + + if (terminate) + iwch_post_terminate(qhp, NULL); + + /* + * If disconnect is 1, then we need to initiate a disconnect + * on the EP. This can be a normal close (RTS->CLOSING) or + * an abnormal close (RTS/CLOSING->ERROR). + */ + if (disconnect) + iwch_ep_disconnect(ep, abort, GFP_KERNEL); + + /* + * If free is 1, then we've disassociated the EP from the QP + * and we need to dereference the EP. + */ + if (free) + put_ep(&ep->com); + + PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state); + return ret; +} + +static int quiesce_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_quiesce_tid(qhp->ep); + qhp->flags |= QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +static int resume_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_resume_tid(qhp->ep); + qhp->flags &= ~QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +int iwch_quiesce_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) { + quiesce_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) + quiesce_qp(qhp); + } + return 0; +} + +int iwch_resume_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = get_qhp(chp->rhp, i); + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) { + resume_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp)) + resume_qp(qhp); + } + return 0; +} From swise at opengridcomputing.com Wed Dec 20 11:20:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:20:55 -0600 Subject: [openib-general] [PATCH v5 06/13] iw_cxgb3 Completion Queues In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192055.19316.62329.stgit@dell3.ogc.int> Functions to manipulate CQs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cq.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c new file mode 100644 index 0000000..ff09509 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" + +/* + * Get one cq entry from cxio and map it to openib. + * + * Returns: + * 0 EMPTY; + * 1 cqe returned + * -EAGAIN caller must try again + * any other -errno fatal error + */ +int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp, + struct ib_wc *wc) +{ + struct iwch_qp *qhp = NULL; + struct t3_cqe cqe, *rd_cqe; + struct t3_wq *wq; + u32 credit = 0; + u8 cqe_flushed; + u64 cookie; + int ret = 1; + + rd_cqe = cxio_next_cqe(&chp->cq); + + if (!rd_cqe) + return 0; + + qhp = get_qhp(rhp, CQE_QPID(*rd_cqe)); + if (!qhp) + wq = NULL; + else { + spin_lock(&qhp->lock); + wq = &(qhp->wq); + } + ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie, + &credit); + if (t3a_device(chp->rhp) && credit) { + PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, + credit, chp->cq.cqid); + cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit); + } + + if (ret) { + ret = -EAGAIN; + goto out; + } + ret = 1; + + wc->wr_id = cookie; + wc->qp_num = qhp->wq.qpid; + wc->vendor_err = CQE_STATUS(cqe); + + PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x " + "lo 0x%x cookie 0x%llx\n", __FUNCTION__, + CQE_QPID(cqe), CQE_TYPE(cqe), + CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe), + CQE_WRID_LOW(cqe), cookie); + + if (CQE_TYPE(cqe) == 0) { + if (!CQE_STATUS(cqe)) + wc->byte_len = CQE_LEN(cqe); + else + wc->byte_len = 0; + wc->opcode = IB_WC_RECV; + } else { + switch (CQE_OPCODE(cqe)) { + case T3_RDMA_WRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case T3_READ_REQ: + wc->opcode = IB_WC_RDMA_READ; + wc->byte_len = CQE_LEN(cqe); + break; + case T3_SEND: + case T3_SEND_WITH_SE: + wc->opcode = IB_WC_SEND; + break; + case T3_BIND_MW: + wc->opcode = IB_WC_BIND_MW; + break; + + /* these aren't supported yet */ + case T3_SEND_WITH_INV: + case T3_SEND_WITH_SE_INV: + case T3_LOCAL_INV: + case T3_FAST_REGISTER: + default: + printk(KERN_ERR MOD "Unexpected opcode %d " + "in the CQE received for QPID=0x%0x\n", + CQE_OPCODE(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + goto out; + } + } + + if (cqe_flushed) + wc->status = IB_WC_WR_FLUSH_ERR; + else { + + switch (CQE_STATUS(cqe)) { + case TPT_ERR_SUCCESS: + wc->status = IB_WC_SUCCESS; + break; + case TPT_ERR_STAG: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_PDID: + wc->status = IB_WC_LOC_PROT_ERR; + break; + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_WRAP: + wc->status = IB_WC_GENERAL_ERR; + break; + case TPT_ERR_BOUND: + wc->status = IB_WC_LOC_LEN_ERR; + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + wc->status = IB_WC_MW_BIND_ERR; + break; + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + case TPT_ERR_OPCODE: + wc->status = IB_WC_FATAL_ERR; + break; + case TPT_ERR_SWFLUSH: + wc->status = IB_WC_WR_FLUSH_ERR; + break; + default: + printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for " + "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + } + } +out: + if (wq) + spin_unlock(&qhp->lock); + return ret; +} + +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + unsigned long flags; + int npolled; + int err = 0; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + + spin_lock_irqsave(&chp->lock, flags); + for (npolled = 0; npolled < num_entries; ++npolled) { +#ifdef DEBUG + int i=0; +#endif + + /* + * Because T3 can post CQEs that are _not_ associated + * with a WR, we might have to poll again after removing + * one of these. + */ + do { + err = iwch_poll_cq_one(rhp, chp, wc + npolled); +#ifdef DEBUG + BUG_ON(++i > 1000); +#endif + } while (err == -EAGAIN); + if (err <= 0) + break; + } + spin_unlock_irqrestore(&chp->lock, flags); + + if (err < 0) + return err; + else { + return npolled; + } +} + +int iwch_modify_cq(struct ib_cq *cq, int cqe) +{ + PDBG("iwch_modify_cq: TBD\n"); + return 0; +} From swise at opengridcomputing.com Wed Dec 20 11:21:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:21:25 -0600 Subject: [openib-general] [PATCH v5 07/13] iw_cxgb3 Async Event Handler In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192125.19316.92319.stgit@dell3.ogc.int> Code to handle async events coming from the T3 RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_ev.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c new file mode 100644 index 0000000..646f612 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp, + struct respQ_msg_t *rsp_msg, + enum ib_event_type ib_event, + int send_term) +{ + struct ib_event event; + struct iwch_qp_attributes attrs; + struct iwch_qp *qhp; + + printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + + spin_lock(&rnicp->lock); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + + if (!qhp) { + printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", + __FUNCTION__, CQE_STATUS(rsp_msg->cqe), + CQE_QPID(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + if ((qhp->attr.state == IWCH_QP_STATE_ERROR) || + (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) { + PDBG("%s AE received after RTS - " + "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, + qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + atomic_inc(&qhp->refcnt); + spin_unlock(&rnicp->lock); + + event.event = ib_event; + event.device = chp->ibcq.device; + if (ib_event == IB_EVENT_CQ_ERR) + event.element.cq = &chp->ibcq; + else + event.element.qp = &qhp->ibqp; + + if (qhp->ibqp.event_handler) + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_TERMINATE; + iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (send_term) + iwch_post_terminate(qhp, rsp_msg); + } + + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); +} + +void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb) +{ + struct iwch_dev *rnicp; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + struct iwch_cq *chp; + struct iwch_qp *qhp; + u32 cqid = RSPQ_CQID(rsp_msg); + + rnicp = (struct iwch_dev *) rdev_p->ulp; + spin_lock(&rnicp->lock); + chp = get_chp(rnicp, cqid); + qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe)); + if (!chp || !qhp) { + printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d " + "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", + cqid, CQE_QPID(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), + CQE_WRID_LOW(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + goto out; + } + iwch_qp_add_ref(&qhp->ibqp); + atomic_inc(&chp->refcnt); + spin_unlock(&rnicp->lock); + + /* + * 1) completion of our sending a TERMINATE. + * 2) incoming TERMINATE message. + */ + if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && + (CQE_STATUS(rsp_msg->cqe) == 0)) { + if (SQ_TYPE(rsp_msg->cqe)) { + PDBG("%s QPID 0x%x ep %p disconnecting\n", + __FUNCTION__, qhp->wq.qpid, qhp->ep); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } else { + PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, + qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, + IB_EVENT_QP_REQ_ERR, 0); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } + goto done; + } + + /* Bad incoming Read request */ + if (SQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + /* Bad incoming write */ + if (RQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + switch (CQE_STATUS(rsp_msg->cqe)) { + + /* Completion Events */ + case TPT_ERR_SUCCESS: + + /* + * Confirm the destination entry if this is a RECV completion. + */ + if (qhp->ep && SQ_TYPE(rsp_msg->cqe)) + dst_confirm(qhp->ep->dst); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + break; + + case TPT_ERR_STAG: + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + case TPT_ERR_WRAP: + case TPT_ERR_BOUND: + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1); + break; + + /* Device Fatal Errors */ + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1); + break; + + /* QP Fatal Errors */ + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_PBL_ADDR_BOUND: + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_OPCODE: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_RQE_ADDR_BOUND: + case TPT_ERR_IRD_OVERFLOW: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + + default: + printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", + CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + } +done: + if (atomic_dec_and_test(&chp->refcnt)) + wake_up(&chp->wait); + iwch_qp_rem_ref(&qhp->ibqp); +out: + dev_kfree_skb_irq(skb); +} From swise at opengridcomputing.com Wed Dec 20 11:21:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:21:55 -0600 Subject: [openib-general] [PATCH v5 08/13] iw_cxgb3 Memory Registration In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192155.19316.73702.stgit@dell3.ogc.int> Functions to register memory regions. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_mem.c | 170 ++++++++++++++++++++++++++++++++ 1 files changed, 170 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c new file mode 100644 index 0000000..5909ec5 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c @@ -0,0 +1,170 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include + +#include +#include + +#include "cxio_hal.h" +#include "iwch.h" +#include "iwch_provider.h" + +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list) +{ + u32 stag; + u32 mmid; + + + if (cxio_register_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + __be64 *page_list, + int npages) +{ + u32 stag; + u32 mmid; + + + /* We could support this... */ + if (npages > mhp->attr.pbl_size) + return -ENOMEM; + + stag = mhp->attr.stag; + if (cxio_reregister_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) + return -ENOMEM; + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + insert_handle(rhp, &rhp->mmidr, mhp, mmid); + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + __be64 **page_list) +{ + u64 mask; + int i, j, n; + + mask = 0; + *total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) + return -EINVAL; + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return -EINVAL; + *total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + if (*total_size > 0xFFFFFFFFULL) + return -ENOMEM; + + /* Find largest page shift we can use to cover buffers */ + for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift)) + if (num_phys_buf > 1) { + if ((1ULL << *shift) & mask) + break; + } else + if (1ULL << *shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << *shift) - 1))) + break; + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1); + buffer_list[0].addr &= ~0ull << *shift; + + *npages = 0; + for (i = 0; i < num_phys_buf; ++i) + *npages += (buffer_list[i].size + + (1ULL << *shift) - 1) >> *shift; + + if (!*npages) + return -EINVAL; + + *page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL); + if (!*page_list) + return -ENOMEM; + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; + ++j) + (*page_list)[n++] = cpu_to_be64(buffer_list[i].addr + + ((u64) j << *shift)); + + PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n", + __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages); + + return 0; + +} From swise at opengridcomputing.com Wed Dec 20 11:22:25 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:22:25 -0600 Subject: [openib-general] [PATCH v5 09/13] iw_cxgb3 Core WQE/CQE Types In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192225.19316.33284.stgit@dell3.ogc.int> T3 WQE and CQE structures, defines, etc... Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_wr.h | 685 ++++++++++++++++++++++++++++ 1 files changed, 685 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h new file mode 100644 index 0000000..234a084 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h @@ -0,0 +1,685 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_WR_H__ +#define __CXIO_WR_H__ + +#include +#include +#include +#include "firmware_exports.h" + +#define T3_MAX_SGE 4 + +#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) +#define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ + ((rptr)!=(wptr)) ) +#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1)) +#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<> S_FW_RIWR_OP)) & M_FW_RIWR_OP) + +#define S_FW_RIWR_SOPEOP 22 +#define M_FW_RIWR_SOPEOP 0x3 +#define V_FW_RIWR_SOPEOP(x) ((x) << S_FW_RIWR_SOPEOP) + +#define S_FW_RIWR_FLAGS 8 +#define M_FW_RIWR_FLAGS 0x3fffff +#define V_FW_RIWR_FLAGS(x) ((x) << S_FW_RIWR_FLAGS) +#define G_FW_RIWR_FLAGS(x) ((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS) + +#define S_FW_RIWR_TID 8 +#define V_FW_RIWR_TID(x) ((x) << S_FW_RIWR_TID) + +#define S_FW_RIWR_LEN 0 +#define V_FW_RIWR_LEN(x) ((x) << S_FW_RIWR_LEN) + +#define S_FW_RIWR_GEN 31 +#define V_FW_RIWR_GEN(x) ((x) << S_FW_RIWR_GEN) + +struct t3_sge { + __be32 stag; + __be32 len; + __be64 to; +}; + +/* If num_sgle is zero, flit 5+ contains immediate data.*/ +struct t3_send_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be32 plen; /* 3 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ +}; + +struct t3_local_inv_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 stag; /* 2 */ + __be32 reserved3; +}; + +struct t3_rdma_write_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 stag_sink; + __be64 to_sink; /* 3 */ + __be32 plen; /* 4 */ + __be32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 5+ */ +}; + +struct t3_rdma_read_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 rdmaop; /* 2 */ + u8 reserved[3]; + __be32 rem_stag; + __be64 rem_to; /* 3 */ + __be32 local_stag; /* 4 */ + __be32 local_len; + __be64 local_to; /* 5 */ +}; + +enum t3_addr_type { + T3_VA_BASED_TO = 0x0, + T3_ZERO_BASED_TO = 0x1 +} __attribute__ ((packed)); + +enum t3_mem_perms { + T3_MEM_ACCESS_LOCAL_READ = 0x1, + T3_MEM_ACCESS_LOCAL_WRITE = 0x2, + T3_MEM_ACCESS_REM_READ = 0x4, + T3_MEM_ACCESS_REM_WRITE = 0x8 +} __attribute__ ((packed)); + +struct t3_bind_mw_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u16 reserved; /* 2 */ + u8 type; + u8 perms; + __be32 mr_stag; + __be32 mw_stag; /* 3 */ + __be32 mw_len; + __be64 mw_va; /* 4 */ + __be32 mr_pbl_addr; /* 5 */ + u8 reserved2[3]; + u8 mr_pagesz; +}; + +struct t3_receive_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 pagesz[T3_MAX_SGE]; + __be32 num_sgle; /* 2 */ + struct t3_sge sgl[T3_MAX_SGE]; /* 3+ */ + __be32 pbl_addr[T3_MAX_SGE]; +}; + +struct t3_bypass_wr { + struct fw_riwrh wrh; + union t3_wrid wrid; /* 1 */ +}; + +struct t3_modify_qp_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 flags; /* 2 */ + __be32 quiesce; /* 2 */ + __be32 max_ird; /* 3 */ + __be32 max_ord; /* 3 */ + __be64 sge_cmd; /* 4 */ + __be64 ctx1; /* 5 */ + __be64 ctx0; /* 6 */ +}; + +enum t3_modify_qp_flags { + MODQP_QUIESCE = 0x01, + MODQP_MAX_IRD = 0x02, + MODQP_MAX_ORD = 0x04, + MODQP_WRITE_EC = 0x08, + MODQP_READ_EC = 0x10, +}; + + +enum t3_mpa_attrs { + uP_RI_MPA_RX_MARKER_ENABLE = 0x1, + uP_RI_MPA_TX_MARKER_ENABLE = 0x2, + uP_RI_MPA_CRC_ENABLE = 0x4, + uP_RI_MPA_IETF_ENABLE = 0x8 +} __attribute__ ((packed)); + +enum t3_qp_caps { + uP_RI_QP_RDMA_READ_ENABLE = 0x01, + uP_RI_QP_RDMA_WRITE_ENABLE = 0x02, + uP_RI_QP_BIND_ENABLE = 0x04, + uP_RI_QP_FAST_REGISTER_ENABLE = 0x08, + uP_RI_QP_STAG0_ENABLE = 0x10 +} __attribute__ ((packed)); + +struct t3_rdma_init_attr { + u32 tid; + u32 qpid; + u32 pdid; + u32 scqid; + u32 rcqid; + u32 rq_addr; + u32 rq_size; + enum t3_mpa_attrs mpaattrs; + enum t3_qp_caps qpcaps; + u16 tcp_emss; + u32 ord; + u32 ird; + u64 qp_dma_addr; + u32 qp_dma_size; + u32 flags; +}; + +struct t3_rdma_init_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + __be32 qpid; /* 2 */ + __be32 pdid; + __be32 scqid; /* 3 */ + __be32 rcqid; + __be32 rq_addr; /* 4 */ + __be32 rq_size; + u8 mpaattrs; /* 5 */ + u8 qpcaps; + __be16 ulpdu_size; + __be32 flags; /* bits 31-1 - reservered */ + /* bit 0 - set if RECV posted */ + __be32 ord; /* 6 */ + __be32 ird; + __be64 qp_dma_addr; /* 7 */ + __be32 qp_dma_size; /* 8 */ + u32 rsvd; +}; + +struct t3_genbit { + u64 flit[15]; + __be64 genbit; +}; + +enum rdma_init_wr_flags { + RECVS_POSTED = 1, +}; + +union t3_wr { + struct t3_send_wr send; + struct t3_rdma_write_wr write; + struct t3_rdma_read_wr read; + struct t3_receive_wr recv; + struct t3_local_inv_wr local_inv; + struct t3_bind_mw_wr bind; + struct t3_bypass_wr bypass; + struct t3_rdma_init_wr init; + struct t3_modify_qp_wr qp_mod; + struct t3_genbit genbit; + u64 flit[16]; +}; + +#define T3_SQ_CQE_FLIT 13 +#define T3_SQ_COOKIE_FLIT 14 + +#define T3_RQ_COOKIE_FLIT 13 +#define T3_RQ_CQE_FLIT 14 + +static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe) +{ + return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags)); +} + +static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, + enum t3_wr_flags flags, u8 genbit, u32 tid, + u8 len) +{ + wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | + V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_FLAGS(flags)); + wmb(); + wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | + V_FW_RIWR_TID(tid) | + V_FW_RIWR_LEN(len)); + /* 2nd gen bit... */ + ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit); +} + +/* + * T3 ULP2_TX commands + */ +enum t3_utx_mem_op { + T3_UTX_MEM_READ = 2, + T3_UTX_MEM_WRITE = 3 +}; + +/* T3 MC7 RDMA TPT entry format */ + +enum tpt_mem_type { + TPT_NON_SHARED_MR = 0x0, + TPT_SHARED_MR = 0x1, + TPT_MW = 0x2, + TPT_MW_RELAXED_PROTECTION = 0x3 +}; + +enum tpt_addr_type { + TPT_ZBTO = 0, + TPT_VATO = 1 +}; + +enum tpt_mem_perm { + TPT_LOCAL_READ = 0x8, + TPT_LOCAL_WRITE = 0x4, + TPT_REMOTE_READ = 0x2, + TPT_REMOTE_WRITE = 0x1 +}; + +struct tpt_entry { + __be32 valid_stag_pdid; + __be32 flags_pagesize_qpid; + + __be32 rsvd_pbl_addr; + __be32 len; + __be32 va_hi; + __be32 va_low_or_fbo; + + __be32 rsvd_bind_cnt_or_pstag; + __be32 rsvd_pbl_size; +}; + +#define S_TPT_VALID 31 +#define V_TPT_VALID(x) ((x) << S_TPT_VALID) +#define F_TPT_VALID V_TPT_VALID(1U) + +#define S_TPT_STAG_KEY 23 +#define M_TPT_STAG_KEY 0xFF +#define V_TPT_STAG_KEY(x) ((x) << S_TPT_STAG_KEY) +#define G_TPT_STAG_KEY(x) (((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY) + +#define S_TPT_STAG_STATE 22 +#define V_TPT_STAG_STATE(x) ((x) << S_TPT_STAG_STATE) +#define F_TPT_STAG_STATE V_TPT_STAG_STATE(1U) + +#define S_TPT_STAG_TYPE 20 +#define M_TPT_STAG_TYPE 0x3 +#define V_TPT_STAG_TYPE(x) ((x) << S_TPT_STAG_TYPE) +#define G_TPT_STAG_TYPE(x) (((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE) + +#define S_TPT_PDID 0 +#define M_TPT_PDID 0xFFFFF +#define V_TPT_PDID(x) ((x) << S_TPT_PDID) +#define G_TPT_PDID(x) (((x) >> S_TPT_PDID) & M_TPT_PDID) + +#define S_TPT_PERM 28 +#define M_TPT_PERM 0xF +#define V_TPT_PERM(x) ((x) << S_TPT_PERM) +#define G_TPT_PERM(x) (((x) >> S_TPT_PERM) & M_TPT_PERM) + +#define S_TPT_REM_INV_DIS 27 +#define V_TPT_REM_INV_DIS(x) ((x) << S_TPT_REM_INV_DIS) +#define F_TPT_REM_INV_DIS V_TPT_REM_INV_DIS(1U) + +#define S_TPT_ADDR_TYPE 26 +#define V_TPT_ADDR_TYPE(x) ((x) << S_TPT_ADDR_TYPE) +#define F_TPT_ADDR_TYPE V_TPT_ADDR_TYPE(1U) + +#define S_TPT_MW_BIND_ENABLE 25 +#define V_TPT_MW_BIND_ENABLE(x) ((x) << S_TPT_MW_BIND_ENABLE) +#define F_TPT_MW_BIND_ENABLE V_TPT_MW_BIND_ENABLE(1U) + +#define S_TPT_PAGE_SIZE 20 +#define M_TPT_PAGE_SIZE 0x1F +#define V_TPT_PAGE_SIZE(x) ((x) << S_TPT_PAGE_SIZE) +#define G_TPT_PAGE_SIZE(x) (((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE) + +#define S_TPT_PBL_ADDR 0 +#define M_TPT_PBL_ADDR 0x1FFFFFFF +#define V_TPT_PBL_ADDR(x) ((x) << S_TPT_PBL_ADDR) +#define G_TPT_PBL_ADDR(x) (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR) + +#define S_TPT_QPID 0 +#define M_TPT_QPID 0xFFFFF +#define V_TPT_QPID(x) ((x) << S_TPT_QPID) +#define G_TPT_QPID(x) (((x) >> S_TPT_QPID) & M_TPT_QPID) + +#define S_TPT_PSTAG 0 +#define M_TPT_PSTAG 0xFFFFFF +#define V_TPT_PSTAG(x) ((x) << S_TPT_PSTAG) +#define G_TPT_PSTAG(x) (((x) >> S_TPT_PSTAG) & M_TPT_PSTAG) + +#define S_TPT_PBL_SIZE 0 +#define M_TPT_PBL_SIZE 0xFFFFF +#define V_TPT_PBL_SIZE(x) ((x) << S_TPT_PBL_SIZE) +#define G_TPT_PBL_SIZE(x) (((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE) + +/* + * CQE defs + */ +struct t3_cqe { + __be32 header; + __be32 len; + union { + struct { + __be32 stag; + __be32 msn; + } rcqe; + struct { + u32 wrid_hi; + u32 wrid_low; + } scqe; + } u; +}; + +#define S_CQE_OOO 31 +#define M_CQE_OOO 0x1 +#define G_CQE_OOO(x) ((((x) >> S_CQE_OOO)) & M_CQE_OOO) +#define V_CEQ_OOO(x) ((x)<> S_CQE_QPID)) & M_CQE_QPID) +#define V_CQE_QPID(x) ((x)<> S_CQE_SWCQE)) & M_CQE_SWCQE) +#define V_CQE_SWCQE(x) ((x)<> S_CQE_GENBIT) & M_CQE_GENBIT) +#define V_CQE_GENBIT(x) ((x)<> S_CQE_STATUS)) & M_CQE_STATUS) +#define V_CQE_STATUS(x) ((x)<> S_CQE_TYPE)) & M_CQE_TYPE) +#define V_CQE_TYPE(x) ((x)<> S_CQE_OPCODE)) & M_CQE_OPCODE) +#define V_CQE_OPCODE(x) ((x)<queue->flit[13] = 1; +} + +static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + return NULL; +} + +static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +#endif From swise at opengridcomputing.com Wed Dec 20 11:22:55 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:22:55 -0600 Subject: [openib-general] [PATCH v5 10/13] iw_cxgb3 Core HAL In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192255.19316.19320.stgit@dell3.ogc.int> The RDMA Core interfaces with the T3 HW and ULLD providing a low level RDMA interface. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_hal.h | 201 ++++ 2 files changed, 1503 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c new file mode 100644 index 0000000..5e31816 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c @@ -0,0 +1,1302 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include +#include +#include +#include + +#include "cxio_resource.h" +#include "cxio_hal.h" +#include "cxgb3_offload.h" +#include "sge_defs.h" + +static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC]; +static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL; + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (!strcmp(rdev_tbl[i]->dev_name, dev_name)) + return rdev_tbl[i]; + return NULL; +} + +static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev + *tdev) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i]) + if (rdev_tbl[i]->t3cdev_p == tdev) + return rdev_tbl[i]; + return NULL; +} + +static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (!rdev_tbl[i]) { + rdev_tbl[i] = rdev_p; + break; + } + return (i == T3_MAX_NUM_RNIC); +} + +static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p) +{ + int i; + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + if (rdev_tbl[i] == rdev_p) { + rdev_tbl[i] = NULL; + break; + } +} + +int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit) +{ + int ret; + struct t3_cqe *cqe; + u32 rptr; + + struct rdma_cq_op setup; + setup.id = cq->cqid; + setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0; + setup.op = op; + ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup); + + if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) + return ret; + + /* + * If the rearm returned an index other than our current index, + * then there might be CQE's in flight (being DMA'd). We must wait + * here for them to complete or the consumer can miss a notification. + */ + if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) { + int i=0; + + rptr = cq->rptr; + + /* + * Keep the generation correct by bumping rptr until it + * matches the index returned by the rearm - 1. + */ + while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret) + rptr++; + + /* + * Now rptr is the index for the (last) cqe that was + * in-flight at the time the HW rearmed the CQ. We + * spin until that CQE is valid. + */ + cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2); + while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) { + udelay(1); + if (i++ > 1000000) { + BUG_ON(1); + printk(KERN_ERR "%s: stalled rnic\n", + rdev_p->dev_name); + return -EIO; + } + } + } + return 0; +} + +static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid) +{ + struct rdma_cq_setup setup; + setup.id = cqid; + setup.base_addr = 0; /* NULL address */ + setup.size = 0; /* disaable the CQ */ + setup.credits = 0; + setup.credit_thres = 0; + setup.ovfl_mode = 0; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid) +{ + u64 sge_cmd; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = qpid << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe); + + cq->cqid = cxio_hal_get_cqid(rdev_p->rscp); + if (!cq->cqid) + return -ENOMEM; + cq->sw_queue = kzalloc(size, GFP_KERNEL); + if (!cq->sw_queue) + return -ENOMEM; + cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) * + sizeof(struct t3_cqe), + &(cq->dma_addr), GFP_KERNEL); + if (!cq->queue) { + kfree(cq->sw_queue); + return -ENOMEM; + } + pci_unmap_addr_set(cq, mapping, cq->dma_addr); + memset(cq->queue, 0, size); + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = 65535; + setup.credit_thres = 1; + if (rdev_p->t3cdev_p->type == T3B) + setup.ovfl_mode = 0; + else + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + struct rdma_cq_setup setup; + setup.id = cq->cqid; + setup.base_addr = (u64) (cq->dma_addr); + setup.size = 1UL << cq->size_log2; + setup.credits = setup.size; + setup.credit_thres = setup.size; /* TBD: overflow recovery */ + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + u32 qpid; + int i; + + mutex_lock(&uctx->lock); + if (!list_empty(&uctx->qpids)) { + entry = list_entry(uctx->qpids.next, struct cxio_qpid_list, + entry); + list_del(&entry->entry); + qpid = entry->qpid; + kfree(entry); + } else { + qpid = cxio_hal_get_qpid(rdev_p->rscp); + if (!qpid) + goto out; + for (i = qpid+1; i & rdev_p->qpmask; i++) { + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + break; + entry->qpid = i; + list_add_tail(&entry->entry, &uctx->qpids); + } + } +out: + mutex_unlock(&uctx->lock); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid, + struct cxio_ucontext *uctx) +{ + struct cxio_qpid_list *entry; + + entry = kmalloc(sizeof *entry, GFP_KERNEL); + if (!entry) + return; + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + entry->qpid = qpid; + mutex_lock(&uctx->lock); + list_add_tail(&entry->entry, &uctx->qpids); + mutex_unlock(&uctx->lock); +} + +void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + struct list_head *pos, *nxt; + struct cxio_qpid_list *entry; + + mutex_lock(&uctx->lock); + list_for_each_safe(pos, nxt, &uctx->qpids) { + entry = list_entry(pos, struct cxio_qpid_list, entry); + list_del_init(&entry->entry); + if (!(entry->qpid & rdev_p->qpmask)) + cxio_hal_put_qpid(rdev_p->rscp, entry->qpid); + kfree(entry); + } + mutex_unlock(&uctx->lock); +} + +void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx) +{ + INIT_LIST_HEAD(&uctx->qpids); + mutex_init(&uctx->lock); +} + +int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain, + struct t3_wq *wq, struct cxio_ucontext *uctx) +{ + int depth = 1UL << wq->size_log2; + int rqsize = 1UL << wq->rq_size_log2; + + wq->qpid = get_qpid(rdev_p, uctx); + if (!wq->qpid) + return -ENOMEM; + + wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL); + if (!wq->rq) + goto err1; + + wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize); + if (!wq->rq_addr) + goto err2; + + wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL); + if (!wq->sq) + goto err3; + + wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev), + depth * sizeof(union t3_wr), + &(wq->dma_addr), GFP_KERNEL); + if (!wq->queue) + goto err4; + + memset(wq->queue, 0, depth * sizeof(union t3_wr)); + pci_unmap_addr_set(wq, mapping, wq->dma_addr); + wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + if (!kernel_domain) + wq->udb = (u64)rdev_p->rnic_info.udbell_physbase + + (wq->qpid << rdev_p->qpshift); + PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__, + wq->qpid, wq->doorbell, wq->udb); + return 0; +err4: + kfree(wq->sq); +err3: + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize); +err2: + kfree(wq->rq); +err1: + put_qpid(rdev_p, wq->qpid, uctx); + return -ENOMEM; +} + +int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq) +{ + int err; + err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid); + kfree(cq->sw_queue); + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (cq->size_log2)) + * sizeof(struct t3_cqe), cq->queue, + pci_unmap_addr(cq, mapping)); + cxio_hal_put_cqid(rdev_p->rscp, cq->cqid); + return err; +} + +int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq, + struct cxio_ucontext *uctx) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << (wq->size_log2)) + * sizeof(union t3_wr), wq->queue, + pci_unmap_addr(wq, mapping)); + kfree(wq->sq); + cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2)); + kfree(wq->rq); + put_qpid(rdev_p, wq->qpid, uctx); + return 0; +} + +static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(T3_SEND) | + V_CQE_TYPE(0) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + u32 ptr; + + PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq); + + /* flush RQ */ + PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, + wq->rq_rptr, wq->rq_wptr, count); + ptr = wq->rq_rptr + count; + while (ptr++ != wq->rq_wptr) + insert_recv_cqe(wq, cq); +} + +static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, + struct t3_swsq *sqp) +{ + struct t3_cqe cqe; + + PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, + wq, cq, cq->sw_rptr, cq->sw_wptr); + memset(&cqe, 0, sizeof(cqe)); + cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | + V_CQE_OPCODE(sqp->opcode) | + V_CQE_TYPE(1) | + V_CQE_SWCQE(1) | + V_CQE_QPID(wq->qpid) | + V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, + cq->size_log2))); + cqe.u.scqe.wrid_hi = sqp->sq_wptr; + + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe; + cq->sw_wptr++; +} + +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) +{ + __u32 ptr; + struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2); + + ptr = wq->sq_rptr + count; + sqp += count; + while (ptr != wq->sq_wptr) { + insert_sq_cqe(wq, cq, sqp); + sqp++; + ptr++; + } +} + +/* + * Move all CQEs from the HWCQ into the SWCQ. + */ +void cxio_flush_hw_cq(struct t3_cq *cq) +{ + struct t3_cqe *cqe, *swcqe; + + PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid); + cqe = cxio_next_hw_cqe(cq); + while (cqe) { + PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n", + __FUNCTION__, cq->rptr, cq->sw_wptr); + swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2); + *swcqe = *cqe; + swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1)); + cq->sw_wptr++; + cq->rptr++; + cqe = cxio_next_hw_cqe(cq); + } +} + +static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq) +{ + if (CQE_OPCODE(*cqe) == T3_TERMINATE) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe)) + return 0; + + if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) + return 0; + + return 1; +} + +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) && + (CQE_QPID(*cqe) == wq->qpid)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count) +{ + struct t3_cqe *cqe; + u32 ptr; + + *count = 0; + PDBG("%s count zero %d\n", __FUNCTION__, *count); + ptr = cq->sw_rptr; + while (!Q_EMPTY(ptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2)); + if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) && + (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq)) + (*count)++; + ptr++; + } + PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count); +} + +static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p) +{ + struct rdma_cq_setup setup; + setup.id = 0; + setup.base_addr = 0; /* NULL address */ + setup.size = 1; /* enable the CQ */ + setup.credits = 0; + + /* force SGE to redirect to RspQ and interrupt */ + setup.credit_thres = 0; + setup.ovfl_mode = 1; + return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup)); +} + +static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p) +{ + int err; + u64 sge_cmd, ctx0, ctx1; + u64 base_addr; + struct t3_modify_qp_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL); + + + if (!skb) { + PDBG("%s alloc_skb failed\n", __FUNCTION__); + return -ENOMEM; + } + err = cxio_hal_init_ctrl_cq(rdev_p); + if (err) { + PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err); + return err; + } + rdev_p->ctrl_qp.workq = dma_alloc_coherent( + &(rdev_p->rnic_info.pdev->dev), + (1 << T3_CTRL_QP_SIZE_LOG2) * + sizeof(union t3_wr), + &(rdev_p->ctrl_qp.dma_addr), + GFP_KERNEL); + if (!rdev_p->ctrl_qp.workq) { + PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__); + return -ENOMEM; + } + pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, + rdev_p->ctrl_qp.dma_addr); + rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr; + memset(rdev_p->ctrl_qp.workq, 0, + (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr)); + + init_MUTEX(&rdev_p->ctrl_qp.sem); + init_waitqueue_head(&rdev_p->ctrl_qp.waitq); + + /* update HW Ctrl QP context */ + base_addr = rdev_p->ctrl_qp.dma_addr; + base_addr >>= 12; + ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) | + V_EC_BASE_LO((u32) base_addr & 0xffff)); + ctx0 <<= 32; + ctx0 |= V_EC_CREDITS(FW_WR_NUM); + base_addr >>= 16; + ctx1 = (u32) base_addr; + base_addr >>= 32; + ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) | + V_EC_TYPE(0) | V_EC_GEN(1) | + V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32; + wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe)); + memset(wqe, 0, sizeof(*wqe)); + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1, + T3_CTL_QP_TID, 7); + wqe->flags = cpu_to_be32(MODQP_WRITE_EC); + sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3; + wqe->sge_cmd = cpu_to_be64(sge_cmd); + wqe->ctx1 = cpu_to_be64(ctx1); + wqe->ctx0 = cpu_to_be64(ctx0); + PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n", + (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq, + 1 << T3_CTRL_QP_SIZE_LOG2); + skb->priority = CPL_PRIORITY_CONTROL; + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p) +{ + dma_free_coherent(&(rdev_p->rnic_info.pdev->dev), + (1UL << T3_CTRL_QP_SIZE_LOG2) + * sizeof(union t3_wr), rdev_p->ctrl_qp.workq, + pci_unmap_addr(&rdev_p->ctrl_qp, mapping)); + return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID); +} + +/* write len bytes of data into addr (32B aligned address) + * If data is NULL, clear len byte of memory to zero. + * caller aquires the sem before the call + */ +static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr, + u32 len, void *data, int completion) +{ + u32 i, nr_wqe, copy_len; + u8 *copy_data; + u8 wr_len, utx_len; /* lenght in 8 byte flit */ + enum t3_wr_flags flag; + __be64 *wqe; + u64 utx_cmd; + addr &= 0x7FFFFFF; + nr_wqe = len % 96 ? len / 96 + 1 : len / 96; /* 96B max per WQE */ + PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n", + __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, + nr_wqe, data, addr); + utx_len = 3; /* in 32B unit */ + for (i = 0; i < nr_wqe; i++) { + if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2)) { + PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, " + "wait for more space i %d\n", __FUNCTION__, + rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i); + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + !Q_FULL(rdev_p->ctrl_qp.rptr, + rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2))) { + PDBG("%s ctrl_qp workq interrupted\n", + __FUNCTION__); + return -ERESTARTSYS; + } + PDBG("%s ctrl_qp wakeup, continue posting work request " + "i %d\n", __FUNCTION__, i); + } + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + flag = 0; + if (i == (nr_wqe - 1)) { + /* last WQE */ + flag = completion ? T3_COMPLETION_FLAG : 0; + if (len % 32) + utx_len = len / 32 + 1; + else + utx_len = len / 32; + } + + /* + * Force a CQE to return the credit to the workq in case + * we posted more than half the max QP size of WRs + */ + if ((i != 0) && + (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) { + flag = T3_COMPLETION_FLAG; + PDBG("%s force completion at i %d\n", __FUNCTION__, i); + } + + /* build the utx mem command */ + wqe += (sizeof(struct t3_bypass_wr) >> 3); + utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3); + utx_cmd <<= 32; + utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1); + *wqe = cpu_to_be64(utx_cmd); + wqe++; + copy_data = (u8 *) data + i * 96; + copy_len = len > 96 ? 96 : len; + + /* clear memory content if data is NULL */ + if (data) + memcpy(wqe, copy_data, copy_len); + else + memset(wqe, 0, copy_len); + if (copy_len % 32) + memset(((u8 *) wqe) + copy_len, 0, + 32 - (copy_len % 32)); + wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + + (utx_len << 2); + wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr % + (1 << T3_CTRL_QP_SIZE_LOG2))); + + /* wptr in the WRID[31:0] */ + ((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr; + + /* + * This must be the last write with a memory barrier + * for the genbit + */ + build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag, + Q_GENBIT(rdev_p->ctrl_qp.wptr, + T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID, + wr_len); + if (flag == T3_COMPLETION_FLAG) + ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID); + len -= 96; + rdev_p->ctrl_qp.wptr++; + } + return 0; +} + +/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size + * OUT: stag index, actual pbl_size, pbl_addr allocated. + * TBD: shared memory region support + */ +static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry, + u32 *stag, u8 stag_state, u32 pdid, + enum tpt_mem_type type, enum tpt_mem_perm perm, + u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl, + u32 *pbl_size, u32 *pbl_addr) +{ + int err; + struct tpt_entry tpt; + u32 stag_idx; + u32 wptr; + int rereg = (*stag != T3_STAG_UNSET); + + stag_state = stag_state > 0; + stag_idx = (*stag) >> 8; + + if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) { + stag_idx = cxio_hal_get_stag(rdev_p->rscp); + if (!stag_idx) + return -ENOMEM; + *stag = (stag_idx << 8) | ((*stag) & 0xFF); + } + PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", + __FUNCTION__, stag_state, type, pdid, stag_idx); + + if (reset_tpt_entry) + cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3); + else if (!rereg) { + *pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3); + if (!*pbl_addr) { + return -ENOMEM; + } + } + + down_interruptible(&rdev_p->ctrl_qp.sem); + + /* write PBL first if any - update pbl only if pbl list exist */ + if (pbl) { + + PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n", + __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base, + *pbl_size); + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + (*pbl_addr >> 5), + (*pbl_size << 3), pbl, 0); + if (err) + goto ret; + } + + /* write TPT entry */ + if (reset_tpt_entry) + memset(&tpt, 0, sizeof(tpt)); + else { + tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID | + V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) | + V_TPT_STAG_STATE(stag_state) | + V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid)); + BUG_ON(page_size >= 28); + tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | + F_TPT_MW_BIND_ENABLE | + V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) | + V_TPT_PAGE_SIZE(page_size)); + tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3)); + tpt.len = cpu_to_be32(len); + tpt.va_hi = cpu_to_be32((u32) (to >> 32)); + tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL)); + tpt.rsvd_bind_cnt_or_pstag = 0; + tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : + cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2)); + } + err = cxio_hal_ctrl_qp_write_mem(rdev_p, + stag_idx + + (rdev_p->rnic_info.tpt_base >> 5), + sizeof(tpt), &tpt, 1); + + /* release the stag index to free pool */ + if (reset_tpt_entry) + cxio_hal_put_stag(rdev_p->rscp, stag_idx); +ret: + wptr = rdev_p->ctrl_qp.wptr; + up(&rdev_p->ctrl_qp.sem); + if (!err) + if (wait_event_interruptible(rdev_p->ctrl_qp.waitq, + SEQ32_GE(rdev_p->ctrl_qp.rptr, + wptr))) + return -ERESTARTSYS; + return err; +} + +/* IN : stag key, pdid, pbl_size + * Out: stag index, actaul pbl_size, and pbl_addr allocated. + */ +int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr) +{ + *stag = T3_STAG_UNSET; + return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, + perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr)); +} + +int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm, + zbva, to, len, page_size, pbl, pbl_size, pbl_addr); +} + +int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, + u32 pbl_addr) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + &pbl_size, &pbl_addr); +} + +int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid) +{ + u32 pbl_size = 0; + *stag = T3_STAG_UNSET; + return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0, + NULL, &pbl_size, NULL); +} + +int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag) +{ + return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL, + NULL, NULL); +} + +int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr) +{ + struct t3_rdma_init_wr *wqe; + struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC); + if (!skb) + return -ENOMEM; + PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p); + wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe)); + wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT)); + wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) | + V_FW_RIWR_LEN(sizeof(*wqe) >> 3)); + wqe->wrid.id1 = 0; + wqe->qpid = cpu_to_be32(attr->qpid); + wqe->pdid = cpu_to_be32(attr->pdid); + wqe->scqid = cpu_to_be32(attr->scqid); + wqe->rcqid = cpu_to_be32(attr->rcqid); + wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base); + wqe->rq_size = cpu_to_be32(attr->rq_size); + wqe->mpaattrs = attr->mpaattrs; + wqe->qpcaps = attr->qpcaps; + wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss); + wqe->flags = cpu_to_be32(attr->flags); + wqe->ord = cpu_to_be32(attr->ord); + wqe->ird = cpu_to_be32(attr->ird); + wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr); + wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size); + wqe->rsvd = 0; + skb->priority = 0; /* 0=>ToeQ; 1=>CtrlQ */ + return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb)); +} + +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = ev_cb; +} + +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb) +{ + cxio_ev_cb = NULL; +} + +static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb) +{ + static int cnt; + struct cxio_rdev *rdev_p = NULL; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x" + " se %0x notify %0x cqbranch %0x creditth %0x\n", + cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg), + RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg), + RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg), + RSPQ_CREDIT_THRESH(rsp_msg)); + PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d " + "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + rdev_p = (struct cxio_rdev *)t3cdev_p->ulp; + if (!rdev_p) { + PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__, + t3cdev_p); + return 0; + } + if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) { + rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1; + wake_up_interruptible(&rdev_p->ctrl_qp.waitq); + dev_kfree_skb_irq(skb); + } else if (CQE_QPID(rsp_msg->cqe) == 0xfff8) + dev_kfree_skb_irq(skb); + else if (cxio_ev_cb) + (*cxio_ev_cb) (rdev_p, skb); + else + dev_kfree_skb_irq(skb); + cnt++; + return 0; +} + +/* Caller takes care of locking if needed */ +int cxio_rdev_open(struct cxio_rdev *rdev_p) +{ + struct net_device *netdev_p = NULL; + int err = 0; + if (strlen(rdev_p->dev_name)) { + if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) { + return -EBUSY; + } + netdev_p = dev_get_by_name(rdev_p->dev_name); + if (!netdev_p) { + return -EINVAL; + } + dev_put(netdev_p); + } else if (rdev_p->t3cdev_p) { + if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) { + return -EBUSY; + } + netdev_p = rdev_p->t3cdev_p->lldev; + strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name, + T3_MAX_DEV_NAME_LEN); + } else { + PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__); + return -EINVAL; + } + + if (cxio_hal_add_rdev(rdev_p)) + return -ENOMEM; + + PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name); + memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp)); + if (!rdev_p->t3cdev_p) + rdev_p->t3cdev_p = T3CDEV(netdev_p); + rdev_p->t3cdev_p->ulp = (void *) rdev_p; + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS, + &(rdev_p->rnic_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS, + &(rdev_p->port_info)); + if (err) { + printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n", + __FUNCTION__, rdev_p->t3cdev_p, err); + goto err1; + } + + /* + * qpshift is the number of bits to shift the qpid left in order + * to get the correct address of the doorbell for that qp. + */ + cxio_init_ucontext(rdev_p, &rdev_p->uctx); + rdev_p->qpshift = PAGE_SHIFT - + ilog2(65536 >> + ilog2(rdev_p->rnic_info.udbell_len >> + PAGE_SHIFT)); + rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT; + rdev_p->qpmask = (65536 >> ilog2(rdev_p->qpnr)) - 1; + PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d " + "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n", + __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base, + rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p), + rdev_p->rnic_info.pbl_base, + rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base, + rdev_p->rnic_info.rqt_top); + PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu " + "qpnr %d qpmask 0x%x\n", + rdev_p->rnic_info.udbell_len, + rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr, + rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask); + + err = cxio_hal_init_ctrl_qp(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing ctrl_qp.\n", + __FUNCTION__, err); + goto err1; + } + err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0, + 0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ, + T3_MAX_NUM_PD); + if (err) { + printk(KERN_ERR "%s error %d initializing hal resources.\n", + __FUNCTION__, err); + goto err2; + } + err = cxio_hal_pblpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing pbl mem pool.\n", + __FUNCTION__, err); + goto err3; + } + err = cxio_hal_rqtpool_create(rdev_p); + if (err) { + printk(KERN_ERR "%s error %d initializing rqt mem pool.\n", + __FUNCTION__, err); + goto err4; + } + return 0; +err4: + cxio_hal_pblpool_destroy(rdev_p); +err3: + cxio_hal_destroy_resource(rdev_p->rscp); +err2: + cxio_hal_destroy_ctrl_qp(rdev_p); +err1: + cxio_hal_delete_rdev(rdev_p); + return err; +} + +void cxio_rdev_close(struct cxio_rdev *rdev_p) +{ + if (rdev_p) { + cxio_hal_pblpool_destroy(rdev_p); + cxio_hal_rqtpool_destroy(rdev_p); + cxio_hal_delete_rdev(rdev_p); + rdev_p->t3cdev_p->ulp = NULL; + cxio_hal_destroy_ctrl_qp(rdev_p); + cxio_hal_destroy_resource(rdev_p->rscp); + } +} + +int __init cxio_hal_init(void) +{ + if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI)) + return -ENOMEM; + memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *)); + t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler); + return 0; +} + +void __exit cxio_hal_exit(void) +{ + int i; + t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL); + for (i = 0; i < T3_MAX_NUM_RNIC; i++) + cxio_rdev_close(rdev_tbl[i]); + cxio_hal_destroy_rhdl_resource(); +} + +static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq) +{ + struct t3_swsq *sqp; + __u32 ptr = wq->sq_rptr; + int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr); + + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + while (count--) + if (!sqp->signaled) { + ptr++; + sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); + } else if (sqp->complete) { + + /* + * Insert this completed cqe into the swcq. + */ + PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n", + __FUNCTION__, Q_PTR2IDX(ptr, wq->sq_size_log2), + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)); + sqp->cqe.header |= htonl(V_CQE_SWCQE(1)); + *(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) + = sqp->cqe; + cq->sw_wptr++; + sqp->signaled = 0; + break; + } else + break; +} + +static inline void create_read_req_cqe(struct t3_wq *wq, + struct t3_cqe *hw_cqe, + struct t3_cqe *read_cqe) +{ + read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr; + read_cqe->len = wq->oldest_read->read_len; + read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) | + V_CQE_SWCQE(SW_CQE(*hw_cqe)) | + V_CQE_OPCODE(T3_READ_REQ) | + V_CQE_TYPE(1)); +} + +/* + * Return a ptr to the next read wr in the SWSQ or NULL. + */ +static inline void advance_oldest_read(struct t3_wq *wq) +{ + + u32 rptr = wq->oldest_read - wq->sq + 1; + u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2); + + while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) { + wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2); + + if (wq->oldest_read->opcode == T3_READ_REQ) + return; + rptr++; + } + wq->oldest_read = NULL; +} + +/* + * cxio_poll_cq + * + * Caller must: + * check the validity of the first CQE, + * supply the wq assicated with the qpid. + * + * credit: cq credit to return to sge. + * cqe_flushed: 1 iff the CQE is flushed. + * cqe: copy of the polled CQE. + * + * return value: + * 0 CQE returned, + * -1 CQE skipped, try again. + */ +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit) +{ + int ret = 0; + struct t3_cqe *hw_cqe, read_cqe; + + *cqe_flushed = 0; + *credit = 0; + hw_cqe = cxio_next_cqe(cq); + + PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x" + " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", + __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe), + CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe), + CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe), + CQE_WRID_LOW(*hw_cqe)); + + /* + * skip cqe's not affiliated with a QP. + */ + if (wq == NULL) { + ret = -1; + goto skip_cqe; + } + + /* + * Gotta tweak READ completions: + * 1) the cqe doesn't contain the sq_wptr from the wr. + * 2) opcode not reflected from the wr. + * 3) read_len not reflected from the wr. + * 4) cq_type is RQ_TYPE not SQ_TYPE. + */ + if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) { + + /* + * Don't write to the HWCQ, so create a new read req CQE + * in local memory. + */ + create_read_req_cqe(wq, hw_cqe, &read_cqe); + hw_cqe = &read_cqe; + advance_oldest_read(wq); + } + + /* + * T3A: Discard TERMINATE CQEs. + */ + if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) { + ret = -1; + wq->error = 1; + goto skip_cqe; + } + + if (CQE_STATUS(*hw_cqe) || wq->error) { + *cqe_flushed = wq->error; + wq->error = 1; + + /* + * T3A inserts errors into the CQE. We cannot return + * these as work completions. + */ + /* incoming write failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE) + && RQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + /* incoming read request failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) { + ret = -1; + goto skip_cqe; + } + + /* incoming SEND with no receive posted failures */ + if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) && + Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) { + ret = -1; + goto skip_cqe; + } + goto proc_cqe; + } + + /* + * RECV completion. + */ + if (RQ_TYPE(*hw_cqe)) { + + /* + * HW only validates 4 bits of MSN. So we must validate that + * the MSN in the SEND is the next expected MSN. If its not, + * then we complete this with TPT_ERR_MSN and mark the wq in + * error. + */ + if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) { + wq->error = 1; + hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN)); + goto proc_cqe; + } + goto proc_cqe; + } + + /* + * If we get here its a send completion. + * + * Handle out of order completion. These get stuffed + * in the SW SQ. Then the SW SQ is walked to move any + * now in-order completions into the SW CQ. This handles + * 2 cases: + * 1) reaping unsignaled WRs when the first subsequent + * signaled WR is completed. + * 2) out of order read completions. + */ + if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) { + struct t3_swsq *sqp; + + PDBG("%s out of order completion going in swsq at idx %ld\n", + __FUNCTION__, + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2)); + sqp = wq->sq + + Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2); + sqp->cqe = *hw_cqe; + sqp->complete = 1; + ret = -1; + goto flush_wq; + } + +proc_cqe: + *cqe = *hw_cqe; + + /* + * Reap the associated WR(s) that are freed up with this + * completion. + */ + if (SQ_TYPE(*hw_cqe)) { + wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe); + PDBG("%s completing sq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2)); + *cookie = (wq->sq + + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id; + wq->sq_rptr++; + } else { + PDBG("%s completing rq idx %ld\n", __FUNCTION__, + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + *cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2)); + wq->rq_rptr++; + } + +flush_wq: + /* + * Flush any completed cqes that are now in-order. + */ + flush_completed_wrs(wq, cq); + +skip_cqe: + if (SW_CQE(*hw_cqe)) { + PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->sw_rptr); + ++cq->sw_rptr; + } else { + PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n", + __FUNCTION__, cq, cq->cqid, cq->rptr); + ++cq->rptr; + + /* + * T3A: compute credits. + */ + if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1))) + || ((cq->rptr - cq->wptr) >= 128)) { + *credit = cq->rptr - cq->wptr; + cq->wptr = cq->rptr; + } + } + return ret; +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h new file mode 100644 index 0000000..e5e702d --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h @@ -0,0 +1,201 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_HAL_H__ +#define __CXIO_HAL_H__ + +#include +#include + +#include "t3_cpl.h" +#include "t3cdev.h" +#include "cxgb3_ctl_defs.h" +#include "cxio_wr.h" + +#define T3_CTRL_QP_ID FW_RI_SGEEC_START +#define T3_CTL_QP_TID FW_RI_TID_START +#define T3_CTRL_QP_SIZE_LOG2 8 +#define T3_CTRL_CQ_ID 0 + +/* TBD */ +#define T3_MAX_NUM_RNIC 8 +#define T3_MAX_NUM_RI (1<<15) +#define T3_MAX_NUM_QP (1<<15) +#define T3_MAX_NUM_CQ (1<<15) +#define T3_MAX_NUM_PD (1<<15) +#define T3_MAX_PBL_SIZE 256 +#define T3_MAX_RQ_SIZE 1024 +#define T3_MAX_NUM_STAG (1<<15) + +#define T3_STAG_UNSET 0xffffffff + +#define T3_MAX_DEV_NAME_LEN 32 + +struct cxio_hal_ctrl_qp { + u32 wptr; + u32 rptr; + struct semaphore sem; /* for the wtpr, can sleep */ + wait_queue_head_t waitq; /* wait for RspQ/CQE msg */ + union t3_wr *workq; /* the work request queue */ + dma_addr_t dma_addr; /* pci bus address of the workq */ + DECLARE_PCI_UNMAP_ADDR(mapping) + void __iomem *doorbell; +}; + +struct cxio_hal_resource { + struct kfifo *tpt_fifo; + spinlock_t tpt_fifo_lock; + struct kfifo *qpid_fifo; + spinlock_t qpid_fifo_lock; + struct kfifo *cqid_fifo; + spinlock_t cqid_fifo_lock; + struct kfifo *pdid_fifo; + spinlock_t pdid_fifo_lock; +}; + +struct cxio_qpid_list { + struct list_head entry; + u32 qpid; +}; + +struct cxio_ucontext { + struct list_head qpids; + struct mutex lock; +}; + +struct cxio_rdev { + char dev_name[T3_MAX_DEV_NAME_LEN]; + struct t3cdev *t3cdev_p; + struct rdma_info rnic_info; + struct adap_ports port_info; + struct cxio_hal_resource *rscp; + struct cxio_hal_ctrl_qp ctrl_qp; + void *ulp; + unsigned long qpshift; + u32 qpnr; + u32 qpmask; + struct cxio_ucontext uctx; + struct gen_pool *pbl_pool; + struct gen_pool *rqt_pool; +}; + +static inline int cxio_num_stags(struct cxio_rdev *rdev_p) +{ + return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5)); +} + +typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p, + struct sk_buff * skb); + +#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff) +#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff) +#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1) +#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1) +#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1) +#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1) +#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1) +#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1) +#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1) + +struct respQ_msg_t { + __be32 flags; /* flit 0 */ + __be32 cq_ptrid; + __be64 rsvd; /* flit 1 */ + struct t3_cqe cqe; /* flits 2-3 */ +}; + +enum t3_cq_opcode { + CQ_ARM_AN = 0x2, + CQ_ARM_SE = 0x6, + CQ_FORCE_AN = 0x3, + CQ_CREDIT_UPDATE = 0x7 +}; + +int cxio_rdev_open(struct cxio_rdev *rdev); +void cxio_rdev_close(struct cxio_rdev *rdev); +int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, + enum t3_cq_opcode op, u32 credit); +int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid); +int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq); +void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx); +int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, + struct cxio_ucontext *uctx); +int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode); +int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr); +int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid, + enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len, + u8 page_size, __be64 *pbl, u32 *pbl_size, + u32 *pbl_addr); +int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, + u32 pbl_addr); +int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid); +int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag); +int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr); +void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb); +u32 cxio_hal_get_rhdl(void); +void cxio_hal_put_rhdl(u32 rhdl); +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp); +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid); +int __init cxio_hal_init(void); +void __exit cxio_hal_exit(void); +void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count); +void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count); +void cxio_flush_hw_cq(struct t3_cq *cq); +int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, + u8 *cqe_flushed, u64 *cookie, u32 *credit); + +#define MOD "iw_cxgb3: " +#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args) + +#ifdef DEBUG +void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag); +void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift); +void cxio_dump_wqe(union t3_wr *wqe); +void cxio_dump_wce(struct t3_cqe *wce); +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents); +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid); +#endif + +#endif From swise at opengridcomputing.com Wed Dec 20 11:23:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:23:26 -0600 Subject: [openib-general] [PATCH v5 11/13] iw_cxgb3 Core Resource Allocation In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192326.19316.22402.stgit@dell3.ogc.int> Core functions to carve up adapter memory, stag, qp, and cq IDs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_resource.c | 331 ++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_resource.h | 70 +++++ 2 files changed, 401 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c new file mode 100644 index 0000000..d1d8722 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c @@ -0,0 +1,331 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +/* Crude resource management */ +#include +#include +#include +#include +#include +#include +#include "cxio_resource.h" +#include "cxio_hal.h" + +static struct kfifo *rhdl_fifo; +static spinlock_t rhdl_fifo_lock; + +#define RANDOM_SIZE 16 + +static int __cxio_init_resource_fifo(struct kfifo **fifo, + spinlock_t *fifo_lock, + u32 nr, u32 skip_low, + u32 skip_high, + int random) +{ + u32 i, j, entry = 0, idx; + u32 random_bytes; + u32 rarray[16]; + spin_lock_init(fifo_lock); + + *fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock); + if (IS_ERR(*fifo)) + return -ENOMEM; + + for (i = 0; i < skip_low + skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32)); + if (random) { + j = 0; + random_bytes = random32(); + for (i = 0; i < RANDOM_SIZE; i++) + rarray[i] = i + skip_low; + for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) { + if (j >= RANDOM_SIZE) { + j = 0; + random_bytes = random32(); + } + idx = (random_bytes >> (j * 2)) & 0xF; + __kfifo_put(*fifo, + (unsigned char *) &rarray[idx], + sizeof(u32)); + rarray[idx] = i; + j++; + } + for (i = 0; i < RANDOM_SIZE; i++) + __kfifo_put(*fifo, + (unsigned char *) &rarray[i], + sizeof(u32)); + } else + for (i = skip_low; i < nr - skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32)); + + for (i = 0; i < skip_low + skip_high; i++) + kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32)); + return 0; +} + +static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 0)); +} + +static int cxio_init_resource_fifo_random(struct kfifo **fifo, + spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 1)); +} + +static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p) +{ + u32 i; + + spin_lock_init(&rdev_p->rscp->qpid_fifo_lock); + + rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), + GFP_KERNEL, + &rdev_p->rscp->qpid_fifo_lock); + if (IS_ERR(rdev_p->rscp->qpid_fifo)) + return -ENOMEM; + + for (i = 16; i < T3_MAX_NUM_QP; i++) + if (!(i & rdev_p->qpmask)) + __kfifo_put(rdev_p->rscp->qpid_fifo, + (unsigned char *) &i, sizeof(u32)); + return 0; +} + +int cxio_hal_init_rhdl_resource(u32 nr_rhdl) +{ + return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1, + 0); +} + +void cxio_hal_destroy_rhdl_resource(void) +{ + kfifo_free(rhdl_fifo); +} + +/* nr_* must be power of 2 */ +int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid) +{ + int err = 0; + struct cxio_hal_resource *rscp; + + rscp = kmalloc(sizeof(*rscp), GFP_KERNEL); + if (!rscp) + return -ENOMEM; + rdev_p->rscp = rscp; + err = cxio_init_resource_fifo_random(&rscp->tpt_fifo, + &rscp->tpt_fifo_lock, + nr_tpt, 1, 0); + if (err) + goto tpt_err; + err = cxio_init_qpid_fifo(rdev_p); + if (err) + goto qpid_err; + err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, + nr_cqid, 1, 0); + if (err) + goto cqid_err; + err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, + nr_pdid, 1, 0); + if (err) + goto pdid_err; + return 0; +pdid_err: + kfifo_free(rscp->cqid_fifo); +cqid_err: + kfifo_free(rscp->qpid_fifo); +qpid_err: + kfifo_free(rscp->tpt_fifo); +tpt_err: + return -ENOMEM; +} + +/* + * returns 0 if no resource available + */ +static inline u32 cxio_hal_get_resource(struct kfifo *fifo) +{ + u32 entry; + if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32))) + return entry; + else + return 0; /* fifo emptry */ +} + +static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry) +{ + BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0); +} + +u32 cxio_hal_get_rhdl(void) +{ + return cxio_hal_get_resource(rhdl_fifo); +} + +void cxio_hal_put_rhdl(u32 rhdl) +{ + cxio_hal_put_resource(rhdl_fifo, rhdl); +} + +u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->tpt_fifo); +} + +void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag) +{ + cxio_hal_put_resource(rscp->tpt_fifo, stag); +} + +u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp) +{ + u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid) +{ + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + cxio_hal_put_resource(rscp->qpid_fifo, qpid); +} + +u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->cqid_fifo); +} + +void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid) +{ + cxio_hal_put_resource(rscp->cqid_fifo, cqid); +} + +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->pdid_fifo); +} + +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid) +{ + cxio_hal_put_resource(rscp->pdid_fifo, pdid); +} + +void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp) +{ + kfifo_free(rscp->tpt_fifo); + kfifo_free(rscp->cqid_fifo); + kfifo_free(rscp->qpid_fifo); + kfifo_free(rscp->pdid_fifo); + kfree(rscp); +} + +/* + * PBL Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_PBL_SHIFT 8 /* 256B == min PBL size (32 entries) */ +#define PBL_CHUNK 2*1024*1024 + +u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size); + return (u32)addr; +} + +void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size); + gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size); +} + +int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1); + if (rdev_p->pbl_pool) + for (i = rdev_p->rnic_info.pbl_base; + i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; + i += PBL_CHUNK) + gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1); + return rdev_p->pbl_pool ? 0 : -ENOMEM; +} + +void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->pbl_pool); +} + +/* + * RQT Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_RQT_SHIFT 10 /* 1KB == mini RQT size (16 entries) */ +#define RQT_CHUNK 2*1024*1024 + +u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6); + return (u32)addr; +} + +void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6); + gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6); +} + +int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1); + if (rdev_p->rqt_pool) + for (i = rdev_p->rnic_info.rqt_base; + i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; + i += RQT_CHUNK) + gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1); + return rdev_p->rqt_pool ? 0 : -ENOMEM; +} + +void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->rqt_pool); +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h new file mode 100644 index 0000000..a6bbe83 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_RESOURCE_H__ +#define __CXIO_RESOURCE_H__ + +#include +#include +#include +#include +#include +#include +#include +#include "cxio_hal.h" + +extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl); +extern void cxio_hal_destroy_rhdl_resource(void); +extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, + u32 nr_pdid); +extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag); +extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid); +extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid); +extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp); + +#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base ) +extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); + +#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base ) +extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); +#endif From swise at opengridcomputing.com Wed Dec 20 11:23:56 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:23:56 -0600 Subject: [openib-general] [PATCH v5 12/13] iw_cxgb3 Core Debug functions In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192356.19316.82880.stgit@dell3.ogc.int> Debug code to dump various data structs, some of which are in adapter memory. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_dbg.c | 205 +++++++++++++++++++++++++++ 1 files changed, 205 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c new file mode 100644 index 0000000..dfaa704 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c @@ -0,0 +1,205 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifdef DEBUG +#include +#include "common.h" +#include "cxgb3_ioctl.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size = 32; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base; + m->len = size; + PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size, npages; + + shift += 12; + npages = (len + (1ULL << shift) - 1) >> shift; + size = npages * sizeof(u64); + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = pbl_addr; + m->len = size; + PDBG("%s PBL addr 0x%x len %d depth %d\n", + __FUNCTION__, m->addr, m->len, npages); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_wqe(union t3_wr *wqe) +{ + __be64 *data = (__be64 *)wqe; + uint size = (uint)(be64_to_cpu(*data) & 0xff); + + if (size == 0) + size = 8; + while (size > 0) { + PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data)); + size--; + data++; + } +} + +void cxio_dump_wce(struct t3_cqe *wce) +{ + __be64 *data = (__be64 *)wce; + int size = sizeof(*wce); + + while (size > 0) { + PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data)); + size -= 8; + data++; + } +} + +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents) +{ + struct ch_mem_range *m; + int size = nents * 64; + u64 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base; + m->len = size; + PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid) +{ + struct ch_mem_range *m; + int size = TCB_SIZE; + u32 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_CM; + m->addr = hwtid * size; + m->len = size; + PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u32 *)m->buf; + while (size > 0) { + printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", + m->addr, + *(data+2), *(data+3), *(data),*(data+1), + *(data+6), *(data+7), *(data+4), *(data+5)); + size -= 32; + data += 8; + m->addr += 32; + } + kfree(m); +} +#endif From swise at opengridcomputing.com Wed Dec 20 11:24:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 20 Dec 2006 13:24:26 -0600 Subject: [openib-general] [PATCH v5 13/13] iw_cxgb3 Kconfig/Makefile In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int> References: <20061220191754.19316.4914.stgit@dell3.ogc.int> Message-ID: <20061220192426.19316.34290.stgit@dell3.ogc.int> Signed-off-by: Steve Wise --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/cxgb3/Kconfig | 27 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/Makefile | 12 ++++++++++++ 4 files changed, 41 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 59b3932..06453ab 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/hw/ipath/Kconfig" source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/hw/amso1100/Kconfig" +source "drivers/infiniband/hw/cxgb3/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index 570b30a..69bdd55 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mt obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ +obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig new file mode 100644 index 0000000..d3db264 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Kconfig @@ -0,0 +1,27 @@ +config INFINIBAND_CXGB3 + tristate "Chelsio RDMA Driver" + depends on CHELSIO_T3 && INFINIBAND + select GENERIC_ALLOCATOR + ---help--- + This is an iWARP/RDMA driver for the Chelsio T3 1GbE and + 10GbE adapters. + + For general information about Chelsio and our products, visit + our website at . + + For customer support, please visit our customer support page at + . + + Please send feedback to . + + To compile this driver as a module, choose M here: the module + will be called iw_cxgb3. + +config INFINIBAND_CXGB3_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_CXGB3 + default n + ---help--- + This option causes the Chelsio RDMA driver to produce copious + amounts of debug messages. Select this if you are developing + the driver or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile new file mode 100644 index 0000000..7a89f6d --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Makefile @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \ + -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core + +obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o + +iw_cxgb3-y := iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \ + iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o + +ifdef CONFIG_INFINIBAND_CXGB3_DEBUG +EXTRA_CFLAGS += -DDEBUG -g +iw_cxgb3-y += core/cxio_dbg.o +endif From vuhuong at mellanox.com Wed Dec 20 13:23:25 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 20 Dec 2006 13:23:25 -0800 Subject: [openib-general] opensm In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com> References: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com> Message-ID: <4589A9CD.7070502@mellanox.com> Hi Ashish, > Hi, > Please see the information below > > This is what I did: > /etc/init.d/openibd start > /etc/init.d/opensmd start > modprobe ib_srp > > Issued the command /usr/local/ofed/sbin/ibsrpdm -c to get the > information about target and used them in > By default without -d option, ibsrpdm will use /dev/infiniband/umad0 -- with corresponding to port 1 of mthca0 > echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, > > dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114 > 6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target This is correct by using srp-mthca0-1; however, I got this from your previous email which you reported *I am seeing the error “ Got failed path rec status -110 †on Linux console* echo id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > /sys/class/infiniband_srp/srp-mthca0-2/add_target You used port 2 of mthca0 here ie. srp-mthca0-2; therefore, you got pathrecord failure Please retry: 0. Make sure you connect port 1 of host hca to target (since you connect them directly. Port 2 work as well but you have to use the umad1 and srp-mthca0-2 for steps 1,2 below) 1. ibsrpdm -c -d /dev/infiniband/umad0 2. echo whatever target discover to srp-mthca0-1 -vu > > Yes, earlier I had silverstorm switch which was running SM but now I > have taken that out and directly connecting the target and host. > > I have only one port connected between the host and the target. > The reason behind link is not stable is that I am restarting and > stopping again and again, as this does not seem to be working and I did > not know the issue until I looked at the console log which was > indicating "Got failed path rec status -110" and after seeing that I > searched on goggle and found that > "https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht > ml" it seems to be a bug with 64-bit machine. > BTW, my linux server is 64-bit. > When I hooked up 32-bit server running OFED-1.1, I see my target > discovered with the same procedure. > > So, whole question is that what is the fix for issue "Got failed path > rec status -110" on 64-bit machine. > > Thanks > Ashish > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 19, 2006 10:35 PM > To: Batwara, Ashish > Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org > Subject: RE: [openib-general] opensm > > On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote: >> Hi, >> Please look towards the end of the attached file. > > What options are you starting opensm with ? What is the command line ? > > Also, it looks like (at least at one point) you have another SM on the > subnet. What is the make (vendor) for your switch ? > > I see many SM port is DOWN. What is going on with this port ? Why is the > physical link not LinkUp and stable ? That is the main issue and is > likely why the SubnGet of NodeInfo is not being responded to. > > -- Hal > >> Thanks >> Ashish >> >> -----Original Message----- >> From: Hal Rosenstock [mailto:halr at voltaire.com] >> Sent: Tuesday, December 19, 2006 5:06 PM >> To: Batwara, Ashish >> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org >> Subject: Re: [openib-general] opensm >> >> Ashish, >> >> On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: >>> Hi, >>> >>> Here is the info that you have asked. I am seeing the Subnet manager >>> is up now having the port active. But server is not able to discover >>> the target. I am seeing the error "Got failed path rec status -110" > on >>> Linux console. >> That means the request for an SA PathRecord from the initiator to the >> target failed (-110 is ETIMEDOUT). Are you sure the target is up >> (ACTIVE) on the subnet ? If it is, can you send the opensm log ? >> >> -- Hal >> >>> Below are the output of different commands. I am using following to >>> discover the target: >>> >>> >>> >>> /etc/init.d/opensmd start >>> >>> /etc/init.d/openibd start >>> >>> modprobe ib_srp >>> >>> echo >>> > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 >> 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > >> /sys/class/infiniband_srp/srp-mthca0-2/add_target >>> >>> >>> >>> >>> [root at p49 ~]# ibv_devinfo >>> >>> hca_id: mthca0 >>> >>> fw_ver: 5.1.400 >>> >>> node_guid: 0002:c902:0022:cce0 >>> >>> sys_image_guid: 0002:c902:0022:cce3 >>> >>> vendor_id: 0x02c9 >>> >>> vendor_part_id: 25218 >>> >>> hw_ver: 0xA0 >>> >>> board_id: MT_0370130002 >>> >>> phys_port_cnt: 2 >>> >>> port: 1 >>> >>> state: PORT_DOWN (1) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 512 (2) >>> >>> sm_lid: 0 >>> >>> port_lid: 0 >>> >>> port_lmc: 0x00 >>> >>> >>> >>> port: 2 >>> >>> state: PORT_ACTIVE (4) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 2048 (4) >>> >>> sm_lid: 1 >>> >>> port_lid: 1 >>> >>> port_lmc: 0x00 >>> hca_id: mthca1 >>> >>> fw_ver: 5.1.400 >>> >>> node_guid: 0002:c902:0022:cd2c >>> >>> sys_image_guid: 0002:c902:0022:cd2f >>> >>> vendor_id: 0x02c9 >>> >>> vendor_part_id: 25218 >>> >>> hw_ver: 0xA0 >>> >>> board_id: MT_0370130002 >>> >>> phys_port_cnt: 2 >>> >>> port: 1 >>> >>> state: PORT_DOWN (1) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 512 (2) >>> >>> sm_lid: 0 >>> >>> port_lid: 0 >>> >>> port_lmc: 0x00 >>> >>> >>> >>> port: 2 >>> >>> state: PORT_DOWN (1) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 512 (2) >>> >>> sm_lid: 0 >>> >>> port_lid: 0 >>> >>> port_lmc: 0x00 >>> >>> >>> >>> >>> >>> [root at p49 ~]# uname -a >>> >>> Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 >>> EDT 2006 x86_64 x86_64 x86_64 GNU/Linux >>> >>> >>> >>> [root at p49 ~]# cat /etc/infiniband/info >>> >>> #!/bin/bash >>> >>> >>> >>> echo prefix=/usr/local/ofed >>> >>> echo Kernel=2.6.9-42.0.3.ELsmp >>> >>> echo >>> >>> echo "Configure options: --with-dapl --with-ipoibtools > --with-libibcm >>> --with-libibcommon --with-libibmad --with-libibumad > --with-libibverbs >>> --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm >>> --with-libsdp --with-openib-diags --with-srptools --with-mstflint >>> --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod >>> --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod >>> --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" >>> >>> echo >>> >>> >>> >>> OFED Version: OFED-1.1 >> >> >>> Thanks >>> >>> Ashish >>> >>> -----Original Message----- >>> From: Eitan Zahavi [mailto:eitan at mellanox.co.il] >>> Sent: Tuesday, December 19, 2006 5:18 AM >>> To: Batwara, Ashish >>> Cc: ishai at mellanox.co.il; openib-general at openib.org >>> Subject: Re: [openib-general] opensm >>> >>> >>> >>> Hi Ashish, >>> >>> >>> >>> SRP people say they have no such error message. >>> >>> OpenSM does. So I take it back. >>> >>> >>> >>> Ashish, >>> >>> Please provide more into: >>> >>> >>> >>> 1. ibv_devinfo >>> >>> 2. Version of code you are using >>> >>> 3. Command line you use for starting opensm >>> >>> 4. /var/log/osm.log >>> >>> >>> >>> Thanks and sorry for the confusion. >>> >>> >>> >>> EZ >>> >>> >>> >>> Eitan Zahavi wrote: >>> >>>> This is not an OpenSM issue. >>>> Forwarded to the SRP people. >>>> EZ >>>> Batwara, Ashish wrote: >>>> >>>>> Hi, >>>>> I am trying to run opensm on Linux server. It has two HCAs >>> (4-ports) and >>> >>>>> connected to IB Switch. ibnodes command displays the information >>> about >>> >>>>> the Switch ports and HCA ports. >>>>> When I start opensm, I see in /var/log/messages "Starting >>> srp_daemon" >>> >>>>> for all the 4 ports and immediately after I see "failed > srp_daemon" >>> for >>> >>>>> all the ports and the displays "SM Port is down". >>>>> I tried several times and even rebooted the server few times but > no >>>>> luck. >>>>> Does anybody know what this problem is? >>>>> Thanks >>>>> Ashish >>>>> _______________________________________________ >>>>> openib-general mailing list >>>>> openib-general at openib.org >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>>>> >>>>> >>>> _______________________________________________ >>>> openib-general mailing list >>>> openib-general at openib.org >>>> http://openib.org/mailman/listinfo/openib-general >>>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>>> >>> >>> >>> >>> >>> > ______________________________________________________________________ >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From notice at ebay.com Wed Dec 20 13:29:02 2006 From: notice at ebay.com (eBay Member : laptopsandmore-online) Date: Wed, 20 Dec 2006 13:29:02 -0800 Subject: [openib-general] Question about item #320063773598 DELL Latitude C640 P4 1.8GHz Laptop Wireless XP Pro DVD Message-ID: <20061220213701.9E6683B000D@sentry-two.sandia.gov> An HTML attachment was scrubbed... URL: From eeb at bartonsoftware.com Wed Dec 20 14:22:13 2006 From: eeb at bartonsoftware.com (Eric Barton) Date: Wed, 20 Dec 2006 22:22:13 GMT Subject: [openib-general] IB_CM_REJ_INVALID_SERVICE_ID Message-ID: <200612202222.kBKMMDeY020463@robert.bartonsoftware.com> Can an rdma_connect be rejected with IB_CM_REJ_INVALID_SERVICE_ID for any other reason than the peer isn't listening with the correct service number? I've had the following bug report... > We are testing 1.6b5 for a InfiniBand cluster with RHEL 4. We use the > binaries provides by CFS and use OFED 1.1 as the IB stack. > > At several times some of the clients hang during fs mount or when an OST > is added (see log). > Error: > LustreError: 1776:0:(o2iblnd_cb.c:2314:kiblnd_rejected()) 10.0.90.8 at o2ib > rejected: reason 8, size 148 > > from OFED: > enum ib_cm_rej_reason { > IB_CM_REJ_INVALID_SERVICE_ID = 8, > > Once an IPoIB ping is started to the corresponding OST the client > continues. Afterwards it is quite stable. ...which seems to be saying that just doing an IPoIB ping to the server was enough to make rdma_connect() work OK. -- Cheers, Eric From halr at voltaire.com Wed Dec 20 14:59:52 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 20 Dec 2006 17:59:52 -0500 Subject: [openib-general] [query]requirement of 'process_mad' in the HCA driver In-Reply-To: <309a667c0612180017g44d9be7dn9cb00dffaa081dd3@mail.gmail.com> References: <2875.47466.qm@web8317.mail.in.yahoo.com> <1166104604.28709.126501.camel@hal.voltaire.com> <309a667c0612180017g44d9be7dn9cb00dffaa081dd3@mail.gmail.com> Message-ID: <1166655590.4519.70241.camel@hal.voltaire.com> On Mon, 2006-12-18 at 03:17, Devesh Sharma wrote: > On similar lines I have a confusion about the mad agent creation:- > there is a function in mad.c ib_agent_port_open() which creates > _send_only_ SMAs for GSI and SMI per port. > > There is a function in mthca_mad.c mthca_create_agents() which is > _again_ createing two send only mad agents for SMI and GSI. > > Why this driver specific agent creation is required? Those agents handle the locally generated traps for the mthca (to be sent up to the SM). -- Hal > On 14 Dec 2006 08:57:11 -0500, Hal Rosenstock wrote: > > On Wed, 2006-12-13 at 22:49, keshetti mahesh wrote: > > > thanks for your reply, > > > > > > >The driver is needed to obtain the information for the IB node to > > > fill > > > >in the MADs for response to the SMA query. It may also issue some > > > traps. > > > >Similarly for PMA as well. > > > > > > Do u mean to say that HCA driver is needed to pass the HCA related > > > information (like GID, GUID, port_info etc..) to the SMA so that it > > > can reply to query(or GET ) MADs. > > > > Yes. > > > > > Isn't SMA capable of doing the same by using "query_(gid, pkey, > > > port)" verbs. > > > > One reason I can think of is that not all the needed information is > > available via verbs. I think there are some others as well. > > > > > And final questions if it is really required to implement > > > 'process_mad' in HCA driver then why it is not specified in the IB > > > specifications. > > > > IB spec is architecture not implementation. > > > > > Whose duty is this (replying to query MADs) according to the IB > > > psec.s(its duty of SMA right?) > > > > Depends on the MAD but if you are referring to the SMA queries, then yes > > it is the SMA's responsibility. > > > > > I have observed that process_mad is not implemented in the IBM's eHCA > > > driver. what is the case with it? > > > > With eHCA, QP0 is not exposed to the host (at least currently) and the > > SMA is totally implemented in firmware. > > > > > PS: I am considering only SMA in the host s/w here. > > > > This is a design choice. > > > > -- Hal > > > > > regards, > > > K.Mahesh. > > > > > > > > > > > > > > > Hal Rosenstock wrote: > > > On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote: > > > > Hello all, > > > > > > > > I want to know from u people that isi it necessary to > > > implement the > > > > process_mad for a HCA. > > > > > > > > After looking into the implementations of process_mad in > > > ipath and > > > > mthca drivers i have fount that they are used to reply the > > > MADs with > > > > port_info,gid_info,sm_info etc.. > > > > > > > > But isn't it handled by SMA in the host...... > > > > > > The SMA can either be in the host on in firmware (as is > > > typical with the > > > Mellanox silicon). > > > > > > > i am little bit confused now . > > > > please just whether it is required to implement process_mad > > > (suppose) > > > > for new HCA driver.... > > > > > > It is. For an example of a host (software SMA), see > > > drivers/infiniband/hw/ipath/ipath_mad.c > > > > > > > if it is required why? > > > > > > The driver is needed to obtain the information for the IB node > > > to fill > > > in the MADs for response to the SMA query. It may also issue > > > some traps. > > > Similarly for PMA as well. > > > > > > -- Hal > > > > > > > Please CC your replies to me. > > > > > > > > regards, > > > > K.Mahesh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > Find out what India is talking about on - Yahoo! Answers > > > India > > > > Send FREE SMS to your friend's mobile from Yahoo! Messenger > > > Version 8. > > > > Get it NOW > > > > > > > > > > > ______________________________________________________________________ > > > > > > > > _______________________________________________ > > > > openib-general mailing list > > > > openib-general at openib.org > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > ______________________________________________________________________ > > > Find out what India is talking about on - Yahoo! Answers India > > > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. > > > Get it NOW > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From dotanb at dev.mellanox.co.il Wed Dec 20 22:30:57 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 21 Dec 2006 08:30:57 +0200 Subject: [openib-general] RDMA to shared memory causing corruption In-Reply-To: <2cfcf21e0612200846t41231b45qec26d6f9f9a01a8@mail.gmail.com> References: <2cfcf21e0612200846t41231b45qec26d6f9f9a01a8@mail.gmail.com> Message-ID: <458A2A21.10709@dev.mellanox.co.il> Hi Steven. Steven Wooding wrote: > Hi, > > I need some advice on a problem I've got RDMAing some data into a > shared memory segment. > > Everything works great until I try to transfer a message of 294Kbytes > or larger in size. There is some management info in the top end of the > share memory segment (we're using Boost shm library). This management > area gets corrupted after the RDMA transfer has occurred. > > I've tried various things to try and debug this. Allocating more > memory than I need from the shared memory segment for the landing > buffer. Making whole shared memory segment larger, and making the > management area smaller. But always I'm hit by this 294K limit. I > don't know whether it's a problem with Boost shmem or with RDMA > writing to memory areas that it shouldn't. What is the problem that you are facing? Failure in memory registration? completion with error? which driver are you using? thanks Dotan From aviram at dev.mellanox.co.il Thu Dec 21 04:36:58 2006 From: aviram at dev.mellanox.co.il (Aviram Gutman) Date: Thu, 21 Dec 2006 14:36:58 +0200 Subject: [openib-general] iSER target In-Reply-To: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com> References: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com> Message-ID: <458A7FEA.7070707@dev.mellanox.co.il> Are you planning to have the iSER target over verbs or kDAPL? Isn't the kDAPL development halted? Aviram Dan Bar Dov wrote: > The iser target code in the gen2 branch is functional > over kdapl. It requires an iscsi target code above it, > however such an iscsi code is not open. > > It was opened as a precursor for an open-source iscsi/iser-target > project. That project is still in its early stages, and the plan is > to add iser-target support, loosly based on the open-iser-target > code, to the stgt project. > > Due to the above, there is no readme/installation guide. > > Dan > > >> -----Original Message----- >> From: openib-general-bounces at openib.org >> [mailto:openib-general-bounces at openib.org] On Behalf Of vishal >> Sent: Wednesday, December 20, 2006 4:03 AM >> To: openib-general at openib.org >> Subject: [openib-general] iSER target >> >> Hi, >> >> I would like to confirm if the iSER target code in the gen2 branch >> is functional. If yes, is there a readme/installation guide >> available... >> >> Thanks a lot! >> >> Vishal >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Dec 21 06:08:11 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 09:08:11 -0500 Subject: [openib-general] OpenSM/osm_ucast_mgr.c: In osm_ucast_mgr_set_fwd_table, always reset port state change when set Message-ID: <1166710089.4519.112824.camel@hal.voltaire.com> OpenSM/osm_ucast_mgr.c: In osm_ucast_mgr_set_fwd_table, always reset port state change when set Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index f663d2d..f546c5f 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -922,7 +922,7 @@ osm_ucast_mgr_set_fwd_table( else life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; - if (life_state != si.life_state) + if ( (life_state != si.life_state) || ib_switch_info_get_state_change( &si ) ) { set_swinfo_require = TRUE; si.life_state = life_state; From monis at voltaire.com Thu Dec 21 06:43:10 2006 From: monis at voltaire.com (Moni Shoua) Date: Thu, 21 Dec 2006 16:43:10 +0200 Subject: [openib-general] [PATCH v3] IB_mthca HCA profile module parameters In-Reply-To: References: <457BF221.8080701@voltaire.com> Message-ID: <458A9D7E.9080801@voltaire.com> Roland Dreier wrote: > OK, the patch below is what I ended up committing. I am really not > pleased with the patch you sent and expected me to include -- there > are really obvious simple-to-fix things that it's just ridiculous for > you to be sending, eg: > > > +MODULE_PARM_DESC(num_mpt, > > trailing whitespace -- please check that your patch applies with 'git > apply --check --whitespace=error-all' > > > + "maximum number of memory protection pable entries per HCA"); > > umm, 'pable'?? > > and plenty of other things... > > For some reason I felt guilty about letting this patch hang for so > long, and so I fixed it up, but after doing it this time, I'm not > going to spend my time like that again. I have plenty of work to do > without cleaning up other people's messes... > > IB/mthca: Add HCA profile module parameters > > Add module parameters that enable settting some of the HCA > profile values, such as the number of QPs, CQs, etc. > > Signed-off-by: Leonid Arsh > Signed-off-by: Moni Shoua > Signed-off-by: Roland Dreier > > diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c > index 0491ec7..711c1b8 100644 > --- a/drivers/infiniband/hw/mthca/mthca_main.c > +++ b/drivers/infiniband/hw/mthca/mthca_main.c > @@ -82,22 +82,59 @@ MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if n > > struct mutex mthca_device_mutex; > > +#define MTHCA_DEFAULT_NUM_QP (1 << 16) > +#define MTHCA_DEFAULT_RDB_PER_QP (1 << 2) > +#define MTHCA_DEFAULT_NUM_CQ (1 << 16) > +#define MTHCA_DEFAULT_NUM_MCG (1 << 13) > +#define MTHCA_DEFAULT_NUM_MPT (1 << 17) > +#define MTHCA_DEFAULT_NUM_MTT (1 << 20) > +#define MTHCA_DEFAULT_NUM_UDAV (1 << 15) > +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18) > +#define MTHCA_DEFAULT_NUM_UARC_SIZE (1 << 18) > + > +static struct mthca_profile hca_profile = { > + .num_qp = MTHCA_DEFAULT_NUM_QP, > + .rdb_per_qp = MTHCA_DEFAULT_RDB_PER_QP, > + .num_cq = MTHCA_DEFAULT_NUM_CQ, > + .num_mcg = MTHCA_DEFAULT_NUM_MCG, > + .num_mpt = MTHCA_DEFAULT_NUM_MPT, > + .num_mtt = MTHCA_DEFAULT_NUM_MTT, > + .num_udav = MTHCA_DEFAULT_NUM_UDAV, /* Tavor only */ > + .fmr_reserved_mtts = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */ > + .uarc_size = MTHCA_DEFAULT_NUM_UARC_SIZE, /* Arbel only */ > +}; > + > +module_param_named(num_qp, hca_profile.num_qp, int, 0444); > +MODULE_PARM_DESC(num_qp, "maximum number of QPs per HCA"); > + > +module_param_named(rdb_per_qp, hca_profile.rdb_per_qp, int, 0444); > +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP"); > + > +module_param_named(num_cq, hca_profile.num_cq, int, 0444); > +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); > + > +module_param_named(num_mcg, hca_profile.num_mcg, int, 0444); > +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); > + > +module_param_named(num_mpt, hca_profile.num_mpt, int, 0444); > +MODULE_PARM_DESC(num_mpt, > + "maximum number of memory protection table entries per HCA"); > + > +module_param_named(num_mtt, hca_profile.num_mtt, int, 0444); > +MODULE_PARM_DESC(num_mtt, > + "maximum number of memory translation table segments per HCA"); > + > +module_param_named(num_udav, hca_profile.num_udav, int, 0444); > +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA"); > + > +module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444); > +MODULE_PARM_DESC(fmr_reserved_mtts, > + "number of memory translation table segments reserved for FMR"); > + > static const char mthca_version[] __devinitdata = > DRV_NAME ": Mellanox InfiniBand HCA driver v" > DRV_VERSION " (" DRV_RELDATE ")\n"; > > -static struct mthca_profile default_profile = { > - .num_qp = 1 << 16, > - .rdb_per_qp = 4, > - .num_cq = 1 << 16, > - .num_mcg = 1 << 13, > - .num_mpt = 1 << 17, > - .num_mtt = 1 << 20, > - .num_udav = 1 << 15, /* Tavor only */ > - .fmr_reserved_mtts = 1 << 18, /* Tavor only */ > - .uarc_size = 1 << 18, /* Arbel only */ > -}; > - > static int mthca_tune_pci(struct mthca_dev *mdev) > { > int cap; > @@ -303,7 +340,7 @@ static int mthca_init_tavor(struct mthca_dev *mdev) > goto err_disable; > } > > - profile = default_profile; > + profile = hca_profile; > profile.num_uar = dev_lim.uar_size / PAGE_SIZE; > profile.uarc_size = 0; > if (mdev->mthca_flags & MTHCA_FLAG_SRQ) > @@ -621,7 +658,7 @@ static int mthca_init_arbel(struct mthca_dev *mdev) > goto err_stop_fw; > } > > - profile = default_profile; > + profile = hca_profile; > profile.num_uar = dev_lim.uar_size / PAGE_SIZE; > profile.num_udav = 0; > if (mdev->mthca_flags & MTHCA_FLAG_SRQ) > @@ -1278,11 +1315,57 @@ static struct pci_driver mthca_driver = { > .remove = __devexit_p(mthca_remove_one) > }; > > +static void __init __mthca_check_profile_val(const char *name, int *pval, > + int pval_default) > +{ > + /* value must be positive and power of 2 */ > + int old_pval = *pval; > + > + if (old_pval <= 0) > + *pval = pval_default; > + else > + *pval = roundup_pow_of_two(old_pval); > + > + if (old_pval != *pval) { > + printk(KERN_WARNING PFX "Invalid value %d for %s in module parameter.\n", > + old_pval, name); > + printk(KERN_WARNING PFX "Corrected %s to %d.\n", name, *pval); > + } > +} > + > +#define mthca_check_profile_val(name, default) \ > + __mthca_check_profile_val(#name, &hca_profile.name, default) > + > +static void __init mthca_validate_profile(void) > +{ > + mthca_check_profile_val(num_qp, MTHCA_DEFAULT_NUM_QP); > + mthca_check_profile_val(rdb_per_qp, MTHCA_DEFAULT_RDB_PER_QP); > + mthca_check_profile_val(num_cq, MTHCA_DEFAULT_NUM_CQ); > + mthca_check_profile_val(num_mcg, MTHCA_DEFAULT_NUM_MCG); > + mthca_check_profile_val(num_mpt, MTHCA_DEFAULT_NUM_MPT); > + mthca_check_profile_val(num_mtt, MTHCA_DEFAULT_NUM_MTT); > + mthca_check_profile_val(num_udav, MTHCA_DEFAULT_NUM_UDAV); > + mthca_check_profile_val(fmr_reserved_mtts, MTHCA_DEFAULT_NUM_RESERVED_MTTS); > + > + if (hca_profile.fmr_reserved_mtts >= hca_profile.num_mtt) { > + printk(KERN_WARNING PFX "Invalid fmr_reserved_mtts module parameter %d.\n", > + hca_profile.fmr_reserved_mtts); > + printk(KERN_WARNING PFX "(Must be smaller than num_mtt %d)\n", > + hca_profile.num_mtt); > + hca_profile.fmr_reserved_mtts = hca_profile.num_mtt / 2; > + printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n", > + hca_profile.fmr_reserved_mtts); > + } > +} > + > static int __init mthca_init(void) > { > int ret; > > mutex_init(&mthca_device_mutex); > + > + mthca_validate_profile(); > + > ret = mthca_catas_init(); > if (ret) > return ret; > OK. Roland, Thanks for your help. I accepts the criticism and I hope to submit better patches next time. From eitan at mellanox.co.il Thu Dec 21 06:59:45 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 21 Dec 2006 16:59:45 +0200 Subject: [openib-general] building and running IBMgtsim? In-Reply-To: <20061220165624.GL31149@sgi.com> References: <20061220165624.GL31149@sgi.com> Message-ID: <458AA161.5090708@mellanox.co.il> Hi Chris, Sorry for my late response on this: The simulator is a standalone "server" where clients connect to it through a TCP/IP socket. OpenSM which is not built with "sim" vendor (using --with-osmv=sim --with-sim=) will not try to connect to the simulator but will go to the real IB network instead. So you need a second "simulator" install of OpenSM. You can simply clone the GIT tree and ./autogen.sh ./configure --with-osmv=sim --with-sim= --prefix= make make install RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o /bin/opensm Actually OsmTest is a test that currently fail (due to last changes in InformInfo), but any other *.check.tcl/*.sim.tcl pair should work. Eitan Chris Elmquist wrote: > Folks, > > I am trying to build and run IBMgtsim so that I can explore some different > topologies and system sizes. But I am having a lot of trouble getting > OpenSM to work with the simulator. > > I pulled down Eitan's ibutils git tree (to get the simulator) and > am otherwise using the OFED 1.1 tarball for the rest of the stuff. > I suspect I have a problem with OpenSM not being built correctly to use > the simulator. > > Does anyone have a recipe on how to build and install all of these pieces > (ie, openib, openSM and ibmgtsim) so that they will work together? > > I have been just trying to run one of the tests provided with the > simulator like this: > > % cd ~/ibutils/ibmgtsim/tests > % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm > > but we get this sort of output: > > -I- Using random seed:43204 > -I- Simulation directory is: /tmp/ibmgtsim.29716 > -I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top > o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log > -I- Simulator Ready > -I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726 > 5 > -I- Connected to the simulator control server > -I- Defined 51 guids > -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a > 2} > -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009 ... > -I- Waiting for OpenSM subnet up ... > -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: > ERR 5422: Unable to find requested CA guid 0x2c90000000009 > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5 > 424: Unable to Open Port 0x2c90000000009 > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: > ERR 3118: Vendor specific bind failed > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10: > SM MAD Controller bind failed (IB_ERROR) > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind > : ERR 1A11: No previous bind > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > Thank you. > > Chris > SGI Network Engineering > From erezz at voltaire.com Thu Dec 21 07:07:58 2006 From: erezz at voltaire.com (Erez Zilber) Date: Thu, 21 Dec 2006 17:07:58 +0200 Subject: [openib-general] iSER target In-Reply-To: <458A7FEA.7070707@dev.mellanox.co.il> References: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com> <458A7FEA.7070707@dev.mellanox.co.il> Message-ID: <458AA34E.60206@voltaire.com> No. We plan to run the iSER target over gen2 verbs. -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Team Voltaire – _The Grid Backbone_ __ www.voltaire.com Aviram Gutman wrote: > Are you planning to have the iSER target over verbs or kDAPL? Isn't the > kDAPL development halted? > > Aviram > > Dan Bar Dov wrote: > >> The iser target code in the gen2 branch is functional >> over kdapl. It requires an iscsi target code above it, >> however such an iscsi code is not open. >> >> It was opened as a precursor for an open-source iscsi/iser-target >> project. That project is still in its early stages, and the plan is >> to add iser-target support, loosly based on the open-iser-target >> code, to the stgt project. >> >> Due to the above, there is no readme/installation guide. >> >> Dan >> >> >> >>> -----Original Message----- >>> From: openib-general-bounces at openib.org >>> [mailto:openib-general-bounces at openib.org] On Behalf Of vishal >>> Sent: Wednesday, December 20, 2006 4:03 AM >>> To: openib-general at openib.org >>> Subject: [openib-general] iSER target >>> >>> Hi, >>> >>> I would like to confirm if the iSER target code in the gen2 branch >>> is functional. If yes, is there a readme/installation guide >>> available... >>> >>> Thanks a lot! >>> >>> Vishal >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >>> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Thu Dec 21 07:11:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 10:11:20 -0500 Subject: [openib-general] building and running IBMgtsim? In-Reply-To: <458AA161.5090708@mellanox.co.il> References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il> Message-ID: <1166713879.4519.115782.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-12-21 at 09:59, Eitan Zahavi wrote: > Hi Chris, > > Sorry for my late response on this: > > The simulator is a standalone "server" where clients connect to it > through a TCP/IP socket. > > OpenSM which is not built with "sim" vendor (using --with-osmv=sim > --with-sim=) > will not try to connect to the simulator but will go to the real IB > network instead. > > So you need a second "simulator" install of OpenSM. > You can simply clone the GIT tree and > ./autogen.sh > ./configure --with-osmv=sim --with-sim= install> --prefix= > make > make install > > RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o > /bin/opensm You might want to put this info up on the wiki. -- Hal > Actually OsmTest is a test that currently fail (due to last changes in > InformInfo), > but any other *.check.tcl/*.sim.tcl pair should work. > > Eitan > > > Chris Elmquist wrote: > > Folks, > > > > I am trying to build and run IBMgtsim so that I can explore some different > > topologies and system sizes. But I am having a lot of trouble getting > > OpenSM to work with the simulator. > > > > I pulled down Eitan's ibutils git tree (to get the simulator) and > > am otherwise using the OFED 1.1 tarball for the rest of the stuff. > > I suspect I have a problem with OpenSM not being built correctly to use > > the simulator. > > > > Does anyone have a recipe on how to build and install all of these pieces > > (ie, openib, openSM and ibmgtsim) so that they will work together? > > > > I have been just trying to run one of the tests provided with the > > simulator like this: > > > > % cd ~/ibutils/ibmgtsim/tests > > % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm > > > > but we get this sort of output: > > > > -I- Using random seed:43204 > > -I- Simulation directory is: /tmp/ibmgtsim.29716 > > -I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top > > o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log > > -I- Simulator Ready > > -I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726 > > 5 > > -I- Connected to the simulator control server > > -I- Defined 51 guids > > -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a > > 2} > > -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009 ... > > -I- Waiting for OpenSM subnet up ... > > -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: > > ERR 5422: Unable to find requested CA guid 0x2c90000000009 > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5 > > 424: Unable to Open Port 0x2c90000000009 > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: > > ERR 3118: Vendor specific bind failed > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10: > > SM MAD Controller bind failed (IB_ERROR) > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind > > : ERR 1A11: No previous bind > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > > > Thank you. > > > > Chris > > SGI Network Engineering > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Dec 21 07:29:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 10:29:13 -0500 Subject: [openib-general] building and running IBMgtsim? In-Reply-To: <458AA161.5090708@mellanox.co.il> References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il> Message-ID: <1166714952.4519.116610.camel@hal.voltaire.com> On Thu, 2006-12-21 at 09:59, Eitan Zahavi wrote: > Hi Chris, > > Sorry for my late response on this: > > The simulator is a standalone "server" where clients connect to it > through a TCP/IP socket. > > OpenSM which is not built with "sim" vendor (using --with-osmv=sim > --with-sim=) > will not try to connect to the simulator but will go to the real IB > network instead. > > So you need a second "simulator" install of OpenSM. > You can simply clone the GIT tree and > ./autogen.sh > ./configure --with-osmv=sim --with-sim= install> --prefix= > make > make install > > RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o > /bin/opensm > > Actually OsmTest is a test that currently fail (due to last changes in > InformInfo), This could easily be worked around by commenting out those tests in osmtest.c. -- Hal > but any other *.check.tcl/*.sim.tcl pair should work. > > Eitan > > > Chris Elmquist wrote: > > Folks, > > > > I am trying to build and run IBMgtsim so that I can explore some different > > topologies and system sizes. But I am having a lot of trouble getting > > OpenSM to work with the simulator. > > > > I pulled down Eitan's ibutils git tree (to get the simulator) and > > am otherwise using the OFED 1.1 tarball for the rest of the stuff. > > I suspect I have a problem with OpenSM not being built correctly to use > > the simulator. > > > > Does anyone have a recipe on how to build and install all of these pieces > > (ie, openib, openSM and ibmgtsim) so that they will work together? > > > > I have been just trying to run one of the tests provided with the > > simulator like this: > > > > % cd ~/ibutils/ibmgtsim/tests > > % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm > > > > but we get this sort of output: > > > > -I- Using random seed:43204 > > -I- Simulation directory is: /tmp/ibmgtsim.29716 > > -I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top > > o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log > > -I- Simulator Ready > > -I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726 > > 5 > > -I- Connected to the simulator control server > > -I- Defined 51 guids > > -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a > > 2} > > -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009 ... > > -I- Waiting for OpenSM subnet up ... > > -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: > > ERR 5422: Unable to find requested CA guid 0x2c90000000009 > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5 > > 424: Unable to Open Port 0x2c90000000009 > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: > > ERR 3118: Vendor specific bind failed > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10: > > SM MAD Controller bind failed (IB_ERROR) > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind > > : ERR 1A11: No previous bind > > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > > > Thank you. > > > > Chris > > SGI Network Engineering > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From chrise at sgi.com Thu Dec 21 09:02:19 2006 From: chrise at sgi.com (Chris Elmquist) Date: Thu, 21 Dec 2006 11:02:19 -0600 Subject: [openib-general] building and running IBMgtsim? In-Reply-To: <458AA161.5090708@mellanox.co.il> References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il> Message-ID: <20061221170219.GH19625@sgi.com> Hi Guys... Thank you very much for the recipe. We actually had a success getting it to go just after posting to the list but these instructions will now confirm whether we did it the right way or not. Are there any guidelines for how big of a network the simulator can deal with? Maybe something that relates it to available memory on the platform it is running or other resource issues? We threw one model at it already which tipped it over but we are certainly not sure we are using it the right way yet. Thanks again. We hope to be activate participants in this space going forward and as soon as we know what we are doing, we'll feed it back to the group. Chris On Thursday (12/21/2006 at 04:59PM +0200), Eitan Zahavi wrote: > Hi Chris, > > Sorry for my late response on this: > > The simulator is a standalone "server" where clients connect to it > through a TCP/IP socket. > > OpenSM which is not built with "sim" vendor (using --with-osmv=sim > --with-sim=) > will not try to connect to the simulator but will go to the real IB > network instead. > > So you need a second "simulator" install of OpenSM. > You can simply clone the GIT tree and > ./autogen.sh > ./configure --with-osmv=sim --with-sim= install> --prefix= > make > make install > > RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o > /bin/opensm > > Actually OsmTest is a test that currently fail (due to last changes in > InformInfo), > but any other *.check.tcl/*.sim.tcl pair should work. > > Eitan > > > Chris Elmquist wrote: > >Folks, > > > >I am trying to build and run IBMgtsim so that I can explore some different > >topologies and system sizes. But I am having a lot of trouble getting > >OpenSM to work with the simulator. > > > >I pulled down Eitan's ibutils git tree (to get the simulator) and > >am otherwise using the OFED 1.1 tarball for the rest of the stuff. > >I suspect I have a problem with OpenSM not being built correctly to use > >the simulator. > > > >Does anyone have a recipe on how to build and install all of these pieces > >(ie, openib, openSM and ibmgtsim) so that they will work together? > > > >I have been just trying to run one of the tests provided with the > >simulator like this: > > > >% cd ~/ibutils/ibmgtsim/tests > >% RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o > >/usr/local/bin/opensm > > > >but we get this sort of output: > > > >-I- Using random seed:43204 > >-I- Simulation directory is: /tmp/ibmgtsim.29716 > >-I- Calling IBMgtSim -s 43204 -V 0xA3 -t > >/root/ibutils/ibmgtsim/tests/IS1-16.top > >o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l > >/tmp/ibmgtsim.29716/sim.log > >-I- Simulator Ready > >-I- Connecting to the simulator control server:pcplod.americas.sgi.com > >port:3726 > >5 > >-I- Connected to the simulator control server > >-I- Defined 51 guids > >-I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} > >{0x0002c9000000000a > > 2} > >-I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009 ... > >-I- Waiting for OpenSM subnet up ... > >-I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> > >osm_vendor_open_port: ERR 5422: Unable to find requested CA guid > >0x2c90000000009 > >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log > >-I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: > >ERR 5 > >424: Unable to Open Port 0x2c90000000009 > >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log > >-I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> > >osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed > >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log > >-I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR > >2E10: > > SM MAD Controller bind failed (IB_ERROR) > >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log > >-I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> > >osm_sa_mad_ctrl_unbind > >: ERR 1A11: No previous bind > >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log > > > >Thank you. > > > >Chris > >SGI Network Engineering > > -- Chris Elmquist mailto:chrise at sgi.com (651)683-3093 Silicon Graphics, Inc. Eagan, MN From eitan at mellanox.co.il Thu Dec 21 11:09:24 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 21 Dec 2006 21:09:24 +0200 Subject: [openib-general] building and running IBMgtsim? In-Reply-To: <20061221170219.GH19625@sgi.com> References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il> <20061221170219.GH19625@sgi.com> Message-ID: <458ADBE4.708@mellanox.co.il> Chris Elmquist wrote: > Hi Guys... > > Thank you very much for the recipe. We actually had a success getting > it to go just after posting to the list but these instructions will now > confirm whether we did it the right way or not. > > Are there any guidelines for how big of a network the simulator can > deal with? Maybe something that relates it to available memory on the > platform it is running or other resource issues? We threw one model > at it already which tipped it over but we are certainly not sure we are > using it the right way yet. > I was able to simulate 10K nodes in the past. What I did to get there was to use two machines: one for the simulator and one for the SM. I also used 64bit (x86_64) machines to avoid the ~3GB data limit. > Thanks again. We hope to be activate participants in this space going > forward and as soon as we know what we are doing, we'll feed it back to > the group. > > Chris > > On Thursday (12/21/2006 at 04:59PM +0200), Eitan Zahavi wrote: > >> Hi Chris, >> >> Sorry for my late response on this: >> >> The simulator is a standalone "server" where clients connect to it >> through a TCP/IP socket. >> >> OpenSM which is not built with "sim" vendor (using --with-osmv=sim >> --with-sim=) >> will not try to connect to the simulator but will go to the real IB >> network instead. >> >> So you need a second "simulator" install of OpenSM. >> You can simply clone the GIT tree and >> ./autogen.sh >> ./configure --with-osmv=sim --with-sim=> install> --prefix= >> make >> make install >> >> RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o >> /bin/opensm >> >> Actually OsmTest is a test that currently fail (due to last changes in >> InformInfo), >> but any other *.check.tcl/*.sim.tcl pair should work. >> >> Eitan >> >> >> Chris Elmquist wrote: >> >>> Folks, >>> >>> I am trying to build and run IBMgtsim so that I can explore some different >>> topologies and system sizes. But I am having a lot of trouble getting >>> OpenSM to work with the simulator. >>> >>> I pulled down Eitan's ibutils git tree (to get the simulator) and >>> am otherwise using the OFED 1.1 tarball for the rest of the stuff. >>> I suspect I have a problem with OpenSM not being built correctly to use >>> the simulator. >>> >>> Does anyone have a recipe on how to build and install all of these pieces >>> (ie, openib, openSM and ibmgtsim) so that they will work together? >>> >>> I have been just trying to run one of the tests provided with the >>> simulator like this: >>> >>> % cd ~/ibutils/ibmgtsim/tests >>> % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o >>> /usr/local/bin/opensm >>> >>> but we get this sort of output: >>> >>> -I- Using random seed:43204 >>> -I- Simulation directory is: /tmp/ibmgtsim.29716 >>> -I- Calling IBMgtSim -s 43204 -V 0xA3 -t >>> /root/ibutils/ibmgtsim/tests/IS1-16.top >>> o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l >>> /tmp/ibmgtsim.29716/sim.log >>> -I- Simulator Ready >>> -I- Connecting to the simulator control server:pcplod.americas.sgi.com >>> port:3726 >>> 5 >>> -I- Connected to the simulator control server >>> -I- Defined 51 guids >>> -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} >>> {0x0002c9000000000a >>> 2} >>> -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009 ... >>> -I- Waiting for OpenSM subnet up ... >>> -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> >>> osm_vendor_open_port: ERR 5422: Unable to find requested CA guid >>> 0x2c90000000009 >>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log >>> -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: >>> ERR 5 >>> 424: Unable to Open Port 0x2c90000000009 >>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log >>> -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> >>> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed >>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log >>> -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR >>> 2E10: >>> SM MAD Controller bind failed (IB_ERROR) >>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log >>> -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> >>> osm_sa_mad_ctrl_unbind >>> : ERR 1A11: No previous bind >>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log >>> >>> Thank you. >>> >>> Chris >>> SGI Network Engineering >>> >>> > > From eitan at mellanox.co.il Thu Dec 21 11:10:32 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 21 Dec 2006 21:10:32 +0200 Subject: [openib-general] OpenSM/osm_ucast_mgr.c: In osm_ucast_mgr_set_fwd_table, always reset port state change when set In-Reply-To: <1166710089.4519.112824.camel@hal.voltaire.com> References: <1166710089.4519.112824.camel@hal.voltaire.com> Message-ID: <458ADC28.80305@mellanox.co.il> Good catch. Hal Rosenstock wrote: > OpenSM/osm_ucast_mgr.c: In osm_ucast_mgr_set_fwd_table, always reset > port state change when set > > Signed-off-by: Hal Rosenstock > > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > index f663d2d..f546c5f 100644 > --- a/osm/opensm/osm_ucast_mgr.c > +++ b/osm/opensm/osm_ucast_mgr.c > @@ -922,7 +922,7 @@ osm_ucast_mgr_set_fwd_table( > else > life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8; > > - if (life_state != si.life_state) > + if ( (life_state != si.life_state) || ib_switch_info_get_state_change( &si ) ) > { > set_swinfo_require = TRUE; > si.life_state = life_state; > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Thu Dec 21 11:14:11 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 21 Dec 2006 21:14:11 +0200 Subject: [openib-general] [PATCH] osm: fix simulator vendor not initializing complete mad address Message-ID: <458ADD03.4020909@mellanox.co.il> Hi Hal, This fix resolves the issue I have seen on osmtest InformInfo flow. I am still not sure it is correct to compare sender address in the SA InformInfo receiver by simply comparing the entire osm_mad_addr structure. But anyway, at least the simulator now behaves like the rest of the stacks. The fix makes sure we init the complete mad address structure before copying the relevant data. Signed-off-by: Eitan Zahavi --- osm/libvendor/osm_vendor_mlx_sim.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/osm/libvendor/osm_vendor_mlx_sim.c b/osm/libvendor/osm_vendor_mlx_sim.c index 4692df0..d3e6eeb 100644 --- a/osm/libvendor/osm_vendor_mlx_sim.c +++ b/osm/libvendor/osm_vendor_mlx_sim.c @@ -381,6 +381,7 @@ __osmv_ibms_mad_addr_to_osm_addr( IN uint8_t is_smi, OUT osm_mad_addr_t *p_osm_addr) { + memset(p_osm_addr, 0, sizeof(osm_mad_addr_t)); p_osm_addr->dest_lid = cl_hton16(p_ibms_addr->slid); p_osm_addr->static_rate = 0; p_osm_addr->path_bits = 0; -- 1.4.4.1.GIT From eitan at mellanox.co.il Thu Dec 21 11:16:59 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 21 Dec 2006 21:16:59 +0200 Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to return error when expected error does not happen Message-ID: <458ADDAB.80301@mellanox.co.il> Hi Hal, I have found that on BAD InformInfo transactions when the osmtest expects an error from the SM it misses returning an error to the calling procedure which will make osmtest pass the test. EZ Signed-off-by: Eitan Zahavi --- osm/osmtest/osmtest.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 48 insertions(+), 2 deletions(-) diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index b1df333..e1c64ef 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; /* InformInfoRecord tests */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a BAD - Set Unsubscribe request\n"); memset( &inform_info_opt, 0, sizeof( inform_info_opt ) ); memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, - IB_MAD_METHOD_SET, &inform_info_rec_opt, + IB_MAD_METHOD_SET, &inform_info_rec_opt, &context ); if ( status == IB_SUCCESS ) + { + status = IB_ERROR; goto Exit; + } else { osm_log( &p_osmt->log, OSM_LOG_ERROR, @@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_ "IS EXPECTED ERROR ^^^^\n"); } + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - Empty GetTable request\n"); memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, - IB_MAD_METHOD_GETTABLE, + IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context ); if ( status != IB_SUCCESS ) goto Exit; /* InformInfo tests */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a BAD - Empty Get request " + "(should fail with NO_RECORDS)\n"); memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, IB_MAD_METHOD_GET, &inform_info_opt, &context ); if ( status == IB_SUCCESS ) + { + status = IB_ERROR; goto Exit; + } else { osm_log( &p_osmt->log, OSM_LOG_ERROR, @@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_ "IS EXPECTED ERROR ^^^^\n"); } + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a BAD - Set Unsubscribe request\n"); memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, IB_MAD_METHOD_SET, &inform_info_opt, &context ); if ( status == IB_SUCCESS ) + { + status = IB_ERROR; goto Exit; + } else { osm_log( &p_osmt->log, OSM_LOG_ERROR, @@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_ } /* Now subscribe */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - Set Subscribe request\n"); inform_info_opt.subscribe = TRUE; memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, @@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; /* Now unsubscribe (QPN needs to be 1 to work) */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - Set Unsubscribe request\n"); inform_info_opt.subscribe = FALSE; inform_info_opt.qpn = 1; memset( &context, 0, sizeof( context ) ); @@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; /* Now subscribe again */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - Set Subscribe request\n"); inform_info_opt.subscribe = TRUE; inform_info_opt.qpn = 1; memset( &context, 0, sizeof( context ) ); @@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; /* Subscribe over existing subscription */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - Set Subscribe (again) request\n"); inform_info_opt.qpn = 0; memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, @@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_ /* More InformInfoRecord tests */ /* RID lookup (with currently invalid enum) */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - GetTable by GID\n"); ib_gid_set_default( &inform_info_rec_opt.subscriber_gid, p_osmt->local_port.port_guid ); inform_info_rec_opt.subscriber_enum = 1; @@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; /* Enum lookup */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - GetTable (subsriber_enum == 0) request\n"); inform_info_rec_opt.subscriber_enum = 0; memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, @@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; /* Get all InformInfoRecords */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - GetTable (ALL records) request\n"); memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, @@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_ goto Exit; /* Cleanup subscriptions before further testing */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - Set (cleanup all of them) request\n"); inform_info_opt.subscribe = FALSE; inform_info_opt.qpn = 1; memset( &context, 0, sizeof( context ) ); -- 1.4.4.1.GIT From chrise at sgi.com Thu Dec 21 11:26:58 2006 From: chrise at sgi.com (Chris Elmquist) Date: Thu, 21 Dec 2006 13:26:58 -0600 Subject: [openib-general] building and running IBMgtsim? In-Reply-To: <458ADBE4.708@mellanox.co.il> References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il> <20061221170219.GH19625@sgi.com> <458ADBE4.708@mellanox.co.il> Message-ID: <20061221192658.GJ19625@sgi.com> On Thursday (12/21/2006 at 09:09PM +0200), Eitan Zahavi wrote: > I was able to simulate 10K nodes in the past. > What I did to get there was to use two machines: one for the simulator > and one for the SM. OK. Those are good datapoints. > I also used 64bit (x86_64) machines to avoid the ~3GB data limit. We've got that covered... Thanks. Chris -- Chris Elmquist mailto:chrise at sgi.com (651)683-3093 Silicon Graphics, Inc. Eagan, MN From halr at voltaire.com Thu Dec 21 11:39:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 14:39:41 -0500 Subject: [openib-general] [PATCH] osm: fix simulator vendor not initializing complete mad address In-Reply-To: <458ADD03.4020909@mellanox.co.il> References: <458ADD03.4020909@mellanox.co.il> Message-ID: <1166729980.4519.128300.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-12-21 at 14:14, Eitan Zahavi wrote: > Hi Hal, > > This fix resolves the issue I have seen on osmtest InformInfo flow. > I am still not sure it is correct to compare sender address in the SA > InformInfo receiver by simply comparing the entire osm_mad_addr structure. I'm not sure either. I will look more into this. > But anyway, at least the simulator now behaves like the rest of the stacks. > > The fix makes sure we init the complete mad address structure before > copying the relevant data. > > Signed-off-by: Eitan Zahavi Thanks. Applied. -- Hal From halr at voltaire.com Thu Dec 21 11:40:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 14:40:57 -0500 Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to return error when expected error does not happen In-Reply-To: <458ADDAB.80301@mellanox.co.il> References: <458ADDAB.80301@mellanox.co.il> Message-ID: <1166730056.4519.128359.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-12-21 at 14:16, Eitan Zahavi wrote: > Hi Hal, > > I have found that on BAD InformInfo transactions when the osmtest > expects an error from the SM > it misses returning an error to the calling procedure which will make > osmtest pass the test. > > EZ > Signed-off-by: Eitan Zahavi > > --- > osm/osmtest/osmtest.c | 50 > +++++++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 48 insertions(+), 2 deletions(-) > > diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c > index b1df333..e1c64ef 100644 > --- a/osm/osmtest/osmtest.c > +++ b/osm/osmtest/osmtest.c > @@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* InformInfoRecord tests */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a BAD - Set Unsubscribe request\n"); > memset( &inform_info_opt, 0, sizeof( inform_info_opt ) ); > memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, This patch is line wrapped here (and maybe other places as well) :-( -- Hal > - IB_MAD_METHOD_SET, &inform_info_rec_opt, > + IB_MAD_METHOD_SET, > &inform_info_rec_opt, > &context ); > if ( status == IB_SUCCESS ) > + { > + status = IB_ERROR; > goto Exit; > + } > else > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > @@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_ > "IS EXPECTED ERROR ^^^^\n"); > } > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - Empty GetTable request\n"); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, > - IB_MAD_METHOD_GETTABLE, > + IB_MAD_METHOD_GETTABLE, > &inform_info_rec_opt, &context ); > if ( status != IB_SUCCESS ) > goto Exit; > > /* InformInfo tests */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a BAD - Empty Get request " > + "(should fail with NO_RECORDS)\n"); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > IB_MAD_METHOD_GET, &inform_info_opt, > &context ); > if ( status == IB_SUCCESS ) > + { > + status = IB_ERROR; > goto Exit; > + } > else > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > @@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_ > "IS EXPECTED ERROR ^^^^\n"); > } > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a BAD - Set Unsubscribe request\n"); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > IB_MAD_METHOD_SET, &inform_info_opt, > &context ); > if ( status == IB_SUCCESS ) > + { > + status = IB_ERROR; > goto Exit; > + } > else > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > @@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_ > } > > /* Now subscribe */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - Set Subscribe request\n"); > inform_info_opt.subscribe = TRUE; > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > @@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Now unsubscribe (QPN needs to be 1 to work) */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - Set Unsubscribe request\n"); > inform_info_opt.subscribe = FALSE; > inform_info_opt.qpn = 1; > memset( &context, 0, sizeof( context ) ); > @@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Now subscribe again */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - Set Subscribe request\n"); > inform_info_opt.subscribe = TRUE; > inform_info_opt.qpn = 1; > memset( &context, 0, sizeof( context ) ); > @@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Subscribe over existing subscription */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - Set Subscribe (again) request\n"); > inform_info_opt.qpn = 0; > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > @@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_ > > /* More InformInfoRecord tests */ > /* RID lookup (with currently invalid enum) */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - GetTable by GID\n"); > ib_gid_set_default( &inform_info_rec_opt.subscriber_gid, > p_osmt->local_port.port_guid ); > inform_info_rec_opt.subscriber_enum = 1; > @@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Enum lookup */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - GetTable (subsriber_enum == 0) request\n"); > inform_info_rec_opt.subscriber_enum = 0; > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, > @@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Get all InformInfoRecords */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - GetTable (ALL records) request\n"); > memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, > @@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Cleanup subscriptions before further testing */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - Set (cleanup all of them) request\n"); > inform_info_opt.subscribe = FALSE; > inform_info_opt.qpn = 1; > memset( &context, 0, sizeof( context ) ); From danb at voltaire.com Thu Dec 21 11:53:03 2006 From: danb at voltaire.com (Dan Bar Dov) Date: Thu, 21 Dec 2006 21:53:03 +0200 Subject: [openib-general] iSER target References: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com> <458A7FEA.7070707@dev.mellanox.co.il> Message-ID: <3857BB049D83424D9DB82753D37CEA5509F469@taurus.voltaire.com> Verbs. RIP kdapl. Dan -----Original Message----- From: Aviram Gutman [mailto:aviram at dev.mellanox.co.il] Sent: Thu 12/21/2006 2:36 PM To: Dan Bar Dov Cc: vishal; openib-general at openib.org Subject: Re: [openib-general] iSER target Are you planning to have the iSER target over verbs or kDAPL? Isn't the kDAPL development halted? Aviram Dan Bar Dov wrote: > The iser target code in the gen2 branch is functional > over kdapl. It requires an iscsi target code above it, > however such an iscsi code is not open. > > It was opened as a precursor for an open-source iscsi/iser-target > project. That project is still in its early stages, and the plan is > to add iser-target support, loosly based on the open-iser-target > code, to the stgt project. > > Due to the above, there is no readme/installation guide. > > Dan > > >> -----Original Message----- >> From: openib-general-bounces at openib.org >> [mailto:openib-general-bounces at openib.org] On Behalf Of vishal >> Sent: Wednesday, December 20, 2006 4:03 AM >> To: openib-general at openib.org >> Subject: [openib-general] iSER target >> >> Hi, >> >> I would like to confirm if the iSER target code in the gen2 branch >> is functional. If yes, is there a readme/installation guide >> available... >> >> Thanks a lot! >> >> Vishal >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Dec 21 12:42:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 15:42:51 -0500 Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to return error when expected error does not happen In-Reply-To: <1166730056.4519.128359.camel@hal.voltaire.com> References: <458ADDAB.80301@mellanox.co.il> <1166730056.4519.128359.camel@hal.voltaire.com> Message-ID: <1166733770.4519.131252.camel@hal.voltaire.com> Hi again Eitan, On Thu, 2006-12-21 at 14:40, Hal Rosenstock wrote: > Hi Eitan, > > On Thu, 2006-12-21 at 14:16, Eitan Zahavi wrote: > > Hi Hal, > > > > I have found that on BAD InformInfo transactions when the osmtest > > expects an error from the SM > > it misses returning an error to the calling procedure which will make > > osmtest pass the test. > > > > EZ > > Signed-off-by: Eitan Zahavi > > > > --- > > osm/osmtest/osmtest.c | 50 > > +++++++++++++++++++++++++++++++++++++++++++++++- > > 1 files changed, 48 insertions(+), 2 deletions(-) > > > > diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c > > index b1df333..e1c64ef 100644 > > --- a/osm/osmtest/osmtest.c > > +++ b/osm/osmtest/osmtest.c > > @@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_ > > goto Exit; > > > > /* InformInfoRecord tests */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a BAD - Set Unsubscribe request\n"); > > memset( &inform_info_opt, 0, sizeof( inform_info_opt ) ); > > memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, > > IB_MAD_ATTR_INFORM_INFO_RECORD, > > This patch is line wrapped here (and maybe other places as well) :-( Never mind. I nursed it through. Other comments to follow... -- Hal > -- Hal > > > - IB_MAD_METHOD_SET, &inform_info_rec_opt, > > + IB_MAD_METHOD_SET, > > &inform_info_rec_opt, > > &context ); > > if ( status == IB_SUCCESS ) > > + { > > + status = IB_ERROR; > > goto Exit; > > + } > > else > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > @@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_ > > "IS EXPECTED ERROR ^^^^\n"); > > } > > > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - Empty GetTable request\n"); > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, > > IB_MAD_ATTR_INFORM_INFO_RECORD, > > - IB_MAD_METHOD_GETTABLE, > > + IB_MAD_METHOD_GETTABLE, > > &inform_info_rec_opt, &context ); > > if ( status != IB_SUCCESS ) > > goto Exit; > > > > /* InformInfo tests */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a BAD - Empty Get request " > > + "(should fail with NO_RECORDS)\n"); > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > > IB_MAD_METHOD_GET, &inform_info_opt, > > &context ); > > if ( status == IB_SUCCESS ) > > + { > > + status = IB_ERROR; > > goto Exit; > > + } > > else > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > @@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_ > > "IS EXPECTED ERROR ^^^^\n"); > > } > > > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a BAD - Set Unsubscribe request\n"); > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > > IB_MAD_METHOD_SET, &inform_info_opt, > > &context ); > > if ( status == IB_SUCCESS ) > > + { > > + status = IB_ERROR; > > goto Exit; > > + } > > else > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > @@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_ > > } > > > > /* Now subscribe */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - Set Subscribe request\n"); > > inform_info_opt.subscribe = TRUE; > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > > @@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_ > > goto Exit; > > > > /* Now unsubscribe (QPN needs to be 1 to work) */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - Set Unsubscribe request\n"); > > inform_info_opt.subscribe = FALSE; > > inform_info_opt.qpn = 1; > > memset( &context, 0, sizeof( context ) ); > > @@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_ > > goto Exit; > > > > /* Now subscribe again */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - Set Subscribe request\n"); > > inform_info_opt.subscribe = TRUE; > > inform_info_opt.qpn = 1; > > memset( &context, 0, sizeof( context ) ); > > @@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_ > > goto Exit; > > > > /* Subscribe over existing subscription */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - Set Subscribe (again) request\n"); > > inform_info_opt.qpn = 0; > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > > @@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_ > > > > /* More InformInfoRecord tests */ > > /* RID lookup (with currently invalid enum) */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - GetTable by GID\n"); > > ib_gid_set_default( &inform_info_rec_opt.subscriber_gid, > > p_osmt->local_port.port_guid ); > > inform_info_rec_opt.subscriber_enum = 1; > > @@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_ > > goto Exit; > > > > /* Enum lookup */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - GetTable (subsriber_enum == 0) request\n"); > > inform_info_rec_opt.subscriber_enum = 0; > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, > > IB_MAD_ATTR_INFORM_INFO_RECORD, > > @@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_ > > goto Exit; > > > > /* Get all InformInfoRecords */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - GetTable (ALL records) request\n"); > > memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); > > memset( &context, 0, sizeof( context ) ); > > status = osmtest_informinfo_request( p_osmt, > > IB_MAD_ATTR_INFORM_INFO_RECORD, > > @@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_ > > goto Exit; > > > > /* Cleanup subscriptions before further testing */ > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > > + "osmtest_informinfo_request: InformInfoRecord " > > + "Sending a Good - Set (cleanup all of them) request\n"); > > inform_info_opt.subscribe = FALSE; > > inform_info_opt.qpn = 1; > > memset( &context, 0, sizeof( context ) ); > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Ashish.Batwara at lsi.com Thu Dec 21 13:39:00 2006 From: Ashish.Batwara at lsi.com (Batwara, Ashish) Date: Thu, 21 Dec 2006 14:39:00 -0700 Subject: [openib-general] opensm Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A0115A12F@NAMAIL2.ad.lsil.com> Thanks Vu, This seems to be working. Thanks Ashish -----Original Message----- From: Vu Pham [mailto:vuhuong at mellanox.com] Sent: Wednesday, December 20, 2006 3:23 PM To: Batwara, Ashish Cc: Hal Rosenstock; ishai at mellanox.co.il; openib-general at openib.org Subject: Re: [openib-general] opensm Hi Ashish, > Hi, > Please see the information below > > This is what I did: > /etc/init.d/openibd start > /etc/init.d/opensmd start > modprobe ib_srp > > Issued the command /usr/local/ofed/sbin/ibsrpdm -c to get the > information about target and used them in > By default without -d option, ibsrpdm will use /dev/infiniband/umad0 -- with corresponding to port 1 of mthca0 > echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, > > dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114 > 6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target This is correct by using srp-mthca0-1; however, I got this from your previous email which you reported *I am seeing the error " Got failed path rec status -110 " on Linux console* echo id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > /sys/class/infiniband_srp/srp-mthca0-2/add_target You used port 2 of mthca0 here ie. srp-mthca0-2; therefore, you got pathrecord failure Please retry: 0. Make sure you connect port 1 of host hca to target (since you connect them directly. Port 2 work as well but you have to use the umad1 and srp-mthca0-2 for steps 1,2 below) 1. ibsrpdm -c -d /dev/infiniband/umad0 2. echo whatever target discover to srp-mthca0-1 -vu > > Yes, earlier I had silverstorm switch which was running SM but now I > have taken that out and directly connecting the target and host. > > I have only one port connected between the host and the target. > The reason behind link is not stable is that I am restarting and > stopping again and again, as this does not seem to be working and I did > not know the issue until I looked at the console log which was > indicating "Got failed path rec status -110" and after seeing that I > searched on goggle and found that > "https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht > ml" it seems to be a bug with 64-bit machine. > BTW, my linux server is 64-bit. > When I hooked up 32-bit server running OFED-1.1, I see my target > discovered with the same procedure. > > So, whole question is that what is the fix for issue "Got failed path > rec status -110" on 64-bit machine. > > Thanks > Ashish > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Tuesday, December 19, 2006 10:35 PM > To: Batwara, Ashish > Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org > Subject: RE: [openib-general] opensm > > On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote: >> Hi, >> Please look towards the end of the attached file. > > What options are you starting opensm with ? What is the command line ? > > Also, it looks like (at least at one point) you have another SM on the > subnet. What is the make (vendor) for your switch ? > > I see many SM port is DOWN. What is going on with this port ? Why is the > physical link not LinkUp and stable ? That is the main issue and is > likely why the SubnGet of NodeInfo is not being responded to. > > -- Hal > >> Thanks >> Ashish >> >> -----Original Message----- >> From: Hal Rosenstock [mailto:halr at voltaire.com] >> Sent: Tuesday, December 19, 2006 5:06 PM >> To: Batwara, Ashish >> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org >> Subject: Re: [openib-general] opensm >> >> Ashish, >> >> On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote: >>> Hi, >>> >>> Here is the info that you have asked. I am seeing the Subnet manager >>> is up now having the port active. But server is not able to discover >>> the target. I am seeing the error "Got failed path rec status -110" > on >>> Linux console. >> That means the request for an SA PathRecord from the initiator to the >> target failed (-110 is ETIMEDOUT). Are you sure the target is up >> (ACTIVE) on the subnet ? If it is, can you send the opensm log ? >> >> -- Hal >> >>> Below are the output of different commands. I am using following to >>> discover the target: >>> >>> >>> >>> /etc/init.d/opensmd start >>> >>> /etc/init.d/openibd start >>> >>> modprobe ib_srp >>> >>> echo >>> > id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000 >> 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > >> /sys/class/infiniband_srp/srp-mthca0-2/add_target >>> >>> >>> >>> >>> [root at p49 ~]# ibv_devinfo >>> >>> hca_id: mthca0 >>> >>> fw_ver: 5.1.400 >>> >>> node_guid: 0002:c902:0022:cce0 >>> >>> sys_image_guid: 0002:c902:0022:cce3 >>> >>> vendor_id: 0x02c9 >>> >>> vendor_part_id: 25218 >>> >>> hw_ver: 0xA0 >>> >>> board_id: MT_0370130002 >>> >>> phys_port_cnt: 2 >>> >>> port: 1 >>> >>> state: PORT_DOWN (1) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 512 (2) >>> >>> sm_lid: 0 >>> >>> port_lid: 0 >>> >>> port_lmc: 0x00 >>> >>> >>> >>> port: 2 >>> >>> state: PORT_ACTIVE (4) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 2048 (4) >>> >>> sm_lid: 1 >>> >>> port_lid: 1 >>> >>> port_lmc: 0x00 >>> hca_id: mthca1 >>> >>> fw_ver: 5.1.400 >>> >>> node_guid: 0002:c902:0022:cd2c >>> >>> sys_image_guid: 0002:c902:0022:cd2f >>> >>> vendor_id: 0x02c9 >>> >>> vendor_part_id: 25218 >>> >>> hw_ver: 0xA0 >>> >>> board_id: MT_0370130002 >>> >>> phys_port_cnt: 2 >>> >>> port: 1 >>> >>> state: PORT_DOWN (1) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 512 (2) >>> >>> sm_lid: 0 >>> >>> port_lid: 0 >>> >>> port_lmc: 0x00 >>> >>> >>> >>> port: 2 >>> >>> state: PORT_DOWN (1) >>> >>> max_mtu: 2048 (4) >>> >>> active_mtu: 512 (2) >>> >>> sm_lid: 0 >>> >>> port_lid: 0 >>> >>> port_lmc: 0x00 >>> >>> >>> >>> >>> >>> [root at p49 ~]# uname -a >>> >>> Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 >>> EDT 2006 x86_64 x86_64 x86_64 GNU/Linux >>> >>> >>> >>> [root at p49 ~]# cat /etc/infiniband/info >>> >>> #!/bin/bash >>> >>> >>> >>> echo prefix=/usr/local/ofed >>> >>> echo Kernel=2.6.9-42.0.3.ELsmp >>> >>> echo >>> >>> echo "Configure options: --with-dapl --with-ipoibtools > --with-libibcm >>> --with-libibcommon --with-libibmad --with-libibumad > --with-libibverbs >>> --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm >>> --with-libsdp --with-openib-diags --with-srptools --with-mstflint >>> --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod >>> --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod >>> --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod" >>> >>> echo >>> >>> >>> >>> OFED Version: OFED-1.1 >> >> >>> Thanks >>> >>> Ashish >>> >>> -----Original Message----- >>> From: Eitan Zahavi [mailto:eitan at mellanox.co.il] >>> Sent: Tuesday, December 19, 2006 5:18 AM >>> To: Batwara, Ashish >>> Cc: ishai at mellanox.co.il; openib-general at openib.org >>> Subject: Re: [openib-general] opensm >>> >>> >>> >>> Hi Ashish, >>> >>> >>> >>> SRP people say they have no such error message. >>> >>> OpenSM does. So I take it back. >>> >>> >>> >>> Ashish, >>> >>> Please provide more into: >>> >>> >>> >>> 1. ibv_devinfo >>> >>> 2. Version of code you are using >>> >>> 3. Command line you use for starting opensm >>> >>> 4. /var/log/osm.log >>> >>> >>> >>> Thanks and sorry for the confusion. >>> >>> >>> >>> EZ >>> >>> >>> >>> Eitan Zahavi wrote: >>> >>>> This is not an OpenSM issue. >>>> Forwarded to the SRP people. >>>> EZ >>>> Batwara, Ashish wrote: >>>> >>>>> Hi, >>>>> I am trying to run opensm on Linux server. It has two HCAs >>> (4-ports) and >>> >>>>> connected to IB Switch. ibnodes command displays the information >>> about >>> >>>>> the Switch ports and HCA ports. >>>>> When I start opensm, I see in /var/log/messages "Starting >>> srp_daemon" >>> >>>>> for all the 4 ports and immediately after I see "failed > srp_daemon" >>> for >>> >>>>> all the ports and the displays "SM Port is down". >>>>> I tried several times and even rebooted the server few times but > no >>>>> luck. >>>>> Does anybody know what this problem is? >>>>> Thanks >>>>> Ashish >>>>> _______________________________________________ >>>>> openib-general mailing list >>>>> openib-general at openib.org >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>>>> >>>>> >>>> _______________________________________________ >>>> openib-general mailing list >>>> openib-general at openib.org >>>> http://openib.org/mailman/listinfo/openib-general >>>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>>> >>> >>> >>> >>> >>> > ______________________________________________________________________ >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Thu Dec 21 12:55:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 15:55:12 -0500 Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to return error when expected error does not happen In-Reply-To: <458ADDAB.80301@mellanox.co.il> References: <458ADDAB.80301@mellanox.co.il> Message-ID: <1166734511.4519.131808.camel@hal.voltaire.com> Hi Eitan, On Thu, 2006-12-21 at 14:16, Eitan Zahavi wrote: > Hi Hal, > > I have found that on BAD InformInfo transactions when the osmtest > expects an error from the SM > it misses returning an error to the calling procedure which will make > osmtest pass the test. > > EZ > Signed-off-by: Eitan Zahavi > > --- > osm/osmtest/osmtest.c | 50 Thanks. Applied (after fixing up the whitespace and adapting to the latest osmtest/osmtest.c). Some additional comments below: > +++++++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 48 insertions(+), 2 deletions(-) > > diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c > index b1df333..e1c64ef 100644 > --- a/osm/osmtest/osmtest.c > +++ b/osm/osmtest/osmtest.c > @@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* InformInfoRecord tests */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a BAD - Set Unsubscribe request\n"); > memset( &inform_info_opt, 0, sizeof( inform_info_opt ) ); > memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, > - IB_MAD_METHOD_SET, &inform_info_rec_opt, > + IB_MAD_METHOD_SET, > &inform_info_rec_opt, > &context ); > if ( status == IB_SUCCESS ) > + { > + status = IB_ERROR; > goto Exit; > + } Dang; missed that again... Yevgeny spanked me on this before... > else > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > @@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_ > "IS EXPECTED ERROR ^^^^\n"); > } > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - Empty GetTable request\n"); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, > - IB_MAD_METHOD_GETTABLE, > + IB_MAD_METHOD_GETTABLE, > &inform_info_rec_opt, &context ); > if ( status != IB_SUCCESS ) > goto Exit; > > /* InformInfo tests */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " ^^^^^^^^^^ InformInfo > + "Sending a BAD - Empty Get request " > + "(should fail with NO_RECORDS)\n"); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > IB_MAD_METHOD_GET, &inform_info_opt, > &context ); > if ( status == IB_SUCCESS ) > + { > + status = IB_ERROR; > goto Exit; > + } > else > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > @@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_ > "IS EXPECTED ERROR ^^^^\n"); > } > > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " ^^^^^^^^^^ InformInfo > + "Sending a BAD - Set Unsubscribe request\n"); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > IB_MAD_METHOD_SET, &inform_info_opt, > &context ); > if ( status == IB_SUCCESS ) > + { > + status = IB_ERROR; > goto Exit; > + } > else > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > @@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_ > } > > /* Now subscribe */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " ^^^^^^^^^^ InformInfo > + "Sending a Good - Set Subscribe request\n"); > inform_info_opt.subscribe = TRUE; > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > @@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Now unsubscribe (QPN needs to be 1 to work) */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " ^^^^^^^^^^ InformInfo > + "Sending a Good - Set Unsubscribe request\n"); > inform_info_opt.subscribe = FALSE; > inform_info_opt.qpn = 1; > memset( &context, 0, sizeof( context ) ); > @@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Now subscribe again */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " ^^^^^^^^^^ InformInfo > + "Sending a Good - Set Subscribe request\n"); > inform_info_opt.subscribe = TRUE; > inform_info_opt.qpn = 1; > memset( &context, 0, sizeof( context ) ); > @@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Subscribe over existing subscription */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " ^^^^^^^^^^ InformInfo > + "Sending a Good - Set Subscribe (again) request\n"); > inform_info_opt.qpn = 0; > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, > @@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_ > > /* More InformInfoRecord tests */ > /* RID lookup (with currently invalid enum) */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - GetTable by GID\n"); > ib_gid_set_default( &inform_info_rec_opt.subscriber_gid, > p_osmt->local_port.port_guid ); > inform_info_rec_opt.subscriber_enum = 1; > @@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Enum lookup */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - GetTable (subsriber_enum == 0) request\n"); subscriber_enum > inform_info_rec_opt.subscriber_enum = 0; > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, > @@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Get all InformInfoRecords */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " > + "Sending a Good - GetTable (ALL records) request\n"); > memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); > memset( &context, 0, sizeof( context ) ); > status = osmtest_informinfo_request( p_osmt, > IB_MAD_ATTR_INFORM_INFO_RECORD, > @@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_ > goto Exit; > > /* Cleanup subscriptions before further testing */ > + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, > + "osmtest_informinfo_request: InformInfoRecord " ^^^^^^^^^^ InformInfo > + "Sending a Good - Set (cleanup all of them) request\n"); > inform_info_opt.subscribe = FALSE; > inform_info_opt.qpn = 1; > memset( &context, 0, sizeof( context ) ); -- Hal From halr at voltaire.com Thu Dec 21 12:59:42 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 21 Dec 2006 15:59:42 -0500 Subject: [openib-general] [PATCH]osmtest/osmtest.c: Add more InformInfo/InformInfoRecord tests Message-ID: <1166734781.4519.132032.camel@hal.voltaire.com> osmtest/osmtest.c: Add more InformInfo/InformInfoRecord tests Signed-off-by: Hal Rosenstock diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 355a6f9..6afa899 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -73,6 +73,7 @@ typedef struct _osmtest_inform_info { boolean_t subscribe; ib_net32_t qpn; + ib_net16_t trap; } osmtest_inform_info_t; typedef struct _osmtest_inform_info_rec @@ -4890,6 +4891,11 @@ osmtest_informinfo_request( rec.g_or_v.generic.qpn_resp_time_val = cl_hton32(p_inform_info_opt->qpn) >> 8; user.comp_mask |= IB_IIR_COMPMASK_QPN; } + if (p_inform_info_opt->trap) + { + rec.g_or_v.generic.trap_num = cl_hton16(p_inform_info_opt->trap); + user.comp_mask |= IB_IIR_COMPMASK_TRAPNUMB; + } user.p_attr = &rec; } user.method = method; @@ -5973,12 +5979,63 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; + /* Another subscription */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfo " + "Sending another Good - Set Subscribe (again) request\n"); + inform_info_opt.qpn = 0; + inform_info_opt.trap = 0x1234; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, &inform_info_opt, + &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* Get all InformInfoRecords again */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - GetTable (ALL records) request\n"); + memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_GETTABLE, + &inform_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + /* Cleanup subscriptions before further testing */ + /* Does order of deletion matter ? Test this !!! */ osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_informinfo_request: InformInfo " - "Sending a Good - Set (cleanup all of them) request\n"); + "Sending a Good - Set (cleanup) request\n"); + inform_info_opt.subscribe = FALSE; + inform_info_opt.qpn = 1; + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, + IB_MAD_METHOD_SET, + &inform_info_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + /* Get all InformInfoRecords again */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - GetTable (ALL records) request\n"); + memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_GETTABLE, + &inform_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfo" + "Sending a Good - Set (cleanup) request\n"); inform_info_opt.subscribe = FALSE; inform_info_opt.qpn = 1; + inform_info_opt.trap = 0; memset( &context, 0, sizeof( context ) ); status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO, IB_MAD_METHOD_SET, @@ -5986,6 +6043,18 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; + /* Get all InformInfoRecords a final time */ + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_informinfo_request: InformInfoRecord " + "Sending a Good - GetTable (ALL records) request\n"); + memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) ); + memset( &context, 0, sizeof( context ) ); + status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD, + IB_MAD_METHOD_GETTABLE, + &inform_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + if (lmc != 0) { test_lid = cl_ntoh16( p_osmt->local_port.lid + 1 ); From eitan at sw053.yok.mtl.com Thu Dec 21 21:10:01 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Fri, 22 Dec 2006 07:10:01 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-22:normal completion Message-ID: <200612220510.kBM5A1pj018761@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Thu_Dec_21_14:36:22_2006 c3fcbb MOD_FILES=1 ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 Total=396 Pass=395 Fail=1 Pass: 54 Stability IS1-16.topo 54 Pkey IS1-16.topo 54 OsmStress IS1-16.topo 54 Multicast IS1-16.topo 54 LidMgr IS1-16.topo 18 Stability IS3-loop.topo 18 Stability IS3-128.topo 18 Pkey IS3-128.topo 18 OsmStress IS3-128.topo 18 Multicast IS3-loop.topo 18 Multicast IS3-128.topo 17 LidMgr IS3-128.topo Failures: 1 LidMgr IS3-128.topo From halr at voltaire.com Fri Dec 22 06:25:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Dec 2006 09:25:59 -0500 Subject: [openib-general] [PATCH 2/2] osmtest/osmtest.c: More SA SMInfoRecord tests Message-ID: <1166797547.4519.181603.camel@hal.voltaire.com> osmtest/osmtest.c: More SA SMInfoRecord tests Signed-off-by: Hal Rosenstock diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 6afa899..0ccc06c 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -69,6 +69,14 @@ #define POOL_MIN_ITEMS 64 #define GUID_ARRAY_SIZE 64 +typedef struct _osmtest_sm_info_rec +{ + ib_net64_t sm_guid; + ib_net16_t lid; + uint8_t priority; + uint8_t sm_state; +} osmtest_sm_info_rec_t; + typedef struct _osmtest_inform_info { boolean_t subscribe; @@ -4756,9 +4764,11 @@ osmtest_get_lft_rec_by_lid( IN osmtest_t /********************************************************************** **********************************************************************/ -ib_api_status_t +static ib_api_status_t osmtest_sminfo_record_request( IN osmtest_t * const p_osmt, + IN uint8_t method, + IN void *p_options, IN OUT osmtest_req_context_t * const p_context ) { ib_api_status_t status = IB_SUCCESS; @@ -4766,6 +4776,7 @@ osmtest_sminfo_record_request( osmv_query_req_t req; ib_sminfo_record_t record; ib_mad_t *p_mad; + osmtest_sm_info_rec_t *p_sm_info_opt; OSM_LOG_ENTER( &p_osmt->log, osmtest_sminfo_record_request ); @@ -4783,6 +4794,29 @@ osmtest_sminfo_record_request( p_context->p_osmt = p_osmt; user.attr_id = IB_MAD_ATTR_SMINFO_RECORD; user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) ); + p_sm_info_opt = p_options; + if (p_sm_info_opt->sm_guid != 0) + { + record.sm_info.guid = p_sm_info_opt->sm_guid; + user.comp_mask |= IB_SMIR_COMPMASK_GUID; + } + if (p_sm_info_opt->lid != 0) + { + record.lid = p_sm_info_opt->lid; + user.comp_mask |= IB_SMIR_COMPMASK_LID; + } + if (p_sm_info_opt->priority != 0) + { + record.sm_info.pri_state = (p_sm_info_opt->priority & 0x0F)<<4; + user.comp_mask |= IB_SMIR_COMPMASK_PRIORITY; + } + if (p_sm_info_opt->sm_state != 0) + { + record.sm_info.pri_state |= p_sm_info_opt->sm_state & 0x0F; + user.comp_mask |= IB_SMIR_COMPMASK_SMSTATE; + } + + user.method = method; user.p_attr = &record; req.query_type = OSMV_QUERY_USER_DEFINED; @@ -4808,9 +4842,12 @@ osmtest_sminfo_record_request( if( status != IB_SUCCESS ) { - osm_log( &p_osmt->log, OSM_LOG_ERROR, - "osmtest_sminfo_record_request: ERR 008D: " - "ib_query failed (%s)\n", ib_get_err_str( status ) ); + if (status != IB_INVALID_PARAMETER) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_sminfo_record_request: ERR 008D: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + } if( status == IB_REMOTE_ERROR ) { p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw ); @@ -4831,7 +4868,7 @@ osmtest_sminfo_record_request( /********************************************************************** **********************************************************************/ -ib_api_status_t +static ib_api_status_t osmtest_informinfo_request( IN osmtest_t * const p_osmt, IN ib_net16_t attr_id, @@ -5553,6 +5590,7 @@ osmtest_validate_against_db( IN osmtest_ { ib_api_status_t status = IB_SUCCESS; ib_gid_t portgid, mgid; + osmtest_sm_info_rec_t sm_info_rec_opt; osmtest_inform_info_t inform_info_opt; osmtest_inform_info_rec_t inform_info_rec_opt; #ifdef VENDOR_RMPP_SUPPORT @@ -5563,6 +5601,7 @@ osmtest_validate_against_db( IN osmtest_ #ifdef DUAL_SIDED_RMPP osmv_multipath_req_t request; #endif + int i; #endif OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_against_db ); @@ -5812,12 +5851,71 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; - /* SMInfoRecord test */ + /* SMInfoRecord tests */ + memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) ); + memset( &context, 0, sizeof( context ) ); + status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_SET, + &sm_info_rec_opt, &context ); + if ( status == IB_SUCCESS ) + { + status = IB_ERROR; + goto Exit; + } + else + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_sminfo_request: " + "IS EXPECTED ERROR ^^^^\n"); + } + + memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) ); + memset( &context, 0, sizeof( context ) ); + status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE, + &sm_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) ); + sm_info_rec_opt.lid = test_lid; /* local LID */ memset( &context, 0, sizeof( context ) ); - status = osmtest_sminfo_record_request( p_osmt, &context ); + status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE, + &sm_info_rec_opt, &context ); if ( status != IB_SUCCESS ) goto Exit; + if (portguid != 0) + { + memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) ); + sm_info_rec_opt.sm_guid = portguid; /* local GUID */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE, + &sm_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + } + + for (i = 1; i < 16; i++) + { + memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) ); + sm_info_rec_opt.priority = i; + memset( &context, 0, sizeof( context ) ); + status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE, + &sm_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + } + + for (i = 1; i < 4; i++) + { + memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) ); + sm_info_rec_opt.sm_state = i; + memset( &context, 0, sizeof( context ) ); + status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE, + &sm_info_rec_opt, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + } + /* InformInfoRecord tests */ osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_informinfo_request: InformInfoRecord " From halr at voltaire.com Fri Dec 22 06:25:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Dec 2006 09:25:46 -0500 Subject: [openib-general] [PATCH 1/2] OpenSM: Better SA SMInfoRecord support Message-ID: <1166797545.4519.181601.camel@hal.voltaire.com> OpenSM: Better SA SMInfoRecord support Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_sa_sminfo_record.h b/osm/include/opensm/osm_sa_sminfo_record.h index 60bfe82..cafc09b 100644 --- a/osm/include/opensm/osm_sa_sminfo_record.h +++ b/osm/include/opensm/osm_sa_sminfo_record.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * @@ -85,7 +85,6 @@ BEGIN_C_DECLS * Ranjit Pandit, Intel * *********/ - /****s* OpenSM: SM Info Receiver/osm_smir_rcv_t * NAME * osm_smir_rcv_t @@ -106,6 +105,7 @@ typedef struct _osm_smir osm_mad_pool_t* p_mad_pool; osm_log_t* p_log; cl_plock_t* p_lock; + cl_qlock_pool_t pool; } osm_smir_rcv_t; /* * FIELDS diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c index 816e10d..440d773 100644 --- a/osm/opensm/osm_sa_class_port_info.c +++ b/osm/opensm/osm_sa_class_port_info.c @@ -197,7 +197,6 @@ __osm_cpi_rcv_respond( SwitchInfoRecord, RandomForwardingTableRecord, MulticastForwardingTableRecord, - SMInfoRecord (partial support), ServiceAssociationRecord other optional records supported "under the table" diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c index 62467c1..7a82b84 100644 --- a/osm/opensm/osm_sa_sminfo_record.c +++ b/osm/opensm/osm_sa_sminfo_record.c @@ -33,7 +33,6 @@ * */ - /* * Abstract: * Implementation of osm_smir_rcv_t. @@ -68,6 +67,25 @@ #include #include #include +#include + +#define OSM_SMIR_RCV_POOL_MIN_SIZE 32 +#define OSM_SMIR_RCV_POOL_GROW_SIZE 32 + +typedef struct _osm_smir_item +{ + cl_pool_item_t pool_item; + ib_sminfo_record_t rec; +} osm_smir_item_t; + +typedef struct _osm_smir_search_ctxt +{ + const ib_sminfo_record_t* p_rcvd_rec; + ib_net64_t comp_mask; + cl_qlist_t* p_list; + osm_smir_rcv_t* p_rcv; + const osm_physp_t* p_req_physp; +} osm_smir_search_ctxt_t; /********************************************************************** **********************************************************************/ @@ -76,6 +94,7 @@ osm_smir_rcv_construct( IN osm_smir_rcv_t* const p_rcv ) { memset( p_rcv, 0, sizeof(*p_rcv) ); + cl_qlock_pool_construct( &p_rcv->pool ); } /********************************************************************** @@ -87,7 +106,7 @@ osm_smir_rcv_destroy( CL_ASSERT( p_rcv ); OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy ); - + cl_qlock_pool_destroy( &p_rcv->pool ); OSM_LOG_EXIT( p_rcv->p_log ); } @@ -116,26 +135,155 @@ osm_smir_rcv_init( p_rcv->p_stats = p_stats; p_rcv->p_mad_pool = p_mad_pool; + status = cl_qlock_pool_init( &p_rcv->pool, + OSM_SMIR_RCV_POOL_MIN_SIZE, + 0, + OSM_SMIR_RCV_POOL_GROW_SIZE, + sizeof(osm_smir_item_t), + NULL, NULL, NULL ); + OSM_LOG_EXIT( p_rcv->p_log ); return( status ); } +static ib_api_status_t +__osm_smir_rcv_new_smir( + IN osm_smir_rcv_t* const p_rcv, + IN const osm_port_t* const p_port, + IN cl_qlist_t* const p_list, + IN ib_net64_t const guid, + IN ib_net32_t const act_count, + IN uint8_t const pri_state, + IN const osm_physp_t* const p_req_physp ) +{ + osm_smir_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_smir_rcv_new_smir ); + + p_rec_item = (osm_smir_item_t*)cl_qlock_pool_get( &p_rcv->pool ); + if( p_rec_item == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_smir_rcv_new_smir: ERR 2801: " + "cl_qlock_pool_get failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_smir_rcv_new_smir: " + "New SMInfo: GUID 0x%016" PRIx64 "\n", + cl_ntoh64( guid ) + ); + } + + memset( &p_rec_item->rec, 0, sizeof(ib_sminfo_record_t) ); + + p_rec_item->rec.lid = osm_port_get_base_lid( p_port ); + p_rec_item->rec.sm_info.guid = guid; + p_rec_item->rec.sm_info.act_count = act_count; + p_rec_item->rec.sm_info.pri_state = pri_state; + + cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +static void +__osm_sa_smir_by_comp_mask( + IN osm_smir_rcv_t* const p_rcv, + IN const osm_remote_sm_t* const p_rem_sm, + osm_smir_search_ctxt_t* const p_ctxt ) +{ + const ib_sminfo_record_t* const p_rcvd_rec = p_ctxt->p_rcvd_rec; + const osm_physp_t* const p_req_physp = p_ctxt->p_req_physp; + ib_net64_t const comp_mask = p_ctxt->comp_mask; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_smir_by_comp_mask ); + + if ( comp_mask & IB_SMIR_COMPMASK_GUID ) + { + if ( p_rem_sm->smi.guid != p_rcvd_rec->sm_info.guid ) + goto Exit; + } + + if ( comp_mask & IB_SMIR_COMPMASK_PRIORITY ) + { + if ( ib_sminfo_get_priority( &p_rem_sm->smi ) != + ib_sminfo_get_priority( &p_rcvd_rec->sm_info ) ) + goto Exit; + } + + if ( comp_mask & IB_SMIR_COMPMASK_SMSTATE ) + { + if ( ib_sminfo_get_state( &p_rem_sm->smi ) != + ib_sminfo_get_state( &p_rcvd_rec->sm_info ) ) + goto Exit; + } + + /* Implement any other needed search cases */ + + __osm_smir_rcv_new_smir( p_rcv, p_rem_sm->p_port, p_ctxt->p_list, + p_rem_sm->smi.guid, + p_rem_sm->smi.act_count, + p_rem_sm->smi.pri_state, + p_req_physp ); + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +static void +__osm_sa_smir_by_comp_mask_cb( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + const osm_remote_sm_t* const p_rem_sm = (osm_remote_sm_t*)p_map_item; + osm_smir_search_ctxt_t* const p_ctxt = (osm_smir_search_ctxt_t *)context; + + __osm_sa_smir_by_comp_mask( p_ctxt->p_rcv, p_rem_sm, p_ctxt ); +} + /********************************************************************** **********************************************************************/ void osm_smir_rcv_process( - IN osm_smir_rcv_t* const p_rcv, - IN const osm_madw_t* const p_madw ) + IN osm_smir_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ) { - const ib_sminfo_record_t* p_sminfo_rec; - ib_sminfo_record_t* p_resp_sminfo_rec; - const ib_sa_mad_t* p_sa_mad; - ib_sa_mad_t* p_resp_sa_mad; - osm_madw_t* p_resp_madw; - ib_api_status_t status; - osm_physp_t* p_req_physp; - ib_net64_t local_guid; - osm_port_t* local_port; + const ib_sa_mad_t* p_rcvd_mad; + const ib_sminfo_record_t* p_rcvd_rec; + const cl_qmap_t* p_tbl; + const osm_port_t* p_port = NULL; + const ib_sm_info_t* p_smi; + cl_qlist_t rec_list; + osm_madw_t* p_resp_madw; + ib_sa_mad_t* p_resp_sa_mad; + ib_sminfo_record_t* p_resp_rec; + uint32_t num_rec, pre_trim_num_rec; +#ifndef VENDOR_RMPP_SUPPORT + uint32_t trim_num_rec; +#endif + uint32_t i; + osm_smir_search_ctxt_t context; + osm_smir_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + ib_net64_t comp_mask; + ib_net64_t port_guid; + osm_physp_t* p_req_physp; + osm_port_t* local_port; + osm_remote_sm_t* p_rem_sm; + cl_qmap_t* p_sm_guid_tbl; + uint8_t pri_state; CL_ASSERT( p_rcv ); @@ -143,19 +291,20 @@ osm_smir_rcv_process( CL_ASSERT( p_madw ); - p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw ); - p_sminfo_rec = (ib_sminfo_record_t*)ib_sa_mad_get_payload_ptr( p_sa_mad ); + p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_rcvd_rec = (ib_sminfo_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad ); + comp_mask = p_rcvd_mad->comp_mask; - CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD ); + CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD ); /* we only support SubnAdmGet and SubnAdmGetTable methods */ - if ( (p_sa_mad->method != IB_MAD_METHOD_GET) && - (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) ) + if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) && + (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_smir_rcv_process: ERR 2804: " "Unsupported Method (%s)\n", - ib_get_sa_method_str( p_sa_mad->method ) ); + ib_get_sa_method_str( p_rcvd_mad->method ) ); osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR ); goto Exit; } @@ -173,72 +322,251 @@ osm_smir_rcv_process( } if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) - osm_dump_sm_info_record( p_rcv->p_log, p_sminfo_rec, OSM_LOG_DEBUG ); + osm_dump_sm_info_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG ); - /* check the matching of pkeys with the local physp the SM is on. */ - local_guid = p_rcv->p_subn->sm_port_guid; - local_port = (osm_port_t*)cl_qmap_get( &p_rcv->p_subn->port_guid_tbl, local_guid ); - if (FALSE == - osm_physp_share_pkey( p_rcv->p_log, p_req_physp, - osm_port_get_default_phys_ptr( local_port ) ) ) + p_tbl = &p_rcv->p_subn->sm_guid_tbl; + p_smi = &p_rcvd_rec->sm_info; + + cl_qlist_init( &rec_list ); + + context.p_rcvd_rec = p_rcvd_rec; + context.p_list = &rec_list; + context.comp_mask = p_rcvd_mad->comp_mask; + context.p_rcv = p_rcv; + context.p_req_physp = p_req_physp; + + cl_plock_acquire( p_rcv->p_lock ); + + /* + If the user specified a LID, it obviously narrows our + work load, since we don't have to search every port + */ + if( comp_mask & IB_SMIR_COMPMASK_LID ) { - osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "osm_smir_rcv_process: ERR 2805: " - "Cannot get SMInfo record due to pkey violation\n" ); + status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port ); + if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) ) + { + status = IB_NOT_FOUND; + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_smir_rcv_process: ERR 2806: " + "No port found with LID 0x%x\n", + cl_ntoh16(p_rcvd_rec->lid) ); + } + } + + if ( status == IB_SUCCESS ) + { + /* Handle our own SM first */ + local_port = osm_get_port_by_guid( p_rcv->p_subn, p_rcv->p_subn->sm_port_guid ); + if ( !local_port ) + { + cl_plock_release( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_smir_rcv_process: ERR 2809: " + "No port found with GUID 0x%016" PRIx64 "\n", + cl_ntoh64(p_rcv->p_subn->sm_port_guid ) ); + goto Exit; + } + + if ( !p_port || local_port == p_port ) + { + if (FALSE == + osm_physp_share_pkey( p_rcv->p_log, p_req_physp, + osm_port_get_default_phys_ptr( local_port ) ) ) + { + cl_plock_release( p_rcv->p_lock ); + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_smir_rcv_process: ERR 2805: " + "Cannot get SMInfo record due to pkey violation\n" ); + goto Exit; + } + + /* Check that other search components specified match */ + if ( comp_mask & IB_SMIR_COMPMASK_GUID ) + { + if ( p_rcv->p_subn->sm_port_guid != p_smi->guid ) + goto Remotes; + } + if ( comp_mask & IB_SMIR_COMPMASK_PRIORITY ) + { + if ( p_rcv->p_subn->opt.sm_priority != ib_sminfo_get_priority( p_smi ) ) + goto Remotes; + } + if ( comp_mask & IB_SMIR_COMPMASK_SMSTATE ) + { + if ( p_rcv->p_subn->sm_state != ib_sminfo_get_state( p_smi ) ) + goto Remotes; + } + + /* Now, add local SMInfo to list */ + pri_state = p_rcv->p_subn->sm_state & 0x0F; + pri_state |= (p_rcv->p_subn->opt.sm_priority & 0x0F) << 4; + __osm_smir_rcv_new_smir( p_rcv, local_port, context.p_list, + p_rcv->p_subn->sm_port_guid, + cl_ntoh32( p_rcv->p_stats->qp0_mads_sent ), + pri_state, + p_req_physp ); + } + + Remotes: + if( p_port && p_port != local_port ) + { + /* Find remote SM corresponding to p_port */ + port_guid = osm_port_get_guid( p_port ); + p_sm_guid_tbl = &p_rcv->p_subn->sm_guid_tbl; + p_rem_sm = (osm_remote_sm_t*)cl_qmap_get( p_sm_guid_tbl, port_guid ); + if (p_rem_sm != (osm_remote_sm_t*)cl_qmap_end( p_sm_guid_tbl ) ) + __osm_sa_smir_by_comp_mask( p_rcv, p_rem_sm, &context ); + else + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_smir_rcv_process: ERR 280A: " + "No remote SM for GUID 0x%016" PRIx64 "\n", + cl_ntoh64( port_guid ) ); + } + } + else + { + /* Go over all other known (remote) SMs */ + cl_qmap_apply_func( &p_rcv->p_subn->sm_guid_tbl, + __osm_sa_smir_by_comp_mask_cb, + &context ); + } + } + + cl_plock_release( p_rcv->p_lock ); + + num_rec = cl_qlist_count( &rec_list ); + + /* + * C15-0.1.30: + * If we do a SubnAdmGet and got more than one record it is an error ! + */ + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_smir_rcv_process: ERR 2808: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS); + + /* need to set the mem free ... */ + p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_smir_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } + } + + pre_trim_num_rec = num_rec; +#ifndef VENDOR_RMPP_SUPPORT + trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_sminfo_record_t); + if (trim_num_rec < num_rec) + { + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, + "osm_smir_rcv_process: " + "Number of records:%u trimmed to:%u to fit in one MAD\n", + num_rec, trim_num_rec ); + num_rec = trim_num_rec; + } +#endif + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_smir_rcv_process: " + "Returning %u records\n", num_rec ); + + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RECORDS ); goto Exit; } - /* - * Get a MAD to reply. Address of Mad is in the received mad_wrapper + /* + * Get a MAD to reply. Address of Mad is in the received mad_wrapper */ - p_resp_madw = osm_mad_pool_get(p_rcv->p_mad_pool, - p_madw->h_bind, - sizeof(ib_sminfo_record_t) + IB_SA_MAD_HDR_SIZE, - &p_madw->mad_addr ); + p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, + p_madw->h_bind, + num_rec * sizeof(ib_sminfo_record_t) + IB_SA_MAD_HDR_SIZE, + &p_madw->mad_addr ); + if( !p_resp_madw ) { osm_log(p_rcv->p_log, OSM_LOG_ERROR, - "osm_smir_rcv_process: ERR 2801: " - "Unable to acquire response MAD\n" ); + "osm_smir_rcv_process: ERR 2807: " + "osm_mad_pool_get failed\n" ); + + for( i = 0; i < num_rec; i++ ) + { + p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list ); + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + } + + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RESOURCES ); + goto Exit; } p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw ); - p_resp_sminfo_rec = - (ib_sminfo_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); - p_resp_sminfo_rec->resv0 = 0; - - /* HACK: This handling is incorrect. Records of known SMs - by our SM, and not just the details of our own SM - should be returned. */ - - cl_plock_acquire( p_rcv->p_lock ); + /* + Copy the MAD header back into the response mad. + Set the 'R' bit and the payload length, + Then copy all records from the list into the response payload. + */ - /* get our local sm_base_lid to send in the sminfo */ - p_resp_sminfo_rec->lid = p_rcv->p_subn->sm_base_lid; - p_resp_sminfo_rec->sm_info.guid = p_rcv->p_subn->sm_port_guid; - p_resp_sminfo_rec->sm_info.sm_key = p_rcv->p_subn->opt.sm_key; - p_resp_sminfo_rec->sm_info.act_count = - cl_ntoh32(p_rcv->p_stats->qp0_mads_sent); - p_resp_sminfo_rec->sm_info.pri_state = p_rcv->p_subn->sm_state; + memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE ); + p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK; + /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */ + p_resp_sa_mad->sm_key = 0; + /* Fill in the offset (paylen will be done by the rmpp SAR) */ + p_resp_sa_mad->attr_offset = + ib_get_attr_offset( sizeof(ib_sminfo_record_t) ); - cl_plock_release( p_rcv->p_lock ); + p_resp_rec = (ib_sminfo_record_t*) + ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); - /* Copy the MAD header back into the response mad */ - memcpy( p_resp_sa_mad, p_sa_mad, IB_SA_MAD_HDR_SIZE ); - if( p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE ) +#ifndef VENDOR_RMPP_SUPPORT + /* we support only one packet RMPP - so we will set the first and + last flags for gettable */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) { - p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; - /* Fill in the offset (paylen will be done by the rmpp SAR) */ - p_resp_sa_mad->attr_offset = - ib_get_attr_offset( sizeof(ib_sminfo_record_t) ); + p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA; + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE; } +#else + /* forcefully define the packet as RMPP one */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; +#endif - p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK; + for( i = 0; i < pre_trim_num_rec; i++ ) + { + p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list ); + /* copy only if not trimmed */ + if (i < num_rec) + { + *p_resp_rec = p_rec_item->rec; + p_resp_rec->sm_info.sm_key = 0; + } + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_resp_rec++; + } - /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */ - p_resp_sa_mad->sm_key = 0; + CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); if( status != IB_SUCCESS ) From svenar at simula.no Fri Dec 22 09:10:52 2006 From: svenar at simula.no (Sven-Arne Reinemo) Date: Fri, 22 Dec 2006 18:10:52 +0100 Subject: [openib-general] SA redirect Message-ID: <458C119C.6090302@simula.no> Hi, One quick question, is SA redirection supported by OpenSM? I did a check, but could not find any information about this. -- SAR ---- GnuPG public key - http://home.ifi.uio.no/~svenar/gpg.asc ---- "There are only 10 kinds of people in this world; those who know binary and those who don't." -- Unknown From svenar at simula.no Fri Dec 22 09:07:37 2006 From: svenar at simula.no (Sven-Arne Reinemo) Date: Fri, 22 Dec 2006 18:07:37 +0100 Subject: [openib-general] Bug in IBMgtSim? Message-ID: <458C10D9.2080909@simula.no> Hi, There seemed to be a bug in IBMgtSim where it forwards packets received from the SM back onto the port where the SM is connected. OpenSM just drops the packet so it does not seem very problematic, but I am just checking to see if anyone else see this behaviour. Below are an example from the log files. Packet dropped by OpenSM: Dec 15 15:21:34 992520 [B44E3BB0] -> Duplicate TID 0x6A7B00001234 received (not a response). Dropping the MAD. Packets forwarded to the SM (the first one is the one that is dropped): -I- Using random seed:96960 -I- Parsing topology definition:/simulator/scalability/test_top.topo -I- Defined 3/3 systems/nodes -I- Init fabric: fabric:1 -I- Started server: opensm.simula.no port:60493 -I- Ready to serve -I- Connecting: sock9 127.0.1.1 48227 -------------------------------------------------------- sl 0x0 pkey_index 0x0 slid 0x0 dlid 0xffff sqpn 0x0 dqpn 0x0 -------------------------------------------------------- -------------------------------------------------------- base_ver 0x1 mgmt_class 0x81 class_ver 0x1 method 0x1 status 0x0 class_spec 0x100 trans_id 0x00006a7b00001234 attr_id 0x11 attr_mod 0x0 -------------------------------------------------------- -------------------------------------------------------- sl 0x0 pkey_index 0x0 slid 0xffff dlid 0x0 sqpn 0x0 dqpn 0x0 -------------------------------------------------------- -------------------------------------------------------- base_ver 0x1 mgmt_class 0x81 class_ver 0x1 method 0x81 status 0x8000 class_spec 0x0 trans_id 0x00006a7b00001234 attr_id 0x11 attr_mod 0x0 -------------------------------------------------------- -- SAR ---- GnuPG public key - http://home.ifi.uio.no/~svenar/gpg.asc ---- "There are only 10 kinds of people in this world; those who know binary and those who don't." -- Unknown From dgrruw at yahoo.co.jp Fri Dec 22 10:10:02 2006 From: dgrruw at yahoo.co.jp (dgrruw at yahoo.co.jp) Date: Sat, 23 Dec 2006 02:10:02 +0800 Subject: [openib-general] =?GB2312?B?vKSwsqOh?= Message-ID: <20061222180842.111703B0011@sentry-two.sandia.gov> An HTML attachment was scrubbed... URL: From halr at voltaire.com Fri Dec 22 11:49:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 22 Dec 2006 14:49:08 -0500 Subject: [openib-general] SA redirect In-Reply-To: <458C119C.6090302@simula.no> References: <458C119C.6090302@simula.no> Message-ID: <1166816948.4519.197118.camel@hal.voltaire.com> On Fri, 2006-12-22 at 12:10, Sven-Arne Reinemo wrote: > Hi, > > One quick question, is SA redirection supported by OpenSM? I did a > check, but could not find any information about this. It's not currently supported. -- Hal From jsquyres at cisco.com Fri Dec 22 12:31:08 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri, 22 Dec 2006 15:31:08 -0500 Subject: [openib-general] DNS changes Message-ID: In order to move on to the next phase of the server transition, we have made some changes to the OFA DNS. Most users should not notice the changes (if you encounter any problems, please let me know ASAP); the gist of it is that we have a few more names that are slowly creeping their way around the world: ssh.openfabrics.org svn.openfabrics.org git.openfabrics.org lists.openfabrics.org wiki.openfabrics.org bugs.openfabrics.org www2.openfabrics.org All of these point to the new server. "www2" is only for testing purposes; it will eventually go away when "www" is switched to point to the new server. Back-end services are not yet hooked up to these names; we'll do that in the not-distant future (probably in the new year at this point). Also note that the name "staging.openfabrics.org" will eventually go away -- at some point after all the new names are in place and the dust has settled. There will be adequate warning before this occurs (so that you can get new git checkouts, etc.), so consider this an early warning. Happy holidays! -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From photo at oto.com Fri Dec 22 16:16:09 2006 From: photo at oto.com (photo at oto.com) Date: 23 Dec 2006 02:16:09 +0200 Subject: ôøñí çéðí áìåç äîöåìí Message-ID: <20061223021552.C8310AD13A9AE309@oto.com> � ��� ���� ����� ��� ���� ��� ���� ���� -------------- next part -------------- An HTML attachment was scrubbed... URL: From eitan at sw053.yok.mtl.com Fri Dec 22 21:28:23 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Sat, 23 Dec 2006 07:28:23 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-23:normal completion Message-ID: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1 ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 Total=396 Pass=393 Fail=3 Pass: 54 Stability IS1-16.topo 54 Pkey IS1-16.topo 54 Multicast IS1-16.topo 54 LidMgr IS1-16.topo 53 OsmStress IS1-16.topo 18 Stability IS3-loop.topo 18 Stability IS3-128.topo 18 OsmStress IS3-128.topo 18 Multicast IS3-loop.topo 18 LidMgr IS3-128.topo 17 Pkey IS3-128.topo 17 Multicast IS3-128.topo Failures: 1 Pkey IS3-128.topo 1 OsmStress IS1-16.topo 1 Multicast IS3-128.topo From eitan at mellanox.co.il Sat Dec 23 00:52:24 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 23 Dec 2006 10:52:24 +0200 Subject: [openib-general] Bug in IBMgtSim? In-Reply-To: <458C10D9.2080909@simula.no> References: <458C10D9.2080909@simula.no> Message-ID: <458CEE48.3020207@mellanox.co.il> Hi Sven, Yes this is the behavior of the simulator. Any MAD leaving from a node and target to the same node also appears back at the sender input. This is also the behavior of the gen1 IB stack. As you pointed out the SM (osm vendor layer) is capable of dropping these duplicated MADs. The "bug" stems from the fact the MADs when injected into the simulator do not carry any information about the client which injected them. So when they finally arrive to the destination they are being forwarded to all "MAD processors" attached to that node (to the specific "management class"). So if the SM is registered as "MAD Processor" for SMPs on node A and sends SMP to node A the SMP will be received on the SM as well as on any other "MAD Processor" attached to that node (e.g. the SMA on the node A). Eitan Sven-Arne Reinemo wrote: > Hi, > > There seemed to be a bug in IBMgtSim where it forwards packets received > from the SM back onto the port where the SM is connected. OpenSM just > drops the packet so it does not seem very problematic, but I am just > checking to see if anyone else see this behaviour. Below are an example > from the log files. > > Packet dropped by OpenSM: > > Dec 15 15:21:34 992520 [B44E3BB0] -> Duplicate TID 0x6A7B00001234 > received (not a response). Dropping the MAD. > > > Packets forwarded to the SM (the first one is the one that is dropped): > > -I- Using random seed:96960 > -I- Parsing topology definition:/simulator/scalability/test_top.topo > -I- Defined 3/3 systems/nodes > -I- Init fabric: fabric:1 > -I- Started server: opensm.simula.no port:60493 > -I- Ready to serve > -I- Connecting: sock9 127.0.1.1 48227 > -------------------------------------------------------- > sl 0x0 > pkey_index 0x0 > slid 0x0 > dlid 0xffff > sqpn 0x0 > dqpn 0x0 > -------------------------------------------------------- > -------------------------------------------------------- > base_ver 0x1 > mgmt_class 0x81 > class_ver 0x1 > method 0x1 > status 0x0 > class_spec 0x100 > trans_id 0x00006a7b00001234 > attr_id 0x11 > attr_mod 0x0 > -------------------------------------------------------- > -------------------------------------------------------- > sl 0x0 > pkey_index 0x0 > slid 0xffff > dlid 0x0 > sqpn 0x0 > dqpn 0x0 > -------------------------------------------------------- > -------------------------------------------------------- > base_ver 0x1 > mgmt_class 0x81 > class_ver 0x1 > method 0x81 > status 0x8000 > class_spec 0x0 > trans_id 0x00006a7b00001234 > attr_id 0x11 > attr_mod 0x0 > -------------------------------------------------------- > > > From eitan at mellanox.co.il Sat Dec 23 01:30:09 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 23 Dec 2006 11:30:09 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-23:normal completion In-Reply-To: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com> References: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com> Message-ID: <458CF721.9030909@mellanox.co.il> Analysis of the 3 failures: 1. TEST BUG: osmStress - somehow the simulation caused the local port to be turned down (must be a bug in the random error injection which should avoid that port). So the simulation ends with OpenSM still trying to connect to the network. 2. Multicast - The osm.mcfdbs is empty. Apparently no Joins where received by the SM. This will require further debug. 3. PKey: the test fails on obtaining ALL path records for the 128 node case. OpenSM complain about timeout during the RMPP transaction. I will add some more time to the transaction timeout for the simulation. EZ Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1 > ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 > Total=396 Pass=393 Fail=3 > > Pass: > 54 Stability IS1-16.topo > 54 Pkey IS1-16.topo > 54 Multicast IS1-16.topo > 54 LidMgr IS1-16.topo > 53 OsmStress IS1-16.topo > 18 Stability IS3-loop.topo > 18 Stability IS3-128.topo > 18 OsmStress IS3-128.topo > 18 Multicast IS3-loop.topo > 18 LidMgr IS3-128.topo > 17 Pkey IS3-128.topo > 17 Multicast IS3-128.topo > > Failures: > 1 Pkey IS3-128.topo > 1 OsmStress IS1-16.topo > 1 Multicast IS3-128.topo > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at sw053.yok.mtl.com Sat Dec 23 09:49:28 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Sat, 23 Dec 2006 19:49:28 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-23:normal completion Message-ID: <200612231749.kBNHnSWU029121@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1 ibutils rev = Sat_Dec_23_17:47:24_2006 2057e4 Total=81 Pass=80 Fail=1 Pass: 9 Stability IS1-16.topo 9 Pkey IS1-16.topo 9 OsmTest IS1-16.topo 9 OsmStress IS1-16.topo 9 Multicast IS1-16.topo 9 LidMgr IS1-16.topo 3 Stability IS3-loop.topo 3 Stability IS3-128.topo 3 Pkey IS3-128.topo 3 OsmTest IS3-loop.topo 3 OsmTest IS3-128.topo 3 OsmStress IS3-128.topo 3 Multicast IS3-loop.topo 3 LidMgr IS3-128.topo 2 Multicast IS3-128.topo Failures: 1 Multicast IS3-128.topo From halr at voltaire.com Sat Dec 23 17:31:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 23 Dec 2006 20:31:14 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-23:normal completion In-Reply-To: <458CF721.9030909@mellanox.co.il> References: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com> <458CF721.9030909@mellanox.co.il> Message-ID: <1166923872.4519.284465.camel@hal.voltaire.com> On Sat, 2006-12-23 at 04:30, Eitan Zahavi wrote: > Analysis of the 3 failures: > 1. TEST BUG: osmStress - somehow the simulation caused the local port to > be turned down (must be a bug in the random error injection which > should avoid that port). So the simulation ends with OpenSM still > trying to connect to the network. > 2. Multicast - The osm.mcfdbs is empty. Apparently no Joins where > received by the SM. This will require further debug. > 3. PKey: the test fails on obtaining ALL path records for the 128 node > case. OpenSM complain about timeout during the RMPP transaction. Why did this happen now ? > I will add some more time to the transaction timeout for the simulation. > > EZ > > Eitan Zahavi wrote: > > OSM Simulation Regression Summary > > OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1 > > ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 > > Total=396 Pass=393 Fail=3 > > > > Pass: > > 54 Stability IS1-16.topo > > 54 Pkey IS1-16.topo > > 54 Multicast IS1-16.topo > > 54 LidMgr IS1-16.topo > > 53 OsmStress IS1-16.topo > > 18 Stability IS3-loop.topo > > 18 Stability IS3-128.topo > > 18 OsmStress IS3-128.topo > > 18 Multicast IS3-loop.topo > > 18 LidMgr IS3-128.topo > > 17 Pkey IS3-128.topo > > 17 Multicast IS3-128.topo > > > > Failures: > > 1 Pkey IS3-128.topo > > 1 OsmStress IS1-16.topo > > 1 Multicast IS3-128.topo > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From eitan at mellanox.co.il Sat Dec 23 23:16:38 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 24 Dec 2006 09:16:38 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-23:normal completion In-Reply-To: <1166923872.4519.284465.camel@hal.voltaire.com> References: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com> <458CF721.9030909@mellanox.co.il> <1166923872.4519.284465.camel@hal.voltaire.com> Message-ID: <458E2956.7090700@mellanox.co.il> Hal Rosenstock wrote: > On Sat, 2006-12-23 at 04:30, Eitan Zahavi wrote: > >> Analysis of the 3 failures: >> 1. TEST BUG: osmStress - somehow the simulation caused the local port to >> be turned down (must be a bug in the random error injection which >> should avoid that port). So the simulation ends with OpenSM still >> trying to connect to the network. >> 2. Multicast - The osm.mcfdbs is empty. Apparently no Joins where >> received by the SM. This will require further debug. >> I have found a design bug in the queue of MADs waiting for dispatching. I used an STL map which means that if two MADs where scheduled for the exact same time the last time was purging the previous one. The fix is simple - use multimap instead. It is under testing now. >> 3. PKey: the test fails on obtaining ALL path records for the 128 node >> case. OpenSM complain about timeout during the RMPP transaction. >> > > Why did this happen now ? > Wish I knew. As I said it is under investigation. Might just be a big enough transaction to overflow the 2sec timeout. > >> I will add some more time to the transaction timeout for the simulation. >> >> EZ >> >> Eitan Zahavi wrote: >> >>> OSM Simulation Regression Summary >>> OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1 >>> ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 >>> Total=396 Pass=393 Fail=3 >>> >>> Pass: >>> 54 Stability IS1-16.topo >>> 54 Pkey IS1-16.topo >>> 54 Multicast IS1-16.topo >>> 54 LidMgr IS1-16.topo >>> 53 OsmStress IS1-16.topo >>> 18 Stability IS3-loop.topo >>> 18 Stability IS3-128.topo >>> 18 OsmStress IS3-128.topo >>> 18 Multicast IS3-loop.topo >>> 18 LidMgr IS3-128.topo >>> 17 Pkey IS3-128.topo >>> 17 Multicast IS3-128.topo >>> >>> Failures: >>> 1 Pkey IS3-128.topo >>> 1 OsmStress IS1-16.topo >>> 1 Multicast IS3-128.topo >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Sun Dec 24 00:49:25 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 24 Dec 2006 10:49:25 +0200 Subject: [openib-general] [PATCH v4 01/13] Linux RDMA Core Changes In-Reply-To: <20061214135303.21159.61880.stgit@dell3.ogc.int> References: <20061214135233.21159.78613.stgit@dell3.ogc.int> <20061214135303.21159.61880.stgit@dell3.ogc.int> Message-ID: <20061224084925.GD15106@mellanox.co.il> > diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c > index 283d50b..15cbd49 100644 > --- a/drivers/infiniband/hw/mthca/mthca_cq.c > +++ b/drivers/infiniband/hw/mthca/mthca_cq.c > @@ -722,7 +722,8 @@ repoll: > return err == 0 || err == -EAGAIN ? npolled : err; > } > > -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) > +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, > + struct ib_udata *udata) > { > __be32 doorbell[2]; > > @@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, > return 0; > } > > -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) > +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, > + struct ib_udata *udata) > { > struct mthca_cq *cq = to_mcq(ibcq); > __be32 doorbell[2]; > diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h > index fe5cecf..6b9ccf6 100644 > --- a/drivers/infiniband/hw/mthca/mthca_dev.h > +++ b/drivers/infiniband/hw/mthca/mthca_dev.h > @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev > > int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, > struct ib_wc *entry); > -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); > -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); > +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); > +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); > int mthca_init_cq(struct mthca_dev *dev, int nent, > struct mthca_ucontext *ctx, u32 pdn, > struct mthca_cq *cq); > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h > index 8eacc35..e3e1a2c 100644 > --- a/include/rdma/ib_verbs.h > +++ b/include/rdma/ib_verbs.h > @@ -941,7 +941,8 @@ struct ib_device { > struct ib_wc *wc); > int (*peek_cq)(struct ib_cq *cq, int wc_cnt); > int (*req_notify_cq)(struct ib_cq *cq, > - enum ib_cq_notify cq_notify); > + enum ib_cq_notify cq_notify, > + struct ib_udata *udata); > int (*req_ncomp_notif)(struct ib_cq *cq, > int wc_cnt); > struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, > @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_ > static inline int ib_req_notify_cq(struct ib_cq *cq, > enum ib_cq_notify cq_notify) > { > - return cq->device->req_notify_cq(cq, cq_notify); > + return cq->device->req_notify_cq(cq, cq_notify, NULL); > } > > /** Can't say I like this adding overhead in data path operations (and note this can't be optimized out). And kernel consumers work without passing it in, so it hurts kernel code even for Chelsio. Granted, the cost is small here, but these things do tend to add up. It seems all Chelsio needs is to pass in a consumer index - so, how about a new entry point? Something like void set_cq_udata(struct ib_cq *cq, struct ib_udata *udata)? -- MST From eitan at mellanox.co.il Sun Dec 24 04:35:14 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 24 Dec 2006 14:35:14 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <4588EAB9.6080106@voltaire.com> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> Message-ID: <458E7402.4000106@mellanox.co.il> Hi Or, Sorry it took me a while. According to the IBTA spec: 1. In order for MTU and MTUSelector to have any effect their component mask bits MUST be set to 1 in the query 2. Behavior of the SM is defined with small "freedom" to choose between multiple matching MTU values if they exist. 3. The table below summarizes all options: Assuming the value M represents the lowest MTU on the path We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) R represents the MTU value in the request. Similarly R-1 is one below R and R+1 is one above R. Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk w. Tavor End Port ----------------------------------------------------------------------------------------- UNDEFINED | UNDEFINED | <= M | M | min(M,1K) R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR I have built some test code for making sure OpenSM does what is required. Apparently it does not. In any case the M is not identical to R it fails the request. I am working on fixing OpenSM. Any comments are welcome. EZ Or Gerlitz wrote: > Michael S. Tsirkin wrote: > >> I am not yet sure what is best for upstream, so I don't really think we need >> any RFCs. >> > > >> We'll need data from SM guys on whether MTU selector actually works >> in SMs, and if not what happens when you enable it. >> > > Eitan, > > Can you please post here the tavor-quirk patch which was integrated into > opensm? i can see the ***code*** of the opensm but might make some wrong > assumptions or get into wrong understandings as i am not able to see the > patch as is. > > Or. > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Sun Dec 24 04:40:18 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 24 Dec 2006 14:40:18 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors Message-ID: <458E7532.5030400@mellanox.co.il> Hi Hal, OpenSM just uses the resulting path MTU/rate/pkt-life and fail the query even though the selector might be allowing for selecting an appropriate value. I have made the attached ibis based program for testing MTU select. After this fix the following results are obtained for a case of path allowing maximal 2K MTU . In standard mode: ------------------------------------------------------------ MTU greater then ... 256 (0x01) -> equal to ....... 2K MTU less then ...... 256 (0x41) -> NO PATHS MTU equal to ....... 256 (0x81) -> equal to ....... 256 MTU largest possible 256 (0xc1) -> equal to ....... 2K MTU greater then ... 512 (0x02) -> equal to ....... 2K MTU less then ...... 512 (0x42) -> equal to ....... 256 MTU equal to ....... 512 (0x82) -> equal to ....... 512 MTU largest possible 512 (0xc2) -> equal to ....... 2K MTU greater then ... 1K (0x03) -> equal to ....... 2K MTU less then ...... 1K (0x43) -> equal to ....... 512 MTU equal to ....... 1K (0x83) -> equal to ....... 1K MTU largest possible 1K (0xc3) -> equal to ....... 2K MTU greater then ... 2K (0x04) -> NO PATHS MTU less then ...... 2K (0x44) -> equal to ....... 1K MTU equal to ....... 2K (0x84) -> equal to ....... 2K MTU largest possible 2K (0xc4) -> equal to ....... 2K MTU greater then ... 4K (0x05) -> NO PATHS MTU less then ...... 4K (0x45) -> equal to ....... 2K MTU equal to ....... 4K (0x85) -> NO PATHS MTU largest possible 4K (0xc5) -> equal to ....... 2K ============================================================ With enable_quirks (when one of the ends is a Tavor device): ------------------------------------------------------------ MTU greater then ... 256 (0x01) -> equal to ....... 1K MTU less then ...... 256 (0x41) -> NO PATHS MTU equal to ....... 256 (0x81) -> equal to ....... 256 MTU largest possible 256 (0xc1) -> equal to ....... 2K MTU greater then ... 512 (0x02) -> equal to ....... 1K MTU less then ...... 512 (0x42) -> equal to ....... 256 MTU equal to ....... 512 (0x82) -> equal to ....... 512 MTU largest possible 512 (0xc2) -> equal to ....... 2K MTU greater then ... 1K (0x03) -> NO PATHS MTU less then ...... 1K (0x43) -> equal to ....... 512 MTU equal to ....... 1K (0x83) -> equal to ....... 1K MTU largest possible 1K (0xc3) -> equal to ....... 2K MTU greater then ... 2K (0x04) -> NO PATHS MTU less then ...... 2K (0x44) -> equal to ....... 1K MTU equal to ....... 2K (0x84) -> equal to ....... 2K MTU largest possible 2K (0xc4) -> equal to ....... 2K MTU greater then ... 4K (0x05) -> NO PATHS MTU less then ...... 4K (0x45) -> equal to ....... 1K MTU equal to ....... 4K (0x85) -> NO PATHS MTU largest possible 4K (0xc5) -> equal to ....... 2K ============================================================ Signed-off-by: Eitan Zahavi --- commit 7a156fd924a543b9891c676024a3dd9a90f848a9 tree 43a00fa2792aeb7d5684c6817154c9338ca96ed9 parent 613e7eea4d14a69e1faaaf251cb88f40dfe5e5a6 author Eitan Zahavi Sun, 24 Dec 2006 14:31:21 +0200 committer Eitan Zahavi Sun, 24 Dec 2006 14:31:21 +0200 osm/opensm/osm_sa_multipath_record.c | 83 +++++++++++++++++++++++----------- osm/opensm/osm_sa_path_record.c | 48 ++++++++++++++++---- 2 files changed, 93 insertions(+), 38 deletions(-) diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c index 28a0190..3eb7a6d 100644 --- a/osm/opensm/osm_sa_multipath_record.c +++ b/osm/opensm/osm_sa_multipath_record.c @@ -615,20 +615,29 @@ __osm_mpr_rcv_get_path_parms( required_mtu = ib_multipath_rec_mtu( p_mpr ); switch ( ib_multipath_rec_mtu_sel( p_mpr ) ) { - case 0: /* must be greater than */ + case 0: /* must be greater than */ if ( mtu <= required_mtu ) status = IB_NOT_FOUND; break; - case 1: /* must be less than */ - if ( mtu >= required_mtu ) - status = IB_NOT_FOUND; - break; - - case 2: /* exact match */ - if ( mtu != required_mtu ) - status = IB_NOT_FOUND; - break; + case 1: /* must be less than */ + if( mtu >= required_mtu ) + { + /* adjust to use the highest mtu + lower then the required one */ + if (required_mtu > 1) + mtu = required_mtu - 1; + else + status = IB_NOT_FOUND; + } + break; + + case 2: /* exact match */ + if( mtu < required_mtu ) + status = IB_NOT_FOUND; + else + mtu = required_mtu; + break; case 3: /* largest available */ /* can't be disqualified by this one */ @@ -646,22 +655,31 @@ __osm_mpr_rcv_get_path_parms( if ( ( comp_mask & IB_MPR_COMPMASK_RATESELEC ) && ( comp_mask & IB_PR_COMPMASK_RATE ) ) { - required_rate = ib_multipath_rec_rate( p_mpr ); - switch ( ib_multipath_rec_rate_sel( p_mpr ) ) - { - case 0: /* must be greater than */ + required_rate = ib_multipath_rec_rate( p_mpr ); + switch ( ib_multipath_rec_rate_sel( p_mpr ) ) + { + case 0: /* must be greater than */ if ( rate <= required_rate ) - status = IB_NOT_FOUND; + status = IB_NOT_FOUND; break; - - case 1: /* must be less than */ - if ( rate >= required_rate ) - status = IB_NOT_FOUND; + + case 1: /* must be less than */ + if( rate >= required_rate ) + { + /* adjust the rate to use the highest rate + lower then the required one */ + if (required_rate > 2) + rate = required_rate - 1; + else + status = IB_NOT_FOUND; + } break; - - case 2: /* exact match */ - if ( rate != required_rate ) - status = IB_NOT_FOUND; + + case 2: /* exact match */ + if( rate < required_rate ) + status = IB_NOT_FOUND; + else + rate = required_rate; break; case 3: /* largest available */ @@ -697,13 +715,22 @@ __osm_mpr_rcv_get_path_parms( break; case 1: /* must be less than */ - if ( pkt_life >= required_pkt_life ) - status = IB_NOT_FOUND; - break; + if( pkt_life >= required_pkt_life ) + { + /* adjust the lifetime to use the highest possible + lower then the required one */ + if (required_pkt_life > 1) + pkt_life = required_pkt_life - 1; + else + status = IB_NOT_FOUND; + } + break; case 2: /* exact match */ - if ( pkt_life != required_pkt_life ) - status = IB_NOT_FOUND; + if( pkt_life < required_pkt_life ) + status = IB_NOT_FOUND; + else + pkt_life = required_pkt_life; break; case 3: /* smallest available */ diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index 7f4a1b6..6d2e64e 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -528,6 +528,7 @@ __osm_pr_rcv_get_path_parms( /* Determine if these values meet the user criteria + and adjust appropriatly */ /* we silently ignore cases where only the MTU selector is defined */ @@ -543,13 +544,22 @@ __osm_pr_rcv_get_path_parms( break; case 1: /* must be less than */ - if( mtu >= required_mtu ) - status = IB_NOT_FOUND; + if( mtu >= required_mtu ) + { + /* adjust to use the highest mtu + lower then the required one */ + if (required_mtu > 1) + mtu = required_mtu - 1; + else + status = IB_NOT_FOUND; + } break; case 2: /* exact match */ - if( mtu != required_mtu ) - status = IB_NOT_FOUND; + if( mtu < required_mtu ) + status = IB_NOT_FOUND; + else + mtu = required_mtu; break; case 3: /* largest available */ @@ -578,12 +588,21 @@ __osm_pr_rcv_get_path_parms( case 1: /* must be less than */ if( rate >= required_rate ) - status = IB_NOT_FOUND; + { + /* adjust the rate to use the highest rate + lower then the required one */ + if (required_rate > 2) + rate = required_rate - 1; + else + status = IB_NOT_FOUND; + } break; case 2: /* exact match */ - if( rate != required_rate ) - status = IB_NOT_FOUND; + if( rate < required_rate ) + status = IB_NOT_FOUND; + else + rate = required_rate; break; case 3: /* largest available */ @@ -620,12 +639,21 @@ __osm_pr_rcv_get_path_parms( case 1: /* must be less than */ if( pkt_life >= required_pkt_life ) - status = IB_NOT_FOUND; + { + /* adjust the lifetime to use the highest possible + lower then the required one */ + if (required_pkt_life > 1) + pkt_life = required_pkt_life - 1; + else + status = IB_NOT_FOUND; + } break; case 2: /* exact match */ - if( pkt_life != required_pkt_life ) - status = IB_NOT_FOUND; + if( pkt_life < required_pkt_life ) + status = IB_NOT_FOUND; + else + pkt_life = required_pkt_life; break; case 3: /* smallest available */ -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ibisTestPathRec.tcl URL: From mictop_eagle at yahoo.ca Sat Dec 23 05:22:43 2006 From: mictop_eagle at yahoo.ca (MR.MICHAEL THOMPSOM) Date: Sat, 23 Dec 2006 17:22:43 +0400 Subject: [openib-general] OFFICE OF THE DIRECTOR Message-ID: <20061224131429.7DDD93B0006@sentry-two.sandia.gov> Dear Director, My name is Mr.Michael Thompson;I work in a gold mining company in GHANA West Africa.There is a polititian friend of my from Republic of SIERA-LEAON one of former Ministers, during president charles Tialor government in office who has in our custody fund cash {for safekeeping in our company}.I received instruction from him to look for a reliable foreign partner/investor who can receive and manage the fund for him until his ordeal with the Government is over, currently he is under detention and probe, his offence is political motivated {he is aspiring for the office of the president come next election} However the money originated from gratification/under the counter sales of copper and diamond in his ministry,The amount is $58,600,000.00 USD. It is upon this facts that I made a tripe to DUBAI in UNITED ARABS EMIRATES (UAE) If you can work with me and render your good help, honestly you are welcome.Kindly help,you will be adequately rewarded for assisting. Regards. Mr.M.Thompson. From halr at voltaire.com Sun Dec 24 05:36:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 24 Dec 2006 08:36:23 -0500 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <458E7402.4000106@mellanox.co.il> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> Message-ID: <1166967379.4519.320031.camel@hal.voltaire.com> Hi Eitan, On Sun, 2006-12-24 at 07:35, Eitan Zahavi wrote: > Hi Or, > > Sorry it took me a while. > > According to the IBTA spec: > 1. In order for MTU and MTUSelector to have any effect their component > mask bits MUST be set to 1 in the query > 2. Behavior of the SM is defined with small "freedom" to choose between > multiple matching MTU values if they exist. I agree in general but would like to be sure about the details. Please be specific as to what IBA spec text you are referring to. > 3. The table below summarizes all options: > > Assuming the value M represents the lowest MTU on the path Is M the lowest available MTU or the highest available MTU for that path ? > We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) > R represents the MTU value in the request. Similarly R-1 is one below R > and R+1 is one above R. > > Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk > w. Tavor End Port > ----------------------------------------------------------------------------------------- > UNDEFINED | UNDEFINED | <= M | M | min(M,1K) > R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) > R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR > R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR ^^^^^^^^ For the R> spec response column, I think you are saying the same as: >R AND <=M if M>R /ERR or R < x <=M if M>R /ERR where x is resp value I agree with this table given the redefinition of M above and R > spec response interpretation. -- Hal > I have built some test code for making sure OpenSM does what is required. > Apparently it does not. In any case the M is not identical to R it fails > the request. > > I am working on fixing OpenSM. > > Any comments are welcome. > > EZ > > Or Gerlitz wrote: > > Michael S. Tsirkin wrote: > > > >> I am not yet sure what is best for upstream, so I don't really think we need > >> any RFCs. > >> > > > > > >> We'll need data from SM guys on whether MTU selector actually works > >> in SMs, and if not what happens when you enable it. > >> > > > > Eitan, > > > > Can you please post here the tavor-quirk patch which was integrated into > > opensm? i can see the ***code*** of the opensm but might make some wrong > > assumptions or get into wrong understandings as i am not able to see the > > patch as is. > > > > Or. > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From sashak at voltaire.com Sun Dec 24 09:02:48 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 24 Dec 2006 19:02:48 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm: remove unused local variable Message-ID: <20061224170248.GA7111@sashak.voltaire.com> Remove unused local variable. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_node.c | 2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c index aba2e39..684eee6 100644 --- a/osm/opensm/osm_node.c +++ b/osm/opensm/osm_node.c @@ -97,7 +97,6 @@ osm_node_new( osm_node_t *p_node; ib_smp_t *p_smp; ib_node_info_t *p_ni; - uint8_t port_num; uint8_t i; uint32_t size; @@ -108,7 +107,6 @@ osm_node_new( CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_NODE_INFO ); p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); - port_num = ib_node_info_get_local_port_num( p_ni ); /* The node object already contains one physical port object. -- 1.4.4.2.gfc82d From sashak at voltaire.com Sun Dec 24 09:03:29 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 24 Dec 2006 19:03:29 +0200 Subject: [openib-general] [PATCH] opensm: rwlock double-release fix. Message-ID: <20061224170329.GB7111@sashak.voltaire.com> When the port is removed from subnet, but previously requested pkey table block is received after this - the lock will be released twice. This leads to deadlocks later when other MAD processor will try to acquire the same lock. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_pkey_rcv.c | 2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c index 3fc7673..3c18fcd 100644 --- a/osm/opensm/osm_pkey_rcv.c +++ b/osm/opensm/osm_pkey_rcv.c @@ -146,7 +146,6 @@ osm_pkey_rcv_process( if( p_port == (osm_port_t*)cl_qmap_end( p_guid_tbl) ) { - cl_plock_release( p_rcv->p_lock ); osm_log( p_rcv->p_log, OSM_LOG_ERROR, "osm_pkey_rcv_process: ERR 4806: " "No port object for port with GUID 0x%" PRIx64 @@ -219,4 +218,3 @@ osm_pkey_rcv_process( OSM_LOG_EXIT( p_rcv->p_log ); } - -- 1.4.4.2.gfc82d From sashak at voltaire.com Sun Dec 24 09:43:15 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 24 Dec 2006 19:43:15 +0200 Subject: [openib-general] [PATCH] opensm: clean old references on ports linking Message-ID: <20061224174315.GC7111@sashak.voltaire.com> When linking ports, cleanup old remote references. Without it the ports still be accessible as "linked" from old neighbors and in case of ports moving, when some MADs can be lost or reordered, OpenSM subnet data structures become broken. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_node.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c index 684eee6..6e72b58 100644 --- a/osm/opensm/osm_node.c +++ b/osm/opensm/osm_node.c @@ -195,6 +195,11 @@ osm_node_link( p_remote_physp = osm_node_get_physp_ptr( p_remote_node, remote_port_num ); + if (p_physp->p_remote_physp) + p_physp->p_remote_physp->p_remote_physp = NULL; + if (p_remote_physp->p_remote_physp) + p_remote_physp->p_remote_physp->p_remote_physp = NULL; + osm_physp_link( p_physp, p_remote_physp ); } -- 1.4.4.2.gfc82d From eitan at mellanox.co.il Sun Dec 24 10:39:06 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 24 Dec 2006 20:39:06 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <1166967379.4519.320031.camel@hal.voltaire.com> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> <1166967379.4519.320031.camel@hal.voltaire.com> Message-ID: <458EC94A.2050808@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Sun, 2006-12-24 at 07:35, Eitan Zahavi wrote: > >> Hi Or, >> >> Sorry it took me a while. >> >> According to the IBTA spec: >> 1. In order for MTU and MTUSelector to have any effect their component >> mask bits MUST be set to 1 in the query >> 2. Behavior of the SM is defined with small "freedom" to choose between >> multiple matching MTU values if they exist. >> > > I agree in general but would like to be sure about the details. Please > be specific as to what IBA spec text you are referring to. > The text is part of the PathRecord table. > >> 3. The table below summarizes all options: >> >> Assuming the value M represents the lowest MTU on the path >> > > Is M the lowest available MTU or the highest available MTU for that path > ? > M is the lowest MTU reported by all PortInfo for ports on the path. > >> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) >> R represents the MTU value in the request. Similarly R-1 is one below R >> and R+1 is one above R. >> >> Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk >> w. Tavor End Port >> ----------------------------------------------------------------------------------------- >> UNDEFINED | UNDEFINED | <= M | M | min(M,1K) >> R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) >> R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR >> R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR >> > ^^^^^^^^ > For the R> spec response column, I think you are saying the same as: > >R AND <=M if M>R /ERR > or > R < x <=M if M>R /ERR > where x is resp value > Yes that is what I mean: the response value MUST be both bigger then R and equal or less to M. Otherwise an error. > I agree with this table given the redefinition of M above and R > spec > response interpretation. > Good. > -- Hal > > >> I have built some test code for making sure OpenSM does what is required. >> Apparently it does not. In any case the M is not identical to R it fails >> the request. >> >> I am working on fixing OpenSM. >> >> Any comments are welcome. >> >> EZ >> >> Or Gerlitz wrote: >> >>> Michael S. Tsirkin wrote: >>> >>> >>>> I am not yet sure what is best for upstream, so I don't really think we need >>>> any RFCs. >>>> >>>> >>> >>> >>>> We'll need data from SM guys on whether MTU selector actually works >>>> in SMs, and if not what happens when you enable it. >>>> >>>> >>> Eitan, >>> >>> Can you please post here the tavor-quirk patch which was integrated into >>> opensm? i can see the ***code*** of the opensm but might make some wrong >>> assumptions or get into wrong understandings as i am not able to see the >>> patch as is. >>> >>> Or. >>> >>> >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Sun Dec 24 13:21:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 24 Dec 2006 23:21:13 +0200 Subject: [openib-general] [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) In-Reply-To: References: Message-ID: <20061224212113.GA31813@mellanox.co.il> > Quoting Linus Torvalds : > Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) > > Peter, tell me I'm crazy, but with the new rules, the following condition > is a bug: > > - shared mapping > - writable > - not already marked dirty in the PTE > > because that combination means that the hardware can mark the PTE dirty > without us even realizing (and thus not marking the "struct page *" > dirty). Er. Sorry about bumping in, and I'm not sure I understand all of the discussion, but this reminded me of an old issue with COW that created what looks like a vaguely similiar data corruption on infiniband. We solved this for infiniband with MADV_DONTFORK, but I always wondered why does it not affect other parts of kernel. Small reminder from that discussion: down mmap sem get user pages up mmap sem page becomes shared, and COW (e.g. fork) process writes to first byte of page <----- gets a copy Now we had a problem: struct page that we got from get user pages does not point to a correct page in our process. For example: if at some point we map this page for DMA, and hardware writes to last byte of page -----> process does not see this data. So for infiniband, what we do is a combination of - prevent page from becoming COW while hardware might DMA to this page, and - ask users not to write to page if hardware might DMA to same page (even if its using different bytes). I just wandered - is there some chance something like this could be happening in the fs code? HTH, -- MST From eitan at sw053.yok.mtl.com Sun Dec 24 22:26:04 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Mon, 25 Dec 2006 08:26:04 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-25:normal completion Message-ID: <200612250626.kBP6Q4Sp025341@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Sun_Dec_24_08:19:04_2006 ef4b40 ibutils rev = Sat_Dec_23_17:47:24_2006 2057e4 MOD_FILES=2 Total=216 Pass=215 Fail=1 Pass: 24 Stability IS1-16.topo 24 Pkey IS1-16.topo 24 OsmTest IS1-16.topo 24 Multicast IS1-16.topo 24 LidMgr IS1-16.topo 23 OsmStress IS1-16.topo 8 Stability IS3-loop.topo 8 Stability IS3-128.topo 8 Pkey IS3-128.topo 8 OsmTest IS3-loop.topo 8 OsmTest IS3-128.topo 8 OsmStress IS3-128.topo 8 Multicast IS3-loop.topo 8 Multicast IS3-128.topo 8 LidMgr IS3-128.topo Failures: 1 OsmStress IS1-16.topo From eitan at mellanox.co.il Sun Dec 24 22:32:44 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 25 Dec 2006 08:32:44 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-25:normal completion In-Reply-To: <200612250626.kBP6Q4Sp025341@sw053.yok.mtl.com> References: <200612250626.kBP6Q4Sp025341@sw053.yok.mtl.com> Message-ID: <458F708C.9070008@mellanox.co.il> The run that fail is caused by a deadlock in the simulator causing ibdiagnet to fail to start. I'm still looking for that deadlock. Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = Sun_Dec_24_08:19:04_2006 ef4b40 > ibutils rev = Sat_Dec_23_17:47:24_2006 2057e4 MOD_FILES=2 > Total=216 Pass=215 Fail=1 > > Pass: > 24 Stability IS1-16.topo > 24 Pkey IS1-16.topo > 24 OsmTest IS1-16.topo > 24 Multicast IS1-16.topo > 24 LidMgr IS1-16.topo > 23 OsmStress IS1-16.topo > 8 Stability IS3-loop.topo > 8 Stability IS3-128.topo > 8 Pkey IS3-128.topo > 8 OsmTest IS3-loop.topo > 8 OsmTest IS3-128.topo > 8 OsmStress IS3-128.topo > 8 Multicast IS3-loop.topo > 8 Multicast IS3-128.topo > 8 LidMgr IS3-128.topo > > Failures: > 1 OsmStress IS1-16.topo > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jackm at dev.mellanox.co.il Sun Dec 24 23:24:52 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 25 Dec 2006 09:24:52 +0200 Subject: [openib-general] [PATCH] mthca: fix PRM compliance problem in atomic-send completions Message-ID: <200612250924.52746.jackm@dev.mellanox.co.il> For Atomic-Send completions, according to the Tavor and Arbel PRMs, the number of bytes transferred is not provided in the byte_cnt field of the cqe. For atomic operations, the number of bytes transferred is always 8 (when the status is "success"), and this value should be inserted by the driver in the ib_wc entry returned to the poller. Signed-off-by: Jack Morgenstein diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 8293c15..8fd0e9e 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -54,6 +54,10 @@ enum { MTHCA_CQ_ENTRY_SIZE = 0x20 }; +enum { + MTHCA_ATOMIC_BYTE_LEN = 8 +}; + /* * Must be packed because start is 64 bits but only aligned to 32 bits. */ @@ -598,11 +602,11 @@ static inline int mthca_poll_one(struct mthca_dev *dev, break; case MTHCA_OPCODE_ATOMIC_CS: entry->opcode = IB_WC_COMP_SWAP; - entry->byte_len = be32_to_cpu(cqe->byte_cnt); + entry->byte_len = MTHCA_ATOMIC_BYTE_LEN; break; case MTHCA_OPCODE_ATOMIC_FA: entry->opcode = IB_WC_FETCH_ADD; - entry->byte_len = be32_to_cpu(cqe->byte_cnt); + entry->byte_len = MTHCA_ATOMIC_BYTE_LEN; break; case MTHCA_OPCODE_BIND_MW: entry->opcode = IB_WC_BIND_MW; From ogerlitz at voltaire.com Mon Dec 25 04:03:47 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 25 Dec 2006 14:03:47 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <458E7402.4000106@mellanox.co.il> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> Message-ID: <458FBE23.4090100@voltaire.com> Eitan Zahavi wrote: > According to the IBTA spec: > 1. In order for MTU and MTUSelector to have any effect their component > mask bits MUST be set to 1 in the query > 2. Behavior of the SM is defined with small "freedom" to choose between > multiple matching MTU values if they exist. > 3. The table below summarizes all options: > > Assuming the value M represents the lowest MTU on the path > We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) > R represents the MTU value in the request. Similarly R-1 is one below R > and R+1 is one above R. > > Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk > w. Tavor End Port > ----------------------------------------------------------------------------------------- > > UNDEFINED | UNDEFINED | <= M | M | min(M,1K) > R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) > R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR > R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR > > I have built some test code for making sure OpenSM does what is required. > Apparently it does not. In any case the M is not identical to R it fails > the request. > > I am working on fixing OpenSM. > > Any comments are welcome. OK Eitan, thanks for putting the time on this, we will be able to provide feedback tomorrow or later this week. Is the direction suggested by this matrix and patch that follows orthogonal to the open-sm tavor quirk patch? Or. From ogerlitz at voltaire.com Mon Dec 25 04:25:22 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 25 Dec 2006 14:25:22 +0200 Subject: [openib-general] No resource tracking per qp for multicast groups In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com> Message-ID: <458FC332.1010801@voltaire.com> Jack Morgenstein wrote: > I noticed that there is no per-qp tracking of multicast groups of which > it is a member. Thus, for example, if a user-space app dies without a > chance to perform its cleanup, the (nonexistent) QP will still be listed > in the HCA firmware as a member of its multicast groups. This has 2 > effects: > > 1. The effective number of qp's which can join that multicast group has > been reduced (since it has zombie entries). > 2. If the above QP gets re-used, it will still be a member of the > multicast groups (and therefore receive packets which were not intended > for the new user of the QP). > > I suggest tracking mcast group membership in kernel-space only. If we > don't wish to change the verbs layer behavior, we can just detach a qp > from all its multicast groups (if any) in ib_destroy_qp (although this > is not IB compliant -- see IB Spec 11.2.4.4 (we should return an error > in this case)). Otherwise, I think we'll need something messy (such as > an ib_verbs layer function requesting a QP to detach from all its > multicast groups. > > My preference is to leave the verbs layer alone as much as possible. > Track the multicast group membership per qp (gid and lid) in struct > ib_qp, and make calls in ib_destroy_qp() to ib_mcast_detach(). Jack, I just have came across this Nov 2005 post of yours... Per my understanding the issues you describe here are orthogonal to Sean's multicast work, correct? were they solved in mthca or its still open? Or. From mst at mellanox.co.il Mon Dec 25 06:18:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 25 Dec 2006 16:18:49 +0200 Subject: [openib-general] [PATCH/RFC] libibverbs: Improve driver loading In-Reply-To: References: Message-ID: <20061225141849.GI842@mellanox.co.il> > OK, at last here is a patch that implements the improvements to > libibverbs driver loading that we discussed back in October. > > With this patch, instead of trying all the .so files in the > $(libdir)/infiniband directory as libibverbs 1.0 does, libibverbs > instead builds a list of drivers to load and dlopen() exactly that > list of libraries. It uses relative paths rather than absolute paths, > so the linker uses the normal search path to find driver libraries. > > (To get a list of drivers, libibverbs parses all the config files it > finds in $(sysconfdir)/libibverbs.d and also looks at the environment > variables RDMAV_DRIVERS and IBV_DRIVERS) > > Then, instead of calling a specific entry point in the driver, > libibverbs assumes the driver will call ibv_register_driver() from an > __attribute__((constructor)) function. > > This has a number of benefits: > - multiple drivers can be linked statically into an executable > - LD_LIBRARY_PATH can be used to manage which drivers to load > - different versions of the driver can be selected automagically at > runtime (eg i686/cmov on i386 distros) > > I will post a libmthca patch to illustrate how driver libraries need > to change to work with this new libibverbs method. I think this looked good, and probably best to do before the next major release. Do you plan to merge this? -- MST From yosefe at voltaire.com Mon Dec 25 07:29:43 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 25 Dec 2006 17:29:43 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 Message-ID: <458FEE67.2080003@voltaire.com> Hello, I've been testing ofed 1.2 build from http://staging.openfabrics.org/builds/ , (latest.tgz versions both user and kernel) and got compilation erros on: ia64, ppc64: *ppc64:* make -w -C ip ip make[2]: Entering directory `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip' [ ... omitted text ... ] gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a -lresolv -L../lib -lnetlink -lutil -o ip /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when searching for -lnetlink /usr/bin/ld: skipping incompatible /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when searching for -lnetlink /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when searching for -lnetlink /usr/bin/ld: cannot find -lnetlink collect2: ld returned 1 exit status make[2]: *** [ip] Error 1 possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides CFLAGS (= instead of +=) *ia64:* make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build obj=/tmp/openib_gen2/kernel/drivers/infiniband/core gcc [ ... omitted text ... ] -c -o /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37, from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38: /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function ‘ib_sg_dma_address’: /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error: implicit declaration of function ‘sg_dma_address’ /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function ‘ib_sg_dma_len’: /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error: implicit declaration of function ‘sg_dma_len’ /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level: /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning: initialization from incompatible pointer type [ ... omitted text ... ] make: *** [kernel] Error 2 Yossi From mst at mellanox.co.il Mon Dec 25 07:46:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 25 Dec 2006 17:46:54 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 In-Reply-To: <458FEE67.2080003@voltaire.com> References: <458FEE67.2080003@voltaire.com> Message-ID: <20061225154654.GG4741@mellanox.co.il> > Quoting r. Yosef Etigin : > Subject: ofed 1.2 - compilation erros on ppc64 and ia64 Which distro are you testing on? > Hello, > I've been testing ofed 1.2 build from > http://staging.openfabrics.org/builds/ > , (latest.tgz versions both user > and kernel) and got compilation erros on: ia64, ppc64: > > *ppc64:* > > make -w -C ip ip > make[2]: Entering directory > `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip' > [ ... omitted text ... ] > gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include > -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c > gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o > rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o > ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o > xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a > -lresolv -L../lib -lnetlink -lutil -o ip > /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when > searching for -lnetlink > /usr/bin/ld: skipping incompatible > /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when > searching for -lnetlink > /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when > searching for -lnetlink > /usr/bin/ld: cannot find -lnetlink > collect2: ld returned 1 exit status > make[2]: *** [ip] Error 1 > > possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides > CFLAGS (= instead of +=) Isn't this makefile part of iproute2? Can you build iproute on this platform? > *ia64:* > > make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build > obj=/tmp/openib_gen2/kernel/drivers/infiniband/core > gcc [ ... omitted text ... ] -c -o > /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o > /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c > In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37, > from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38: > /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function > ‘ib_sg_dma_address’: > /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error: > implicit declaration of function ‘sg_dma_address’ > /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function > ‘ib_sg_dma_len’: > /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error: > implicit declaration of function ‘sg_dma_len’ > /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level: > /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning: > initialization from incompatible pointer type > [ ... omitted text ... ] > make: *** [kernel] Error 2 Probably a distro-specific backport problem - check how come sg_dma_len is not defined. I see this on upstream 2.6.16 asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length) -- MST From yosefe at voltaire.com Mon Dec 25 08:11:02 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 25 Dec 2006 18:11:02 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 In-Reply-To: <20061225154654.GG4741@mellanox.co.il> References: <458FEE67.2080003@voltaire.com> <20061225154654.GG4741@mellanox.co.il> Message-ID: <458FF816.3010800@voltaire.com> Michael S. Tsirkin wrote: >>Quoting r. Yosef Etigin : >>Subject: ofed 1.2 - compilation erros on ppc64 and ia64 >> >> > >Which distro are you testing on? > > > I am testing on sles10, both ia64 and ppc64. >>Hello, >>I've been testing ofed 1.2 build from >>http://staging.openfabrics.org/builds/ >>, (latest.tgz versions both user >>and kernel) and got compilation erros on: ia64, ppc64: >> >>*ppc64:* >> >> make -w -C ip ip >> make[2]: Entering directory >> `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip' >> [ ... omitted text ... ] >> gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include >> -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c >> gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o >> rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o >> ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o >> xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a >> -lresolv -L../lib -lnetlink -lutil -o ip >> /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when >> searching for -lnetlink >> /usr/bin/ld: skipping incompatible >> /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when >> searching for -lnetlink >> /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when >> searching for -lnetlink >> /usr/bin/ld: cannot find -lnetlink >> collect2: ld returned 1 exit status >> make[2]: *** [ip] Error 1 >> >>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides >>CFLAGS (= instead of +=) >> >> > >Isn't this makefile part of iproute2? >Can you build iproute on this platform? > > This makefile is indeed of iproute, but it seems to make 32-bit object files for `iproute' during compilation and therefore fails to find 64-bit during linkage of `ip'. > > > >>*ia64:* >> >> make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build >> obj=/tmp/openib_gen2/kernel/drivers/infiniband/core >> gcc [ ... omitted text ... ] -c -o >> /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o >> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c >> In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37, >> from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38: >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function >> ‘ib_sg_dma_address’: >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error: >> implicit declaration of function ‘sg_dma_address’ >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function >> ‘ib_sg_dma_len’: >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error: >> implicit declaration of function ‘sg_dma_len’ >> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level: >> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning: >> initialization from incompatible pointer type >> [ ... omitted text ... ] >> make: *** [kernel] Error 2 >> >> > >Probably a distro-specific backport problem - check how come sg_dma_len is not defined. >I see this on upstream 2.6.16 > asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length) > > Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere else in this file, but in: ./asm-ia64/pci.h:82:#define sg_dma_len(sg) ((sg)->dma_length) Yossi From mst at mellanox.co.il Mon Dec 25 08:24:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 25 Dec 2006 18:24:23 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 In-Reply-To: <458FF816.3010800@voltaire.com> References: <458FF816.3010800@voltaire.com> Message-ID: <20061225162423.GI4741@mellanox.co.il> Mutt Label Removed By VIM Quoting r. Yosef Etigin : Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64 Michael S. Tsirkin wrote: > >>Quoting r. Yosef Etigin : > >>Subject: ofed 1.2 - compilation erros on ppc64 and ia64 > >> > >> > > > >Which distro are you testing on? > > > > > > > I am testing on sles10, both ia64 and ppc64. > > >>Hello, > >>I've been testing ofed 1.2 build from > >>http://staging.openfabrics.org/builds/ > >>, (latest.tgz versions both user > >>and kernel) and got compilation erros on: ia64, ppc64: > >> > >>*ppc64:* > >> > >> make -w -C ip ip > >> make[2]: Entering directory > >> `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip' > >> [ ... omitted text ... ] > >> gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include > >> -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c > >> gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o > >> rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o > >> ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o > >> xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a > >> -lresolv -L../lib -lnetlink -lutil -o ip > >> /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when > >> searching for -lnetlink > >> /usr/bin/ld: skipping incompatible > >> /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when > >> searching for -lnetlink > >> /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when > >> searching for -lnetlink > >> /usr/bin/ld: cannot find -lnetlink > >> collect2: ld returned 1 exit status > >> make[2]: *** [ip] Error 1 > >> > >>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides > >>CFLAGS (= instead of +=) > >> > >> > > > >Isn't this makefile part of iproute2? > >Can you build iproute on this platform? > > > > > This makefile is indeed of iproute, > but it seems to make 32-bit object files for `iproute' during compilation > and therefore fails to find 64-bit during linkage of `ip'. Will installing the 32 bit version of the library help? > > > > > > > >>*ia64:* > >> > >> make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build > >> obj=/tmp/openib_gen2/kernel/drivers/infiniband/core > >> gcc [ ... omitted text ... ] -c -o > >> /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o > >> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c > >> In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37, > >> from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38: > >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function > >> ‘ib_sg_dma_address’: > >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error: > >> implicit declaration of function ‘sg_dma_address’ > >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function > >> ‘ib_sg_dma_len’: > >> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error: > >> implicit declaration of function ‘sg_dma_len’ > >> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level: > >> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning: > >> initialization from incompatible pointer type > >> [ ... omitted text ... ] > >> make: *** [kernel] Error 2 > >> > >> > > > >Probably a distro-specific backport problem - check how come sg_dma_len is not defined. > >I see this on upstream 2.6.16 > > asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length) > > > > > Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere > else in this file, but in: > ./asm-ia64/pci.h:82:#define sg_dma_len(sg) ((sg)->dma_length) > Isee, its fixed on 2.6.20. Need to do something about it in the backport then. I wonder whether we can just put #ifdef __ia64__ #define sg_dma_len(sg) ((sg)->dma_length) #endif in kernel_addons/backports/2.6.16/include/asm/scatterlist.h Also need tofind out in which kernel this was fixed. -- MST From yosefe at voltaire.com Mon Dec 25 09:49:57 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Mon, 25 Dec 2006 19:49:57 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 In-Reply-To: <20061225162423.GI4741@mellanox.co.il> References: <458FF816.3010800@voltaire.com> <20061225162423.GI4741@mellanox.co.il> Message-ID: <45900F45.50906@voltaire.com> Michael S. Tsirkin wrote: > Mutt Label Removed By VIM > Quoting r. Yosef Etigin : > Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64 > > Michael S. Tsirkin wrote: > > >>>>Quoting r. Yosef Etigin : >>>>Subject: ofed 1.2 - compilation erros on ppc64 and ia64 >>>> >>>> >>> >>>Which distro are you testing on? >>> >>> >>> >> >>I am testing on sles10, both ia64 and ppc64. >> >> >>>>Hello, >>>>I've been testing ofed 1.2 build from >>>>http://staging.openfabrics.org/builds/ >>>>, (latest.tgz versions both user >>>>and kernel) and got compilation erros on: ia64, ppc64: >>>> >>>>*ppc64:* >>>> >>>> make -w -C ip ip >>>> make[2]: Entering directory >>>> `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip' >>>> [ ... omitted text ... ] >>>> gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include >>>> -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c >>>> gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o >>>> rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o >>>> ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o >>>> xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a >>>> -lresolv -L../lib -lnetlink -lutil -o ip >>>> /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when >>>> searching for -lnetlink >>>> /usr/bin/ld: skipping incompatible >>>> /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when >>>> searching for -lnetlink >>>> /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when >>>> searching for -lnetlink >>>> /usr/bin/ld: cannot find -lnetlink >>>> collect2: ld returned 1 exit status >>>> make[2]: *** [ip] Error 1 >>>> >>>>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides >>>>CFLAGS (= instead of +=) >>>> >>>> >>> >>>Isn't this makefile part of iproute2? >>>Can you build iproute on this platform? >>> >>> >> >>This makefile is indeed of iproute, >>but it seems to make 32-bit object files for `iproute' during compilation >>and therefore fails to find 64-bit during linkage of `ip'. > > > Will installing the 32 bit version of the library help? > > I dont think so.. the issue arised during compilation, since `iproute' was inconsinsten in its use of -m64: The iproute Makefile overrides any `CFLAGS' it might get from top-level, thus throwing `-m64' away, while LDFLAGS are not overriden. Therefore, the compilation is done in 32bit while the linkage in 64bit >>> >>> >>> >>> >>>>*ia64:* >>>> >>>> make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build >>>> obj=/tmp/openib_gen2/kernel/drivers/infiniband/core >>>> gcc [ ... omitted text ... ] -c -o >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c >>>> In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37, >>>> from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38: >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function >>>> ‘ib_sg_dma_address’: >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error: >>>> implicit declaration of function ‘sg_dma_address’ >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function >>>> ‘ib_sg_dma_len’: >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error: >>>> implicit declaration of function ‘sg_dma_len’ >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level: >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning: >>>> initialization from incompatible pointer type >>>> [ ... omitted text ... ] >>>> make: *** [kernel] Error 2 >>>> >>>> >>> >>>Probably a distro-specific backport problem - check how come sg_dma_len is not defined. >>>I see this on upstream 2.6.16 >>> asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length) >>> >>> >> >>Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere >>else in this file, but in: >> ./asm-ia64/pci.h:82:#define sg_dma_len(sg) ((sg)->dma_length) >> > > > Isee, its fixed on 2.6.20. > Need to do something about it in the backport then. > > I wonder whether we can just put > #ifdef __ia64__ > #define sg_dma_len(sg) ((sg)->dma_length) > #endif > > in kernel_addons/backports/2.6.16/include/asm/scatterlist.h > > Also need tofind out in which kernel this was fixed. > Looks like in all kernels up to 2.6.20 it was in `pci.h' so need to backtort to.. all previous versions Yossi From eitan at mellanox.co.il Mon Dec 25 11:51:33 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 25 Dec 2006 21:51:33 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <458FBE23.4090100@voltaire.com> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> <458FBE23.4090100@voltaire.com> Message-ID: <45902BC5.4070107@mellanox.co.il> Or Gerlitz wrote: > Eitan Zahavi wrote: > >> According to the IBTA spec: >> 1. In order for MTU and MTUSelector to have any effect their component >> mask bits MUST be set to 1 in the query >> 2. Behavior of the SM is defined with small "freedom" to choose between >> multiple matching MTU values if they exist. >> 3. The table below summarizes all options: >> >> Assuming the value M represents the lowest MTU on the path >> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) >> R represents the MTU value in the request. Similarly R-1 is one below R >> and R+1 is one above R. >> >> Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk >> w. Tavor End Port >> ----------------------------------------------------------------------------------------- >> >> UNDEFINED | UNDEFINED | <= M | M | min(M,1K) >> R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) >> R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR >> R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR >> >> I have built some test code for making sure OpenSM does what is required. >> Apparently it does not. In any case the M is not identical to R it fails >> the request. >> >> I am working on fixing OpenSM. >> >> Any comments are welcome. >> > > OK Eitan, thanks for putting the time on this, we will be able to > provide feedback tomorrow or later this week. > > Is the direction suggested by this matrix and patch that follows > orthogonal to the open-sm tavor quirk patch? > The table above has a column named "OpenSM Quirk" which describes the expected result of the tavor quirk patch. If that is not the outcome of that patch = it should be fixed. I am not proposing a new type of behavior - just to fix the existing one. > Or. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hudaslalt at millic.com.ar Mon Dec 25 14:02:58 2006 From: hudaslalt at millic.com.ar (Diana Furtado) Date: Mon, 25 Dec 2006 21:02:58 -0100 Subject: [openib-general] Que tengas una feliz navidad Message-ID: <9457164.6GG63504j@ciudad.com.ar> Llegamos ya al final del año Es momento de reflexionar de hacer un balance de nuestros logros laborales y de nuestras metas y objetivos ¿Lograste cumplir tus metas laborales este año? Si la respuesta es NO... no es de extrañarse... pocas personas lo logran en el esquema de empleo tradicional ¿Sentis que tenés un techo dentro de tu trabajo por encima del cual no podrás crecer nunca? Se da a menudo ¿Te imponen los días de vacaciones? Suele pasar ¿No estas el tiempo que quisieras con tu familia? Es habitual Yo padecí todas esas cosas, y muchas más, hasta que dije BASTA Comencé a buscar un sistema de trabajo alternativo. Mi búsqueda no fue fácil, pero logré dar con una empresa seria que me permitió despedir a mi jefe, trabajar en casa y pasar más tiempo con mi hijo. Ya no vivo pendiente de si tendré mi puesto el mes que viene o si le caigo o no bien a mi EX jefe Ya no tengo que tomar 2 colectivos de ida y 2 de vuelta todos los días Recién hoy día tomo conciencia del tiempo de mi vida que desperdiciaba viajando Ahora mi puesto de trabajo esta en mi casa ¿genial no? A mi me cambió la vida radicalmente en solo 10 meses porque gano casi el doble que trabajando bajo patrón (y trabajo la mitad de las horas que solía trabajar) Si te pasa lo mismo que me pasaba a mi puedo ayudarte mostrándote lo que yo hago ¿Quién dijo que todo está perdido? Yo hice el cambio a principios de 2006 y te estoy contando mi experiencia El 2007 puede marcar tu cambio mandame un mail a produccion_en_argent at fullzero.com.ar y coloca en el asunto del correo electronico la frase " quie-ro mas infor-macion" Te deseo feliz año Diana Furtado Si conocés a alguien a quien le interese hacele llegar este email Un enorme abrazo para vos y para tu familia From mst at mellanox.co.il Mon Dec 25 15:00:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 01:00:51 +0200 Subject: [openib-general] ib_dma_addr_t Message-ID: <20061225230051.GG17469@mellanox.co.il> I'd like to propose that we introduce ib_dma_addr_t. The idea is to add some type safety (via sparse checker) that we lost when all addresses were converted to u64. How does it sound? -- MST From mst at mellanox.co.il Mon Dec 25 15:46:50 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 01:46:50 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 In-Reply-To: <45900F45.50906@voltaire.com> References: <45900F45.50906@voltaire.com> Message-ID: <20061225234648.GJ17469@mellanox.co.il> > > Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64 > > > > Michael S. Tsirkin wrote: > > > > > >>>>Quoting r. Yosef Etigin : > >>>>Subject: ofed 1.2 - compilation erros on ppc64 and ia64 > >>>> > >>>> > >>> > >>>Which distro are you testing on? > >>> > >>> > >>> > >> > >>I am testing on sles10, both ia64 and ppc64. > >> > >> > >>>>Hello, > >>>>I've been testing ofed 1.2 build from > >>>>http://staging.openfabrics.org/builds/ > >>>>, (latest.tgz versions both user > >>>>and kernel) and got compilation erros on: ia64, ppc64: > >>>> > >>>>*ppc64:* > >>>> > >>>> make -w -C ip ip > >>>> make[2]: Entering directory > >>>> `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip' > >>>> [ ... omitted text ... ] > >>>> gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include > >>>> -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c > >>>> gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o > >>>> rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o > >>>> ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o > >>>> xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a > >>>> -lresolv -L../lib -lnetlink -lutil -o ip > >>>> /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when > >>>> searching for -lnetlink > >>>> /usr/bin/ld: skipping incompatible > >>>> /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when > >>>> searching for -lnetlink > >>>> /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when > >>>> searching for -lnetlink > >>>> /usr/bin/ld: cannot find -lnetlink > >>>> collect2: ld returned 1 exit status > >>>> make[2]: *** [ip] Error 1 > >>>> > >>>>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides > >>>>CFLAGS (= instead of +=) > >>>> > >>>> > >>> > >>>Isn't this makefile part of iproute2? > >>>Can you build iproute on this platform? > >>> > >>> > >> > >>This makefile is indeed of iproute, > >>but it seems to make 32-bit object files for `iproute' during compilation > >>and therefore fails to find 64-bit during linkage of `ip'. > > > > > > Will installing the 32 bit version of the library help? > > > > > > I dont think so.. the issue arised during compilation, since `iproute' > was inconsinsten in its use of -m64: > The iproute Makefile overrides any `CFLAGS' it might get from top-level, > thus throwing `-m64' away, while LDFLAGS are not overriden. > Therefore, the compilation is done in 32bit while the linkage in 64bit Probably the easies thing is to fix iproute. Patch? > >>> > >>> > >>> > >>> > >>>>*ia64:* > >>>> > >>>> make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build > >>>> obj=/tmp/openib_gen2/kernel/drivers/infiniband/core > >>>> gcc [ ... omitted text ... ] -c -o > >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o > >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c > >>>> In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37, > >>>> from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38: > >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function > >>>> ‘ib_sg_dma_address’: > >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error: > >>>> implicit declaration of function ‘sg_dma_address’ > >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function > >>>> ‘ib_sg_dma_len’: > >>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error: > >>>> implicit declaration of function ‘sg_dma_len’ > >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level: > >>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning: > >>>> initialization from incompatible pointer type > >>>> [ ... omitted text ... ] > >>>> make: *** [kernel] Error 2 > >>>> > >>>> > >>> > >>>Probably a distro-specific backport problem - check how come sg_dma_len is not defined. > >>>I see this on upstream 2.6.16 > >>> asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length) > >>> > >>> > >> > >>Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere > >>else in this file, but in: > >> ./asm-ia64/pci.h:82:#define sg_dma_len(sg) ((sg)->dma_length) > >> > > > > > > Isee, its fixed on 2.6.20. > > Need to do something about it in the backport then. > > > > I wonder whether we can just put > > #ifdef __ia64__ > > #define sg_dma_len(sg) ((sg)->dma_length) > > #endif > > > > in kernel_addons/backports/2.6.16/include/asm/scatterlist.h > > > > Also need tofind out in which kernel this was fixed. > > > > Looks like in all kernels up to 2.6.20 it was in `pci.h' so need to > backtort to.. all previous versions Right. Try sticking this in kernel_addons/backports/2.6.20 and copying it over. -- MST From eitan at sw053.yok.mtl.com Mon Dec 25 21:08:10 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Tue, 26 Dec 2006 07:08:10 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-26:normal completion Message-ID: <200612260508.kBQ58Afr019644@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Sun_Dec_24_08:19:04_2006 ef4b40 ibutils rev = Tue_Dec_26_00:00:31_2006 f81b3b Total=351 Pass=350 Fail=1 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 38 OsmStress IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo Failures: 1 OsmStress IS1-16.topo From mst at mellanox.co.il Tue Dec 26 00:51:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 10:51:44 +0200 Subject: [openib-general] [PATCH/RFC] libibverbs: Improve driver loading In-Reply-To: <20061225141849.GI842@mellanox.co.il> References: <20061225141849.GI842@mellanox.co.il> Message-ID: <20061226085144.GA4325@mellanox.co.il> > > (To get a list of drivers, libibverbs parses all the config files it > > finds in $(sysconfdir)/libibverbs.d and also looks at the environment > > variables RDMAV_DRIVERS and IBV_DRIVERS) > > > > Then, instead of calling a specific entry point in the driver, > > libibverbs assumes the driver will call ibv_register_driver() from an > > __attribute__((constructor)) function. > > > > This has a number of benefits: > > - multiple drivers can be linked statically into an executable > > - LD_LIBRARY_PATH can be used to manage which drivers to load > > - different versions of the driver can be selected automagically at > > runtime (eg i686/cmov on i386 distros) > Wrt static linking: I see this warning when I link with -static: : warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking And it actually crashes inside dlopen on some platforms. Would it be possible to add a configuration option to avoid using dlopen for static apps? Or, maybe, it makes more sense to make an empty stub for libdl, and ask apps to link with that? -- MST From mst at mellanox.co.il Tue Dec 26 00:53:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 10:53:11 +0200 Subject: [openib-general] libsdp.conf placement Message-ID: <20061226085311.GB4325@mellanox.co.il> I noticed autotools have sysconfdir variable. So it seems to me this would be the best, standard, place to keep the libsdp.conf file. Eitan? -- Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd. From eitan at mellanox.co.il Tue Dec 26 03:49:12 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 26 Dec 2006 13:49:12 +0200 Subject: [openib-general] libsdp.conf placement In-Reply-To: <20061226085311.GB4325@mellanox.co.il> References: <20061226085311.GB4325@mellanox.co.il> Message-ID: <45910C38.1020606@mellanox.co.il> Michael S. Tsirkin wrote: > I noticed autotools have sysconfdir variable. > So it seems to me this would be the best, standard, place to keep the > libsdp.conf file. > > Eitan? > > Unfortunately autotools are not doing the right thing. Quoting from libsdp Makefile.am: AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\" And then internally in the port.c code: #define LIBSDP_DEFAULT_CONFIG_FILE SYSCONFDIR "/libsdp.conf" Somehow when you run ./configure you get $prefix/etc as the $sysconfdir EZ From mst at mellanox.co.il Tue Dec 26 03:53:09 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 13:53:09 +0200 Subject: [openib-general] libsdp.conf placement In-Reply-To: <45910C38.1020606@mellanox.co.il> References: <45910C38.1020606@mellanox.co.il> Message-ID: <20061226115309.GE4325@mellanox.co.il> > > I noticed autotools have sysconfdir variable. > > So it seems to me this would be the best, standard, place to keep the > > libsdp.conf file. > > > > Eitan? > > > > > Unfortunately autotools are not doing the right thing. > Quoting from libsdp Makefile.am: > AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\" > > And then internally in the port.c code: > #define LIBSDP_DEFAULT_CONFIG_FILE SYSCONFDIR "/libsdp.conf" > > Somehow when you run ./configure you get $prefix/etc as the $sysconfdir So, that's what all other libraries that use autotools will get (e.g. libibverbs) and that's the best default place then. If we want to, OFED can override sysconfdir with a configure switch, can it not? -- MST From eitan at mellanox.co.il Tue Dec 26 03:57:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 26 Dec 2006 13:57:26 +0200 Subject: [openib-general] libsdp.conf placement In-Reply-To: <20061226115309.GE4325@mellanox.co.il> References: <45910C38.1020606@mellanox.co.il> <20061226115309.GE4325@mellanox.co.il> Message-ID: <45910E26.70002@mellanox.co.il> Michael S. Tsirkin wrote: >>> I noticed autotools have sysconfdir variable. >>> So it seems to me this would be the best, standard, place to keep the >>> libsdp.conf file. >>> >>> Eitan? >>> >>> >>> >> Unfortunately autotools are not doing the right thing. >> Quoting from libsdp Makefile.am: >> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\" >> >> And then internally in the port.c code: >> #define LIBSDP_DEFAULT_CONFIG_FILE SYSCONFDIR "/libsdp.conf" >> >> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir >> > > So, that's what all other libraries that use autotools will get > (e.g. libibverbs) and that's the best default place then. > > If we want to, OFED can override sysconfdir with a configure switch, can it not? > Yes it can but some people might want to upgrade just libsdp. For those I would preferably use a more reasonable sysconfig then $prefix/etc EZ From mst at mellanox.co.il Tue Dec 26 04:06:00 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 14:06:00 +0200 Subject: [openib-general] libsdp.conf placement In-Reply-To: <45910E26.70002@mellanox.co.il> References: <45910E26.70002@mellanox.co.il> Message-ID: <20061226120600.GF4325@mellanox.co.il> > >>> I noticed autotools have sysconfdir variable. > >>> So it seems to me this would be the best, standard, place to keep the > >>> libsdp.conf file. > >>> > >>> Eitan? > >>> > >>> > >>> > >> Unfortunately autotools are not doing the right thing. > >> Quoting from libsdp Makefile.am: > >> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\" > >> > >> And then internally in the port.c code: > >> #define LIBSDP_DEFAULT_CONFIG_FILE SYSCONFDIR "/libsdp.conf" > >> > >> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir > >> > > > > So, that's what all other libraries that use autotools will get > > (e.g. libibverbs) and that's the best default place then. > > > > If we want to, OFED can override sysconfdir with a configure switch, can it not? > > > Yes it can but some people might want to upgrade just libsdp. For those > I would preferably use a more reasonable sysconfig then $prefix/etc I think these people can use a configure switch, too (updating just libsdp without OFED needs playing with configure switches anyway, because of all the 64/32 bit situation). My point is, let's not mess with the defaults unless strictly necessary - otherwise libibverbs config is in one place, and libsdp is in another, and its a mess. -- MST From eitan at mellanox.co.il Tue Dec 26 04:10:58 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 26 Dec 2006 14:10:58 +0200 Subject: [openib-general] libsdp.conf placement In-Reply-To: <20061226120600.GF4325@mellanox.co.il> References: <45910E26.70002@mellanox.co.il> <20061226120600.GF4325@mellanox.co.il> Message-ID: <45911152.1000008@mellanox.co.il> Michael S. Tsirkin wrote: >>>>> I noticed autotools have sysconfdir variable. >>>>> So it seems to me this would be the best, standard, place to keep the >>>>> libsdp.conf file. >>>>> >>>>> Eitan? >>>>> >>>>> >>>>> >>>>> >>>> Unfortunately autotools are not doing the right thing. >>>> Quoting from libsdp Makefile.am: >>>> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\" >>>> >>>> And then internally in the port.c code: >>>> #define LIBSDP_DEFAULT_CONFIG_FILE SYSCONFDIR "/libsdp.conf" >>>> >>>> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir >>>> >>>> >>> So, that's what all other libraries that use autotools will get >>> (e.g. libibverbs) and that's the best default place then. >>> >>> If we want to, OFED can override sysconfdir with a configure switch, can it not? >>> >>> >> Yes it can but some people might want to upgrade just libsdp. For those >> I would preferably use a more reasonable sysconfig then $prefix/etc >> > > I think these people can use a configure switch, too (updating > just libsdp without OFED needs playing with configure switches anyway, > because of all the 64/32 bit situation). > > My point is, let's not mess with the defaults unless strictly necessary - > otherwise libibverbs config is in one place, and libsdp is in another, > and its a mess. > > RPM making should use the --sysconfigdir option for configure. Still the default is broken. I will probably find a way to fix that .. one day. EZ From mst at mellanox.co.il Tue Dec 26 04:19:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 14:19:11 +0200 Subject: [openib-general] libsdp.conf placement In-Reply-To: <45911152.1000008@mellanox.co.il> References: <45911152.1000008@mellanox.co.il> Message-ID: <20061226121911.GG4325@mellanox.co.il> > >>>>> I noticed autotools have sysconfdir variable. > >>>>> So it seems to me this would be the best, standard, place to keep the > >>>>> libsdp.conf file. > >>>>> > >>>>> Eitan? > >>>>> > >>>>> > >>>>> > >>>>> > >>>> Unfortunately autotools are not doing the right thing. > >>>> Quoting from libsdp Makefile.am: > >>>> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\" > >>>> > >>>> And then internally in the port.c code: > >>>> #define LIBSDP_DEFAULT_CONFIG_FILE SYSCONFDIR "/libsdp.conf" > >>>> > >>>> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir > >>>> > >>>> > >>> So, that's what all other libraries that use autotools will get > >>> (e.g. libibverbs) and that's the best default place then. > >>> > >>> If we want to, OFED can override sysconfdir with a configure switch, can it not? > >>> > >>> > >> Yes it can but some people might want to upgrade just libsdp. For those > >> I would preferably use a more reasonable sysconfig then $prefix/etc > >> > > > > I think these people can use a configure switch, too (updating > > just libsdp without OFED needs playing with configure switches anyway, > > because of all the 64/32 bit situation). > > > > My point is, let's not mess with the defaults unless strictly necessary - > > otherwise libibverbs config is in one place, and libsdp is in another, > > and its a mess. > > > > So we are in agreement libsdp will put its config file in $sysconfigdir, and let packagers change where it points to? > RPM making should use the --sysconfigdir option for configure. OK, but if so it should do so for all libraries, not just libsdp. Right? > Still the default is broken. Looks like a matter of taste. What is important is to keep it consistent across all libraries in OFED. > I will probably find a way to fix that .. > one day. But for now, it defaults to $prefix/etc and if we want, OFED will override that as appropriate? -- MST From yosefe at voltaire.com Tue Dec 26 07:58:36 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Tue, 26 Dec 2006 17:58:36 +0200 Subject: [openib-general] [PATCH] [MINOR] ipoibtools: fix compilation errors on ppc64 Message-ID: <1167148716.7006.17.camel@muscida> Fix compilation errors of ipoibtools on ppc64 caused by overriding CFLAGS in the Makefile. Signed-off-by: Yosef Etigin --- diff -ur a/src/userspace/ipoibtools/iproute2/Makefile b/src/userspace/ipoibtools/iproute2/Makefile --- a/src/userspace/ipoibtools/iproute2/Makefile 2006-12-25 16:18:43.000000000 +0200 +++ b/src/userspace/ipoibtools/iproute2/Makefile 2006-12-25 15:54:40.000000000 +0200 @@ -22,7 +22,7 @@ CC = gcc HOSTCC = gcc CCOPTS = -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -CFLAGS = $(CCOPTS) -I../include $(DEFINES) +CFLAGS += $(CCOPTS) -I../include $(DEFINES) YACCFLAGS = -d -t -v LDLIBS += -L../lib -lnetlink -lutil -- Yosef Etigin Voltaire From halr at voltaire.com Tue Dec 26 09:27:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 12:27:40 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: remove unused local variable In-Reply-To: <20061224170248.GA7111@sashak.voltaire.com> References: <20061224170248.GA7111@sashak.voltaire.com> Message-ID: <1167154058.29620.1725.camel@hal.voltaire.com> On Sun, 2006-12-24 at 12:02, Sasha Khapyorsky wrote: > Remove unused local variable. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Tue Dec 26 09:28:03 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 12:28:03 -0500 Subject: [openib-general] [PATCH] opensm: rwlock double-release fix. In-Reply-To: <20061224170329.GB7111@sashak.voltaire.com> References: <20061224170329.GB7111@sashak.voltaire.com> Message-ID: <1167154064.29620.1727.camel@hal.voltaire.com> On Sun, 2006-12-24 at 12:03, Sasha Khapyorsky wrote: > When the port is removed from subnet, but previously requested pkey > table block is received after this - the lock will be released twice. > This leads to deadlocks later when other MAD processor will try to > acquire the same lock. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Tue Dec 26 09:28:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 12:28:08 -0500 Subject: [openib-general] [PATCH] opensm: clean old references on ports linking In-Reply-To: <20061224174315.GC7111@sashak.voltaire.com> References: <20061224174315.GC7111@sashak.voltaire.com> Message-ID: <1167154069.29620.1729.camel@hal.voltaire.com> On Sun, 2006-12-24 at 12:43, Sasha Khapyorsky wrote: > When linking ports, cleanup old remote references. Without it the ports > still be accessible as "linked" from old neighbors and in case of ports > moving, when some MADs can be lost or reordered, OpenSM subnet data > structures become broken. > > Signed-off-by: Sasha Khapyorsky Good catch. Thanks. Applied. -- Hal From halr at voltaire.com Tue Dec 26 09:28:12 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 12:28:12 -0500 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <458E7532.5030400@mellanox.co.il> References: <458E7532.5030400@mellanox.co.il> Message-ID: <1167154075.29620.1731.camel@hal.voltaire.com> Hi Eitan, On Sun, 2006-12-24 at 07:40, Eitan Zahavi wrote: > Hi Hal, > > OpenSM just uses the resulting path MTU/rate/pkt-life and fail the > query even though the selector might be allowing for selecting an > appropriate value. > > I have made the attached ibis based program for testing MTU select. > > After this fix the following results are obtained for a case of > path allowing maximal 2K MTU . > > In standard mode: > ------------------------------------------------------------ > MTU greater then ... 256 (0x01) -> equal to ....... 2K > MTU less then ...... 256 (0x41) -> NO PATHS > MTU equal to ....... 256 (0x81) -> equal to ....... 256 > MTU largest possible 256 (0xc1) -> equal to ....... 2K > MTU greater then ... 512 (0x02) -> equal to ....... 2K > MTU less then ...... 512 (0x42) -> equal to ....... 256 > MTU equal to ....... 512 (0x82) -> equal to ....... 512 > MTU largest possible 512 (0xc2) -> equal to ....... 2K > MTU greater then ... 1K (0x03) -> equal to ....... 2K > MTU less then ...... 1K (0x43) -> equal to ....... 512 > MTU equal to ....... 1K (0x83) -> equal to ....... 1K > MTU largest possible 1K (0xc3) -> equal to ....... 2K > MTU greater then ... 2K (0x04) -> NO PATHS > MTU less then ...... 2K (0x44) -> equal to ....... 1K > MTU equal to ....... 2K (0x84) -> equal to ....... 2K > MTU largest possible 2K (0xc4) -> equal to ....... 2K > MTU greater then ... 4K (0x05) -> NO PATHS > MTU less then ...... 4K (0x45) -> equal to ....... 2K > MTU equal to ....... 4K (0x85) -> NO PATHS > MTU largest possible 4K (0xc5) -> equal to ....... 2K > ============================================================ > > With enable_quirks (when one of the ends is a Tavor device): > ------------------------------------------------------------ > MTU greater then ... 256 (0x01) -> equal to ....... 1K > MTU less then ...... 256 (0x41) -> NO PATHS > MTU equal to ....... 256 (0x81) -> equal to ....... 256 > MTU largest possible 256 (0xc1) -> equal to ....... 2K > MTU greater then ... 512 (0x02) -> equal to ....... 1K > MTU less then ...... 512 (0x42) -> equal to ....... 256 > MTU equal to ....... 512 (0x82) -> equal to ....... 512 > MTU largest possible 512 (0xc2) -> equal to ....... 2K > MTU greater then ... 1K (0x03) -> NO PATHS > MTU less then ...... 1K (0x43) -> equal to ....... 512 > MTU equal to ....... 1K (0x83) -> equal to ....... 1K > MTU largest possible 1K (0xc3) -> equal to ....... 2K > MTU greater then ... 2K (0x04) -> NO PATHS > MTU less then ...... 2K (0x44) -> equal to ....... 1K > MTU equal to ....... 2K (0x84) -> equal to ....... 2K > MTU largest possible 2K (0xc4) -> equal to ....... 2K > MTU greater then ... 4K (0x05) -> NO PATHS > MTU less then ...... 4K (0x45) -> equal to ....... 1K > MTU equal to ....... 4K (0x85) -> NO PATHS > MTU largest possible 4K (0xc5) -> equal to ....... 2K > ============================================================ > > Signed-off-by: Eitan Zahavi Thanks. Applied. Note osm_sa_multipath_record.c had 2 rejected hunks which were applied by hand. -- Hal From halr at voltaire.com Tue Dec 26 09:28:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 12:28:26 -0500 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <458EC94A.2050808@mellanox.co.il> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> <1166967379.4519.320031.camel@hal.voltaire.com> <458EC94A.2050808@mellanox.co.il> Message-ID: <1167154101.29620.1733.camel@hal.voltaire.com> On Sun, 2006-12-24 at 13:39, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Sun, 2006-12-24 at 07:35, Eitan Zahavi wrote: > > > >> Hi Or, > >> > >> Sorry it took me a while. > >> > >> According to the IBTA spec: > >> 1. In order for MTU and MTUSelector to have any effect their component > >> mask bits MUST be set to 1 in the query > >> 2. Behavior of the SM is defined with small "freedom" to choose between > >> multiple matching MTU values if they exist. > >> > > > > I agree in general but would like to be sure about the details. Please > > be specific as to what IBA spec text you are referring to. > > > The text is part of the PathRecord table. Are you referring to the description of XXXSelector ? > >> 3. The table below summarizes all options: > >> > >> Assuming the value M represents the lowest MTU on the path > >> > > > > Is M the lowest available MTU or the highest available MTU for that path > > ? > > > M is the lowest MTU reported by all PortInfo for ports on the path. ^^^ NeighborMTU We are saying the same thing in different ways. -- Hal > > > >> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) > >> R represents the MTU value in the request. Similarly R-1 is one below R > >> and R+1 is one above R. > >> > >> Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk > >> w. Tavor End Port > >> -- > >> UNDEFINED | UNDEFINED | <= M | M | min(M,1K) > >> R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) > >> R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR > >> R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR > >> > > ^^^^^^^^ > > For the R> spec response column, I think you are saying the same as: > > >R AND <=M if M>R /ERR > > or > > R < x <=M if M>R /ERR > > where x is resp value > > > Yes that is what I mean: the response value MUST be both bigger then R > and equal or less to M. Otherwise an error. > > I agree with this table given the redefinition of M above and R > spec > > response interpretation. > > > Good. > > -- Hal > > > > > >> I have built some test code for making sure OpenSM does what is required. > >> Apparently it does not. In any case the M is not identical to R it fails > >> the request. > >> > >> I am working on fixing OpenSM. > >> > >> Any comments are welcome. > >> > >> EZ > >> > >> Or Gerlitz wrote: > >> > >>> Michael S. Tsirkin wrote: > >>> > >>> > >>>> I am not yet sure what is best for upstream, so I don't really think we need > >>>> any RFCs. > >>>> > >>>> > >>> > >>> > >>>> We'll need data from SM guys on whether MTU selector actually works > >>>> in SMs, and if not what happens when you enable it. > >>>> > >>>> > >>> Eitan, > >>> > >>> Can you please post here the tavor-quirk patch which was integrated into > >>> opensm? i can see the ***code*** of the opensm but might make some wrong > >>> assumptions or get into wrong understandings as i am not able to see the > >>> patch as is. > >>> > >>> Or. > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> openib-general mailing list > >>> openib-general at openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Tue Dec 26 10:46:44 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 13:46:44 -0500 Subject: [openib-general] [PATCH] opensm: rwlock double-release fix. In-Reply-To: <1167154064.29620.1727.camel@hal.voltaire.com> References: <20061224170329.GB7111@sashak.voltaire.com> <1167154064.29620.1727.camel@hal.voltaire.com> Message-ID: <1167158802.29620.5949.camel@hal.voltaire.com> On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote: > On Sun, 2006-12-24 at 12:03, Sasha Khapyorsky wrote: > > When the port is removed from subnet, but previously requested pkey > > table block is received after this - the lock will be released twice. > > This leads to deadlocks later when other MAD processor will try to > > acquire the same lock. > > > > Signed-off-by: Sasha Khapyorsky > > Thanks. Applied. Looks like this applied to OFED 1.1 as well. -- Hal > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Tue Dec 26 10:47:39 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 13:47:39 -0500 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <1167154075.29620.1731.camel@hal.voltaire.com> References: <458E7532.5030400@mellanox.co.il> <1167154075.29620.1731.camel@hal.voltaire.com> Message-ID: <1167158845.29620.6014.camel@hal.voltaire.com> Hi again Eitan, On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote: > Hi Eitan, > > On Sun, 2006-12-24 at 07:40, Eitan Zahavi wrote: > > Hi Hal, > > > > OpenSM just uses the resulting path MTU/rate/pkt-life and fail the > > query even though the selector might be allowing for selecting an > > appropriate value. > > > > I have made the attached ibis based program for testing MTU select. > > > > After this fix the following results are obtained for a case of > > path allowing maximal 2K MTU . > > > > In standard mode: > > ------------------------------------------------------------ > > MTU greater then ... 256 (0x01) -> equal to ....... 2K > > MTU less then ...... 256 (0x41) -> NO PATHS > > MTU equal to ....... 256 (0x81) -> equal to ....... 256 > > MTU largest possible 256 (0xc1) -> equal to ....... 2K > > MTU greater then ... 512 (0x02) -> equal to ....... 2K > > MTU less then ...... 512 (0x42) -> equal to ....... 256 > > MTU equal to ....... 512 (0x82) -> equal to ....... 512 > > MTU largest possible 512 (0xc2) -> equal to ....... 2K > > MTU greater then ... 1K (0x03) -> equal to ....... 2K > > MTU less then ...... 1K (0x43) -> equal to ....... 512 > > MTU equal to ....... 1K (0x83) -> equal to ....... 1K > > MTU largest possible 1K (0xc3) -> equal to ....... 2K > > MTU greater then ... 2K (0x04) -> NO PATHS > > MTU less then ...... 2K (0x44) -> equal to ....... 1K > > MTU equal to ....... 2K (0x84) -> equal to ....... 2K > > MTU largest possible 2K (0xc4) -> equal to ....... 2K > > MTU greater then ... 4K (0x05) -> NO PATHS > > MTU less then ...... 4K (0x45) -> equal to ....... 2K > > MTU equal to ....... 4K (0x85) -> NO PATHS > > MTU largest possible 4K (0xc5) -> equal to ....... 2K > > ============================================================ > > > > With enable_quirks (when one of the ends is a Tavor device): > > ------------------------------------------------------------ > > MTU greater then ... 256 (0x01) -> equal to ....... 1K > > MTU less then ...... 256 (0x41) -> NO PATHS > > MTU equal to ....... 256 (0x81) -> equal to ....... 256 > > MTU largest possible 256 (0xc1) -> equal to ....... 2K > > MTU greater then ... 512 (0x02) -> equal to ....... 1K > > MTU less then ...... 512 (0x42) -> equal to ....... 256 > > MTU equal to ....... 512 (0x82) -> equal to ....... 512 > > MTU largest possible 512 (0xc2) -> equal to ....... 2K > > MTU greater then ... 1K (0x03) -> NO PATHS > > MTU less then ...... 1K (0x43) -> equal to ....... 512 > > MTU equal to ....... 1K (0x83) -> equal to ....... 1K > > MTU largest possible 1K (0xc3) -> equal to ....... 2K > > MTU greater then ... 2K (0x04) -> NO PATHS > > MTU less then ...... 2K (0x44) -> equal to ....... 1K > > MTU equal to ....... 2K (0x84) -> equal to ....... 2K > > MTU largest possible 2K (0xc4) -> equal to ....... 2K > > MTU greater then ... 4K (0x05) -> NO PATHS > > MTU less then ...... 4K (0x45) -> equal to ....... 1K > > MTU equal to ....... 4K (0x85) -> NO PATHS > > MTU largest possible 4K (0xc5) -> equal to ....... 2K > > ============================================================ > > > > Signed-off-by: Eitan Zahavi > > Thanks. Applied. Note osm_sa_multipath_record.c had 2 rejected hunks > which were applied by hand. Should this be applied for OFED 1.1 as well ? -- Hal > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Tue Dec 26 10:47:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 13:47:26 -0500 Subject: [openib-general] [PATCH] opensm: clean old references on ports linking In-Reply-To: <1167154069.29620.1729.camel@hal.voltaire.com> References: <20061224174315.GC7111@sashak.voltaire.com> <1167154069.29620.1729.camel@hal.voltaire.com> Message-ID: <1167158805.29620.5951.camel@hal.voltaire.com> On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote: > On Sun, 2006-12-24 at 12:43, Sasha Khapyorsky wrote: > > When linking ports, cleanup old remote references. Without it the ports > > still be accessible as "linked" from old neighbors and in case of ports > > moving, when some MADs can be lost or reordered, OpenSM subnet data > > structures become broken. > > > > Signed-off-by: Sasha Khapyorsky > > Good catch. > > Thanks. Applied. Looks like this applied to OFED 1.1 as well. -- Hal > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Tue Dec 26 11:15:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 14:15:27 -0500 Subject: [openib-general] Old svn repository access Message-ID: <1167160526.29620.7478.camel@hal.voltaire.com> Hi, Thought the old svn repository was made RO. When I do a RO operation to it, I get the following error: svn log | more (R)eject, accept (t)emporarily or accept (p)ermanently? svn: PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/management/diags/src/ibnetdiscover.c' svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/diags/src/ibnetdiscover.c': 405 Method Not Allowed (https://openib.org) Shouldn't this work ? -- Hal From mst at mellanox.co.il Tue Dec 26 11:26:03 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 21:26:03 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <1167158845.29620.6014.camel@hal.voltaire.com> References: <458E7532.5030400@mellanox.co.il> <1167154075.29620.1731.camel@hal.voltaire.com> <1167158845.29620.6014.camel@hal.voltaire.com> Message-ID: <20061226192603.GA4815@mellanox.co.il> > Should this be applied for OFED 1.1 as well ? There are a lot of other fixes all over the stack that might be useful to people. But first EWG needs to decide how OFED 1.1 support will be done. For now, the only thing we have is the support wiki with links to patches. So if there's a customer that is hit by one of these bugs, a patch should be created and put here, and description added to wiki. -- MST From halr at voltaire.com Tue Dec 26 11:34:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 14:34:58 -0500 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <20061226192603.GA4815@mellanox.co.il> References: <458E7532.5030400@mellanox.co.il> <1167154075.29620.1731.camel@hal.voltaire.com> <1167158845.29620.6014.camel@hal.voltaire.com> <20061226192603.GA4815@mellanox.co.il> Message-ID: <1167161695.29620.8457.camel@hal.voltaire.com> On Tue, 2006-12-26 at 14:26, Michael S. Tsirkin wrote: > > Should this be applied for OFED 1.1 as well ? > > There are a lot of other fixes all over the stack that might be > useful to people. > But first EWG needs to decide how OFED 1.1 support will be done. I thought that was already decided. Tziporet indicated to do this a while ago (post 1.1 "ship"). > For now, the only thing we have is the support wiki with links > to patches. So if there's a customer that is hit by one > of these bugs, a patch should be created and put here, > and description added to wiki. Yes and the sources updated as well just in case a new SRPM is created... -- Hal From eitan at mellanox.co.il Tue Dec 26 11:54:48 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 26 Dec 2006 21:54:48 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <1167158845.29620.6014.camel@hal.voltaire.com> References: <458E7532.5030400@mellanox.co.il> <1167154075.29620.1731.camel@hal.voltaire.com> <1167158845.29620.6014.camel@hal.voltaire.com> Message-ID: <45917E08.3010005@mellanox.co.il> Hal Rosenstock wrote: > Hi again Eitan, > > On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote: > >> Hi Eitan, >> >> On Sun, 2006-12-24 at 07:40, Eitan Zahavi wrote: >> >>> Hi Hal, >>> >>> OpenSM just uses the resulting path MTU/rate/pkt-life and fail the >>> query even though the selector might be allowing for selecting an >>> appropriate value. >>> >>> I have made the attached ibis based program for testing MTU select. >>> >>> After this fix the following results are obtained for a case of >>> path allowing maximal 2K MTU . >>> >>> In standard mode: >>> ------------------------------------------------------------ >>> MTU greater then ... 256 (0x01) -> equal to ....... 2K >>> MTU less then ...... 256 (0x41) -> NO PATHS >>> MTU equal to ....... 256 (0x81) -> equal to ....... 256 >>> MTU largest possible 256 (0xc1) -> equal to ....... 2K >>> MTU greater then ... 512 (0x02) -> equal to ....... 2K >>> MTU less then ...... 512 (0x42) -> equal to ....... 256 >>> MTU equal to ....... 512 (0x82) -> equal to ....... 512 >>> MTU largest possible 512 (0xc2) -> equal to ....... 2K >>> MTU greater then ... 1K (0x03) -> equal to ....... 2K >>> MTU less then ...... 1K (0x43) -> equal to ....... 512 >>> MTU equal to ....... 1K (0x83) -> equal to ....... 1K >>> MTU largest possible 1K (0xc3) -> equal to ....... 2K >>> MTU greater then ... 2K (0x04) -> NO PATHS >>> MTU less then ...... 2K (0x44) -> equal to ....... 1K >>> MTU equal to ....... 2K (0x84) -> equal to ....... 2K >>> MTU largest possible 2K (0xc4) -> equal to ....... 2K >>> MTU greater then ... 4K (0x05) -> NO PATHS >>> MTU less then ...... 4K (0x45) -> equal to ....... 2K >>> MTU equal to ....... 4K (0x85) -> NO PATHS >>> MTU largest possible 4K (0xc5) -> equal to ....... 2K >>> ============================================================ >>> >>> With enable_quirks (when one of the ends is a Tavor device): >>> ------------------------------------------------------------ >>> MTU greater then ... 256 (0x01) -> equal to ....... 1K >>> MTU less then ...... 256 (0x41) -> NO PATHS >>> MTU equal to ....... 256 (0x81) -> equal to ....... 256 >>> MTU largest possible 256 (0xc1) -> equal to ....... 2K >>> MTU greater then ... 512 (0x02) -> equal to ....... 1K >>> MTU less then ...... 512 (0x42) -> equal to ....... 256 >>> MTU equal to ....... 512 (0x82) -> equal to ....... 512 >>> MTU largest possible 512 (0xc2) -> equal to ....... 2K >>> MTU greater then ... 1K (0x03) -> NO PATHS >>> MTU less then ...... 1K (0x43) -> equal to ....... 512 >>> MTU equal to ....... 1K (0x83) -> equal to ....... 1K >>> MTU largest possible 1K (0xc3) -> equal to ....... 2K >>> MTU greater then ... 2K (0x04) -> NO PATHS >>> MTU less then ...... 2K (0x44) -> equal to ....... 1K >>> MTU equal to ....... 2K (0x84) -> equal to ....... 2K >>> MTU largest possible 2K (0xc4) -> equal to ....... 2K >>> MTU greater then ... 4K (0x05) -> NO PATHS >>> MTU less then ...... 4K (0x45) -> equal to ....... 1K >>> MTU equal to ....... 4K (0x85) -> NO PATHS >>> MTU largest possible 4K (0xc5) -> equal to ....... 2K >>> ============================================================ >>> >>> Signed-off-by: Eitan Zahavi >>> >> Thanks. Applied. Note osm_sa_multipath_record.c had 2 rejected hunks >> which were applied by hand. >> > > Should this be applied for OFED 1.1 as well ? > I would say it should. But I think it deserves OFED group call. > -- Hal > > >> -- Hal >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Tue Dec 26 12:01:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 22:01:58 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <1167161695.29620.8457.camel@hal.voltaire.com> References: <458E7532.5030400@mellanox.co.il> <1167154075.29620.1731.camel@hal.voltaire.com> <1167158845.29620.6014.camel@hal.voltaire.com> <20061226192603.GA4815@mellanox.co.il> <1167161695.29620.8457.camel@hal.voltaire.com> Message-ID: <20061226200158.GF4815@mellanox.co.il> > > > Should this be applied for OFED 1.1 as well ? > > > > There are a lot of other fixes all over the stack that might be > > useful to people. > > But first EWG needs to decide how OFED 1.1 support will be done. > > I thought that was already decided. Tziporet indicated to do this a > while ago (post 1.1 "ship"). The support page. Yes. But not for new SRPMs. > > For now, the only thing we have is the support wiki with links > > to patches. So if there's a customer that is hit by one > > of these bugs, a patch should be created and put here, > > and description added to wiki. > > Yes and the sources updated as well just in case a new SRPM is > created... That's the big question. Suppose someone decides there's a show-stopper he wants fixed (like ehca guys had) and wants to build a bugfix release. This entity might not care about or use opensm, but since you checked stuff into branch, a version of opensm that was not properly QA'd will get dropped in this dot release. It would have been better to stick with the QA'd code from 1.1. So what I am saying, *when* there's a release the person(s) that do it should decide changes in which packages do they want. All this stems from the model we had for OFED, where we have a global "BUILD ID" and a monolitic package instead of a set of modules which can be updated individually. Hopefully maintainers (besides Roland that is) will finally start making releases of packages, then OFED will package them together but user will be later able to update some package separately. This clearly applies to userspace libraries, and maybe for kernel modules we can also invent something like this too, so that e.g. ehca module can be updated without risking breaking mthca. -- MST From mst at mellanox.co.il Tue Dec 26 12:04:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 22:04:16 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <45917E08.3010005@mellanox.co.il> References: <458E7532.5030400@mellanox.co.il> <1167154075.29620.1731.camel@hal.voltaire.com> <1167158845.29620.6014.camel@hal.voltaire.com> <45917E08.3010005@mellanox.co.il> Message-ID: <20061226200416.GG4815@mellanox.co.il> > > Should this be applied for OFED 1.1 as well ? > > > I would say it should. But I think it deserves OFED group call. I think we should apply things to ofed branch only before bugfix release, and only for packages that will be re-tested, otherwise untested code will ship. -- MST From halr at voltaire.com Tue Dec 26 12:30:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 15:30:30 -0500 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLife explicitly ignoring selectors In-Reply-To: <20061226200158.GF4815@mellanox.co.il> References: <458E7532.5030400@mellanox.co.il> <1167154075.29620.1731.camel@hal.voltaire.com> <1167158845.29620.6014.camel@hal.voltaire.com> <20061226192603.GA4815@mellanox.co.il> <1167161695.29620.8457.camel@hal.voltaire.com> <20061226200158.GF4815@mellanox.co.il> Message-ID: <1167165028.29620.11344.camel@hal.voltaire.com> On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote: > > > > Should this be applied for OFED 1.1 as well ? > > > > > > There are a lot of other fixes all over the stack that might be > > > useful to people. > > > But first EWG needs to decide how OFED 1.1 support will be done. > > > > I thought that was already decided. Tziporet indicated to do this a > > while ago (post 1.1 "ship"). > > The support page. Yes. But not for new SRPMs. That's fine with me but not what a previous email said (in terms of updating the sources) and what has been followed for OpenSM at least until now... > > > For now, the only thing we have is the support wiki with links > > > to patches. So if there's a customer that is hit by one > > > of these bugs, a patch should be created and put here, > > > and description added to wiki. > > > > Yes and the sources updated as well just in case a new SRPM is > > created... > > That's the big question. Suppose someone decides there's a > show-stopper he wants fixed (like ehca guys had) and wants to build > a bugfix release. This entity might not care about or use opensm, > but since you checked stuff into branch, a version of opensm that > was not properly QA'd will get dropped in this dot release. > It would have been better to stick with the QA'd code from 1.1. > > So what I am saying, *when* there's a release the person(s) > that do it should decide changes in which packages do they want. > > All this stems from the model we had for OFED, where > we have a global "BUILD ID" and a monolitic package > instead of a set of modules which can be updated individually. > > Hopefully maintainers (besides Roland that is) will finally > start making releases of packages, This has been agreed to and will be done before 1/31 for OFED 1.2. -- Hal > then OFED will package > them together but user will be later able to update some package > separately. This clearly applies to userspace libraries, > and maybe for kernel modules we can also invent something like this too, > so that e.g. ehca module can be updated without risking breaking > mthca. From mst at mellanox.co.il Tue Dec 26 12:45:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 22:45:04 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLifeexplicitly ignoring selectors In-Reply-To: <1167165028.29620.11344.camel@hal.voltaire.com> References: <1167165028.29620.11344.camel@hal.voltaire.com> Message-ID: <20061226204504.GB4329@mellanox.co.il> > On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote: > > > > > Should this be applied for OFED 1.1 as well ? > > > > > > > > There are a lot of other fixes all over the stack that might be > > > > useful to people. > > > > But first EWG needs to decide how OFED 1.1 support will be done. > > > > > > I thought that was already decided. Tziporet indicated to do this a > > > while ago (post 1.1 "ship"). > > > > The support page. Yes. But not for new SRPMs. > > That's fine with me but not what a previous email said (in terms of > updating the sources) and what has been followed for OpenSM at least > until now... Maybe I'm wrong. I don't have that mail around. Was not the idea that when someone wants to do a bugfix release he puts just these fixes in a package, tests it and releases the update? If so opensm should be updated only if it will be-retested, and this is only needed before release. -- MST From halr at voltaire.com Tue Dec 26 12:55:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 26 Dec 2006 15:55:00 -0500 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLifeexplicitly ignoring selectors In-Reply-To: <20061226204504.GB4329@mellanox.co.il> References: <1167165028.29620.11344.camel@hal.voltaire.com> <20061226204504.GB4329@mellanox.co.il> Message-ID: <1167166497.29620.12552.camel@hal.voltaire.com> On Tue, 2006-12-26 at 15:45, Michael S. Tsirkin wrote: > > On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote: > > > > > > Should this be applied for OFED 1.1 as well ? > > > > > > > > > > There are a lot of other fixes all over the stack that might be > > > > > useful to people. > > > > > But first EWG needs to decide how OFED 1.1 support will be done. > > > > > > > > I thought that was already decided. Tziporet indicated to do this a > > > > while ago (post 1.1 "ship"). > > > > > > The support page. Yes. But not for new SRPMs. > > > > That's fine with me but not what a previous email said (in terms of > > updating the sources) and what has been followed for OpenSM at least > > until now... > > Maybe I'm wrong. I don't have that mail around. I can repost it if needed (or point to a URL for it). > Was not the idea that when someone wants to do a bugfix release > he puts just these fixes in a package, tests it and releases the update? There was no mention of the testing aspects in that email. -- Hal > If so opensm should be updated only if it will be-retested, and > this is only needed before release. From mst at mellanox.co.il Tue Dec 26 13:38:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 26 Dec 2006 23:38:04 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in usingMTU/rate/PktLifeexplicitly ignoring selectors In-Reply-To: <1167166497.29620.12552.camel@hal.voltaire.com> References: <1167166497.29620.12552.camel@hal.voltaire.com> Message-ID: <20061226213804.GC4329@mellanox.co.il> > > > On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote: > > > > > > > Should this be applied for OFED 1.1 as well ? > > > > > > > > > > > > There are a lot of other fixes all over the stack that might be > > > > > > useful to people. > > > > > > But first EWG needs to decide how OFED 1.1 support will be done. > > > > > > > > > > I thought that was already decided. Tziporet indicated to do this a > > > > > while ago (post 1.1 "ship"). > > > > > > > > The support page. Yes. But not for new SRPMs. > > > > > > That's fine with me but not what a previous email said (in terms of > > > updating the sources) and what has been followed for OpenSM at least > > > until now... > > > > Maybe I'm wrong. I don't have that mail around. > > I can repost it if needed (or point to a URL for it). Why not? > > Was not the idea that when someone wants to do a bugfix release > > he puts just these fixes in a package, tests it and releases the update? > > There was no mention of the testing aspects in that email. So, what do you think? > > If so opensm should be updated only if it will be-retested, and > > this is only needed before release. -- MST From sashak at voltaire.com Tue Dec 26 16:35:09 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 27 Dec 2006 02:35:09 +0200 Subject: [openib-general] [PATCH] opensm: rwlock double-release fix. In-Reply-To: <1167158802.29620.5949.camel@hal.voltaire.com> References: <20061224170329.GB7111@sashak.voltaire.com> <1167154064.29620.1727.camel@hal.voltaire.com> <1167158802.29620.5949.camel@hal.voltaire.com> Message-ID: <20061227003509.GB32492@sashak.voltaire.com> On 13:46 Tue 26 Dec , Hal Rosenstock wrote: > On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote: > > On Sun, 2006-12-24 at 12:03, Sasha Khapyorsky wrote: > > > When the port is removed from subnet, but previously requested pkey > > > table block is received after this - the lock will be released twice. > > > This leads to deadlocks later when other MAD processor will try to > > > acquire the same lock. > > > > > > Signed-off-by: Sasha Khapyorsky > > > > Thanks. Applied. > > Looks like this applied to OFED 1.1 as well. Yes, this is the old code. Sasha From sashak at voltaire.com Tue Dec 26 16:35:55 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 27 Dec 2006 02:35:55 +0200 Subject: [openib-general] [PATCH] opensm: clean old references on ports linking In-Reply-To: <1167158805.29620.5951.camel@hal.voltaire.com> References: <20061224174315.GC7111@sashak.voltaire.com> <1167154069.29620.1729.camel@hal.voltaire.com> <1167158805.29620.5951.camel@hal.voltaire.com> Message-ID: <20061227003555.GC32492@sashak.voltaire.com> On 13:47 Tue 26 Dec , Hal Rosenstock wrote: > On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote: > > On Sun, 2006-12-24 at 12:43, Sasha Khapyorsky wrote: > > > When linking ports, cleanup old remote references. Without it the ports > > > still be accessible as "linked" from old neighbors and in case of ports > > > moving, when some MADs can be lost or reordered, OpenSM subnet data > > > structures become broken. > > > > > > Signed-off-by: Sasha Khapyorsky > > > > Good catch. > > > > Thanks. Applied. > > Looks like this applied to OFED 1.1 as well. Yes. Sasha From sashak at voltaire.com Tue Dec 26 17:16:15 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 27 Dec 2006 03:16:15 +0200 Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using MTU/rate/PktLifeexplicitly ignoring selectors In-Reply-To: <20061226204504.GB4329@mellanox.co.il> References: <1167165028.29620.11344.camel@hal.voltaire.com> <20061226204504.GB4329@mellanox.co.il> Message-ID: <20061227011615.GD32492@sashak.voltaire.com> On 22:45 Tue 26 Dec , Michael S. Tsirkin wrote: > > On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote: > > > > > > Should this be applied for OFED 1.1 as well ? > > > > > > > > > > There are a lot of other fixes all over the stack that might be > > > > > useful to people. > > > > > But first EWG needs to decide how OFED 1.1 support will be done. > > > > > > > > I thought that was already decided. Tziporet indicated to do this a > > > > while ago (post 1.1 "ship"). > > > > > > The support page. Yes. But not for new SRPMs. > > > > That's fine with me but not what a previous email said (in terms of > > updating the sources) and what has been followed for OpenSM at least > > until now... > > Maybe I'm wrong. I don't have that mail around. > Was not the idea that when someone wants to do a bugfix release > he puts just these fixes in a package, tests it and releases the update? What is the point to put all together and have minimal testing time, w/out any native pre-release testing? Why source version control is needed then? If you need to remember where a last release point was just use tag (or date). And if one will need to "cherrypick" fixes she/he will be able to use this tag. Sasha > > If so opensm should be updated only if it will be-retested, and > this is only needed before release. > > -- > MST > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at sw053.yok.mtl.com Tue Dec 26 21:10:18 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Wed, 27 Dec 2006 07:10:18 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-27:normal completion Message-ID: <200612270510.kBR5AIn5016958@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Tue_Dec_26_12:24:26_2006 1ae301 ibutils rev = Tue_Dec_26_00:00:31_2006 f81b3b Total=351 Pass=349 Fail=2 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 38 OsmTest IS1-16.topo 38 OsmStress IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo Failures: 1 OsmTest IS1-16.topo 1 OsmStress IS1-16.topo From eitan at mellanox.co.il Tue Dec 26 22:47:47 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 27 Dec 2006 08:47:47 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-27:normal completion In-Reply-To: <200612270510.kBR5AIn5016958@sw053.yok.mtl.com> References: <200612270510.kBR5AIn5016958@sw053.yok.mtl.com> Message-ID: <45921713.4040301@mellanox.co.il> Analysis: OsmStress: TEST ISSUE = Somehow OpenSM lost it's local port which should have never get into DOWN state. OsmTest: ibmgtsim issue = the fix I introduced in for the deadlock actually causes a race on client close that make the simulator segfault. I need to really resolve the deadlock. Should have known it's coming. EZ Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = Tue_Dec_26_12:24:26_2006 1ae301 > ibutils rev = Tue_Dec_26_00:00:31_2006 f81b3b > Total=351 Pass=349 Fail=2 > > Pass: > 39 Stability IS1-16.topo > 39 Pkey IS1-16.topo > 39 Multicast IS1-16.topo > 39 LidMgr IS1-16.topo > 38 OsmTest IS1-16.topo > 38 OsmStress IS1-16.topo > 13 Stability IS3-loop.topo > 13 Stability IS3-128.topo > 13 Pkey IS3-128.topo > 13 OsmTest IS3-loop.topo > 13 OsmTest IS3-128.topo > 13 OsmStress IS3-128.topo > 13 Multicast IS3-loop.topo > 13 Multicast IS3-128.topo > 13 LidMgr IS3-128.topo > > Failures: > 1 OsmTest IS1-16.topo > 1 OsmStress IS1-16.topo > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From yosefe at voltaire.com Tue Dec 26 23:27:41 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 27 Dec 2006 09:27:41 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 In-Reply-To: <20061225234648.GJ17469@mellanox.co.il> References: <45900F45.50906@voltaire.com> <20061225234648.GJ17469@mellanox.co.il> Message-ID: <4592206D.3070206@voltaire.com> Michael S. Tsirkin wrote: >>>Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64 >>> >>>Michael S. Tsirkin wrote: >>> >>> >>> >>>>>>Quoting r. Yosef Etigin : >>>>>>Subject: ofed 1.2 - compilation erros on ppc64 and ia64 >>>>>> >>>>>> >>>>> >>>>>Which distro are you testing on? >>>>> >>>>> >>>>> >>>> >>>>I am testing on sles10, both ia64 and ppc64. >>>> >>>> >>>> >>>>>>Hello, >>>>>>I've been testing ofed 1.2 build from >>>>>>http://staging.openfabrics.org/builds/ >>>>>>, (latest.tgz versions both user >>>>>>and kernel) and got compilation erros on: ia64, ppc64: >>>>>> >>>>>>*ppc64:* >>>>>> >>>>>> make -w -C ip ip >>>>>> make[2]: Entering directory >>>>>> `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip' >>>>>> [ ... omitted text ... ] >>>>>> gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include >>>>>> -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c >>>>>> gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o >>>>>> rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o >>>>>> ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o >>>>>> xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a >>>>>> -lresolv -L../lib -lnetlink -lutil -o ip >>>>>> /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when >>>>>> searching for -lnetlink >>>>>> /usr/bin/ld: skipping incompatible >>>>>> /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when >>>>>> searching for -lnetlink >>>>>> /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when >>>>>> searching for -lnetlink >>>>>> /usr/bin/ld: cannot find -lnetlink >>>>>> collect2: ld returned 1 exit status >>>>>> make[2]: *** [ip] Error 1 >>>>>> >>>>>>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides >>>>>>CFLAGS (= instead of +=) >>>>>> >>>>>> >>>>> >>>>>Isn't this makefile part of iproute2? >>>>>Can you build iproute on this platform? >>>>> >>>>> >>>> >>>>This makefile is indeed of iproute, >>>>but it seems to make 32-bit object files for `iproute' during compilation >>>>and therefore fails to find 64-bit during linkage of `ip'. >>> >>> >>>Will installing the 32 bit version of the library help? >>> >>> >> >>I dont think so.. the issue arised during compilation, since `iproute' >>was inconsinsten in its use of -m64: >>The iproute Makefile overrides any `CFLAGS' it might get from top-level, >>thus throwing `-m64' away, while LDFLAGS are not overriden. >>Therefore, the compilation is done in 32bit while the linkage in 64bit > > > Probably the easies thing is to fix iproute. Patch? > > >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>*ia64:* >>>>>> >>>>>> make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build >>>>>> obj=/tmp/openib_gen2/kernel/drivers/infiniband/core >>>>>> gcc [ ... omitted text ... ] -c -o >>>>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o >>>>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c >>>>>> In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37, >>>>>> from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38: >>>>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function >>>>>> ‘ib_sg_dma_address’: >>>>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error: >>>>>> implicit declaration of function ‘sg_dma_address’ >>>>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function >>>>>> ‘ib_sg_dma_len’: >>>>>> /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error: >>>>>> implicit declaration of function ‘sg_dma_len’ >>>>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level: >>>>>> /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning: >>>>>> initialization from incompatible pointer type >>>>>> [ ... omitted text ... ] >>>>>> make: *** [kernel] Error 2 >>>>>> >>>>>> >>>>> >>>>>Probably a distro-specific backport problem - check how come sg_dma_len is not defined. >>>>>I see this on upstream 2.6.16 >>>>> asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length) >>>>> >>>>> >>>> >>>>Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere >>>>else in this file, but in: >>>> ./asm-ia64/pci.h:82:#define sg_dma_len(sg) ((sg)->dma_length) >>>> >>> >>> >>>Isee, its fixed on 2.6.20. >>>Need to do something about it in the backport then. >>> >>>I wonder whether we can just put >>>#ifdef __ia64__ >>>#define sg_dma_len(sg) ((sg)->dma_length) >>>#endif >>> >>>in kernel_addons/backports/2.6.16/include/asm/scatterlist.h >>> >>>Also need tofind out in which kernel this was fixed. >>> >> >>Looks like in all kernels up to 2.6.20 it was in `pci.h' so need to >>backtort to.. all previous versions > > > Right. Try sticking this in kernel_addons/backports/2.6.20 and > copying it over. > OK, I put: #ifndef BACKPORT_SCATTERLIST_H #define BACKPORT_SCATTERLIST_H #include_next #ifdef __ia64__ #define sg_dma_address(sg) ((sg)->dma_address) #define sg_dma_len(sg) ((sg)->dma_length) #endif #endif in kernel_addons/backport/X where X<=2.6.19 and it does the job -- Yossi From mst at mellanox.co.il Tue Dec 26 23:53:21 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 09:53:21 +0200 Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64 In-Reply-To: <4592206D.3070206@voltaire.com> References: <45900F45.50906@voltaire.com> <20061225234648.GJ17469@mellanox.co.il> <4592206D.3070206@voltaire.com> Message-ID: <20061227075321.GE19436@mellanox.co.il> > OK, I put: > > #ifndef BACKPORT_SCATTERLIST_H > #define BACKPORT_SCATTERLIST_H > > #include_next > > #ifdef __ia64__ > #define sg_dma_address(sg) ((sg)->dma_address) > #define sg_dma_len(sg) ((sg)->dma_length) > #endif > > #endif > > in kernel_addons/backport/X where X<=2.6.19 > and it does the job OK. Where can I pull all this from? -- MST From kliteyn at dev.mellanox.co.il Wed Dec 27 01:03:23 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 11:03:23 +0200 Subject: [openib-general] [PATCH 1/3] osm: Changes for windows compatability Message-ID: <459236DB.70009@dev.mellanox.co.il> Hi Hal. Fixing windows compilation problems. Signed-off-by: Yevgeny Kliteynik --- osm/include/iba/ib_types.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index 723e8b9..ec65b64 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -59,9 +59,10 @@ BEGIN_C_DECLS #define OSM_EXPORT __declspec(dllimport) #endif #define OSM_API __stdcall + #define OSM_CDECL __cdecl #else #define OSM_EXPORT extern - #define OSM_API + #define OSM_CDECL #define __ptr64 #endif -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Wed Dec 27 01:03:50 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 11:03:50 +0200 Subject: [openib-general] [PATCH 2/3] osm: Changes for windows compatability Message-ID: <459236F6.8060707@dev.mellanox.co.il> Hi Hal. Fixing windows compilation problems. Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 42 ++++++++++++++++++++++-------------------- 1 files changed, 22 insertions(+), 20 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index ba95a0d..054e3c9 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -135,8 +135,8 @@ typedef uint8_t * ftree_fwd_tbl_t; typedef struct ftree_port_t_ { cl_map_item_t map_item; - uint16_t port_num; /* port number on the current node */ - uint16_t remote_port_num; /* port number on the remote node */ + uint8_t port_num; /* port number on the current node */ + uint8_t remote_port_num; /* port number on the remote node */ uint32_t counter_up; /* number of allocated routs upwards */ uint32_t counter_down; /* number of allocated routs downwards */ } ftree_port_t; @@ -212,7 +212,7 @@ typedef struct ftree_fabric_t_ cl_qmap_t hca_tbl; cl_qmap_t sw_tbl; cl_qmap_t sw_by_tuple_tbl; - uint32_t tree_rank; + uint16_t tree_rank; ftree_sw_t ** leaf_switches; uint32_t leaf_switches_num; uint16_t max_hcas_per_leaf; @@ -226,7 +226,7 @@ typedef struct ftree_fabric_t_ ** ***************************************************/ -int +int OSM_CDECL __osm_ftree_compare_switches_by_index( IN const void * p1, IN const void * p2) @@ -247,7 +247,7 @@ __osm_ftree_compare_switches_by_index( /***************************************************/ -int +int OSM_CDECL __osm_ftree_compare_port_groups_by_remote_switch_index( IN const void * p1, IN const void * p2) @@ -401,8 +401,8 @@ __osm_ftree_sw_tbl_element_destroy( static ftree_port_t * __osm_ftree_port_create( - IN uint16_t port_num, - IN uint16_t remote_port_num) + IN uint8_t port_num, + IN uint8_t remote_port_num) { ftree_port_t * p_port = (ftree_port_t *)malloc(sizeof(ftree_port_t)); if (!p_port) @@ -553,8 +553,8 @@ __osm_ftree_port_group_dump( static void __osm_ftree_port_group_add_port( IN ftree_port_group_t * p_group, - IN uint16_t port_num, - IN uint16_t remote_port_num) + IN uint8_t port_num, + IN uint8_t remote_port_num) { uint16_t i; ftree_port_t * p_port; @@ -722,8 +722,8 @@ __osm_ftree_sw_get_port_group_by_remote_ static void __osm_ftree_sw_add_port( IN ftree_sw_t * p_sw, - IN uint16_t port_num, - IN uint16_t remote_port_num, + IN uint8_t port_num, + IN uint8_t remote_port_num, IN ib_net16_t base_lid, IN uint8_t lmc, IN ib_net16_t remote_base_lid, @@ -872,8 +872,8 @@ __osm_ftree_hca_get_port_group_by_remote static void __osm_ftree_hca_add_port( IN ftree_hca_t * p_hca, - IN uint16_t port_num, - IN uint16_t remote_port_num, + IN uint8_t port_num, + IN uint8_t remote_port_num, IN ib_net16_t base_lid, IN uint8_t lmc, IN ib_net16_t remote_base_lid, @@ -1799,7 +1799,7 @@ __osm_ftree_fabric_route_upgoing_by_goin /* find the least loaded port of the group (in indexing order) */ p_min_port = NULL; - ports_num = cl_ptr_vector_get_size(&p_group->ports); + ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports); /* ToDo: no need to select a least loaded port for non-main path. Think about optimization. */ for (j = 0; j < ports_num; j++) @@ -1951,7 +1951,7 @@ __osm_ftree_fabric_route_downgoing_by_go { p_group = p_sw->up_port_groups[i]; - ports_num = cl_ptr_vector_get_size(&p_group->ports); + ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports); for (j = 0; j < ports_num; j++) { cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port); @@ -2182,7 +2182,9 @@ __osm_ftree_fabric_route_to_hcas( osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: " "Routing %u dummy HCAs\n", p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); - for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++) + for ( j = 0; + ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); + j++) { /* assign downgoing ports by stepping up */ __osm_ftree_fabric_route_downgoing_by_going_up( @@ -2329,7 +2331,7 @@ __osm_ftree_rank_from_switch( osm_node_t * p_node; osm_node_t * p_remote_node; osm_physp_t * p_osm_port; - uint16_t i; + uint8_t i; cl_list_t bfs_list; ftree_sw_tbl_element_t * p_sw_tbl_element = NULL; @@ -2394,7 +2396,7 @@ __osm_ftree_rank_switches_from_hca( osm_node_t * p_osm_node = p_hca->p_osm_node; osm_node_t * p_remote_osm_node; osm_physp_t * p_osm_port; - static uint16_t i = 0; + static uint8_t i = 0; int res = 0; OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_rank_switches_from_hca); @@ -2493,7 +2495,7 @@ __osm_ftree_fabric_construct_hca_ports( uint8_t remote_node_type; ib_net64_t remote_node_guid; osm_physp_t * p_remote_osm_port; - uint16_t i; + uint8_t i; uint8_t remote_port_num; int res = 0; @@ -2590,7 +2592,7 @@ __osm_ftree_fabric_construct_sw_ports( osm_physp_t * p_remote_osm_port; ftree_direction_t direction; void * p_remote_hca_or_sw; - uint16_t i; + uint8_t i; uint8_t remote_port_num; int res = 0; -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Wed Dec 27 01:05:18 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 11:05:18 +0200 Subject: [openib-general] [PATCH 3/3] osm: Changes for windows compatability Message-ID: <4592374E.7020008@dev.mellanox.co.il> Hi Hal. Fixing windows compilation problems. Signed-off-by: Yevgeny Kliteynik --- osm/osmtest/osmtest.c | 11 +++++++---- 1 files changed, 7 insertions(+), 4 deletions(-) diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 0ccc06c..05b1134 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -471,10 +471,13 @@ osmtest_destroy( IN osmtest_t * const p_ { cl_map_item_t *p_item,*p_next_item; + /* Currently there is a problem with IBAL exit flow - memory overrun, + so bypass vendor deletion - it will be cleaned by the Windows OS */ +#ifndef __WIN__ if( p_osmt->p_vendor ) - { osm_vendor_delete( &p_osmt->p_vendor ); - } +#endif + cl_qpool_destroy( &p_osmt->port_pool ); cl_qpool_destroy( &p_osmt->node_pool ); @@ -4922,7 +4925,7 @@ osmtest_informinfo_request( /* as currently no comp mask bits defined for InformInfo!!! */ user.comp_mask = IB_IIR_COMPMASK_SUBSCRIBE; p_inform_info_opt = p_options; - rec.subscribe = p_inform_info_opt->subscribe; + rec.subscribe = (uint8_t)p_inform_info_opt->subscribe; if (p_inform_info_opt->qpn) { rec.g_or_v.generic.qpn_resp_time_val = cl_hton32(p_inform_info_opt->qpn) >> 8; @@ -5601,7 +5604,7 @@ osmtest_validate_against_db( IN osmtest_ #ifdef DUAL_SIDED_RMPP osmv_multipath_req_t request; #endif - int i; + uint8_t i; #endif OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_against_db ); -- 1.4.4.1.GIT From yosefe at voltaire.com Wed Dec 27 02:08:14 2006 From: yosefe at voltaire.com (Yosef Etigin) Date: Wed, 27 Dec 2006 12:08:14 +0200 Subject: [openib-general] [PATCH] ofed 1.2: fix module compilation erros on ia64 Message-ID: <1167214095.27740.13.camel@muscida> Fix compilation errors on ia64 that are caused by the definition of sg_dma_address and sg_dma_len in asm-ia64/pci.h instead of in asm/scatterlist.h, as in other architectures. tested on: ia64[sles10]; x86_64 [sles10,rh4] Signed-off by: Yosef Etigin --- diff -urN a/kernel_addons/backport/2.6.11/include/asm/scatterlist.h b/kernel_addons/backport/2.6.11/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.11/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.11/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h b/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.12/include/asm/scatterlist.h b/kernel_addons/backport/2.6.12/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.12/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.12/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.13/include/asm/scatterlist.h b/kernel_addons/backport/2.6.13/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.13/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.13/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h b/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.14/include/asm/scatterlist.h b/kernel_addons/backport/2.6.14/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.14/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.14/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.15/include/asm/scatterlist.h b/kernel_addons/backport/2.6.15/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.15/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.15/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.16/include/asm/scatterlist.h b/kernel_addons/backport/2.6.16/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.16/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.16/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h b/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.17/include/asm/scatterlist.h b/kernel_addons/backport/2.6.17/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.17/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.17/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.18/include/asm/scatterlist.h b/kernel_addons/backport/2.6.18/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.18/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.18/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h b/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.19/include/asm/scatterlist.h b/kernel_addons/backport/2.6.19/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.19/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.19/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h b/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h b/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h b/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif diff -urN a/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h b/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h --- a/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h 1970-01-01 02:00:00.000000000 +0200 +++ b/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h 2006-12-26 16:07:21.000000000 +0200 @@ -0,0 +1,12 @@ +/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */ +#ifndef BACKPORT_SCATTERLIST_H +#define BACKPORT_SCATTERLIST_H + +#include_next + +#ifdef __ia64__ +#define sg_dma_address(sg) ((sg)->dma_address) +#define sg_dma_len(sg) ((sg)->dma_length) +#endif + +#endif -- Yosef Etigin yosefe at voltaire.com From jsquyres at cisco.com Wed Dec 27 05:13:25 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 08:13:25 -0500 Subject: [openib-general] Old svn repository access In-Reply-To: <1167160526.29620.7478.camel@hal.voltaire.com> References: <1167160526.29620.7478.camel@hal.voltaire.com> Message-ID: <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com> This is probably my fault; sorry. :-( I advised Sandia that it would be ok to turn off the old server, but I thought that the new server was up and running. Doing some poking around on staging.ofa, I see that the SVN repository is located at file:///data/svn, but I don't see that it's being made available via http[s]. I'll poke around today and see if I can get it up and running via http [s] on svn.openfabrics.org. On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote: > Hi, > > Thought the old svn repository was made RO. When I do a RO > operation to > it, I get the following error: > > svn log | more > (R)eject, accept (t)emporarily or accept (p)ermanently? svn: > PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/ > management/diags/src/ibnetdiscover.c' > svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/ > diags/src/ibnetdiscover.c': 405 Method Not Allowed (https:// > openib.org) > > Shouldn't this work ? > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From mst at mellanox.co.il Wed Dec 27 05:24:01 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 15:24:01 +0200 Subject: [openib-general] Old svn repository access In-Reply-To: <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com> References: <1167160526.29620.7478.camel@hal.voltaire.com> <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com> Message-ID: <20061227132401.GN19436@mellanox.co.il> Can the openib.org dns be also changed to point to the new server? Scripts from OFED 1.0 are still using that, I think we should keep them running. Quoting r. Jeff Squyres : Subject: Re: Old svn repository access This is probably my fault; sorry. :-( I advised Sandia that it would be ok to turn off the old server, but I thought that the new server was up and running. Doing some poking around on staging.ofa, I see that the SVN repository is located at file:///data/svn, but I don't see that it's being made available via http[s]. I'll poke around today and see if I can get it up and running via http [s] on svn.openfabrics.org. On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote: > Hi, > > Thought the old svn repository was made RO. When I do a RO > operation to > it, I get the following error: > > svn log | more > (R)eject, accept (t)emporarily or accept (p)ermanently? svn: > PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/ > management/diags/src/ibnetdiscover.c' > svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/ > diags/src/ibnetdiscover.c': 405 Method Not Allowed (https:// > openib.org) > > Shouldn't this work ? > > -- Hal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From jsquyres at cisco.com Wed Dec 27 05:37:24 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 08:37:24 -0500 Subject: [openib-general] Old svn repository access In-Reply-To: <20061227132401.GN19436@mellanox.co.il> References: <1167160526.29620.7478.camel@hal.voltaire.com> <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com> <20061227132401.GN19436@mellanox.co.il> Message-ID: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com> On Dec 27, 2006, at 8:24 AM, Michael S. Tsirkin wrote: > Can the openib.org dns be also changed to point to the new server? > Scripts from OFED 1.0 are still using that, I think we should keep > them running. I don't think we're there yet -- I need to talk to Michael Lee before we make the switch to make openfabrics.org and openib.org point to the new server. What exactly in OFED 1.0 uses the name openib.org -- SVN access? > Quoting r. Jeff Squyres : > Subject: Re: Old svn repository access > > This is probably my fault; sorry. :-( > > I advised Sandia that it would be ok to turn off the old server, but > I thought that the new server was up and running. Doing some poking > around on staging.ofa, I see that the SVN repository is located at > file:///data/svn, but I don't see that it's being made available via > http[s]. > > I'll poke around today and see if I can get it up and running via http > [s] on svn.openfabrics.org. > > > > On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote: > >> Hi, >> >> Thought the old svn repository was made RO. When I do a RO >> operation to >> it, I get the following error: >> >> svn log | more >> (R)eject, accept (t)emporarily or accept (p)ermanently? svn: >> PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/ >> management/diags/src/ibnetdiscover.c' >> svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/ >> diags/src/ibnetdiscover.c': 405 Method Not Allowed (https:// >> openib.org) >> >> Shouldn't this work ? >> >> -- Hal >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >> openib-general > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general > > -- > MST -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From dotanb at dev.mellanox.co.il Wed Dec 27 05:46:06 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 27 Dec 2006 15:46:06 +0200 Subject: [openib-general] [PATCH] [mthca] don't execute the QUERY command in QP is in RESET state Message-ID: <1167227166.6664.2.camel@mtls05.yok.mtl.com> If the QP state is RESET, don't execute the QUERY command (because it will fail). Signed-off-by: Dotan Barak --- Index: gen2_devel_kernel/drivers/infiniband/hw/mthca/mthca_qp.c =================================================================== --- gen2_devel_kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c 2006-12-24 19:41:56.000000000 +0200 +++ gen2_devel_kernel/drivers/infiniband/hw/mthca/mthca_qp.c 2006-12-25 15:54:49.000000000 +0200 @@ -429,13 +429,18 @@ int mthca_query_qp(struct ib_qp *ibqp, s { struct mthca_dev *dev = to_mdev(ibqp->device); struct mthca_qp *qp = to_mqp(ibqp); - int err; - struct mthca_mailbox *mailbox; + int err = 0; + struct mthca_mailbox *mailbox = NULL; struct mthca_qp_param *qp_param; struct mthca_qp_context *context; int mthca_state; u8 status; + if (qp->state == IB_QPS_RESET) { + qp_attr->qp_state = IB_QPS_RESET; + goto done; + } + mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); if (IS_ERR(mailbox)) return PTR_ERR(mailbox); @@ -454,7 +459,6 @@ int mthca_query_qp(struct ib_qp *ibqp, s mthca_state = be32_to_cpu(context->flags) >> 28; qp_attr->qp_state = to_ib_qp_state(mthca_state); - qp_attr->cur_qp_state = qp_attr->qp_state; qp_attr->path_mtu = context->mtu_msgmax >> 5; qp_attr->path_mig_state = to_ib_mig_state((be32_to_cpu(context->flags) >> 11) & 0x3); @@ -464,11 +468,6 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->dest_qp_num = be32_to_cpu(context->remote_qpn) & 0xffffff; qp_attr->qp_access_flags = to_ib_qp_access_flags(be32_to_cpu(context->params2)); - qp_attr->cap.max_send_wr = qp->sq.max; - qp_attr->cap.max_recv_wr = qp->rq.max; - qp_attr->cap.max_send_sge = qp->sq.max_gs; - qp_attr->cap.max_recv_sge = qp->rq.max_gs; - qp_attr->cap.max_inline_data = qp->max_inline_data; if (qp->transport == RC || qp->transport == UC) { to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path); @@ -495,7 +494,16 @@ int mthca_query_qp(struct ib_qp *ibqp, s qp_attr->retry_cnt = (be32_to_cpu(context->params1) >> 16) & 0x7; qp_attr->rnr_retry = context->pri_path.rnr_retry >> 5; qp_attr->alt_timeout = context->alt_path.ackto >> 3; - qp_init_attr->cap = qp_attr->cap; + +done: + qp_attr->cur_qp_state = qp_attr->qp_state; + qp_attr->cap.max_send_wr = qp->sq.max; + qp_attr->cap.max_recv_wr = qp->rq.max; + qp_attr->cap.max_send_sge = qp->sq.max_gs; + qp_attr->cap.max_recv_sge = qp->rq.max_gs; + qp_attr->cap.max_inline_data = qp->max_inline_data; + + qp_init_attr->cap = qp_attr->cap; out: mthca_free_mailbox(dev, mailbox); From ogerlitz at voltaire.com Wed Dec 27 06:03:38 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 27 Dec 2006 16:03:38 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <458E7402.4000106@mellanox.co.il> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> Message-ID: <45927D3A.9030502@voltaire.com> Eitan Zahavi wrote: > Hi Or, > > Sorry it took me a while. > > According to the IBTA spec: > 1. In order for MTU and MTUSelector to have any effect their component > mask bits MUST be set to 1 in the query > 2. Behavior of the SM is defined with small "freedom" to choose between > multiple matching MTU values if they exist. > 3. The table below summarizes all options: > > Assuming the value M represents the lowest MTU on the path > We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) > R represents the MTU value in the request. Similarly R-1 is one below R > and R+1 is one above R. > > Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk > w. Tavor End Port > ----------------------------------------------------------------------------------------- > > UNDEFINED | UNDEFINED | <= M | M | min(M,1K) > R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) > R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR > R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR Hi Eitan, Not that it matters too much for the decision if to push this into the Open SM, but the SM group here is positive w.r.t to the approach and patch you have sent. However, there are some clarifications i will be happy to get: 1st maybe its clear to everyone expect me, but what do you mean by /ERR in the table above, is it what opensm would return before the patch you suggested? 2nd can you post the open sm tavor quirk patch? 3rd Eitan/Michael: what is the bigger picture here? what is the dependency between these four patches +1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors +2 osm: tavor quirk +3 IB/rdmacm: tavor quirk +4 IB/ipoib: use appropriate mtu selector for path queries for example is it correct that: if [2] is applied on the SA side then [4] must be applied on ipoib else if will get 1K mtu on its path query? if [2] is not applied on the SA side, then [3] is useless? Or. From jsquyres at cisco.com Wed Dec 27 06:15:15 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 09:15:15 -0500 Subject: [openib-general] Old svn repository access In-Reply-To: <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com> References: <1167160526.29620.7478.camel@hal.voltaire.com> <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com> Message-ID: <10526AA0-00ED-4A25-84DD-BD70E09277AF@cisco.com> After poking around some more, I see that SVN https access was half setup. I've fixed it -- you can now access SVN via authenticated channels at: https://svn.openfabrics.org/svn/openib/ And anonymous channels (commits are disallowed here): http://svn.openfabrics.org/svn/openib/ Please let me know if you have any problems with it; sorry for the mix-up. :-( More details on apache and SVN coming soon. On Dec 27, 2006, at 8:13 AM, Jeff Squyres wrote: > This is probably my fault; sorry. :-( > > I advised Sandia that it would be ok to turn off the old server, but > I thought that the new server was up and running. Doing some poking > around on staging.ofa, I see that the SVN repository is located at > file:///data/svn, but I don't see that it's being made available via > http[s]. > > I'll poke around today and see if I can get it up and running via http > [s] on svn.openfabrics.org. > > > > On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote: > >> Hi, >> >> Thought the old svn repository was made RO. When I do a RO >> operation to >> it, I get the following error: >> >> svn log | more >> (R)eject, accept (t)emporarily or accept (p)ermanently? svn: >> PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/ >> management/diags/src/ibnetdiscover.c' >> svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/ >> diags/src/ibnetdiscover.c': 405 Method Not Allowed (https:// >> openib.org) >> >> Shouldn't this work ? >> >> -- Hal >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/ >> openib-general > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From eitan at mellanox.co.il Wed Dec 27 06:21:47 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 27 Dec 2006 16:21:47 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <45927D3A.9030502@voltaire.com> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com> Message-ID: <4592817B.3030700@mellanox.co.il> Or Gerlitz wrote: > Eitan Zahavi wrote: > >> Hi Or, >> >> Sorry it took me a while. >> >> According to the IBTA spec: >> 1. In order for MTU and MTUSelector to have any effect their component >> mask bits MUST be set to 1 in the query >> 2. Behavior of the SM is defined with small "freedom" to choose between >> multiple matching MTU values if they exist. >> 3. The table below summarizes all options: >> >> Assuming the value M represents the lowest MTU on the path >> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) >> R represents the MTU value in the request. Similarly R-1 is one below R >> and R+1 is one above R. >> >> Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM Quirk >> w. Tavor End Port >> ----------------------------------------------------------------------------------------- >> >> UNDEFINED | UNDEFINED | <= M | M | min(M,1K) >> R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, M, 1K) >> R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR >> R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR >> > > Hi Eitan, > > Not that it matters too much for the decision if to push this into the > Open SM, but the SM group here is positive w.r.t to the approach and > patch you have sent. > > However, there are some clarifications i will be happy to get: > > 1st maybe its clear to everyone expect me, but what do you mean by /ERR > in the table above, is it what opensm would return before the patch you > suggested? > Hi Or, By ERR I mean that the path being evaluated is rejected from being included in the paths group of the response to the provided query. > 2nd can you post the open sm tavor quirk patch? > What do you mean? The old patch introducing the "opensm quirk" mode? It is GIT versions: 86077144ed956ddb32a0f8d067d5bb00fd564ac6 followed by 03e3b3a6fa934202c0f4270a2c69d64ac486b1ca or SVN: 9497 followed by 9518 > 3rd Eitan/Michael: what is the bigger picture here? what is the > dependency between these four patches > > +1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors > Required - OpenSM broken otherwise > +2 osm: tavor quirk > Required - if want to rely on OpenSM for selecting 1K MTU for Tavor paths if it has the freedom to do so > +3 IB/rdmacm: tavor quirk > +4 IB/ipoib: use appropriate mtu selector for path queries > I will let Michael answer that > for example is it correct that: > > if [2] is applied on the SA side then [4] must be applied on ipoib else > if will get 1K mtu on its path query? > > if [2] is not applied on the SA side, then [3] is useless? > > Or. > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vlad at dev.mellanox.co.il Wed Dec 27 06:34:36 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 27 Dec 2006 16:34:36 +0200 Subject: [openib-general] [PATCH] [MINOR] ipoibtools: fix compilation errors on ppc64 In-Reply-To: <1167148716.7006.17.camel@muscida> References: <1167148716.7006.17.camel@muscida> Message-ID: <4592847C.5030408@dev.mellanox.co.il> Applied. Thanks, Regards, Vladimir Yosef Etigin wrote: > Fix compilation errors of ipoibtools on ppc64 caused by > overriding CFLAGS in the Makefile. > > Signed-off-by: Yosef Etigin > > --- > diff -ur a/src/userspace/ipoibtools/iproute2/Makefile b/src/userspace/ipoibtools/iproute2/Makefile > --- a/src/userspace/ipoibtools/iproute2/Makefile 2006-12-25 16:18:43.000000000 +0200 > +++ b/src/userspace/ipoibtools/iproute2/Makefile 2006-12-25 15:54:40.000000000 +0200 > @@ -22,7 +22,7 @@ > CC = gcc > HOSTCC = gcc > CCOPTS = -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall > -CFLAGS = $(CCOPTS) -I../include $(DEFINES) > +CFLAGS += $(CCOPTS) -I../include $(DEFINES) > YACCFLAGS = -d -t -v > > LDLIBS += -L../lib -lnetlink -lutil > > -- > Yosef Etigin > Voltaire > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Wed Dec 27 07:28:52 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 10:28:52 -0500 Subject: [openib-general] DNS: "git.openfabrics.org" now exists Message-ID: <6440C91B-ED19-4EFA-B9DC-8EC7DDBA5E54@cisco.com> The name "git.openfabrics.org" now exists in DNS and points to the new server. I would strongly encourage everyone to start using "git.openfabrics.org" as the hostname to access your git repositories (vs. "staging.openfabrics.org"). Relevant web pages, documentation, etc. should also be updated with this new hostname. The name "staging.openfabrics.org" was intended to be temporary. I propose for it to go away end of Q1'07 (March 31 2007). -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From jsquyres at cisco.com Wed Dec 27 07:35:10 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 10:35:10 -0500 Subject: [openib-general] New server: Apache / SSL / IP addresses Message-ID: <37E1A658-F4B1-405C-81AA-C31E563D454B@cisco.com> We have the following services on the OFA server that use authentication, and therefore use Apache's SSL services: - subversion - bugzilla - tiki Due to the nature of SSL connections, you can only have one SSL vhost per IP address. Specifically, you cannot have https:// foo.example.com and https://bar.example.com be distinct vhosts on the same IP address. This fact, along with the fact that we currently only have one IP address active on the new server, prevents the use of multiple .openfabrics.org hostnames for different SSL/ authenticated services through Apache. johncompanies.com lists the hosted servers plan as coming with 5 IP addresses. Is this the plan that we got? If so, can we request our 3 of our 4 additional IP addresses? (who is the OFA contact with johncompanies.com?) I propose the following: IP address 1 (146.246.248.81): - http://www.openfabrics.org/ -- main web site - https://www.openfabrics.org/ -- redirects back to http - http://builds.openfabrics.org/ -- nightly builds - http://git.openfabrics.org/ -- gitweb access ==> Also use git://git.openfabrics.org/ for normal git access (not through Apache, of course) - http://.openfabrics.org/ -- ...any other non-authenticated vhost IP address 2: - http://bugs.openfabrics.org/ -- redirects to https - https://bugs.openfabrics.org/ -- all bugzilla access IP address 3: - http://wiki.openfabrics.org/ -- read only wiki access - https://wiki.openfabrics.org/ -- authentication wiki access (I don't know if it's possible to separate these two with tiki; if not, just have http redirect to https) IP address 4: - http://svn.openfabrics.org/ -- read only SVN access - https://svn.openfabrics.org/ -- authenticated SVN access ==> this vhost to possibly go away end of Q1'07 Comments? -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From kliteyn at dev.mellanox.co.il Wed Dec 27 07:46:55 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 17:46:55 +0200 Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows compatability Message-ID: <4592956F.3020501@dev.mellanox.co.il> Hi Hal. Fixing windows compilation problems [V2 - Previous patch had an error] Signed-off-by: Yevgeny Kliteynik --- osm/include/iba/ib_types.h | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index 723e8b9..ec65b64 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -59,9 +59,10 @@ BEGIN_C_DECLS #define OSM_EXPORT __declspec(dllimport) #endif #define OSM_API __stdcall + #define OSM_CDECL __cdecl #else #define OSM_EXPORT extern #define OSM_API + #define OSM_CDECL #define __ptr64 #endif -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Wed Dec 27 07:47:23 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 17:47:23 +0200 Subject: [openib-general] [PATCH] osm: additional check of tree topology Message-ID: <4592958B.7030102@dev.mellanox.co.il> Hi Hal As we've discussed before - added check for fat-tree topology to be at least of rank 2. -- Yevgeny Signed-off-by: Yevgeny Kliteynik Subject: [PATCH] Added additional check of tree topology Signed-off-by: Yevgeny Kliteynik --- osm/opensm/osm_ucast_ftree.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c index 054e3c9..0473135 100644 --- a/osm/opensm/osm_ucast_ftree.c +++ b/osm/opensm/osm_ucast_ftree.c @@ -2877,6 +2877,11 @@ __osm_ftree_construct_fabric( "Fabric rank is %u (>%u) - " "fat-tree routing falls back to default routing\n", __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK); + else if (__osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK) + osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, + "Fabric rank is %u (<%u) - " + "fat-tree routing falls back to default routing\n", + __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MIN_RANK); status = -1; goto Exit; } -- 1.4.4.1.GIT From kliteyn at dev.mellanox.co.il Wed Dec 27 08:19:23 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 18:19:23 +0200 Subject: [openib-general] [PATCH] osm: fat-tree documentation Message-ID: <45929D0B.3090308@dev.mellanox.co.il> Hi Hal. Added fat-tree routing details and some cosmetics in the txt files. -- Yevgeny Signed-off-by: Yevgeny Kliteynik --- osm/doc/current-routing.txt | 57 ++++++++++++++++++++++++++++++++++++++---- osm/doc/modular-routing.txt | 4 +- 2 files changed, 53 insertions(+), 8 deletions(-) diff --git a/osm/doc/current-routing.txt b/osm/doc/current-routing.txt index e58ae1f..da050c6 100644 --- a/osm/doc/current-routing.txt +++ b/osm/doc/current-routing.txt @@ -1,5 +1,5 @@ Current OpenSM Routing -12/20/06 +12/27/06 OpenSM offers three routing engines: @@ -11,11 +11,10 @@ node, but it is constrained to ranking r if the subnet is not a pure Fat Tree, and deadlock may occur due to a loop in the subnet. -3. Fat Tree Unicast routing algorithm - this algorithm optimizes routing -for congestion-free "shift" communication pattern. -It should be chosen if a subnet is a symmetrical Fat Trees of various types, -not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio. -Similar to UPDN, Fat Tree routing is constrained to ranking rules. +3. Fat-tree Unicast routing algorithm - this algorithm optimizes routing +Of fat-trees for congestion-free "shift" communication pattern. +It should be chosen if a subnet is a symmetrical fat-tree. +Similar to UPDN, Fat-tree routing is credit-loop-free. OpenSM now also offers a file method which can load routes from a table. See modular-routing.txt for more information on this. @@ -73,6 +72,7 @@ switches will be skipped. Multicast is n Min Hop Algorithm +----------------- The Min Hop algorithm is invoked when neither UPDN or the file method are specified. @@ -91,6 +91,9 @@ port GUID. The latter is supplied by: LMC awareness routes based on (remote) system or switch basis. +UPDN Routing Algorithm +---------------------- + Purpose of UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops @@ -151,3 +154,45 @@ To learn more about deadlock-free routin "Deadlock Free Message Routing in Multiprocessor Interconnection Networks" by William J Dally and Charles L Seitz (1985). + +Fat-tree Routing Algorithm +-------------------------- + +Purpose: + +The fat-tree algorithm optimizes routing for "shift" communication pattern. +It should be chosen if a subnet is a symmetrical fat-tree of various types. +It supports not just K-ary-N-Trees, by handling for non-constant K, +cases where not all leafs (HCAs) are present, any CBB ratio. +As in UPDN, fat-tree also prevents credit-loop-deadlocks. +Fat-tree algorithm supports topologies that comply with the following rules: + - Tree rank should be between two and eight (inclusively) + - Switches of the same rank should have the same number + of UP-going port groups*, unless they are root switches, + in which case the shouldn't have UP-going ports at all. + - Switches of the same rank should have the same number + of DOWN-going port groups, unless they are leaf switches. + - Switches of the same rank should have the same number + of ports in each UP-going port group. + - Switches of the same rank should have the same number + of ports in each DOWN-going port group. +*ports that are connected to the same remote switch are referenced as 'port group'. + +Note that although fat-tree algorithm supports trees with non-integer CBB +ratio, the routing will not be as balanced as in case of integer CBB ratio. +In addition to this, although the algorithm allows leaf switches to have any +number of HCAs, the closer the tree to be fully populated, the more effective +the "shift" communication pattern will be. + +The algorithm also dumps HCA ordering file (osm-ftree-ca-order.dump) in the +same directory where the OpenSM log resides. This ordering file provides the +HCA order that may be used to create efficient communication pattern, that +will match the routing tables. + + +Usage: + +Activation through OpenSM + +Use '-R ftree' option to activate the fat-tree algorithm. + diff --git a/osm/doc/modular-routing.txt b/osm/doc/modular-routing.txt index 3708e1b..86677d0 100644 --- a/osm/doc/modular-routing.txt +++ b/osm/doc/modular-routing.txt @@ -6,8 +6,8 @@ for ease of "plugging" new routing modul Currently, only unicast callbacks are supported. Multicast can be added later. -One existing routing module is up-down "updn", which may be -activate with '-R updn' option (instead of old '-u'). +One of existing routing modules is up-down "updn", which may +be activate with '-R updn' option (instead of old '-u'). General usage is: $ opensm -R 'module-name' -- 1.4.4.1.GIT From halr at voltaire.com Wed Dec 27 08:37:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 11:37:26 -0500 Subject: [openib-general] [PATCH 0/4] OpenSM: Add optional SA SwitchInfoRecord support Message-ID: <1167237443.29620.74762.camel@hal.voltaire.com> OpenSM: Add optional SA SwitchInfoRecord support This patch adds suppport for the optional SA SwitchInfoRecord. Signed-off-by: Hal Rosenstock From halr at voltaire.com Wed Dec 27 08:41:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 11:41:20 -0500 Subject: [openib-general] [PATCH 1/4] OpenSM/ib_types.h: Add needed SwitchInfoRecord component masks Message-ID: <1167237447.29620.74764.camel@hal.voltaire.com> OpenSM/ib_types.h: Add needed SwitchInfoRecord component masks Signed-off-by: Hal Rosenstock diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index 91304e2..897e839 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -2361,6 +2361,10 @@ typedef struct _ib_path_rec #define IB_PKEY_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<1)) #define IB_PKEY_COMPMASK_PORT (CL_HTON64(((uint64_t)1)<<2)) +/* Switch Info Record Masks */ +#define IB_SWIR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) +#define IB_SWIR_COMPMASK_RESERVED1 (CL_HTON64(((uint64_t)1)<<1)) + /* LFT Record Masks */ #define IB_LFTR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_LFTR_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<1)) From halr at voltaire.com Wed Dec 27 08:41:25 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 11:41:25 -0500 Subject: [openib-general] [PATCH 2/4] OpenSM: Add optional SA SwitchInfoRecord support Message-ID: <1167237674.29620.74964.camel@hal.voltaire.com> OpenSM: Add optional SA SwitchInfoRecord support Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_sa_sw_info_record.h b/osm/include/opensm/osm_sa_sw_info_record.h new file mode 100644 index 0000000..c6b421f --- /dev/null +++ b/osm/include/opensm/osm_sa_sw_info_record.h @@ -0,0 +1,306 @@ +/* + * Copyright (c) 2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Declaration of osm_sir_rcv_t. + * This object represents the SwitchInfo Receiver object. + * attribute from a switch node. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#ifndef _OSM_SIR_RCV_H_ +#define _OSM_SIR_RCV_H_ + +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/Switch Info Receiver +* NAME +* Switch Info Receiver +* +* DESCRIPTION +* The Switch Info Receiver object encapsulates the information +* needed to receive the SwitchInfo attribute from a switch node. +* +* The Switch Info Receiver object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ + +/****s* OpenSM: Switch Info Receiver/osm_sir_rcv_t +* NAME +* osm_sir_rcv_t +* +* DESCRIPTION +* Switch Info Receiver structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_sir_rcv +{ + osm_subn_t *p_subn; + osm_sa_resp_t *p_resp; + osm_mad_pool_t *p_mad_pool; + osm_log_t *p_log; + osm_req_t *p_req; + osm_state_mgr_t *p_state_mgr; + cl_plock_t *p_lock; + cl_qlock_pool_t pool; +} osm_sir_rcv_t; +/* +* FIELDS +* p_subn +* Pointer to the Subnet object for this subnet. +* +* p_log +* Pointer to the log object. +* +* p_req +* Pointer to the Request object. +* +* p_state_mgr +* Pointer to the State Manager object. +* +* p_lock +* Pointer to the serializing lock. +* +* SEE ALSO +* Switch Info Receiver object +*********/ + +/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_construct +* NAME +* osm_sir_rcv_construct +* +* DESCRIPTION +* This function constructs a Switch Info Receiver object. +* +* SYNOPSIS +*/ +void osm_sir_rcv_construct( + IN osm_sir_rcv_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to a Switch Info Receiver object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_sir_rcv_init, osm_sir_rcv_destroy, +* and osm_sir_rcv_is_inited. +* +* Calling osm_sir_rcv_construct is a prerequisite to calling any other +* method except osm_sir_rcv_init. +* +* SEE ALSO +* Switch Info Receiver object, osm_sir_rcv_init, +* osm_sir_rcv_destroy, osm_sir_rcv_is_inited +*********/ + +/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_destroy +* NAME +* osm_sir_rcv_destroy +* +* DESCRIPTION +* The osm_sir_rcv_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void osm_sir_rcv_destroy( + IN osm_sir_rcv_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* Switch Info Receiver object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_sir_rcv_construct or osm_sir_rcv_init. +* +* SEE ALSO +* Switch Info Receiver object, osm_sir_rcv_construct, +* osm_sir_rcv_init +*********/ + +/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_init +* NAME +* osm_sir_rcv_init +* +* DESCRIPTION +* The osm_sir_rcv_init function initializes a +* Switch Info Receiver object for use. +* +* SYNOPSIS +*/ +ib_api_status_t osm_sir_rcv_init( + IN osm_sir_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to an osm_sir_rcv_t object to initialize. +* +* p_resp +* [in] Pointer to the SA Responder object. +* +* p_mad_pool +* [in] Pointer to the mad pool. +* +* p_subn +* [in] Pointer to the Subnet object for this subnet. +* +* p_log +* [in] Pointer to the log object. +* +* p_lock +* [in] Pointer to the OpenSM serializing lock. +* +* RETURN VALUES +* IB_SUCCESS if the Switch Info Receiver object was initialized +* successfully. +* +* NOTES +* Allows calling other Switch Info Receiver methods. +* +* SEE ALSO +* Switch Info Receiver object, osm_sir_rcv_construct, +* osm_sir_rcv_destroy, osm_sir_rcv_is_inited +*********/ + +/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_is_inited +* NAME +* osm_sir_rcv_is_inited +* +* DESCRIPTION +* Indicates if the object has been initialized with osm_sir_rcv_init. +* +* SYNOPSIS +*/ +boolean_t osm_sir_rcv_is_inited( + IN const osm_sir_rcv_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_sir_rcv_t object. +* +* RETURN VALUES +* TRUE if the object was initialized successfully, +* FALSE otherwise. +* +* NOTES +* The osm_sir_rcv_construct or osm_sir_rcv_init must be +* called before using this function. +* +* SEE ALSO +* Switch Info Receiver object, osm_sir_rcv_construct, +* osm_sir_rcv_init +*********/ + +/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_process +* NAME +* osm_sir_rcv_process +* +* DESCRIPTION +* Process the SwitchInfo attribute. +* +* SYNOPSIS +*/ +void osm_sir_rcv_process( + IN osm_sir_rcv_t* const p_ctrl, + IN const osm_madw_t* const p_madw ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_sir_rcv_t object. +* +* p_madw +* [in] Pointer to the MAD Wrapper containing the MAD +* that contains the node's SwitchInfo attribute. +* +* RETURN VALUES +* CL_SUCCESS if the SwitchInfo processing was successful. +* +* NOTES +* This function processes a SwitchInfo attribute. +* +* SEE ALSO +* Switch Info Receiver, Switch Info Response Controller +*********/ + +END_C_DECLS + +#endif /* _OSM_SIR_RCV_H_ */ diff --git a/osm/include/opensm/osm_sa_sw_info_record_ctrl.h b/osm/include/opensm/osm_sa_sw_info_record_ctrl.h new file mode 100644 index 0000000..b58654f --- /dev/null +++ b/osm/include/opensm/osm_sa_sw_info_record_ctrl.h @@ -0,0 +1,259 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Declaration of osm_sir_rcv_ctrl_t. + * This object represents a controller that receives the IBA SwitchInfo + * attribute from a switch node. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#ifndef _OSM_SIR_RCV_CTRL_H_ +#define _OSM_SIR_RCV_CTRL_H_ + +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/Switch Info Receive Controller +* NAME +* Switch Info Receive Controller +* +* DESCRIPTION +* The Switch Info Receive Controller object encapsulates the information +* needed to receive the SwitchInfo attribute from a switch node. +* +* The Switch Info Receive Controller object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ + +/****s* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_t +* NAME +* osm_sir_rcv_ctrl_t +* +* DESCRIPTION +* Switch Info Receive Controller structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_sir_rcv_ctrl +{ + osm_sir_rcv_t *p_rcv; + osm_log_t *p_log; + cl_dispatcher_t *p_disp; + cl_disp_reg_handle_t h_disp; +} osm_sir_rcv_ctrl_t; +/* +* FIELDS +* p_rcv +* Pointer to the Switch Info Receiver object. +* +* p_log +* Pointer to the log object. +* +* p_disp +* Pointer to the Dispatcher. +* +* h_disp +* Handle returned from dispatcher registration. +* +* SEE ALSO +* Switch Info Receive Controller object +* Switch Info Receiver object +*********/ + +/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_construct +* NAME +* osm_sir_rcv_ctrl_construct +* +* DESCRIPTION +* This function constructs a Switch Info Receive Controller object. +* +* SYNOPSIS +*/ +void osm_sir_rcv_ctrl_construct( + IN osm_sir_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to a Switch Info Receive Controller +* object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_sir_rcv_ctrl_init, osm_sir_rcv_ctrl_destroy, +* and osm_sir_rcv_ctrl_is_inited. +* +* Calling osm_sir_rcv_ctrl_construct is a prerequisite to calling any +* other method except osm_sir_rcv_ctrl_init. +* +* SEE ALSO +* Switch Info Receive Controller object, osm_sir_rcv_ctrl_init, +* osm_sir_rcv_ctrl_destroy, osm_sir_rcv_ctrl_is_inited +*********/ + +/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_destroy +* NAME +* osm_sir_rcv_ctrl_destroy +* +* DESCRIPTION +* The osm_sir_rcv_ctrl_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void osm_sir_rcv_ctrl_destroy( + IN osm_sir_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* Switch Info Receive Controller object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_sir_rcv_ctrl_construct or osm_sir_rcv_ctrl_init. +* +* SEE ALSO +* Switch Info Receive Controller object, osm_sir_rcv_ctrl_construct, +* osm_sir_rcv_ctrl_init +*********/ + +/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_init +* NAME +* osm_sir_rcv_ctrl_init +* +* DESCRIPTION +* The osm_sir_rcv_ctrl_init function initializes a +* Switch Info Receive Controller object for use. +* +* SYNOPSIS +*/ +ib_api_status_t osm_sir_rcv_ctrl_init( + IN osm_sir_rcv_ctrl_t* const p_ctrl, + IN osm_sir_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_sir_rcv_ctrl_t object to initialize. +* +* p_rcv +* [in] Pointer to an osm_sir_rcv_t object. +* +* p_log +* [in] Pointer to the log object. +* +* p_disp +* [in] Pointer to the OpenSM central Dispatcher. +* +* RETURN VALUES +* CL_SUCCESS if the Switch Info Receive Controller object was initialized +* successfully. +* +* NOTES +* Allows calling other Switch Info Receive Controller methods. +* +* SEE ALSO +* Switch Info Receive Controller object, osm_sir_rcv_ctrl_construct, +* osm_sir_rcv_ctrl_destroy, osm_sir_rcv_ctrl_is_inited +*********/ + +/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_is_inited +* NAME +* osm_sir_rcv_ctrl_is_inited +* +* DESCRIPTION +* Indicates if the object has been initialized with osm_sir_rcv_ctrl_init. +* +* SYNOPSIS +*/ +boolean_t osm_sir_rcv_ctrl_is_inited( + IN const osm_sir_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_sir_rcv_ctrl_t object. +* +* RETURN VALUES +* TRUE if the object was initialized successfully, +* FALSE otherwise. +* +* NOTES +* The osm_sir_rcv_ctrl_construct or osm_sir_rcv_ctrl_init must be +* called before using this function. +* +* SEE ALSO +* Switch Info Receive Controller object, osm_sir_rcv_ctrl_construct, +* osm_sir_rcv_ctrl_init +*********/ + +END_C_DECLS + +#endif /* _OSM_SIR_RCV_CTRL_H_ */ diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c new file mode 100644 index 0000000..2da30ba --- /dev/null +++ b/osm/opensm/osm_sa_sw_info_record.c @@ -0,0 +1,530 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of osm_sir_rcv_t. + * This object represents the SwitchInfo Receiver object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define OSM_SIR_RCV_POOL_MIN_SIZE 32 +#define OSM_SIR_RCV_POOL_GROW_SIZE 32 + +typedef struct _osm_sir_item +{ + cl_pool_item_t pool_item; + ib_switch_info_record_t rec; +} osm_sir_item_t; + +typedef struct _osm_sir_search_ctxt +{ + const ib_switch_info_record_t* p_rcvd_rec; + ib_net64_t comp_mask; + cl_qlist_t* p_list; + osm_sir_rcv_t* p_rcv; + const osm_physp_t* p_req_physp; +} osm_sir_search_ctxt_t; + +/********************************************************************** + **********************************************************************/ +void +osm_sir_rcv_construct( + IN osm_sir_rcv_t* const p_rcv ) +{ + memset( p_rcv, 0, sizeof(*p_rcv) ); + cl_qlock_pool_construct( &p_rcv->pool ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_sir_rcv_destroy( + IN osm_sir_rcv_t* const p_rcv ) +{ + OSM_LOG_ENTER( p_rcv->p_log, osm_sir_rcv_destroy ); + cl_qlock_pool_destroy( &p_rcv->pool ); + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_sir_rcv_init( + IN osm_sir_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ) +{ + ib_api_status_t status; + + OSM_LOG_ENTER( p_log, osm_sir_rcv_init ); + + osm_sir_rcv_construct( p_rcv ); + + p_rcv->p_log = p_log; + p_rcv->p_subn = p_subn; + p_rcv->p_lock = p_lock; + p_rcv->p_resp = p_resp; + p_rcv->p_mad_pool = p_mad_pool; + + status = cl_qlock_pool_init( &p_rcv->pool, + OSM_SIR_RCV_POOL_MIN_SIZE, + 0, + OSM_SIR_RCV_POOL_GROW_SIZE, + sizeof(osm_sir_item_t), + NULL, NULL, NULL ); + + OSM_LOG_EXIT( p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +static ib_api_status_t +__osm_sir_rcv_new_sir( + IN osm_sir_rcv_t* const p_rcv, + IN const osm_switch_t* const p_sw, + IN cl_qlist_t* const p_list, + IN ib_net16_t const lid ) +{ + osm_sir_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_sir_rcv_new_sir ); + + p_rec_item = (osm_sir_item_t*)cl_qlock_pool_get( &p_rcv->pool ); + if( p_rec_item == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sir_rcv_new_sir: ERR 5308: " + "cl_qlock_pool_get failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_sir_rcv_new_sir: " + "New SwitchInfoRecord: lid 0x%X\n", + cl_ntoh16( lid ) + ); + } + + memset( &p_rec_item->rec, 0, sizeof(ib_switch_info_record_t) ); + + p_rec_item->rec.lid = lid; + p_rec_item->rec.switch_info = p_sw->switch_info; + + cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +static osm_port_t* +__osm_sir_get_port_by_guid( + IN osm_sir_rcv_t* const p_rcv, + IN uint64_t port_guid ) +{ + osm_port_t* p_port; + + CL_PLOCK_ACQUIRE(p_rcv->p_lock); + + p_port = (osm_port_t *)cl_qmap_get(&p_rcv->p_subn->port_guid_tbl, + port_guid); + if (p_port == (osm_port_t *)cl_qmap_end(&p_rcv->p_subn->port_guid_tbl)) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_sir_get_port_by_guid ERR 5309: " + "Invalid port GUID 0x%016" PRIx64 "\n", + port_guid ); + p_port = NULL; + } + + CL_PLOCK_RELEASE(p_rcv->p_lock); + return p_port; +} + +/********************************************************************** + **********************************************************************/ +static void +__osm_sir_rcv_create_sir( + IN osm_sir_rcv_t* const p_rcv, + IN const osm_switch_t* const p_sw, + IN cl_qlist_t* const p_list, + IN ib_net16_t const match_lid, + IN const osm_physp_t* const p_req_physp ) +{ + osm_port_t* p_port; + const osm_physp_t* p_physp; + uint16_t match_lid_ho; + ib_net16_t min_lid_ho; + ib_net16_t max_lid_ho; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_sir_rcv_create_sir ); + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_sir_rcv_create_sir: " + "Looking for SwitchInfoRecord with LID: 0x%X\n", + cl_ntoh16( match_lid ) + ); + } + + /* In switches, the port guid is the node guid. */ + p_port = + __osm_sir_get_port_by_guid( p_rcv, p_sw->p_node->node_info.port_guid ); + if (! p_port) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sir_rcv_create_sir: ERR 530A: " + "Failed to find Port by Node Guid:0x%016" PRIx64 + "\n", + cl_ntoh64( p_sw->p_node->node_info.node_guid ) + ); + goto Exit; + } + + /* check that the requester physp and the current physp are under + the same partition. */ + p_physp = osm_port_get_default_phys_ptr( p_port ); + if (! p_physp) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_sir_rcv_create_sir: ERR 530B: " + "Failed to find default physical Port by Node Guid:0x%016" PRIx64 + "\n", + cl_ntoh64( p_sw->p_node->node_info.node_guid ) + ); + goto Exit; + } + if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_physp )) + goto Exit; + + /* get the port 0 of the switch */ + osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho ); + + match_lid_ho = cl_ntoh16( match_lid ); + if( match_lid_ho ) + { + /* + We validate that the lid belongs to this switch. + */ + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_sir_rcv_create_sir: " + "Comparing LID: 0x%X <= 0x%X <= 0x%X\n", + min_lid_ho, match_lid_ho, max_lid_ho + ); + } + + if ( match_lid_ho < min_lid_ho || match_lid_ho > max_lid_ho ) + goto Exit; + + } + + __osm_sir_rcv_new_sir( p_rcv, p_sw, p_list, osm_port_get_base_lid(p_port) ); + +Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +static void +__osm_sir_rcv_by_comp_mask( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + const osm_sir_search_ctxt_t* const p_ctxt = (osm_sir_search_ctxt_t *)context; + const osm_switch_t* const p_sw = (osm_switch_t*)p_map_item; + const ib_switch_info_record_t* const p_rcvd_rec = p_ctxt->p_rcvd_rec; + const osm_physp_t* const p_req_physp = p_ctxt->p_req_physp; + osm_sir_rcv_t* const p_rcv = p_ctxt->p_rcv; + ib_net64_t const comp_mask = p_ctxt->comp_mask; + ib_net16_t match_lid = 0; + + OSM_LOG_ENTER( p_ctxt->p_rcv->p_log, __osm_sir_rcv_by_comp_mask ); + + osm_dump_switch_info( + p_ctxt->p_rcv->p_log, + &p_sw->switch_info, + OSM_LOG_VERBOSE ); + + if( comp_mask & IB_SWIR_COMPMASK_LID ) + match_lid = p_rcvd_rec->lid; + + __osm_sir_rcv_create_sir( p_rcv, p_sw, p_ctxt->p_list, + match_lid, p_req_physp ); + + OSM_LOG_EXIT( p_ctxt->p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_sir_rcv_process( + IN osm_sir_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ) +{ + const ib_sa_mad_t* p_rcvd_mad; + const ib_switch_info_record_t* p_rcvd_rec; + ib_switch_info_record_t* p_resp_rec; + cl_qlist_t rec_list; + osm_madw_t* p_resp_madw; + ib_sa_mad_t* p_resp_sa_mad; + uint32_t num_rec, pre_trim_num_rec; +#ifndef VENDOR_RMPP_SUPPORT + uint32_t trim_num_rec; +#endif + uint32_t i; + osm_sir_search_ctxt_t context; + osm_sir_item_t* p_rec_item; + ib_api_status_t status; + osm_physp_t* p_req_physp; + + CL_ASSERT( p_rcv ); + + OSM_LOG_ENTER( p_rcv->p_log, osm_sir_rcv_process ); + + CL_ASSERT( p_madw ); + + p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_rcvd_rec = (ib_switch_info_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad ); + + CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_SWITCH_INFO_RECORD ); + + /* we only support SubnAdmGet and SubnAdmGetTable methods */ + if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) && + (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) ) { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_sir_rcv_process: ERR 5305: " + "Unsupported Method (%s)\n", + ib_get_sa_method_str( p_rcvd_mad->method ) ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR ); + goto Exit; + } + + /* update the requester physical port. */ + p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log, + p_rcv->p_subn, + osm_madw_get_mad_addr_ptr(p_madw) ); + if (p_req_physp == NULL) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_sir_rcv_process: ERR 5304: " + "Cannot find requester physical port\n" ); + goto Exit; + } + + if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + osm_dump_switch_info_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG ); + + cl_qlist_init( &rec_list ); + + context.p_rcvd_rec = p_rcvd_rec; + context.p_list = &rec_list; + context.comp_mask = p_rcvd_mad->comp_mask; + context.p_rcv = p_rcv; + context.p_req_physp = p_req_physp; + + cl_plock_acquire( p_rcv->p_lock ); + + /* Go over all switches */ + cl_qmap_apply_func( &p_rcv->p_subn->sw_guid_tbl, + __osm_sir_rcv_by_comp_mask, + &context ); + + cl_plock_release( p_rcv->p_lock ); + + num_rec = cl_qlist_count( &rec_list ); + + /* + * C15-0.1.30: + * If we do a SubnAdmGet and got more than one record it is an error ! + */ + if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1) ) { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_sir_rcv_process: ERR 5303: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); + + /* need to set the mem free ... */ + p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_sir_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } + + pre_trim_num_rec = num_rec; +#ifndef VENDOR_RMPP_SUPPORT + /* we limit the number of records to a single packet */ + trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_switch_info_record_t); + if (trim_num_rec < num_rec) + { + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, + "osm_sir_rcv_process: " + "Number of records:%u trimmed to:%u to fit in one MAD\n", + num_rec, trim_num_rec ); + num_rec = trim_num_rec; + } +#endif + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_sir_rcv_process: " + "Returning %u records\n", num_rec ); + + if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + + /* + * Get a MAD to reply. Address of Mad is in the received mad_wrapper + */ + p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, + p_madw->h_bind, + num_rec * sizeof(ib_switch_info_record_t) + IB_SA_MAD_HDR_SIZE, + &p_madw->mad_addr ); + + if( !p_resp_madw ) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_sir_rcv_process: ERR 5306: " + "osm_mad_pool_get failed\n" ); + + for( i = 0; i < num_rec; i++ ) + { + p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list ); + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + } + + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RESOURCES ); + goto Exit; + } + + p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw ); + + /* + Copy the MAD header back into the response mad. + Set the 'R' bit and the payload length, + Then copy all records from the list into the response payload. + */ + + memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE ); + p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK; + /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */ + p_resp_sa_mad->sm_key = 0; + /* Fill in the offset (paylen will be done by the rmpp SAR) */ + p_resp_sa_mad->attr_offset = + ib_get_attr_offset( sizeof(ib_switch_info_record_t) ); + + p_resp_rec = (ib_switch_info_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); + +#ifndef VENDOR_RMPP_SUPPORT + /* we support only one packet RMPP - so we will set the first and + last flags for gettable */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + { + p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA; + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE; + } +#else + /* forcefully define the packet as RMPP one */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; +#endif + + for( i = 0; i < pre_trim_num_rec; i++ ) + { + p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list ); + /* copy only if not trimmed */ + if (i < num_rec) + { + *p_resp_rec = p_rec_item->rec; + } + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_resp_rec++; + } + + CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); + + status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + if (status != IB_SUCCESS) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_sir_rcv_process: ERR 5307: " + "osm_vendor_send status = %s\n", + ib_get_err_str(status)); + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} diff --git a/osm/opensm/osm_sa_sw_info_record_ctrl.c b/osm/opensm/osm_sa_sw_info_record_ctrl.c new file mode 100644 index 0000000..daf55cc --- /dev/null +++ b/osm/opensm/osm_sa_sw_info_record_ctrl.c @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of osm_sir_rcv_ctrl_t. + * This object represents the SwitchInfo Record controller object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +/********************************************************************** + **********************************************************************/ +void +__osm_sir_ctrl_disp_callback( + IN void *context, + IN void *p_data ) +{ + /* ignore return status when invoked via the dispatcher */ + osm_sir_rcv_process( ((osm_sir_rcv_ctrl_t*)context)->p_rcv, + (osm_madw_t*)p_data ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_sir_rcv_ctrl_construct( + IN osm_sir_rcv_ctrl_t* const p_ctrl ) +{ + memset( p_ctrl, 0, sizeof(*p_ctrl) ); + p_ctrl->h_disp = CL_DISP_INVALID_HANDLE; +} + +/********************************************************************** + **********************************************************************/ +void +osm_sir_rcv_ctrl_destroy( + IN osm_sir_rcv_ctrl_t* const p_ctrl ) +{ + CL_ASSERT( p_ctrl ); + cl_disp_unregister( p_ctrl->h_disp ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_sir_rcv_ctrl_init( + IN osm_sir_rcv_ctrl_t* const p_ctrl, + IN osm_sir_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ) +{ + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_log, osm_sir_rcv_ctrl_init ); + + osm_sir_rcv_ctrl_construct( p_ctrl ); + p_ctrl->p_log = p_log; + p_ctrl->p_rcv = p_rcv; + p_ctrl->p_disp = p_disp; + + p_ctrl->h_disp = cl_disp_register( + p_disp, + OSM_MSG_MAD_SWITCH_INFO_RECORD, + __osm_sir_ctrl_disp_callback, + p_ctrl ); + + if( p_ctrl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_sir_rcv_ctrl_init: ERR 5301: " + "Dispatcher registration failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_log ); + return( status ); +} From halr at voltaire.com Wed Dec 27 08:46:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 11:46:27 -0500 Subject: [openib-general] [PATCH 3/4] OpenSM: Other changes to incorporate optional SA SwitchInfoRecord support Message-ID: <1167237684.29620.74966.camel@hal.voltaire.com> OpenSM: Other changes to incorporate optional SA SwitchInfoRecord support Signed-off-by: Hal Rosenstock diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am index cc90283..d051b9a 100644 --- a/osm/include/Makefile.am +++ b/osm/include/Makefile.am @@ -109,6 +109,8 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sa_link_record_ctrl.h \ $(srcdir)/opensm/osm_sw_info_rcv_ctrl.h \ $(srcdir)/opensm/osm_sa_mcmember_record.h \ + $(srcdir)/opensm/osm_sa_sw_info_record_ctrl.h \ + $(srcdir)/opensm/osm_sa_sw_info_record.h \ $(srcdir)/opensm/osm_vl15intf.h \ $(srcdir)/opensm/osm_drop_mgr.h \ $(srcdir)/opensm/osm_port_info_rcv.h \ diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h index a9fa613..3611025 100644 --- a/osm/include/opensm/osm_msgdef.h +++ b/osm/include/opensm/osm_msgdef.h @@ -195,6 +195,7 @@ enum OSM_MSG_MAD_SLVL, OSM_MSG_MAD_GUIDINFO_RECORD, OSM_MSG_MAD_INFORM_INFO_RECORD, + OSM_MSG_MAD_SWITCH_INFO_RECORD, #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) OSM_MSG_MAD_MULTIPATH_RECORD, #endif diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h index 93324b2..ae8d5ac 100644 --- a/osm/include/opensm/osm_sa.h +++ b/osm/include/opensm/osm_sa.h @@ -76,6 +76,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -190,6 +191,10 @@ typedef struct _osm_sa /* LinearForwardingTable Query */ osm_lftr_rcv_t lftr_rcv; osm_lftr_rcv_ctrl_t lftr_rcv_ctrl; + + /* SwitchInfo Query */ + osm_sir_rcv_t sir_rcv; + osm_sir_rcv_ctrl_t sir_rcv_ctrl; } osm_sa_t; /* * FIELDS diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index 7c09e81..3ef246c 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -77,7 +77,8 @@ opensm_SOURCES = main.c osm_console.c os osm_sa_service_record_ctrl.c osm_sa_slvl_record.c \ osm_sa_slvl_record_ctrl.c osm_sa_sminfo_record.c \ osm_sa_sminfo_record_ctrl.c osm_sa_vlarb_record.c \ - osm_sa_vlarb_record_ctrl.c osm_service.c \ + osm_sa_vlarb_record_ctrl.c osm_sa_sw_info_record.c \ + osm_sa_sw_info_record_ctrl.c osm_service.c \ osm_slvl_map_rcv.c osm_slvl_map_rcv_ctrl.c \ osm_sm.c osm_sminfo_rcv.c \ osm_sminfo_rcv_ctrl.c osm_sm_mad_ctrl.c \ diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c index a6c475c..983d5e5 100644 --- a/osm/opensm/osm_sa.c +++ b/osm/opensm/osm_sa.c @@ -128,6 +128,9 @@ osm_sa_construct( osm_lftr_rcv_construct( &p_sa->lftr_rcv ); osm_lftr_rcv_ctrl_construct( &p_sa->lftr_rcv_ctrl ); + + osm_sir_rcv_construct( &p_sa->sir_rcv ); + osm_sir_rcv_ctrl_construct( &p_sa->sir_rcv_ctrl ); } /********************************************************************** @@ -159,6 +162,7 @@ osm_sa_shutdown( osm_slvl_rec_rcv_ctrl_destroy( &p_sa->slvl_rec_rcv_ctrl ); osm_pkey_rec_rcv_ctrl_destroy( &p_sa->pkey_rec_rcv_ctrl ); osm_lftr_rcv_ctrl_destroy( &p_sa->lftr_rcv_ctrl ); + osm_sir_rcv_ctrl_destroy( &p_sa->sir_rcv_ctrl ); osm_sa_mad_ctrl_destroy( &p_sa->mad_ctrl ); OSM_LOG_EXIT( p_sa->p_log ); @@ -190,6 +194,7 @@ osm_sa_destroy( osm_slvl_rec_rcv_destroy( &p_sa->slvl_rec_rcv ); osm_pkey_rec_rcv_destroy( &p_sa->pkey_rec_rcv ); osm_lftr_rcv_destroy( &p_sa->lftr_rcv ); + osm_sir_rcv_destroy( &p_sa->sir_rcv ); osm_sa_resp_destroy( &p_sa->resp ); OSM_LOG_EXIT( p_sa->p_log ); @@ -514,6 +519,24 @@ osm_sa_init( if( status != IB_SUCCESS ) goto Exit; + status = osm_sir_rcv_init( + &p_sa->sir_rcv, + &p_sa->resp, + p_sa->p_mad_pool, + p_subn, + p_log, + p_lock); + if( status != IB_SUCCESS ) + goto Exit; + + status = osm_sir_rcv_ctrl_init( + &p_sa->sir_rcv_ctrl, + &p_sa->sir_rcv, + p_log, + p_disp ); + if( status != IB_SUCCESS ) + goto Exit; + Exit: OSM_LOG_EXIT( p_log ); return( status ); diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c index 440d773..4d7bcbb 100644 --- a/osm/opensm/osm_sa_class_port_info.c +++ b/osm/opensm/osm_sa_class_port_info.c @@ -194,7 +194,6 @@ __osm_cpi_rcv_respond( /* set specific capability mask bits */ /* we do not support the following optional records: OSM_CAP_IS_SUBN_OPT_RECS_SUP : - SwitchInfoRecord, RandomForwardingTableRecord, MulticastForwardingTableRecord, ServiceAssociationRecord diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c index 2605fbf..90c732d 100644 --- a/osm/opensm/osm_sa_mad_ctrl.c +++ b/osm/opensm/osm_sa_mad_ctrl.c @@ -212,6 +212,10 @@ __osm_sa_mad_ctrl_process( msg_id = OSM_MSG_MAD_INFORM_INFO_RECORD; break; + case IB_MAD_ATTR_SWITCH_INFO_RECORD: + msg_id = OSM_MSG_MAD_SWITCH_INFO_RECORD; + break; + #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) case IB_MAD_ATTR_MULTIPATH_RECORD: msg_id = OSM_MSG_MAD_MULTIPATH_RECORD; From halr at voltaire.com Wed Dec 27 08:46:36 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 11:46:36 -0500 Subject: [openib-general] [PATCH 4/4] osmtest/osmtest.c: Add SA SwitchInfoRecord tests Message-ID: <1167237690.29620.74968.camel@hal.voltaire.com> osmtest/osmtest.c: Add SA SwitchInfoRecord tests Signed-off-by: Hal Rosenstock diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 0ccc06c..eed390b 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -4677,6 +4677,92 @@ osmtest_get_pkeytbl_rec_by_lid( IN osmte } /********************************************************************** + * Get SwitchInfo record by LID + **********************************************************************/ +ib_api_status_t +osmtest_get_sw_info_rec_by_lid( IN osmtest_t * const p_osmt, + IN ib_net16_t const lid, + IN OUT osmtest_req_context_t * const p_context ) +{ + ib_api_status_t status = IB_SUCCESS; + osmv_user_query_t user; + osmv_query_req_t req; + ib_switch_info_record_t record; + ib_mad_t *p_mad; + + OSM_LOG_ENTER( &p_osmt->log, osmtest_get_sw_info_rec_by_lid ); + + if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) + { + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_get_sw_info_rec_by_lid: " + "Getting SwitchInfo record for LID 0x%02X\n", + cl_ntoh16( lid ) ); + } + + /* + * Do a blocking query for this record in the subnet. + * The result is returned in the result field of the caller's + * context structure. + * + * The query structures are locals. + */ + memset( &req, 0, sizeof( req ) ); + memset( &user, 0, sizeof( user ) ); + memset( &record, 0, sizeof( record ) ); + + record.lid = lid; + p_context->p_osmt = p_osmt; + user.comp_mask = IB_SWIR_COMPMASK_LID; + user.attr_id = IB_MAD_ATTR_SWITCH_INFO_RECORD; + user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) ); + user.p_attr = &record; + + req.query_type = OSMV_QUERY_USER_DEFINED; + req.timeout_ms = p_osmt->opt.transaction_timeout; + req.retry_cnt = p_osmt->opt.retry_count; + + req.flags = OSM_SA_FLAGS_SYNC; + req.query_context = p_context; + req.pfn_query_cb = osmtest_query_res_cb; + req.p_query_input = &user; + req.sm_key = 0; + + status = osmv_query_sa( p_osmt->h_bind, &req ); + if( status != IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_sw_info_rec_by_lid: ERR 006C: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + goto Exit; + } + + status = p_context->result.status; + + if( status != IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_sw_info_rec_by_lid: ERR 006D: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + if( status == IB_REMOTE_ERROR ) + { + p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw ); + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_sw_info_rec_by_lid: " + "Remote error = %s\n", + ib_get_mad_status_str( p_mad )); + + status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK ); + } + goto Exit; + } + + Exit: + OSM_LOG_EXIT( &p_osmt->log ); + return ( status ); +} + +/********************************************************************** * Get LFT record by LID **********************************************************************/ ib_api_status_t @@ -5820,6 +5906,17 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; + /* SwitchInfo Record tests */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_sw_info_rec_by_lid( p_osmt, 0, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_sw_info_rec_by_lid( p_osmt, test_lid, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + /* LFT Record test */ memset( &context, 0, sizeof( context ) ); status = osmtest_get_lft_rec_by_lid( p_osmt, test_lid, &context ); @@ -6169,6 +6266,12 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; + /* Another SwitchInfo Record test */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_sw_info_rec_by_lid( p_osmt, test_lid, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + /* Another LFT Record test */ memset( &context, 0, sizeof( context ) ); status = osmtest_get_lft_rec_by_lid( p_osmt, test_lid, &context ); From jsquyres at cisco.com Wed Dec 27 09:02:41 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 12:02:41 -0500 Subject: [openib-general] SVN deprecation Message-ID: I propose "svn rm"'ing unused trees in the SVN repository and leaving README files indicating that everything has moved to git (remember: everything is still available via the SVN history). If no one has any objections, I'll do this on Friday, 5 Jan 2007. ** PLEASE READ THE FOLLOWING CAREFULLY and send in your comments! Otherwise, things may disappear from SVN that you didn't expect. UNKNOWN whether to keep or remove: (i.e., they seem to have "recent" development) ============================================== DEVELOPER MTIME PATH --------- -------- ---------------------------------- dotanb Dec 2006 /trunk/contrib/mellanox vlad Dec 2006 /gen2/trunk/ofed swise Oct 2006 /gen2/branches/iwarp hnguyen Sep 2006 /trunk/contrib/ibm amitk Sep 2006 /gen2/branches/1.0 vlad Sep 2006 /gen2/branches/ofed_fixes monil Sep 2006 /gen2/branches/backport woody Sep 2006 /gen2/branches/backport-to-2.6.9 halr May 2006 /gen2/branches/ibat mst Jul 2006 /gen2/branches/mellanox_fixes KEEP the following: =================== - /gen2/branches/1.1: by request (Tziporet) REMOVE the following: ===================== In short, everything will be removed except what was listed above. However, to be explicit, some more entries are listed below. (*) entries mean "everything except what was already listed above" Remove these trees based on the fact that they haven't changed in a long time: MTIME PATH --------- ------------------------------ Apr 2006 /trunk/contrib/* Apr 2006 /trunk/branches/* Apr 2006 /gen2/ulps Apr 2006 /gen2/branches/* Mar 2006 /gen2/users May 2005 /gen1 Jan 2005 /gen2/trunk/arch Dec 2004 /gen2/utils Nov 2004 /gen2/trunk/scripts Jul 2004 /tags Apr 2004 /trunk/openib Remove these trees for additional rationale: - /branches: it's empty - /gen2/tags: replaced by OFED and git - /gen2/src: everything should now be in git (*** IS THIS RIGHT?!?!) Comments? -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From sashak at voltaire.com Wed Dec 27 09:18:13 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 27 Dec 2006 19:18:13 +0200 Subject: [openib-general] [PATCH 2/3] osm: Changes for windows compatability In-Reply-To: <459236F6.8060707@dev.mellanox.co.il> References: <459236F6.8060707@dev.mellanox.co.il> Message-ID: <20061227171813.GA11268@sashak.voltaire.com> Hi Yevgeny, On 11:03 Wed 27 Dec , Yevgeny Kliteynik wrote: > Hi Hal. > > Fixing windows compilation problems. > > Signed-off-by: Yevgeny Kliteynik > --- > osm/opensm/osm_ucast_ftree.c | 42 ++++++++++++++++++++++-------------------- > 1 files changed, 22 insertions(+), 20 deletions(-) > > diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c > index ba95a0d..054e3c9 100644 > --- a/osm/opensm/osm_ucast_ftree.c > +++ b/osm/opensm/osm_ucast_ftree.c [snip..] > @@ -226,7 +226,7 @@ typedef struct ftree_fabric_t_ > ** > ***************************************************/ > > -int > +int OSM_CDECL > __osm_ftree_compare_switches_by_index( > IN const void * p1, > IN const void * p2) Is this function is used somewhere in a global namespace? If no, this probably should be 'static' and don't have OSM_CDECL attribute. If yes, isn't this cleaner to have OSM_CDECL in header file, where the function prototype is located? > @@ -247,7 +247,7 @@ __osm_ftree_compare_switches_by_index( > > /***************************************************/ > > -int > +int OSM_CDECL > __osm_ftree_compare_port_groups_by_remote_switch_index( > IN const void * p1, > IN const void * p2) Ditto. Sasha From sashak at voltaire.com Wed Dec 27 09:25:29 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 27 Dec 2006 19:25:29 +0200 Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows compatability In-Reply-To: <4592956F.3020501@dev.mellanox.co.il> References: <4592956F.3020501@dev.mellanox.co.il> Message-ID: <20061227172529.GB11268@sashak.voltaire.com> On 17:46 Wed 27 Dec , Yevgeny Kliteynik wrote: > Hi Hal. > > Fixing windows compilation problems > [V2 - Previous patch had an error] > > Signed-off-by: Yevgeny Kliteynik > --- > osm/include/iba/ib_types.h | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h > index 723e8b9..ec65b64 100644 > --- a/osm/include/iba/ib_types.h > +++ b/osm/include/iba/ib_types.h > @@ -59,9 +59,10 @@ BEGIN_C_DECLS > #define OSM_EXPORT __declspec(dllimport) > #endif > #define OSM_API __stdcall > + #define OSM_CDECL __cdecl > #else > #define OSM_EXPORT extern > #define OSM_API > + #define OSM_CDECL > #define __ptr64 > #endif Just wondering, how does lack of __cdecl hurt windows compilation (in the context of where those __cdecl is used)? What is the reason to have both __stdcall and __cdecl (and what is the default)? Sasha From halr at voltaire.com Wed Dec 27 09:17:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 12:17:54 -0500 Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows compatability In-Reply-To: <4592956F.3020501@dev.mellanox.co.il> References: <4592956F.3020501@dev.mellanox.co.il> Message-ID: <1167239871.29620.76806.camel@hal.voltaire.com> On Wed, 2006-12-27 at 10:46, Yevgeny Kliteynik wrote: > Hi Hal. > > Fixing windows compilation problems > [V2 - Previous patch had an error] > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. -- Hal From halr at voltaire.com Wed Dec 27 09:18:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 12:18:27 -0500 Subject: [openib-general] [PATCH 2/3] osm: Changes for windows compatability In-Reply-To: <459236F6.8060707@dev.mellanox.co.il> References: <459236F6.8060707@dev.mellanox.co.il> Message-ID: <1167239876.29620.76808.camel@hal.voltaire.com> On Wed, 2006-12-27 at 04:03, Yevgeny Kliteynik wrote: > Hi Hal. > > Fixing windows compilation problems. > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. -- Hal From halr at voltaire.com Wed Dec 27 09:18:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 12:18:33 -0500 Subject: [openib-general] [PATCH 3/3] osm: Changes for windows compatability In-Reply-To: <4592374E.7020008@dev.mellanox.co.il> References: <4592374E.7020008@dev.mellanox.co.il> Message-ID: <1167239903.29620.76873.camel@hal.voltaire.com> On Wed, 2006-12-27 at 04:05, Yevgeny Kliteynik wrote: > Hi Hal. > > Fixing windows compilation problems. > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. -- Hal From halr at voltaire.com Wed Dec 27 09:24:38 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 12:24:38 -0500 Subject: [openib-general] [PATCH] osm: additional check of tree topology In-Reply-To: <4592958B.7030102@dev.mellanox.co.il> References: <4592958B.7030102@dev.mellanox.co.il> Message-ID: <1167240276.29620.77186.camel@hal.voltaire.com> On Wed, 2006-12-27 at 10:47, Yevgeny Kliteynik wrote: > Hi Hal > > As we've discussed before - added check for fat-tree topology > to be at least of rank 2. > > -- > Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. -- Hal From mst at mellanox.co.il Wed Dec 27 09:26:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 19:26:58 +0200 Subject: [openib-general] Old svn repository access In-Reply-To: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com> References: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com> Message-ID: <20061227172658.GB5377@mellanox.co.il> > What exactly in OFED 1.0 uses the name openib.org -- SVN access? Yes. -- MST From halr at voltaire.com Wed Dec 27 09:32:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 27 Dec 2006 12:32:28 -0500 Subject: [openib-general] [PATCH] osm: fat-tree documentation In-Reply-To: <45929D0B.3090308@dev.mellanox.co.il> References: <45929D0B.3090308@dev.mellanox.co.il> Message-ID: <1167240747.29620.77561.camel@hal.voltaire.com> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote: > Hi Hal. > > Added fat-tree routing details and some cosmetics in the txt files. > > -- > Yevgeny > > Signed-off-by: Yevgeny Kliteynik Thanks. Applied. A couple of minor questions: Should similar text as in current-routing.txt be added to the OpenSM man page ? Also, rather than HCA in the below, is CA better (to include TCAs as well) ? -- Hal From mst at mellanox.co.il Wed Dec 27 09:42:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 19:42:45 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <45927D3A.9030502@voltaire.com> References: <45927D3A.9030502@voltaire.com> Message-ID: <20061227174245.GC5377@mellanox.co.il> > 3rd Eitan/Michael: what is the bigger picture here? what is the > dependency between these four patches In short, [2] is an independent fix to improve tavor performance. Other things are not directly related. Detail below. > +1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors > +2 osm: tavor quirk > +3 IB/rdmacm: tavor quirk > +4 IB/ipoib: use appropriate mtu selector for path queries In the above: [1] is a bug fix I think. It is not required for [2]. [2] is a feature that improves performance for tavor without need for any other stack/ULP changes [3] is a hack that should have same effect as [2] for old SMs, but it needs manual tuning by user. If activated, it unfortunately triggers a bug in opensm that [1] fixes. So it might not be a good idea after all. [4] is not strictly necessary, and not related to this patch set - it just happens to also play with MTU selector. It is a strict compliance cleanup that I just happened to notice when I invented [2]. > for example is it correct that: > > if [2] is applied on the SA side then [4] must be applied on ipoib else > if will get 1K mtu on its path query? Not really - ipoib does not actually use the MTU it gets from the path query, according to spec it uses the bcast group mtu for all packets. > if [2] is not applied on the SA side, then [3] is useless? No. If [2] is applied on te SA side, the [3] is unnecessary. -- MST From jsquyres at cisco.com Wed Dec 27 09:54:03 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 12:54:03 -0500 Subject: [openib-general] Old svn repository access In-Reply-To: <20061227172658.GB5377@mellanox.co.il> References: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com> <20061227172658.GB5377@mellanox.co.il> Message-ID: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com> Ok. Does that mean we need to keep OFED 1.0 available in SVN (and not "svn rm" it)? See my mail from earlier today about SVN. On Dec 27, 2006, at 12:26 PM, Michael S. Tsirkin wrote: >> What exactly in OFED 1.0 uses the name openib.org -- SVN access? > > Yes. > > -- > MST -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From mst at mellanox.co.il Wed Dec 27 09:55:24 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 19:55:24 +0200 Subject: [openib-general] SVN deprecation In-Reply-To: References: Message-ID: <20061227175524.GC6644@mellanox.co.il> > mst Jul 2006 /gen2/branches/mellanox_fixes Remove. -- MST From mst at mellanox.co.il Wed Dec 27 09:56:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 19:56:46 +0200 Subject: [openib-general] Old svn repository access In-Reply-To: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com> References: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com> Message-ID: <20061227175646.GD6644@mellanox.co.il> Yes, I think it's a good idea to keep OFED 1.0 around and not svn rm it. Quoting r. Jeff Squyres : Subject: Re: Old svn repository access Ok. Does that mean we need to keep OFED 1.0 available in SVN (and not "svn rm" it)? See my mail from earlier today about SVN. On Dec 27, 2006, at 12:26 PM, Michael S. Tsirkin wrote: >> What exactly in OFED 1.0 uses the name openib.org -- SVN access? > > Yes. > > -- > MST -- Jeff Squyres Server Virtualization Business Unit Cisco Systems -- MST From mst at mellanox.co.il Wed Dec 27 10:06:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 20:06:02 +0200 Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows compatability In-Reply-To: <4592956F.3020501@dev.mellanox.co.il> References: <4592956F.3020501@dev.mellanox.co.il> Message-ID: <20061227180602.GE6644@mellanox.co.il> > Hi Hal. > > Fixing windows compilation problems > [V2 - Previous patch had an error] I don't think "fixing windows compilation" is a real log description. What kind of errors? Isn't there a better fix? > Signed-off-by: Yevgeny Kliteynik > --- > osm/include/iba/ib_types.h | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h > index 723e8b9..ec65b64 100644 > --- a/osm/include/iba/ib_types.h > +++ b/osm/include/iba/ib_types.h > @@ -59,9 +59,10 @@ BEGIN_C_DECLS > #define OSM_EXPORT __declspec(dllimport) > #endif > #define OSM_API __stdcall > + #define OSM_CDECL __cdecl > #else > #define OSM_EXPORT extern > #define OSM_API > + #define OSM_CDECL > #define __ptr64 > #endif Why is this necessary at all? http://msdn2.microsoft.com/en-us/library/zkwh89ks.aspx Microsoft Specific This is the default calling convention for C and C++ programs. In other words it's the default, you don't have to declare it. Place the __cdecl modifier before a variable or a function name. Because the C naming and calling conventions are the default, the only time you need to use __cdecl is when you have specified the /Gz (stdcall) or /Gr (fastcall) compiler option. The /Gd compiler option forces the __cdecl calling convention. So why are you compiling with /Gz, after the code is already littered with OSM_API? And why is OSM_API necessary? It seems to me the right thing might be to remove all of OSM_API/OSM_CDECL from code, and just build everything on windows with consistent compiler flags. -- MST From mst at mellanox.co.il Wed Dec 27 10:10:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 27 Dec 2006 20:10:48 +0200 Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows compatability In-Reply-To: <1167239871.29620.76806.camel@hal.voltaire.com> References: <4592956F.3020501@dev.mellanox.co.il> <1167239871.29620.76806.camel@hal.voltaire.com> Message-ID: <20061227181048.GF6644@mellanox.co.il> > > Hi Hal. > > > > Fixing windows compilation problems > > [V2 - Previous patch had an error] > > > > Signed-off-by: Yevgeny Kliteynik > > Thanks. Applied. The log is not really informative - shouldn't it say what does this fix? In this case, it is forcing a specific calling convention on code - its a bit more that just "fixing compilation" as it claims. I'm worried that windows-related patches don't seem to be properly peer-reviewed. Wouldn't looking things up on msdn before applying windows-related stuff be a good idea? -- MST From jsquyres at cisco.com Wed Dec 27 10:14:09 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 13:14:09 -0500 Subject: [openib-general] SVN deprecation In-Reply-To: <20061227175524.GC6644@mellanox.co.il> References: <20061227175524.GC6644@mellanox.co.il> Message-ID: So noted -- thanks! On Dec 27, 2006, at 12:55 PM, Michael S. Tsirkin wrote: >> mst Jul 2006 /gen2/branches/mellanox_fixes > > Remove. > > -- > MST -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From jsquyres at cisco.com Wed Dec 27 10:14:44 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 27 Dec 2006 13:14:44 -0500 Subject: [openib-general] Old svn repository access In-Reply-To: <20061227175646.GD6644@mellanox.co.il> References: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com> <20061227175646.GD6644@mellanox.co.il> Message-ID: On Dec 27, 2006, at 12:56 PM, Michael S. Tsirkin wrote: > Yes, I think it's a good idea to keep OFED 1.0 around and not > svn rm it. So noted -- won't remove. Thanks! > Quoting r. Jeff Squyres : > Subject: Re: Old svn repository access > > Ok. Does that mean we need to keep OFED 1.0 available in SVN (and > not "svn rm" it)? See my mail from earlier today about SVN. > > > On Dec 27, 2006, at 12:26 PM, Michael S. Tsirkin wrote: > >>> What exactly in OFED 1.0 uses the name openib.org -- SVN access? >> >> Yes. >> >> -- >> MST > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > -- > MST -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From mshefty at ichips.intel.com Wed Dec 27 11:50:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 27 Dec 2006 11:50:33 -0800 Subject: [openib-general] IB_CM_REJ_INVALID_SERVICE_ID In-Reply-To: <200612202222.kBKMMDeY020463@robert.bartonsoftware.com> References: <200612202222.kBKMMDeY020463@robert.bartonsoftware.com> Message-ID: <4592CE89.2060005@ichips.intel.com> Eric Barton wrote: > Can an rdma_connect be rejected with IB_CM_REJ_INVALID_SERVICE_ID for any other > reason than the peer isn't listening with the correct service number? This should only occur if the remote peer isn't listening. This reject code is automatically sent by the ib_cm when a request does not find a corresponding listen. >>We are testing 1.6b5 for a InfiniBand cluster with RHEL 4. We use the >>binaries provides by CFS and use OFED 1.1 as the IB stack. >> >>At several times some of the clients hang during fs mount or when an OST >>is added (see log). >>Error: >>LustreError: 1776:0:(o2iblnd_cb.c:2314:kiblnd_rejected()) 10.0.90.8 at o2ib >>rejected: reason 8, size 148 Is this event = 8 and status = 8? >> >>from OFED: >>enum ib_cm_rej_reason { >> IB_CM_REJ_INVALID_SERVICE_ID = 8, >> >>Once an IPoIB ping is started to the corresponding OST the client >>continues. Afterwards it is quite stable. > > > ...which seems to be saying that just doing an IPoIB ping to the server was > enough to make rdma_connect() work OK. I can't explain the relationship between the ping and the connect starting to work. - Sean From mshefty at ichips.intel.com Wed Dec 27 12:00:36 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 27 Dec 2006 12:00:36 -0800 Subject: [openib-general] No resource tracking per qp for multicast groups In-Reply-To: <458FC332.1010801@voltaire.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com> <458FC332.1010801@voltaire.com> Message-ID: <4592D0E4.2010006@ichips.intel.com> > Per my understanding the issues you describe here are orthogonal to > Sean's multicast work, correct? were they solved in mthca or its still > open? This is orthogonal to the multicast module, which tracks joins made to the SA. I do not know if this problem was solved however. - Sean From kliteyn at dev.mellanox.co.il Wed Dec 27 13:05:27 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 23:05:27 +0200 Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows compatability In-Reply-To: <20061227180602.GE6644@mellanox.co.il> References: <4592956F.3020501@dev.mellanox.co.il> <20061227180602.GE6644@mellanox.co.il> Message-ID: <4592E017.1050508@dev.mellanox.co.il> Michael S. Tsirkin wrote: >> Hi Hal. >> >> Fixing windows compilation problems >> [V2 - Previous patch had an error] > > I don't think "fixing windows compilation" is a real log description. > What kind of errors? Isn't there a better fix? > >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/include/iba/ib_types.h | 2 ++ >> 1 files changed, 2 insertions(+), 0 deletions(-) >> >> diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h >> index 723e8b9..ec65b64 100644 >> --- a/osm/include/iba/ib_types.h >> +++ b/osm/include/iba/ib_types.h >> @@ -59,9 +59,10 @@ BEGIN_C_DECLS >> #define OSM_EXPORT __declspec(dllimport) >> #endif >> #define OSM_API __stdcall >> + #define OSM_CDECL __cdecl >> #else >> #define OSM_EXPORT extern >> #define OSM_API >> + #define OSM_CDECL >> #define __ptr64 >> #endif > > Why is this necessary at all? > http://msdn2.microsoft.com/en-us/library/zkwh89ks.aspx > Microsoft Specific > This is the default calling convention for C and C++ programs. > In other words it's the default, you don't have to declare it. > > Place the __cdecl modifier before a variable or a function name. Because the C > naming and calling conventions are the default, the only time you need to use > __cdecl is when you have specified the /Gz (stdcall) or /Gr (fastcall) compiler > option. The /Gd compiler option forces the __cdecl calling convention. > > So why are you compiling with /Gz, after the code is already littered with > OSM_API? And why is OSM_API necessary? I did saw that __cdecl is default on windows. However, the compiler complained about a certain function (more specifically - about a comparison function that is supplied as an argument to qsort() function) that it's defined as __stdcall instead of __cdecl. As you say, it's probably because of compilation flag - I didn't investigate this issue. > It seems to me the right thing might be to remove all of OSM_API/OSM_CDECL > from code, and just build everything on windows with consistent compiler flags. I'll check with the windows guys why do we have such compilation flag (assuming we do have it), and whether it can be removed. -- Yevgeny From kliteyn at dev.mellanox.co.il Wed Dec 27 13:30:08 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 23:30:08 +0200 Subject: [openib-general] [PATCH 2/3] osm: Changes for windows compatability In-Reply-To: <20061227171813.GA11268@sashak.voltaire.com> References: <459236F6.8060707@dev.mellanox.co.il> <20061227171813.GA11268@sashak.voltaire.com> Message-ID: <4592E5E0.2060505@dev.mellanox.co.il> Hi Sasha. Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 11:03 Wed 27 Dec , Yevgeny Kliteynik wrote: >> Hi Hal. >> >> Fixing windows compilation problems. >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/opensm/osm_ucast_ftree.c | 42 ++++++++++++++++++++++-------------------- >> 1 files changed, 22 insertions(+), 20 deletions(-) >> >> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c >> index ba95a0d..054e3c9 100644 >> --- a/osm/opensm/osm_ucast_ftree.c >> +++ b/osm/opensm/osm_ucast_ftree.c > > [snip..] > >> @@ -226,7 +226,7 @@ typedef struct ftree_fabric_t_ >> ** >> ***************************************************/ >> >> -int >> +int OSM_CDECL >> __osm_ftree_compare_switches_by_index( >> IN const void * p1, >> IN const void * p2) > > Is this function is used somewhere in a global namespace? If no, this > probably should be 'static' and don't have OSM_CDECL attribute. If yes, > isn't this cleaner to have OSM_CDECL in header file, where the function > prototype is located? The function should be 'static __cdecl'. I'll check with the windows guys regarding the __cdecl not being default. >> @@ -247,7 +247,7 @@ __osm_ftree_compare_switches_by_index( >> >> /***************************************************/ >> >> -int >> +int OSM_CDECL >> __osm_ftree_compare_port_groups_by_remote_switch_index( >> IN const void * p1, >> IN const void * p2) > > Ditto. Right, same thing here. Thanks. --Yevgeny > Sasha > From kliteyn at dev.mellanox.co.il Wed Dec 27 13:36:40 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 27 Dec 2006 23:36:40 +0200 Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows compatability In-Reply-To: <20061227172529.GB11268@sashak.voltaire.com> References: <4592956F.3020501@dev.mellanox.co.il> <20061227172529.GB11268@sashak.voltaire.com> Message-ID: <4592E768.3090005@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 17:46 Wed 27 Dec , Yevgeny Kliteynik wrote: >> Hi Hal. >> >> Fixing windows compilation problems >> [V2 - Previous patch had an error] >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> osm/include/iba/ib_types.h | 2 ++ >> 1 files changed, 2 insertions(+), 0 deletions(-) >> >> diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h >> index 723e8b9..ec65b64 100644 >> --- a/osm/include/iba/ib_types.h >> +++ b/osm/include/iba/ib_types.h >> @@ -59,9 +59,10 @@ BEGIN_C_DECLS >> #define OSM_EXPORT __declspec(dllimport) >> #endif >> #define OSM_API __stdcall >> + #define OSM_CDECL __cdecl >> #else >> #define OSM_EXPORT extern >> #define OSM_API >> + #define OSM_CDECL >> #define __ptr64 >> #endif > > Just wondering, how does lack of __cdecl hurt windows compilation (in > the context of where those __cdecl is used)? > What is the reason to have both __stdcall and __cdecl (and what is the > default)? Hi Sasha. The __cdecl is default on windows. However, the compiler complained about a certain function (more specifically - about a comparison function that is supplied as an argument to qsort() function) that it's defined as __stdcall instead of __cdecl. As MST has pointed out, it's probably because of compilation flag. I'll check with the windows guys why do we have such compilation flag (assuming we do have it), and whether it can be removed. Same goes for the __stdcall - I'm sure there is some historical reason for having it. The question is - do we still need it. Thanks. -- Yevgeny > Sasha > From sashak at voltaire.com Wed Dec 27 15:09:15 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 28 Dec 2006 01:09:15 +0200 Subject: [openib-general] [PATCH] diags: fix loops handling in ibnetdiscover Message-ID: <20061227230915.GF11268@sashak.voltaire.com> This fixes loop cabling and loopback connections handling in ibnetdiscover. Signed-off-by: Sasha Khapyorsky --- diags/src/ibnetdiscover.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c index 71f6b83..31b7063 100644 --- a/diags/src/ibnetdiscover.c +++ b/diags/src/ibnetdiscover.c @@ -338,7 +338,7 @@ handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist) free(remotenode); /* Handle loopback plug */ - if (port->portguid == remoteport->portguid) { + if (port->portnum == remoteport->portnum) { free(remoteport); remoteport = port; } -- 1.4.4.2.gfc82d From sashak at voltaire.com Wed Dec 27 15:10:17 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 28 Dec 2006 01:10:17 +0200 Subject: [openib-general] [PATCH] diags: eliminate __WORDSIZE ifdefs for printing Message-ID: <20061227231017.GG11268@sashak.voltaire.com> Use portable PRIx64 macro in printf format strings instead of using '#if __WORDSIZE == 64' with printf style functions. Signed-off-by: Sasha Khapyorsky --- diags/src/ibnetdiscover.c | 63 +++++-------------------------------- diags/src/ibroute.c | 15 ++------- diags/src/ibtracert.c | 74 +++++---------------------------------------- diags/src/sminfo.c | 8 +---- 4 files changed, 22 insertions(+), 138 deletions(-) diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c index 31b7063..0b5078b 100644 --- a/diags/src/ibnetdiscover.c +++ b/diags/src/ibnetdiscover.c @@ -213,21 +213,12 @@ dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port) if (!dumplevel) return; -#if __WORDSIZE == 64 - fprintf(f, "%s -> %s %s {%016lx} portnum %d lid %d-%d\"%s\"\n", + fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n", portid2str(path), prompt, (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum, port->lid, port->lid + (1 << port->lmc) - 1, clean_nodedesc(node->nodedesc)); -#else - fprintf(f, "%s -> %s %s {%016Lx} portnum %d lid %d-%d\"%s\"\n", - portid2str(path), prompt, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum, - port->lid, port->lid + (1 << port->lmc) - 1, - clean_nodedesc(node->nodedesc)); -#endif } #define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) @@ -265,11 +256,7 @@ link_port(Port *port, Node *node, Port *remoteport) } if (dumplevel) -#if __WORDSIZE == 64 - fprintf(f, "\t[%d] {%016lx}\n", port->portnum, port->portguid); -#else - fprintf(f, "\t[%d] {%016Lx}\n", port->portnum, port->portguid); -#endif + fprintf(f, "\t[%d] {%016" PRIx64 "}\n", port->portnum, port->portguid); DEBUG("inserting new port %p (%d) to node %p", port, port->portnum, node); port->node = node; @@ -447,13 +434,8 @@ node_name(Node *node) { static char buf[256]; -#if __WORDSIZE == 64 - sprintf(buf, "\"%s-%016lx\"", + sprintf(buf, "\"%s-%016" PRIx64 "\"", node->type == SWITCH_NODE ? "S" : "H", node->nodeguid); -#else - sprintf(buf, "\"%s-%016Lx\"", - node->type == SWITCH_NODE ? "S" : "H", node->nodeguid); -#endif return buf; } @@ -477,17 +459,10 @@ list_node(Node *node) node_type = "???"; break; } -#if __WORDSIZE == 64 - fprintf(f, "%s\t : 0x%016lx ports %d devid 0x%x vendid 0x%x \"%s\"\n", - node_type, - node->nodeguid, node->numports, node->devid, node->vendid, - clean_nodedesc(node->nodedesc)); -#else - fprintf(f, "%s\t : 0x%016Lx ports %d devid 0x%x vendid 0x%x \"%s\"\n", + fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n", node_type, node->nodeguid, node->numports, node->devid, node->vendid, clean_nodedesc(node->nodedesc)); -#endif } void @@ -495,11 +470,7 @@ out_ids(Node *node) { fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); if (node->sysimgguid) -#if __WORDSIZE == 64 - fprintf(f, "sysimgguid=0x%lx\n", node->sysimgguid); -#else - fprintf(f, "sysimgguid=0x%Lx\n", node->sysimgguid); -#endif + fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid); } void @@ -514,11 +485,7 @@ out_chassis(Node *node) fprintf(f, "\nChassis %d", node->chrecord->chassisnum); guid = get_chassis_guid(node->chrecord->chassisnum); if (guid) { -#if __WORDSIZE == 64 - fprintf(f, " (guid 0x%lx)", guid); -#else - fprintf(f, " (guid 0x%Lx)", guid); -#endif + fprintf(f, " (guid 0x%" PRIx64 ")", guid); } fprintf(f, "\n"); } @@ -541,11 +508,7 @@ out_switch(Node *node, int group) } out_ids(node); -#if __WORDSIZE == 64 - fprintf(f, "%s=0x%lx", "switchguid", node->nodeguid); -#else - fprintf(f, "%s=0x%Lx", "switchguid", node->nodeguid); -#endif + fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); if (group) { if (node->chrecord) { if (node->chrecord->chassisnum) { @@ -592,11 +555,7 @@ out_ca(Node *node) node_type2 = "???"; break; } -#if __WORDSIZE == 64 - fprintf(f, "%s%s=0x%lx\n", node_type, "guid", node->nodeguid); -#else - fprintf(f, "%s%s=0x%Lx\n", node_type, "guid", node->nodeguid); -#endif + fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); fprintf(f, "%s\t%d %s\t\t# %s\n", node_type2, node->numports, node_name(node), clean_nodedesc(node->nodedesc)); @@ -649,11 +608,7 @@ dump_topology(int listtype, int group) if (!listtype) { fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered); -#if __WORDSIZE == 64 - fprintf(f, "# Initiated from node %016lx port %016lx\n", mynode->nodeguid, mynode->portguid); -#else - fprintf(f, "# Initiated from node %016Lx port %016Lx\n", mynode->nodeguid, mynode->portguid); -#endif + fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid); } /* Make pass on switches */ diff --git a/diags/src/ibroute.c b/diags/src/ibroute.c index f590fdd..8152b6d 100644 --- a/diags/src/ibroute.c +++ b/diags/src/ibroute.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -192,13 +193,8 @@ dump_multicast_tables(ib_portid_t *portid, int startlid, int endlid) endlid = IB_MAX_MCAST_LID; } -#if __WORDSIZE == 64 - printf("Multicast mlids [0x%x-0x%x] of switch %s guid 0x%016lx (%s):\n", + printf("Multicast mlids [0x%x-0x%x] of switch %s guid 0x%016" PRIx64 " (%s):\n", startlid, endlid, portid2str(portid), nodeguid, nd); -#else - printf("Multicast mlids [0x%x-0x%x] of switch %s guid 0x%016Lx (%s):\n", - startlid, endlid, portid2str(portid), nodeguid, nd); -#endif if (brief) printf(" MLid Port Mask\n"); @@ -338,13 +334,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid) endlid = IB_MAX_UCAST_LID; } -#if __WORDSIZE == 64 - printf("Unicast lids [0x%x-0x%x] of switch %s guid 0x%016lx (%s):\n", - startlid, endlid, portid2str(portid), nodeguid, nd); -#else - printf("Unicast lids [0x%x-0x%x] of switch %s guid 0x%016Lx (%s):\n", + printf("Unicast lids [0x%x-0x%x] of switch %s guid 0x%016" PRIx64 " (%s):\n", startlid, endlid, portid2str(portid), nodeguid, nd); -#endif DEBUG("Switch top is 0x%x\n", top); printf(" Lid Out Destination\n"); diff --git a/diags/src/ibtracert.c b/diags/src/ibtracert.c index bfa3d25..e545e9a 100644 --- a/diags/src/ibtracert.c +++ b/diags/src/ibtracert.c @@ -214,32 +214,17 @@ dump_endnode(int dump, char *prompt, Node *node, Port *port) if (!dump) return; if (dump == 1) { -#if __WORDSIZE == 64 - fprintf(f, "%s {%016lx}[%d]\n", + fprintf(f, "%s {%016" PRIx64 "}[%d]\n", prompt, node->nodeguid, node->type == IB_NODE_SWITCH ? 0 : port->portnum); -#else - fprintf(f, "%s {%016Lx}[%d]\n", - prompt, node->nodeguid, - node->type == IB_NODE_SWITCH ? 0 : port->portnum); -#endif return; } -#if __WORDSIZE == 64 - fprintf(f, "%s %s {%016lx} portnum %d lid 0x%x-0x%x \"%s\"\n", - prompt, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->type == IB_NODE_SWITCH ? 0 : port->portnum, - port->lid, port->lid + (1 << port->lmc) - 1, - node->nodedesc); -#else - fprintf(f, "%s %s {%016Lx} portnum %d lid 0x%x-0x%x \"%s\"\n", + fprintf(f, "%s %s {%016" PRIx64 "} portnum %d lid 0x%x-0x%x \"%s\"\n", prompt, (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->type == IB_NODE_SWITCH ? 0 : port->portnum, port->lid, port->lid + (1 << port->lmc) - 1, node->nodedesc); -#endif } static void @@ -247,29 +232,16 @@ dump_route(int dump, Node *node, int outport, Port *port) { if (!dump && !verbose) return; -#if __WORDSIZE == 64 if (dump == 1) - fprintf(f, "[%d] -> {%016lx}[%d]\n", + fprintf(f, "[%d] -> {%016" PRIx64 "}[%d]\n", outport, port->portguid, port->portnum); else - fprintf(f, "[%d] -> %s port {%016lx}[%d] lid 0x%x-0x%x \"%s\"\n", + fprintf(f, "[%d] -> %s port {%016" PRIx64 "}[%d] lid 0x%x-0x%x \"%s\"\n", outport, (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), port->portguid, port->portnum, port->lid, port->lid + (1 << port->lmc) - 1, node->nodedesc); -#else - if (dump == 1) - fprintf(f, "[%d] -> {%016Lx}[%d]\n", - outport, port->portguid, port->portnum); - else - fprintf(f, "[%d] -> %s port {%016Lx}[%d] lid 0x%x-0x%x \"%s\"\n", - outport, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - port->portguid, port->portnum, - port->lid, port->lid + (1 << port->lmc) - 1, - node->nodedesc); -#endif } static int @@ -667,65 +639,35 @@ dump_mcpath(Node *node, int dumplevel) dump_mcpath(node->upnode, dumplevel); if (!node->dist) { -#if __WORDSIZE == 64 - printf("From %s 0x%lx port %d lid 0x%x-0x%x \"%s\"\n", - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->ports->portnum, node->ports->lid, - node->ports->lid + (1 << node->ports->lmc) - 1, - node->nodedesc); -#else - printf("From %s 0x%Lx port %d lid 0x%x-0x%x \"%s\"\n", + printf("From %s 0x%" PRIx64 " port %d lid 0x%x-0x%x \"%s\"\n", (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->ports->portnum, node->ports->lid, node->ports->lid + (1 << node->ports->lmc) - 1, node->nodedesc); -#endif return; } if (node->dist) { -#if __WORDSIZE == 64 if (dumplevel == 1) - printf("[%d] -> %s {%016lx}[%d]\n", + printf("[%d] -> %s {%016" PRIx64 "}[%d]\n", node->ports->remoteport->portnum, (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->upport); else - printf("[%d] -> %s 0x%lx[%d] lid 0x%x \"%s\"\n", + printf("[%d] -> %s 0x%" PRIx64 "[%d] lid 0x%x \"%s\"\n", node->ports->remoteport->portnum, (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->upport, node->ports->lid, node->nodedesc); -#else - if (dumplevel == 1) - printf("[%d] -> %s {%016Lx}[%d]\n", - node->ports->remoteport->portnum, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->upport); - else - printf("[%d] -> %s 0x%Lx[%d] lid 0x%x \"%s\"\n", - node->ports->remoteport->portnum, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->upport, - node->ports->lid, node->nodedesc); -#endif } if (node->dist < 0) /* target node */ -#if __WORDSIZE == 64 - printf("To %s 0x%lx port %d lid 0x%x-0x%x \"%s\"\n", + printf("To %s 0x%" PRIx64 " port %d lid 0x%x-0x%x \"%s\"\n", (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->ports->portnum, node->ports->lid, node->ports->lid + (1 << node->ports->lmc) - 1, node->nodedesc); -#else - printf("To %s 0x%Lx port %d lid 0x%x-0x%x \"%s\"\n", - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->ports->portnum, node->ports->lid, - node->ports->lid + (1 << node->ports->lmc) - 1, - node->nodedesc); -#endif } static void diff --git a/diags/src/sminfo.c b/diags/src/sminfo.c index 98e2ed7..c01f195 100644 --- a/diags/src/sminfo.c +++ b/diags/src/sminfo.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #define __BUILD_VERSION_TAG__ 1.1 @@ -218,13 +219,8 @@ main(int argc, char **argv) mad_decode_field(sminfo, IB_SMINFO_PRIO_F, &prio); mad_decode_field(sminfo, IB_SMINFO_STATE_F, &state); -#if __WORDSIZE == 64 - printf("sminfo: sm lid %d sm guid 0x%lx, activity count %d priority %d state %d %s\n", + printf("sminfo: sm lid %d sm guid 0x%" PRIx64 ", activity count %d priority %d state %d %s\n", portid.lid, guid, act, prio, state, STATESTR(state)); -#else - printf("sminfo: sm lid %d sm guid 0x%Lx, activity count %d priority %d state %d %s\n", - portid.lid, guid, act, prio, state, STATESTR(state)); -#endif exit(0); } -- 1.4.4.2.gfc82d From eitan at sw053.yok.mtl.com Wed Dec 27 21:24:47 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Thu, 28 Dec 2006 07:24:47 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-28:normal completion Message-ID: <200612280524.kBS5OlmA014141@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Wed_Dec_27_12:30:42_2006 61a6c6 ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe Total=374 Pass=374 Fail=0 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmTest IS1-16.topo 42 Multicast IS1-16.topo 41 OsmStress IS1-16.topo 39 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo Failures: From jackm at dev.mellanox.co.il Thu Dec 28 00:09:32 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 28 Dec 2006 10:09:32 +0200 Subject: [openib-general] No resource tracking per qp for multicast groups In-Reply-To: <4592D0E4.2010006@ichips.intel.com> References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com> <458FC332.1010801@voltaire.com> <4592D0E4.2010006@ichips.intel.com> Message-ID: <200612281009.32750.jackm@dev.mellanox.co.il> On Wednesday 27 December 2006 22:00, Sean Hefty wrote: > I do not know if this problem was solved however. > The problem was solved in ofed 1.1. See file core/uverbs_main.c, procedure ib_uverbs_cleanup_ucontext(): list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { struct ib_qp *qp = uobj->object; struct ib_uqp_object *uqp = container_of(uobj, struct ib_uqp_object, uevent.uobject); idr_remove_uobj(&ib_uverbs_qp_idr, uobj); --> ib_uverbs_detach_umcast(qp, uqp); ib_destroy_qp(qp); ib_uverbs_release_uevent(file, &uqp->uevent); kfree(uqp); } - Jack From spreggna at novell.com Thu Dec 28 00:23:07 2006 From: spreggna at novell.com (Preggna S) Date: Thu, 28 Dec 2006 01:23:07 -0700 Subject: [openib-general] [PATCH][TRIVIAL] srp_tools: trivial log message fix Message-ID: <4593CC39.9947.00D4.0@novell.com> Trivial log message fix... Signed-off-by: Preggna S -- --- src/userspace/srptools/srp_daemon/srp_daemon.c 2006-11-20 11:54:23.000000000 +0530 +++ src_srpt_fixed/userspace/srptools/srp_daemon/srp_daemon.c 2006-12-27 15:40:50.000000000 +0530 @@ -1236,7 +1236,7 @@ int recalc(struct umad_resources *umad_r umad_res->sm_lid = strtol(val, NULL, 0); if (umad_res->sm_lid == 0) { - pr_err("SM LID is 0, maybe no opesm is running\n"); + pr_err("SM LID is 0, maybe no opensm is running\n"); return -1; } From dotanb at dev.mellanox.co.il Thu Dec 28 00:35:38 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 28 Dec 2006 10:35:38 +0200 Subject: [openib-general] [rdma_ucm] enabling the rdma_ucm and restarting the driver several times causes kernel oops Message-ID: <459381DA.7030007@dev.mellanox.co.il> Hi Sean. When i enabled the rdma_ucm (on the trunk driver) and restarted the driver several times (using openibd restart) i got kernel oops. Here is more info on this issue: ************************************************************* Host Architecture : x86_64 Linux Distribution: Red Hat Enterprise Linux AS release 4 (Nahant Update 4) Kernel Version : 2.6.19-smp GCC Version : gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3) Memory size : 4041240 kB Driver Version : gen2_devel-20061226-1730 HCA ID(s) : mthca0 HCA model(s) : 25218 FW version(s) : 5.1.940 Board(s) : MT_0150000001 ************************************************************* here is the backtrace from the /var/log/messages: Dec 27 15:36:25 sw086 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000001 RIP: Dec 27 15:36:25 sw086 kernel: [<0000000000000001>] Dec 27 15:36:25 sw086 kernel: PGD 11f4c3067 PUD 11fed7067 PMD 0 Dec 27 15:36:25 sw086 kernel: Oops: 0000 [1] SMP Dec 27 15:36:25 sw086 kernel: CPU 1 Dec 27 15:36:25 sw086 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_mthca ib_umad ib_ucm ib_u verbs ib_cm ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core tg3 sg ext3 jbd sd_mod Dec 27 15:36:25 sw086 kernel: Pid: 11363, comm: udev Not tainted 2.6.19-smp #1 Dec 27 15:36:25 sw086 kernel: RIP: 0010:[<0000000000000001>] [<0000000000000001>] Dec 27 15:36:25 sw086 kernel: RSP: 0018:ffff81012017dec0 EFLAGS: 00010282 Dec 27 15:36:25 sw086 kernel: RAX: 0000000000000002 RBX: ffff810116af9f50 RCX: 0000000000000000 Dec 27 15:36:25 sw086 kernel: RDX: ffffffff80364eea RSI: 00000000ffffffff RDI: ffff81011fdf0a01 Dec 27 15:36:25 sw086 kernel: RBP: ffff81011bf49740 R08: 00000000fffffffb R09: 0000000000000000 Dec 27 15:36:25 sw086 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff81011fdf0a10 Dec 27 15:36:25 sw086 kernel: R13: ffff81012017df50 R14: ffffffff80507f10 R15: ffffffff8826ada0 Dec 27 15:36:25 sw086 kernel: FS: 00002b1118f8cde0(0000) GS:ffff810123477c40(0000) knlGS:0000000000000000 Dec 27 15:36:25 sw086 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 27 15:36:25 sw086 kernel: CR2: 0000000000000001 CR3: 0000000120893000 CR4: 00000000000006e0 Dec 27 15:36:25 sw086 kernel: Process udev (pid: 11363, threadinfo ffff81012017c000, task ffff8101168e7140) Dec 27 15:36:25 sw086 kernel: Stack: ffffffff802b5c7a 0000000000000101 0000000000001000 000000000064a970 Dec 27 15:36:25 sw086 kernel: 0000000000001000 0000000000000000 ffff81011b983800 000000000064a970 Dec 27 15:36:25 sw086 kernel: ffff81012017df50 0000000000615a80 ffffffff80275727 ffff81011b983800 Dec 27 15:36:25 sw086 kernel: Call Trace: Dec 27 15:36:25 sw086 kernel: [] sysfs_read_file+0xaf/0x142 Dec 27 15:36:25 sw086 kernel: [] vfs_read+0xd1/0x172 Dec 27 15:36:25 sw086 kernel: [] sys_read+0x45/0x6e Dec 27 15:36:25 sw086 kernel: [] system_call+0x7e/0x83 Dec 27 15:36:25 sw086 kernel: Dec 27 15:36:25 sw086 kernel: Dec 27 15:36:25 sw086 kernel: Code: Bad RIP value. Dec 27 15:36:25 sw086 kernel: RIP [<0000000000000001>] Dec 27 15:36:25 sw086 kernel: RSP Dec 27 15:36:25 sw086 kernel: CR2: 0000000000000001 thanks Dotan From ogerlitz at voltaire.com Thu Dec 28 02:10:02 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 28 Dec 2006 12:10:02 +0200 Subject: [openib-general] No resource tracking per qp for multicast groups In-Reply-To: <200612281009.32750.jackm@dev.mellanox.co.il> References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com> <458FC332.1010801@voltaire.com> <4592D0E4.2010006@ichips.intel.com> <200612281009.32750.jackm@dev.mellanox.co.il> Message-ID: <459397FA.4000306@voltaire.com> Jack Morgenstein wrote: > On Wednesday 27 December 2006 22:00, Sean Hefty wrote: >> I do not know if this problem was solved however. >> > The problem was solved in ofed 1.1. See file core/uverbs_main.c, > procedure ib_uverbs_cleanup_ucontext(): > > list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) { > struct ib_qp *qp = uobj->object; > struct ib_uqp_object *uqp = > container_of(uobj, struct ib_uqp_object, uevent.uobject); > > idr_remove_uobj(&ib_uverbs_qp_idr, uobj); > --> ib_uverbs_detach_umcast(qp, uqp); > ib_destroy_qp(qp); > ib_uverbs_release_uevent(file, &uqp->uevent); > kfree(uqp); > } OK, Jack I see now that your patch fixing this was committed by Linus on Nov 2005 (http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f4e401562c11c7ca65592ebd749353cf0b19af7b) Or. From ogerlitz at voltaire.com Thu Dec 28 02:25:13 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 28 Dec 2006 12:25:13 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <4592817B.3030700@mellanox.co.il> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com> <4592817B.3030700@mellanox.co.il> Message-ID: <45939B89.9020305@voltaire.com> Eitan Zahavi wrote: > Or Gerlitz wrote: >>> Assuming the value M represents the lowest MTU on the path >>> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) >>> R represents the MTU value in the request. Similarly R-1 is one below >>> R and R+1 is one above R. >>> >>> Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM >>> Quirk w. Tavor End Port >>> ----------------------------------------------------------------------------------------- >>> >>> UNDEFINED | UNDEFINED | <= M | M | min(M,1K) >>> R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, >>> M, 1K) >>> R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R >>> /ERR >>> R | > | R < <= M | R+1 if M>R /ERR| R+1 if >>> M>R /ERR >> 1st maybe its clear to everyone expect me, but what do you mean by >> /ERR in the table above, is it what opensm would return before the >> patch you suggested? > By ERR I mean that the path being evaluated is rejected from being > included in the paths group of the response to the provided query. so when you say "X if some relation holds on (Y,Z) /ERR" you mean that it "should return X but if r(Y,Z) holds return no record" and this how the code is written with the patch? >> 2nd can you post the open sm tavor quirk patch? >> > What do you mean? The old patch introducing the "opensm quirk" mode? > It is GIT versions: 86077144ed956ddb32a0f8d067d5bb00fd564ac6 followed by > 03e3b3a6fa934202c0f4270a2c69d64ac486b1ca > or SVN: 9497 followed by 9518 OK, thanks, i guess you mean to the svn trunk or its the ofed 1.1 branch? can be cool if you send a pointer to the SVN... Or. From ogerlitz at voltaire.com Thu Dec 28 02:26:28 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 28 Dec 2006 12:26:28 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <20061227174245.GC5377@mellanox.co.il> References: <45927D3A.9030502@voltaire.com> <20061227174245.GC5377@mellanox.co.il> Message-ID: <45939BD4.90204@voltaire.com> Michael S. Tsirkin wrote: >> 3rd Eitan/Michael: what is the bigger picture here? what is the >> dependency between these four patches > > In short, [2] is an independent fix to improve tavor performance. > Other things are not directly related. Detail below. > >> +1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors >> +2 osm: tavor quirk >> +3 IB/rdmacm: tavor quirk >> +4 IB/ipoib: use appropriate mtu selector for path queries > > In the above: > [1] is a bug fix I think. It is not required for [2]. > [2] is a feature that improves performance for tavor without need for > any other stack/ULP changes > [3] is a hack that should have same effect as [2] for old SMs, but it needs > manual tuning by user. If activated, it unfortunately triggers a bug in opensm > that [1] fixes. So it might not be a good idea after all. > [4] is not strictly necessary, and not related to this patch set - > it just happens to also play with MTU selector. > It is a strict compliance cleanup that I just happened to notice when > I invented [2]. > OK, Michael, thanks for the clarifications. Or. From eitan at mellanox.co.il Thu Dec 28 02:46:03 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Thu, 28 Dec 2006 12:46:03 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <45939B89.9020305@voltaire.com> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com> <4592817B.3030700@mellanox.co.il> <45939B89.9020305@voltaire.com> Message-ID: <4593A06B.4010706@mellanox.co.il> Or Gerlitz wrote: > Eitan Zahavi wrote: > >> Or Gerlitz wrote: >> >>>> Assuming the value M represents the lowest MTU on the path >>>> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K) >>>> R represents the MTU value in the request. Similarly R-1 is one below >>>> R and R+1 is one above R. >>>> >>>> Query-MTU | Query-Sel | Resp by Spec | OpenSM Should | OpenSM >>>> Quirk w. Tavor End Port >>>> ----------------------------------------------------------------------------------------- >>>> >>>> UNDEFINED | UNDEFINED | <= M | M | min(M,1K) >>>> R | < | <= min(R-1, M) | min(R-1, M) | min(R-1, >>>> M, 1K) >>>> R | = | R if M>=R /ERR | R if M>=R /ERR | R if M>=R /ERR >>>> R | > | R < <= M | R+1 if M>R /ERR| R+1 if M>R /ERR >>>> > > >>> 1st maybe its clear to everyone expect me, but what do you mean by >>> /ERR in the table above, is it what opensm would return before the >>> patch you suggested? >>> > > >> By ERR I mean that the path being evaluated is rejected from being >> included in the paths group of the response to the provided query. >> > > so when you say > > "X if some relation holds on (Y,Z) /ERR" > > you mean that it "should return X but if r(Y,Z) holds return no record" > and this how the code is written with the patch? > > No: R if M>=R /ERR mean: Return R if M is bigger or equal to R or else this path does not match the request. R+1 if M>R /ERR meas: Return R+1 if M is bigger then R or else this path does not match the request. If no paths math the request you the response depends on the query method: For Get(PathRecord) you will get an error. For GetTable(PathRecord) you will get zero number of returned records For GetMulti(MultiPathRecord) you should get zero number of returned records EZ >>> 2nd can you post the open sm tavor quirk patch? >>> >>> >> What do you mean? The old patch introducing the "opensm quirk" mode? >> It is GIT versions: 86077144ed956ddb32a0f8d067d5bb00fd564ac6 followed by >> 03e3b3a6fa934202c0f4270a2c69d64ac486b1ca >> or SVN: 9497 followed by 9518 >> > > OK, thanks, i guess you mean to the svn trunk or its the ofed 1.1 > branch? can be cool if you send a pointer to the SVN... > This is trunk > Or. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Thu Dec 28 02:50:05 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 28 Dec 2006 12:50:05 +0200 Subject: [openib-general] tavor quirks etc (opensm compliance etc) In-Reply-To: <4593A06B.4010706@mellanox.co.il> References: <4587F6E0.10000@voltaire.com> <20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com> <458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com> <4592817B.3030700@mellanox.co.il> <45939B89.9020305@voltaire.com> <4593A06B.4010706@mellanox.co.il> Message-ID: <4593A15D.6020608@voltaire.com> Eitan Zahavi wrote: > Or Gerlitz wrote: >> so when you say >> "X if some relation holds on (Y,Z) /ERR" >> you mean that it "should return X but if r(Y,Z) holds return no >> record" and this how the code is written with the patch? > No: > R if M>=R /ERR mean: > Return R if M is bigger or equal to R or else this path does not match > the request. > > R+1 if M>R /ERR meas: > Return R+1 if M is bigger then R or else this path does not match the > request. Got it, thanks. >> OK, thanks, i guess you mean to the svn trunk or its the ofed 1.1 >> branch? can be cool if you send a pointer to the SVN... >> > This is trunk OK Or. From ogerlitz at voltaire.com Thu Dec 28 02:57:33 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 28 Dec 2006 12:57:33 +0200 Subject: [openib-general] SVN deprecation In-Reply-To: References: Message-ID: <4593A31D.50808@voltaire.com> Jeff Squyres wrote: > I propose "svn rm"'ing unused trees in the SVN repository and leaving > README files indicating that everything has moved to git (remember: > everything is still available via the SVN history). If no one has > any objections, I'll do this on Friday, 5 Jan 2007. > KEEP the following: > - /gen2/branches/1.1: by request (Tziporet) > REMOVE the following: > - /gen2/src: everything should now be in git (*** IS THIS RIGHT?!?!) I guess you refer to gen2/trunk/src please no. Lets leave these sources with a readme stating they are unmaintained along with the gen2/branches/1.1 sources at least for the dev/release cycle of OFED 1.2 Or. From halr at voltaire.com Thu Dec 28 06:27:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Dec 2006 09:27:19 -0500 Subject: [openib-general] [PATCH] diags: fix loops handling in ibnetdiscover In-Reply-To: <20061227230915.GF11268@sashak.voltaire.com> References: <20061227230915.GF11268@sashak.voltaire.com> Message-ID: <1167316029.29620.142536.camel@hal.voltaire.com> On Wed, 2006-12-27 at 18:09, Sasha Khapyorsky wrote: > This fixes loop cabling and loopback connections handling in > ibnetdiscover. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Thu Dec 28 06:29:27 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Dec 2006 09:29:27 -0500 Subject: [openib-general] [PATCH] diags: eliminate __WORDSIZE ifdefs for printing In-Reply-To: <20061227231017.GG11268@sashak.voltaire.com> References: <20061227231017.GG11268@sashak.voltaire.com> Message-ID: <1167316051.29620.142538.camel@hal.voltaire.com> On Wed, 2006-12-27 at 18:10, Sasha Khapyorsky wrote: > Use portable PRIx64 macro in printf format strings instead of using > '#if __WORDSIZE == 64' with printf style functions. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Thu Dec 28 07:06:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Dec 2006 10:06:50 -0500 Subject: [openib-general] [PATCH] OpenSM: Remove use of osm_svn_revision.h Message-ID: <1167318395.29620.144439.camel@hal.voltaire.com> OpenSM: Remove use of osm_svn_revision.h Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index 3ef246c..aed60d7 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -10,27 +10,6 @@ DBGFLAGS = -g endif if OSMV_OPENIB -BUILT_SOURCES = $(srcdir)/../include/opensm/osm_svn_revision.h -.PHONY: always -$(srcdir)/../include/opensm/osm_svn_revision.h: always - echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - if test '!' -d '$(srcdir)/.svn'; then \ - echo -n Exported revision >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - else \ - svnversion -n $(srcdir)/.. >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - fi ; \ - echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - then \ - rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \ - else \ - mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \ - $(srcdir)/../include/opensm/osm_svn_revision.h ; \ - fi -endif - -if OSMV_OPENIB libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 else libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 diff --git a/osm/opensm/main.c b/osm/opensm/main.c index bc916ab..ee09db0 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -54,9 +54,6 @@ #include #include #include -#ifdef OSM_VENDOR_INTF_OPENIB -#include -#endif #include #include @@ -599,10 +596,6 @@ main( printf("-------------------------------------------------\n"); printf("%s\n", OSM_VERSION); -#if defined ( OSM_VENDOR_INTF_OPENIB ) - if (strlen(OSM_SVN_REVISION)) - printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION); -#endif osm_subn_set_default_opt(&opt); osm_subn_parse_conf_file(&opt); diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 9cac636..0061193 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -57,9 +57,6 @@ #include #include #include -#ifdef OSM_VENDOR_INTF_OPENIB -#include -#endif #include #include #include @@ -204,33 +201,12 @@ osm_opensm_init( if( status != IB_SUCCESS ) return ( status ); -#ifndef OSM_VENDOR_INTF_OPENIB /* If there is a log level defined - add the OSM_VERSION to it. */ osm_log( &p_osm->log, osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", OSM_VERSION ); /* Write the OSM_VERSION to the SYS_LOG */ osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ -#else - if (strlen(OSM_SVN_REVISION)) - { - /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION to it. */ - osm_log( &p_osm->log, - osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB svn %s\n", - OSM_VERSION, OSM_SVN_REVISION ); - /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */ - osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", OSM_VERSION, OSM_SVN_REVISION ); /* Format Waived */ - } - else - { - /* If there is a log level defined - add the OSM_VERSION to it. */ - osm_log( &p_osm->log, - osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n", - OSM_VERSION ); - /* Write the OSM_VERSION to the SYS_LOG */ - osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION ); /* Format Waived */ - } -#endif osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format Waived */ From tziporet at dev.mellanox.co.il Thu Dec 28 07:31:57 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 28 Dec 2006 17:31:57 +0200 Subject: [openib-general] SVN deprecation In-Reply-To: References: Message-ID: <4593E36D.6020001@dev.mellanox.co.il> Jeff Squyres wrote: > I propose "svn rm"'ing unused trees in the SVN repository and leaving > README files indicating that everything has moved to git (remember: > everything is still available via the SVN history). If no one has > any objections, I'll do this on Friday, 5 Jan 2007. > > ** PLEASE READ THE FOLLOWING CAREFULLY and send in your comments! > Otherwise, things may disappear from SVN that you didn't expect. > > UNKNOWN whether to keep or remove: > (i.e., they seem to have "recent" development) > ============================================== > > DEVELOPER MTIME PATH > --------- -------- ---------------------------------- > dotanb Dec 2006 /trunk/contrib/mellanox > vlad Dec 2006 /gen2/trunk/ofed > swise Oct 2006 /gen2/branches/iwarp > hnguyen Sep 2006 /trunk/contrib/ibm > amitk Sep 2006 /gen2/branches/1.0 > vlad Sep 2006 /gen2/branches/ofed_fixes > monil Sep 2006 /gen2/branches/backport > woody Sep 2006 /gen2/branches/backport-to-2.6.9 > halr May 2006 /gen2/branches/ibat > mst Jul 2006 /gen2/branches/mellanox_fixes > > KEEP the following: > =================== > > - /gen2/branches/1.1: by request (Tziporet) > > REMOVE the following: > ===================== > > In short, everything will be removed except what was listed above. > However, to be explicit, some more entries are listed below. > > (*) entries mean "everything except what was already listed above" > > Remove these trees based on the fact that they haven't changed in a > long time: > > MTIME PATH > --------- ------------------------------ > Apr 2006 /trunk/contrib/* > Apr 2006 /trunk/branches/* > Apr 2006 /gen2/ulps > Apr 2006 /gen2/branches/* > Mar 2006 /gen2/users > May 2005 /gen1 > Jan 2005 /gen2/trunk/arch > Dec 2004 /gen2/utils > Nov 2004 /gen2/trunk/scripts > Jul 2004 /tags > Apr 2004 /trunk/openib > > > There are some important directories under /trunk/contrib/mellanox so please don't remove them: gen1/ib_srpt - this is the srp target code Mellanox opened - Vu can you open a git tree with it instead? ibtp/ - these are tests we posted - Dotan - can you create git tree for the tests Please also save gen2/branches/1.0/ since it was used for 1.0 release thanks Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Dec 28 09:16:02 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 28 Dec 2006 09:16:02 -0800 Subject: [openib-general] [rdma_ucm] enabling the rdma_ucm and restarting the driver several times causes kernel oops In-Reply-To: <459381DA.7030007@dev.mellanox.co.il> References: <459381DA.7030007@dev.mellanox.co.il> Message-ID: <4593FBD2.4000109@ichips.intel.com> Dotan Barak wrote: > here is the backtrace from the /var/log/messages: > Dec 27 15:36:25 sw086 kernel: Unable to handle kernel NULL pointer > dereference at 0000000000000001 RIP: > Dec 27 15:36:25 sw086 kernel: [<0000000000000001>] > Dec 27 15:36:25 sw086 kernel: PGD 11f4c3067 PUD 11fed7067 PMD 0 > Dec 27 15:36:25 sw086 kernel: Oops: 0000 [1] SMP > Dec 27 15:36:25 sw086 kernel: CPU 1 > Dec 27 15:36:25 sw086 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm > iw_cm ib_addr ib_ipoib ib_mthca ib_umad ib_ucm ib_u > verbs ib_cm ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp > parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod > button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core tg3 sg > ext3 jbd sd_mod Can you narrow down which module unload is causing the issue? Is anything using the rdma_ucm or ib_uverbs? Is ib_sdp the first module unloaded? - Sean From robert.j.woodruff at intel.com Thu Dec 28 11:46:57 2006 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 28 Dec 2006 11:46:57 -0800 Subject: [openib-general] SVN deprecation Message-ID: Jeff Squyres wrote: > I propose "svn rm"'ing unused trees in the SVN repository and leaving > README files indicating that everything has moved to git (remember: > everything is still available via the SVN history). If no one has > any objections, I'll do this on Friday, 5 Jan 2007. Please keep this woody Sep 2006 /gen2/branches/backport-to-2.6.9 until I find out if anyone is still using the old backport patches and RPMS. These were not moved to git and there are no plans to move them to git. woody From halr at voltaire.com Thu Dec 28 13:12:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 28 Dec 2006 16:12:18 -0500 Subject: [openib-general] [PATCH] OpenSM/osm_sa_lft_record.c: In __osm_lftr_rcv_by_comp_mask, when BlockNum component is wildcarded, fix max_block calculation Message-ID: <1167340337.29620.163416.camel@hal.voltaire.com> OpenSM/osm_sa_lft_record.c: In __osm_lftr_rcv_by_comp_mask, when BlockNum component is wildcarded, fix max_block calculation Signed-off-by: Hal Rosenstock diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c index 7d37074..46bebf2 100644 --- a/osm/opensm/osm_sa_lft_record.c +++ b/osm/opensm/osm_sa_lft_record.c @@ -226,7 +226,6 @@ __osm_lftr_rcv_by_comp_mask( osm_port_t* p_port; uint16_t min_lid_ho, max_lid_ho; uint16_t min_block, max_block, block; - uint16_t lids_per_block; const osm_physp_t* p_physp; /* In switches, the port guid is the node guid. */ @@ -283,10 +282,9 @@ __osm_lftr_rcv_by_comp_mask( } else { - /* use as many blocks as possible */ + /* use as many blocks as "in use" */ min_block = 0; - lids_per_block = osm_fwd_tbl_get_lids_per_block( osm_switch_get_fwd_tbl_ptr( p_sw ) ); - max_block = (max_lid_ho + lids_per_block - 1)/lids_per_block; + max_block = osm_switch_get_max_block_id_in_use(p_sw); } /* so we can add these blocks one by one ... */ From Leonid.Grossman at neterion.com Thu Dec 28 13:24:09 2006 From: Leonid.Grossman at neterion.com (Leonid Grossman) Date: Thu, 28 Dec 2006 16:24:09 -0500 Subject: [openib-general] one vs. two drivers for an iWARP-capable Ethernet NIC Message-ID: <78C9135A3D2ECE4B8162EBDCE82CAD77010FB433@nekter> Jeff/Roland/all, What is the preferred submission driver model for an iWARP-capable Ethernet NIC - two separate drivers (Ethernet and OpenFabrics) that interact with each other, or a single driver that supports both OpenFabrics and Ethernet interfaces? For our hardware we can go either way, although in case of separate drivers the interface between the two would get somewhat artificial... Thanks, Leonid -------------- next part -------------- An HTML attachment was scrubbed... URL: From Leonid.Grossman at neterion.com Thu Dec 28 13:31:13 2006 From: Leonid.Grossman at neterion.com (Leonid Grossman) Date: Thu, 28 Dec 2006 16:31:13 -0500 Subject: [openib-general] one vs. two drivers for an iWARP-capable Ethernet NIC Message-ID: <78C9135A3D2ECE4B8162EBDCE82CAD77010FB436@nekter> Re-sending as a plain text to reach netdev. Sorry for the extra traffic, please ignore the earlier html version of this e-mail... ------------------------------------------------------------ Jeff/Roland/all, What is the preferred submission driver model for an iWARP-capable Ethernet NIC - two separate drivers (Ethernet and OpenFabrics) that interact with each other, or a single driver that supports both OpenFabrics and Ethernet interfaces? For our hardware we can go either way, although in case of separate drivers the interface between the two would get somewhat artificial... Thanks, Leonid From mshefty at ichips.intel.com Thu Dec 28 14:46:57 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 28 Dec 2006 14:46:57 -0800 Subject: [openib-general] rdma-dev git tree updated to 2.6.20-rc2 Message-ID: <45944961.3050402@ichips.intel.com> My git tree has been updated to help support OFED 1.2 testing. From mshefty at ichips.intel.com Thu Dec 28 15:23:01 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 28 Dec 2006 15:23:01 -0800 Subject: [openib-general] [PATCH] librdmacm Pass back the status or errno in RDMA CM events. In-Reply-To: <20061215225606.22765.18276.stgit@dell3.ogc.int> References: <20061215225606.22765.18276.stgit@dell3.ogc.int> Message-ID: <459451D5.8000705@ichips.intel.com> Steve Wise wrote: > The librdmacm code isn't passing back the errno in all events. > > For example, if a connection request times out the kernel CMA will pass > up event RDMA_CM_EVENT_UNREACHABLE with the status set to -ETIMEDOUT. > This errno isn't currently passed back to the librdmacm user in the event. Thanks - committed. - Sean From mshefty at ichips.intel.com Thu Dec 28 15:25:33 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 28 Dec 2006 15:25:33 -0800 Subject: [openib-general] [PATCH] rdma_cm iWARP connection setup timeouts reported as rejects. In-Reply-To: <20061215225017.22628.17881.stgit@dell3.ogc.int> References: <20061215225017.22628.17881.stgit@dell3.ogc.int> Message-ID: <4594526D.1000309@ichips.intel.com> Steve Wise wrote: > The IWCM should report timeouts as event RDMA_CM_EVENT_UNREACHABLE, > not event RDMA_CM_EVENT_REJECTED. > > Signed-off-by: Steve Wise Looks fine to me. Can we pull this into 2.6.20? Signed-off-by: Sean Hefty From bunk at stusta.de Thu Dec 28 18:10:09 2006 From: bunk at stusta.de (Adrian Bunk) Date: Fri, 29 Dec 2006 03:10:09 +0100 Subject: [openib-general] [-mm patch] infiniband/ulp/ipoib/ipoib_cm.c: make functions static In-Reply-To: <20061228024237.375a482f.akpm@osdl.org> References: <20061228024237.375a482f.akpm@osdl.org> Message-ID: <20061229021009.GN20714@stusta.de> On Thu, Dec 28, 2006 at 02:42:37AM -0800, Andrew Morton wrote: >... > Changes since 2.6.20-rc1-mm1: >... > git-infiniband.patch >... > git trees >... This patch makes some needlessly global functions static. Signed-off-by: Adrian Bunk --- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) --- linux-2.6.20-rc2-mm1/drivers/infiniband/ulp/ipoib/ipoib_cm.c.old 2006-12-29 01:40:17.000000000 +0100 +++ linux-2.6.20-rc2-mm1/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2006-12-29 01:43:22.000000000 +0100 @@ -56,7 +56,8 @@ u32 remote_mtu; }; -int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); +static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event); static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, dma_addr_t mapping[IPOIB_CM_RX_SG]) @@ -265,7 +266,8 @@ return ret; } -int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event) { struct ipoib_cm_rx *p; struct ipoib_dev_priv *priv; @@ -396,7 +398,7 @@ "for buf %d\n", wr_id); } -void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr) +static void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr) { struct net_device *dev = (struct net_device *) dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -550,7 +552,7 @@ spin_unlock_irqrestore(&priv->tx_lock, flags); } -void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) +static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr) { struct ipoib_cm_tx *tx = tx_ptr; int n, i; @@ -768,7 +770,8 @@ return 0; } -int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, struct ib_sa_path_rec *pathrec) +static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, + struct ib_sa_path_rec *pathrec) { struct ipoib_dev_priv *priv = netdev_priv(p->dev); int ret; @@ -841,7 +844,7 @@ return ret; } -void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) +static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) { struct ipoib_dev_priv *priv = netdev_priv(p->dev); struct ipoib_tx_buf *tx_req; @@ -875,7 +878,8 @@ kfree(p); } -int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, + struct ib_cm_event *event) { struct ipoib_cm_tx *tx = cm_id->context; struct ipoib_dev_priv *priv = netdev_priv(tx->dev); @@ -960,7 +964,7 @@ } } -void ipoib_cm_tx_start(struct work_struct *work) +static void ipoib_cm_tx_start(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.start_task); @@ -1003,7 +1007,7 @@ spin_unlock_irqrestore(&priv->tx_lock, flags); } -void ipoib_cm_tx_reap(struct work_struct *work) +static void ipoib_cm_tx_reap(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.reap_task); From eitan at sw053.yok.mtl.com Thu Dec 28 21:28:56 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Fri, 29 Dec 2006 07:28:56 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-29:normal completion Message-ID: <200612290528.kBT5SuGS015171@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Thu_Dec_28_12:00:53_2006 298216 ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe Total=351 Pass=351 Fail=0 Pass: 39 Stability IS1-16.topo 39 Pkey IS1-16.topo 39 OsmTest IS1-16.topo 39 OsmStress IS1-16.topo 39 Multicast IS1-16.topo 39 LidMgr IS1-16.topo 13 Stability IS3-loop.topo 13 Stability IS3-128.topo 13 Pkey IS3-128.topo 13 OsmTest IS3-loop.topo 13 OsmTest IS3-128.topo 13 OsmStress IS3-128.topo 13 Multicast IS3-loop.topo 13 Multicast IS3-128.topo 13 LidMgr IS3-128.topo Failures: From mst at mellanox.co.il Thu Dec 28 21:39:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 29 Dec 2006 07:39:30 +0200 Subject: [openib-general] [-mm patch] infiniband/ulp/ipoib/ipoib_cm.c: make functions static In-Reply-To: <20061229021009.GN20714@stusta.de> References: <20061228024237.375a482f.akpm@osdl.org> <20061229021009.GN20714@stusta.de> Message-ID: <20061229053930.GA4580@mellanox.co.il> > Quoting Adrian Bunk : > Subject: [-mm patch] infiniband/ulp/ipoib/ipoib_cm.c: make functions static > > On Thu, Dec 28, 2006 at 02:42:37AM -0800, Andrew Morton wrote: > >... > > Changes since 2.6.20-rc1-mm1: > >... > > git-infiniband.patch > >... > > git trees > >... > > > This patch makes some needlessly global functions static. > > Signed-off-by: Adrian Bunk Thanks, I'll put this in my tree. -- MST From halr at voltaire.com Fri Dec 29 08:26:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 11:26:47 -0500 Subject: [openib-general] [PATCH] OpenSM/ib_types.h: Add support for SA MFTRecord Message-ID: <1167409604.29620.225320.camel@hal.voltaire.com> OpenSM/ib_types.h: Add support for SA MFTRecord Signed-off-by: Hal Rosenstock diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h index 1770f8d..738bd7f 100644 --- a/osm/include/iba/ib_types.h +++ b/osm/include/iba/ib_types.h @@ -1283,6 +1283,18 @@ ib_class_is_rmpp( #define IB_MAD_ATTR_LFT_RECORD (CL_NTOH16(0x0015)) /**********/ +/****d* IBA Base: Constants/IB_MAD_ATTR_MFT_RECORD +* NAME +* IB_MAD_ATTR_MFT_RECORD +* +* DESCRIPTION +* MulticastForwardingTableRecord attribute (15.2.5.8) +* +* SOURCE +*/ +#define IB_MAD_ATTR_MFT_RECORD (CL_NTOH16(0x0017)) +/**********/ + /****d* IBA Base: Constants/IB_MAD_ATTR_PKEYTBL_RECORD * NAME * IB_MAD_ATTR_PKEYTBL_RECORD @@ -2371,6 +2383,13 @@ typedef struct _ib_path_rec #define IB_LFTR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_LFTR_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<1)) +/* MFT Record Masks */ +#define IB_MFTR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) +#define IB_MFTR_COMPMASK_POSITION (CL_HTON64(((uint64_t)1)<<1)) +#define IB_MFTR_COMPMASK_RESERVED1 (CL_HTON64(((uint64_t)1)<<2)) +#define IB_MFTR_COMPMASK_BLOCK (CL_HTON64(((uint64_t)1)<<3)) +#define IB_MFTR_COMPMASK_RESERVED2 (CL_HTON64(((uint64_t)1)<<4)) + /* NodeInfo Record Masks */ #define IB_NR_COMPMASK_LID (CL_HTON64(((uint64_t)1)<<0)) #define IB_NR_COMPMASK_RESERVED1 (CL_HTON64(((uint64_t)1)<<1)) @@ -5530,6 +5549,26 @@ typedef struct _ib_lft_record #include /************/ +/****s* IBA Base: Types/ib_mft_record_t +* NAME +* ib_mft_record_t +* +* DESCRIPTION +* IBA defined MulticastForwardingTableRecord (15.2.5.8) +* +* SYNOPSIS +*/ +#include +typedef struct _ib_mft_record +{ + ib_net16_t lid; + ib_net16_t position_block_num; + uint32_t resv0; + ib_net16_t mft[IB_MCAST_BLOCK_SIZE]; +} PACK_SUFFIX ib_mft_record_t; +#include +/************/ + /****s* IBA Base: Types/ib_switch_info_t * NAME * ib_switch_info_t From halr at voltaire.com Fri Dec 29 08:34:19 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 11:34:19 -0500 Subject: [openib-general] [PATCH] osm: fat-tree documentation In-Reply-To: <1167240747.29620.77561.camel@hal.voltaire.com> References: <45929D0B.3090308@dev.mellanox.co.il> <1167240747.29620.77561.camel@hal.voltaire.com> Message-ID: <1167410047.29620.225730.camel@hal.voltaire.com> On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote: > On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote: > > Hi Hal. > > > > Added fat-tree routing details and some cosmetics in the txt files. > > > > -- > > Yevgeny > > > > Signed-off-by: Yevgeny Kliteynik > > Thanks. Applied. > > A couple of minor questions: > > Should similar text as in current-routing.txt be added to the OpenSM man > page ? I took care of making the man page including the fat tree routing information you put into current-routing.txt. The question below is outstanding: > Also, rather than HCA in the below, is CA better (to include TCAs as > well) ? Thanks. -- Hal > -- Hal From halr at voltaire.com Fri Dec 29 08:39:09 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 11:39:09 -0500 Subject: [openib-general] SVN deprecation In-Reply-To: References: Message-ID: <1167410068.29620.225732.camel@hal.voltaire.com> Hi Jeff, On Wed, 2006-12-27 at 12:02, Jeff Squyres wrote: > I propose "svn rm"'ing unused trees in the SVN repository and leaving > README files indicating that everything has moved to git (remember: > everything is still available via the SVN history). If no one has > any objections, I'll do this on Friday, 5 Jan 2007. > > ** PLEASE READ THE FOLLOWING CAREFULLY and send in your comments! > Otherwise, things may disappear from SVN that you didn't expect. > > UNKNOWN whether to keep or remove: > (i.e., they seem to have "recent" development) > ============================================== > > DEVELOPER MTIME PATH > --------- -------- ---------------------------------- > dotanb Dec 2006 /trunk/contrib/mellanox > vlad Dec 2006 /gen2/trunk/ofed > swise Oct 2006 /gen2/branches/iwarp > hnguyen Sep 2006 /trunk/contrib/ibm > amitk Sep 2006 /gen2/branches/1.0 > vlad Sep 2006 /gen2/branches/ofed_fixes > monil Sep 2006 /gen2/branches/backport > woody Sep 2006 /gen2/branches/backport-to-2.6.9 > halr May 2006 /gen2/branches/ibat This can be removed. -- Hal > mst Jul 2006 /gen2/branches/mellanox_fixes > > KEEP the following: > =================== > > - /gen2/branches/1.1: by request (Tziporet) > > REMOVE the following: > ===================== > > In short, everything will be removed except what was listed above. > However, to be explicit, some more entries are listed below. > > (*) entries mean "everything except what was already listed above" > > Remove these trees based on the fact that they haven't changed in a > long time: > > MTIME PATH > --------- ------------------------------ > Apr 2006 /trunk/contrib/* > Apr 2006 /trunk/branches/* > Apr 2006 /gen2/ulps > Apr 2006 /gen2/branches/* > Mar 2006 /gen2/users > May 2005 /gen1 > Jan 2005 /gen2/trunk/arch > Dec 2004 /gen2/utils > Nov 2004 /gen2/trunk/scripts > Jul 2004 /tags > Apr 2004 /trunk/openib > > Remove these trees for additional rationale: > > - /branches: it's empty > - /gen2/tags: replaced by OFED and git > - /gen2/src: everything should now be in git (*** IS THIS RIGHT?!?!) > > Comments? From halr at voltaire.com Fri Dec 29 09:05:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 12:05:00 -0500 Subject: [openib-general] [PATCH 0/4] OpenSM: Add optional SA MFTRecord support Message-ID: <1167411898.29620.227395.camel@hal.voltaire.com> OpenSM: Add optional SA MFTRecord support This patch series adds support for the optional SA MFTRecord. Signed-off-by: Hal Rosenstock From halr at voltaire.com Fri Dec 29 09:07:15 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 12:07:15 -0500 Subject: [openib-general] [PATCH 1/4] OpenSM/osm_switch.h: Add some missing multicast table routines Message-ID: <1167411902.29620.227397.camel@hal.voltaire.com> OpenSM/osm_switch.h: Add some missing multicast table routines Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h index 32fc547..71b3c8a 100644 --- a/osm/include/opensm/osm_switch.h +++ b/osm/include/opensm/osm_switch.h @@ -1054,6 +1054,122 @@ osm_switch_set_mft_block( * SEE ALSO *********/ +/****f* OpenSM: Switch/osm_switch_get_mft_block +* NAME +* osm_switch_get_mft_block +* +* DESCRIPTION +* Retrieve a block of multicast port masks from the multicast table. +* +* SYNOPSIS +*/ +static inline boolean_t +osm_switch_get_mft_block( + IN osm_switch_t* const p_sw, + IN const uint16_t block_num, + IN const uint8_t position, + OUT ib_net16_t* const p_block ) +{ + CL_ASSERT( p_sw ); + return( osm_mcast_tbl_get_block( &p_sw->mcast_tbl, + block_num, position, p_block ) ); +} +/* +* PARAMETERS +* p_sw +* [in] Pointer to the switch object. +* +* block_num +* [in] Block number (0-511) to set. +* +* position +* [in] Port mask position (0-15) to set. +* +* p_block +* [out] Pointer to the block of port masks stored. +* +* RETURN VALUES +* Returns true if there are more blocks necessary to +* configure all the MLIDs reachable from this switch. +* FALSE otherwise. +* +* NOTES +* +* SEE ALSO +*********/ + +/****f* OpenSM: Switch/osm_switch_get_mft_max_block +* NAME +* osm_switch_get_mft_max_block +* +* DESCRIPTION +* Get the max_block from the associated multicast table. +* +* SYNOPSIS +*/ +static inline uint16_t +osm_switch_get_mft_max_block( + IN osm_switch_t* const p_sw ) +{ + CL_ASSERT( p_sw ); + return( osm_mcast_tbl_get_max_block( &p_sw->mcast_tbl ) ); +} +/* +* PARAMETERS +* p_sw +* [in] Pointer to the switch object. +* +* RETURN VALUE +*/ + +/****f* OpenSM: Switch/osm_switch_get_mft_max_block_in_use +* NAME +* osm_switch_get_mft_max_block_in_use +* +* DESCRIPTION +* Get the max_block_in_use from the associated multicast table. +* +* SYNOPSIS +*/ +static inline uint16_t +osm_switch_get_mft_max_block_in_use( + IN osm_switch_t* const p_sw ) +{ + CL_ASSERT( p_sw ); + return( osm_mcast_tbl_get_max_block_in_use( &p_sw->mcast_tbl ) ); +} +/* +* PARAMETERS +* p_sw +* [in] Pointer to the switch object. +* +* RETURN VALUE +*/ + +/****f* OpenSM: Switch/osm_switch_get_mft_max_position +* NAME +* osm_switch_get_mft_max_position +* +* DESCRIPTION +* Get the max_position from the associated multicast table. +* +* SYNOPSIS +*/ +static inline uint8_t +osm_switch_get_mft_max_position( + IN osm_switch_t* const p_sw ) +{ + CL_ASSERT( p_sw ); + return( osm_mcast_tbl_get_max_position( &p_sw->mcast_tbl ) ); +} +/* +* PARAMETERS +* p_sw +* [in] Pointer to the switch object. +* +* RETURN VALUE +*/ + /****f* OpenSM: Switch/osm_switch_recommend_path * NAME * osm_switch_recommend_path From halr at voltaire.com Fri Dec 29 09:11:16 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 12:11:16 -0500 Subject: [openib-general] [PATCH 2/4} OpenSM: Add optional SA MFTRecord support Message-ID: <1167412270.29620.227738.camel@hal.voltaire.com> OpenSM: Add optional SA MFTRecord support Signed-off-by: Hal Rosenstock diff --git a/osm/include/opensm/osm_sa_mft_record.h b/osm/include/opensm/osm_sa_mft_record.h new file mode 100644 index 0000000..f961206 --- /dev/null +++ b/osm/include/opensm/osm_sa_mft_record.h @@ -0,0 +1,280 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Declaration of osm_mftr_rcv_t. + * This object represents the MulticastForwardingTable Receiver object. + * attribute from a switch node. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#ifndef _OSM_MFTR_H_ +#define _OSM_MFTR_H_ + +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/Multicast Forwarding Table Receiver +* NAME +* Multicast Forwarding Table Receiver +* +* DESCRIPTION +* The Multicast Forwarding Table Receiver object encapsulates the information +* needed to receive the MulticastForwardingTable attribute from a switch node. +* +* The Multicast Forwarding Table Receiver object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ + +/****s* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_t +* NAME +* osm_mftr_rcv_t +* +* DESCRIPTION +* Multicast Forwarding Table Receiver structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_mft +{ + osm_subn_t* p_subn; + osm_stats_t* p_stats; + osm_sa_resp_t* p_resp; + osm_mad_pool_t* p_mad_pool; + osm_log_t* p_log; + cl_plock_t* p_lock; + cl_qlock_pool_t pool; +} osm_mftr_rcv_t; +/* +* FIELDS +* p_subn +* Pointer to the Subnet object for this subnet. +* +* p_stats +* Pointer to the statistics. +* +* p_resp +* Pointer to the SA responder. +* +* p_mad_pool +* Pointer to the mad pool. +* +* p_log +* Pointer to the log object. +* +* p_lock +* Pointer to the serializing lock. +* +* pool +* Pool of linkable Multicast Forwarding Table Record objects used to +* generate the query response. +* +* SEE ALSO +* Multicast Forwarding Table Receiver object +*********/ + +/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_construct +* NAME +* osm_mftr_rcv_construct +* +* DESCRIPTION +* This function constructs a Multicast Forwarding Table Receiver object. +* +* SYNOPSIS +*/ +void osm_mftr_rcv_construct( + IN osm_mftr_rcv_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to a Multicast Forwarding Table Receiver object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_mftr_rcv_init, osm_mftr_rcv_destroy +* +* Calling osm_mftr_rcv_construct is a prerequisite to calling any other +* method except osm_mftr_rcv_init. +* +* SEE ALSO +* Multicast Forwarding Table Receiver object, osm_mftr_rcv_init, +* osm_mftr_rcv_destroy +*********/ + +/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_destroy +* NAME +* osm_mftr_rcv_destroy +* +* DESCRIPTION +* The osm_mftr_rcv_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void osm_mftr_rcv_destroy( + IN osm_mftr_rcv_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* Multicast Forwarding Table Receiver object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_mftr_rcv_construct or osm_mftr_rcv_init. +* +* SEE ALSO +* Multicast Forwarding Table Receiver object, osm_mftr_rcv_construct, +* osm_mftr_rcv_init +*********/ + +/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_init +* NAME +* osm_mftr_rcv_init +* +* DESCRIPTION +* The osm_mftr_rcv_init function initializes a +* Multicast Forwarding Table Receiver object for use. +* +* SYNOPSIS +*/ +ib_api_status_t osm_mftr_rcv_init( + IN osm_mftr_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ); +/* +* PARAMETERS +* p_rcv +* [in] Pointer to an osm_mftr_rcv_t object to initialize. +* +* p_req +* [in] Pointer to an osm_req_t object. +* +* p_subn +* [in] Pointer to the Subnet object for this subnet. +* +* p_log +* [in] Pointer to the log object. +* +* p_lock +* [in] Pointer to the OpenSM serializing lock. +* +* RETURN VALUES +* CL_SUCCESS if the Multicast Forwarding Table Receiver object was initialized +* successfully. +* +* NOTES +* Allows calling other Multicast Forwarding Table Receiver methods. +* +* SEE ALSO +* Multicast Forwarding Table Receiver object, osm_mftr_rcv_construct, +* osm_mftr_rcv_destroy +*********/ + +/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_process +* NAME +* osm_mftr_rcv_process +* +* DESCRIPTION +* Process the MulticastForwardingTable attribute. +* +* SYNOPSIS +*/ +void osm_mftr_rcv_process( + IN osm_mftr_rcv_t* const p_ctrl, + IN const osm_madw_t* const p_madw ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_mftr_rcv_t object. +* +* p_madw +* [in] Pointer to the MAD Wrapper containing the MAD +* that contains the switch node's MulticastForwardingTable attribute. +* +* RETURN VALUES +* CL_SUCCESS if the MulticastForwardingTable processing was successful. +* +* NOTES +* This function processes a MulticastForwardingTable attribute. +* +* SEE ALSO +* Multicast Forwarding Table Receiver, Multicast Forwarding Table Response +* Controller +*********/ + +END_C_DECLS + +#endif /* _OSM_MFTR_H_ */ diff --git a/osm/include/opensm/osm_sa_mft_record_ctrl.h b/osm/include/opensm/osm_sa_mft_record_ctrl.h new file mode 100644 index 0000000..a28374d --- /dev/null +++ b/osm/include/opensm/osm_sa_mft_record_ctrl.h @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Declaration of osm_mftr_rcv_ctrl_t. + * This object represents a controller that receives the IBA + * MulticastForwardingTable attribute from a switch. + * This object is part of the OpenSM family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#ifndef _OSM_MFTR_RCV_CTRL_H_ +#define _OSM_MFTR_RCV_CTRL_H_ + +#include +#include +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +/****h* OpenSM/Multicast Forwarding Table Receive Controller +* NAME +* Multicast Forwarding Table Record Receive Controller +* +* DESCRIPTION +* The Multicast Forwarding Table Receive Controller object encapsulates +* the information needed to receive the MulticastFowardingTable attribute +* from a switch node. +* +* The Multicast Forwarding Table Receive Controller object is thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Hal Rosenstock, Voltaire +* +*********/ + +/****s* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_t +* NAME +* osm_mftr_rcv_ctrl_t +* +* DESCRIPTION +* Multicast Forwarding Table Receive Controller structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_mftr_rcv_ctrl +{ + osm_mftr_rcv_t *p_rcv; + osm_log_t *p_log; + cl_dispatcher_t *p_disp; + cl_disp_reg_handle_t h_disp; +} osm_mftr_rcv_ctrl_t; +/* +* FIELDS +* p_rcv +* Pointer to the Multicast Forwarding Table Receiver object. +* +* p_log +* Pointer to the log object. +* +* p_disp +* Pointer to the Dispatcher. +* +* h_disp +* Handle returned from dispatcher registration. +* +* SEE ALSO +* Multicast Forwarding Table Receive Controller object +* Multicast Forwarding Table Receiver object +*********/ + +/****f* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_construct +* NAME +* osm_mftr_rcv_ctrl_construct +* +* DESCRIPTION +* This function constructs a Multicast Forwarding Table Receive +* Controller object. +* +* SYNOPSIS +*/ +void osm_mftr_rcv_ctrl_construct( + IN osm_mftr_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to a Multicast Forwarding Table Receive Controller +* object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Allows calling osm_mftr_rcv_ctrl_init, osm_mftr_rcv_ctrl_destroy +* +* Calling osm_mftr_rcv_ctrl_construct is a prerequisite to calling any other +* method except osm_mftr_rcv_ctrl_init. +* +* SEE ALSO +* Multicast Forwarding Table Receive Controller object, osm_mftr_rcv_ctrl_init, +* osm_mftr_rcv_ctrl_destroy +*********/ + +/****f* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_destroy +* NAME +* osm_mftr_rcv_ctrl_destroy +* +* DESCRIPTION +* The osm_mftr_rcv_ctrl_destroy function destroys the object, releasing +* all resources. +* +* SYNOPSIS +*/ +void osm_mftr_rcv_ctrl_destroy( + IN osm_mftr_rcv_ctrl_t* const p_ctrl ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Performs any necessary cleanup of the specified +* Multicast Forwarding Table Receive Controller object. +* Further operations should not be attempted on the destroyed object. +* This function should only be called after a call to +* osm_mftr_rcv_ctrl_construct or osm_mftr_rcv_ctrl_init. +* +* SEE ALSO +* Multicast Forwarding Table Receive Controller object, osm_mftr_rcv_ctrl_construct, +* osm_mftr_rcv_ctrl_init +*********/ + +/****f* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_init +* NAME +* osm_mftr_rcv_ctrl_init +* +* DESCRIPTION +* The osm_mftr_rcv_ctrl_init function initializes a +* Multicast Forwarding Table Receive Controller object for use. +* +* SYNOPSIS +*/ +ib_api_status_t osm_mftr_rcv_ctrl_init( + IN osm_mftr_rcv_ctrl_t* const p_ctrl, + IN osm_mftr_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ); +/* +* PARAMETERS +* p_ctrl +* [in] Pointer to an osm_mftr_rcv_ctrl_t object to initialize. +* +* p_rcv +* [in] Pointer to an osm_mftr_t object. +* +* p_log +* [in] Pointer to the log object. +* +* p_disp +* [in] Pointer to the OpenSM central Dispatcher. +* +* RETURN VALUES +* CL_SUCCESS if the Multicast Forwarding Table Receive Controller object +* was initialized successfully. +* +* NOTES +* Allows calling other Multicast Forwarding Table Receive Controller methods. +* +* SEE ALSO +* Multicast Forwarding Table Receive Controller object, +* osm_mftr_rcv_ctrl_construct, osm_mftr_rcv_ctrl_destroy +*********/ + +END_C_DECLS + +#endif /* _OSM_MFTR_RCV_CTRL_H_ */ diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c new file mode 100644 index 0000000..a415fb9 --- /dev/null +++ b/osm/opensm/osm_sa_mft_record.c @@ -0,0 +1,540 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of osm_mftr_rcv_t. + * This object represents the MulticastForwardingTable Receiver object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define OSM_MFTR_RCV_POOL_MIN_SIZE 32 +#define OSM_MFTR_RCV_POOL_GROW_SIZE 32 + +typedef struct _osm_mftr_item +{ + cl_pool_item_t pool_item; + ib_mft_record_t rec; +} osm_mftr_item_t; + +typedef struct _osm_mftr_search_ctxt +{ + const ib_mft_record_t* p_rcvd_rec; + ib_net64_t comp_mask; + cl_qlist_t* p_list; + osm_mftr_rcv_t* p_rcv; + const osm_physp_t* p_req_physp; +} osm_mftr_search_ctxt_t; + +/********************************************************************** + **********************************************************************/ +void +osm_mftr_rcv_construct( + IN osm_mftr_rcv_t* const p_rcv ) +{ + memset( p_rcv, 0, sizeof(*p_rcv) ); + cl_qlock_pool_construct( &p_rcv->pool ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_mftr_rcv_destroy( + IN osm_mftr_rcv_t* const p_rcv ) +{ + OSM_LOG_ENTER( p_rcv->p_log, osm_mftr_rcv_destroy ); + cl_qlock_pool_destroy( &p_rcv->pool ); + OSM_LOG_EXIT( p_rcv->p_log ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_mftr_rcv_init( + IN osm_mftr_rcv_t* const p_rcv, + IN osm_sa_resp_t* const p_resp, + IN osm_mad_pool_t* const p_mad_pool, + IN osm_subn_t* const p_subn, + IN osm_log_t* const p_log, + IN cl_plock_t* const p_lock ) +{ + ib_api_status_t status; + + OSM_LOG_ENTER( p_log, osm_mftr_rcv_init ); + + osm_mftr_rcv_construct( p_rcv ); + + p_rcv->p_log = p_log; + p_rcv->p_subn = p_subn; + p_rcv->p_lock = p_lock; + p_rcv->p_resp = p_resp; + p_rcv->p_mad_pool = p_mad_pool; + + status = cl_qlock_pool_init( &p_rcv->pool, + OSM_MFTR_RCV_POOL_MIN_SIZE, + 0, + OSM_MFTR_RCV_POOL_GROW_SIZE, + sizeof(osm_mftr_item_t), + NULL, NULL, NULL ); + + OSM_LOG_EXIT( p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +static ib_api_status_t +__osm_mftr_rcv_new_mftr( + IN osm_mftr_rcv_t* const p_rcv, + IN osm_switch_t* const p_sw, + IN cl_qlist_t* const p_list, + IN ib_net16_t const lid, + IN uint16_t const block, + IN uint8_t const position ) +{ + osm_mftr_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + uint16_t position_block_num; + + OSM_LOG_ENTER( p_rcv->p_log, __osm_mftr_rcv_new_mftr ); + + p_rec_item = (osm_mftr_item_t*)cl_qlock_pool_get( &p_rcv->pool ); + if( p_rec_item == NULL ) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mftr_rcv_new_mftr: ERR 4A02: " + "cl_qlock_pool_get failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mftr_rcv_new_mftr: " + "New MulticastForwardingTable: sw 0x%016" PRIx64 + "\n\t\t\t\tblock %u position %u lid 0x%02X\n", + cl_ntoh64( osm_node_get_node_guid( p_sw->p_node ) ), + block, position, cl_ntoh16( lid ) + ); + } + + position_block_num = ((uint16_t)position << 12) | + (block & IB_MCAST_BLOCK_ID_MASK_HO); + + memset( &p_rec_item->rec, 0, sizeof(ib_mft_record_t) ); + + p_rec_item->rec.lid = lid; + p_rec_item->rec.position_block_num = cl_hton16( position_block_num ); + + /* copy the mft block */ + osm_switch_get_mft_block( p_sw, block, position, p_rec_item->rec.mft ); + + cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item ); + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); + return( status ); +} + +/********************************************************************** + **********************************************************************/ +static osm_port_t* +__osm_mftr_get_port_by_guid( + IN osm_mftr_rcv_t* const p_rcv, + IN uint64_t port_guid ) +{ + osm_port_t* p_port; + + CL_PLOCK_ACQUIRE(p_rcv->p_lock); + + p_port = (osm_port_t *)cl_qmap_get(&p_rcv->p_subn->port_guid_tbl, + port_guid); + if (p_port == (osm_port_t *)cl_qmap_end(&p_rcv->p_subn->port_guid_tbl)) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mftr_get_port_by_guid ERR 4A04: " + "Invalid port GUID 0x%016" PRIx64 "\n", + port_guid ); + p_port = NULL; + } + + CL_PLOCK_RELEASE(p_rcv->p_lock); + return p_port; +} + +/********************************************************************** + **********************************************************************/ +static void +__osm_mftr_rcv_by_comp_mask( + IN cl_map_item_t* const p_map_item, + IN void* context ) +{ + const osm_mftr_search_ctxt_t* const p_ctxt = + (osm_mftr_search_ctxt_t *)context; + osm_switch_t* const p_sw = (osm_switch_t*)p_map_item; + const ib_mft_record_t* const p_rcvd_rec = p_ctxt->p_rcvd_rec; + osm_mftr_rcv_t* const p_rcv = p_ctxt->p_rcv; + ib_net64_t const comp_mask = p_ctxt->comp_mask; + const osm_physp_t* const p_req_physp = p_ctxt->p_req_physp; + osm_port_t* p_port; + uint16_t min_lid_ho, max_lid_ho; + uint16_t position_block_num_ho; + uint16_t min_block, max_block, block; + const osm_physp_t* p_physp; + uint8_t min_position, max_position, position; + + /* In switches, the port guid is the node guid. */ + p_port = + __osm_mftr_get_port_by_guid( p_rcv, p_sw->p_node->node_info.port_guid ); + if (! p_port) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mftr_rcv_by_comp_mask: ERR 4A05: " + "Failed to find Port by Node Guid:0x%016" PRIx64 + "\n", + cl_ntoh64( p_sw->p_node->node_info.node_guid ) + ); + return; + } + + /* check that the requester physp and the current physp are under + the same partition. */ + p_physp = osm_port_get_default_phys_ptr( p_port ); + if (! p_physp) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "__osm_mftr_rcv_by_comp_mask: ERR 4A06: " + "Failed to find default physical Port by Node Guid:0x%016" PRIx64 + "\n", + cl_ntoh64( p_sw->p_node->node_info.node_guid ) + ); + return; + } + if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_physp )) + return; + + /* get the port 0 of the switch */ + osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho ); + + /* compare the lids - if required */ + if( comp_mask & IB_MFTR_COMPMASK_LID ) + { + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "__osm_mftr_rcv_by_comp_mask: " + "Comparing lid:0x%02X to port lid range: 0x%02X .. 0x%02X\n", + cl_ntoh16( p_rcvd_rec->lid ), min_lid_ho, max_lid_ho + ); + /* ok we are ready for range check */ + if (min_lid_ho > cl_ntoh16(p_rcvd_rec->lid) || + max_lid_ho < cl_ntoh16(p_rcvd_rec->lid)) + return; + } + + position_block_num_ho = cl_ntoh16(p_rcvd_rec->position_block_num); + + /* now we need to decide which blocks to output */ + if( comp_mask & IB_MFTR_COMPMASK_BLOCK ) + { + max_block = min_block = position_block_num_ho & IB_MCAST_BLOCK_ID_MASK_HO; + if (max_block > osm_switch_get_mft_max_block_in_use( p_sw ) ) + return; + } + else + { + /* use as many blocks as needed */ + min_block = 0; + max_block = osm_switch_get_mft_max_block_in_use( p_sw ); + } + + /* need to decide which positions to output */ + if ( comp_mask & IB_MFTR_COMPMASK_POSITION ) + { + min_position = max_position = (position_block_num_ho & 0xF000) >> 12; + if (max_position > osm_switch_get_mft_max_position( p_sw ) ) + return; + } + else + { + /* use as many positions as needed */ + min_position = 0; + max_position = osm_switch_get_mft_max_position( p_sw ); + } + + /* so we can add these one by one ... */ + for (block = min_block; block <= max_block; block++) + for (position = min_position; position <= max_position; position++) + __osm_mftr_rcv_new_mftr( p_rcv, p_sw, p_ctxt->p_list, + osm_port_get_base_lid(p_port), + block, position ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_mftr_rcv_process( + IN osm_mftr_rcv_t* const p_rcv, + IN const osm_madw_t* const p_madw ) +{ + const ib_sa_mad_t* p_rcvd_mad; + const ib_mft_record_t* p_rcvd_rec; + ib_mft_record_t* p_resp_rec; + cl_qlist_t rec_list; + osm_madw_t* p_resp_madw; + ib_sa_mad_t* p_resp_sa_mad; + uint32_t num_rec, pre_trim_num_rec; +#ifndef VENDOR_RMPP_SUPPORT + uint32_t trim_num_rec; +#endif + uint32_t i; + osm_mftr_search_ctxt_t context; + osm_mftr_item_t* p_rec_item; + ib_api_status_t status = IB_SUCCESS; + osm_physp_t* p_req_physp; + + CL_ASSERT( p_rcv ); + + OSM_LOG_ENTER( p_rcv->p_log, osm_mftr_rcv_process ); + + CL_ASSERT( p_madw ); + + p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw ); + p_rcvd_rec = (ib_mft_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad ); + + CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_MFT_RECORD ); + + /* we only support SubnAdmGet and SubnAdmGetTable methods */ + if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) && + (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) ) { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_mftr_rcv_process: ERR 4A08: " + "Unsupported Method (%s)\n", + ib_get_sa_method_str( p_rcvd_mad->method ) ); + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR ); + goto Exit; + } + + /* update the requester physical port. */ + p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log, + p_rcv->p_subn, + osm_madw_get_mad_addr_ptr(p_madw) ); + if (p_req_physp == NULL) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_mftr_rcv_process: ERR 4A07: " + "Cannot find requester physical port\n" ); + goto Exit; + } + + cl_qlist_init( &rec_list ); + + context.p_rcvd_rec = p_rcvd_rec; + context.p_list = &rec_list; + context.comp_mask = p_rcvd_mad->comp_mask; + context.p_rcv = p_rcv; + context.p_req_physp = p_req_physp; + + cl_plock_acquire( p_rcv->p_lock ); + + /* Go over all switches */ + cl_qmap_apply_func( &p_rcv->p_subn->sw_guid_tbl, + __osm_mftr_rcv_by_comp_mask, + &context ); + + cl_plock_release( p_rcv->p_lock ); + + num_rec = cl_qlist_count( &rec_list ); + + /* + * C15-0.1.30: + * If we do a SubnAdmGet and got more than one record it is an error ! + */ + if (p_rcvd_mad->method == IB_MAD_METHOD_GET) + { + if (num_rec == 0) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + if (num_rec > 1) + { + osm_log( p_rcv->p_log, OSM_LOG_ERROR, + "osm_mftr_rcv_process: ERR 4A09: " + "Got more than one record for SubnAdmGet (%u)\n", + num_rec ); + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_TOO_MANY_RECORDS); + + /* need to set the mem free ... */ + p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list ); + while( p_rec_item != (osm_mftr_item_t*)cl_qlist_end( &rec_list ) ) + { + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list ); + } + + goto Exit; + } + } + + pre_trim_num_rec = num_rec; +#ifndef VENDOR_RMPP_SUPPORT + /* we limit the number of records to a single packet */ + trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_mft_record_t); + if (trim_num_rec < num_rec) + { + osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, + "osm_mftr_rcv_process: " + "Number of records:%u trimmed to:%u to fit in one MAD\n", + num_rec, trim_num_rec ); + num_rec = trim_num_rec; + } +#endif + + osm_log( p_rcv->p_log, OSM_LOG_DEBUG, + "osm_mftr_rcv_process: " + "Returning %u records\n", num_rec ); + + if ((p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) && + (num_rec == 0)) + { + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RECORDS ); + goto Exit; + } + + /* + * Get a MAD to reply. Address of Mad is in the received mad_wrapper + */ + p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool, + p_madw->h_bind, + num_rec * sizeof(ib_mft_record_t) + IB_SA_MAD_HDR_SIZE, + &p_madw->mad_addr ); + + if( !p_resp_madw ) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_mftr_rcv_process: ERR 4A10: " + "osm_mad_pool_get failed\n" ); + + for( i = 0; i < num_rec; i++ ) + { + p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list ); + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + } + + osm_sa_send_error( p_rcv->p_resp, p_madw, + IB_SA_MAD_STATUS_NO_RESOURCES ); + + goto Exit; + } + + p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw ); + + /* + Copy the MAD header back into the response mad. + Set the 'R' bit and the payload length, + Then copy all records from the list into the response payload. + */ + + memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE ); + p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK; + /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */ + p_resp_sa_mad->sm_key = 0; + /* Fill in the offset (paylen will be done by the rmpp SAR) */ + p_resp_sa_mad->attr_offset = + ib_get_attr_offset( sizeof(ib_mft_record_t) ); + + p_resp_rec = (ib_mft_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad ); + +#ifndef VENDOR_RMPP_SUPPORT + /* we support only one packet RMPP - so we will set the first and + last flags for gettable */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + { + p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA; + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE; + } +#else + /* forcefully define the packet as RMPP one */ + if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP) + p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE; +#endif + + for( i = 0; i < pre_trim_num_rec; i++ ) + { + p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list ); + /* copy only if not trimmed */ + if (i < num_rec) + { + *p_resp_rec = p_rec_item->rec; + } + cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item ); + p_resp_rec++; + } + + CL_ASSERT( cl_is_qlist_empty( &rec_list ) ); + + status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE ); + if (status != IB_SUCCESS) + { + osm_log(p_rcv->p_log, OSM_LOG_ERROR, + "osm_mftr_rcv_process: ERR 4A11: " + "osm_vendor_send status = %s\n", + ib_get_err_str(status)); + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_rcv->p_log ); +} diff --git a/osm/opensm/osm_sa_mft_record_ctrl.c b/osm/opensm/osm_sa_mft_record_ctrl.c new file mode 100644 index 0000000..cf433a9 --- /dev/null +++ b/osm/opensm/osm_sa_mft_record_ctrl.c @@ -0,0 +1,123 @@ +/* + * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of osm_mftr_rcv_ctrl_t. + * This object represents the MulticastForwardingTable request controller object. + * This object is part of the opensm family of objects. + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +/********************************************************************** + **********************************************************************/ +void +__osm_mftr_rcv_ctrl_disp_callback( + IN void *context, + IN void *p_data ) +{ + /* ignore return status when invoked via the dispatcher */ + osm_mftr_rcv_process( ((osm_mftr_rcv_ctrl_t*)context)->p_rcv, + (osm_madw_t*)p_data ); +} + +/********************************************************************** + **********************************************************************/ +void +osm_mftr_rcv_ctrl_construct( + IN osm_mftr_rcv_ctrl_t* const p_ctrl ) +{ + memset( p_ctrl, 0, sizeof(*p_ctrl) ); + p_ctrl->h_disp = CL_DISP_INVALID_HANDLE; +} + +/********************************************************************** + **********************************************************************/ +void +osm_mftr_rcv_ctrl_destroy( + IN osm_mftr_rcv_ctrl_t* const p_ctrl ) +{ + CL_ASSERT( p_ctrl ); + cl_disp_unregister( p_ctrl->h_disp ); +} + +/********************************************************************** + **********************************************************************/ +ib_api_status_t +osm_mftr_rcv_ctrl_init( + IN osm_mftr_rcv_ctrl_t* const p_ctrl, + IN osm_mftr_rcv_t* const p_rcv, + IN osm_log_t* const p_log, + IN cl_dispatcher_t* const p_disp ) +{ + ib_api_status_t status = IB_SUCCESS; + + OSM_LOG_ENTER( p_log, osm_mftr_rcv_ctrl_init ); + + osm_mftr_rcv_ctrl_construct( p_ctrl ); + p_ctrl->p_log = p_log; + p_ctrl->p_rcv = p_rcv; + p_ctrl->p_disp = p_disp; + + p_ctrl->h_disp = cl_disp_register( + p_disp, + OSM_MSG_MAD_MFT_RECORD, + __osm_mftr_rcv_ctrl_disp_callback, + p_ctrl ); + + if( p_ctrl->h_disp == CL_DISP_INVALID_HANDLE ) + { + osm_log( p_log, OSM_LOG_ERROR, + "osm_mftr_rcv_ctrl_init: ERR 4A01: " + "Dispatcher registration failed\n" ); + status = IB_INSUFFICIENT_RESOURCES; + goto Exit; + } + + Exit: + OSM_LOG_EXIT( p_log ); + return( status ); +} From halr at voltaire.com Fri Dec 29 09:12:22 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 12:12:22 -0500 Subject: [openib-general] [PATCH 3/4] OpenSM: Other changes to incorporate optional SA MFTRecord support Message-ID: <1167412341.29620.227807.camel@hal.voltaire.com> OpenSM: Other changes to incorporate optional SA MFTRecord support Signed-off-by: Hal Rosenstock diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am index d051b9a..ea8ab10 100644 --- a/osm/include/Makefile.am +++ b/osm/include/Makefile.am @@ -28,6 +28,7 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sa_service_record_ctrl.h \ $(srcdir)/opensm/osm_pkey_rcv_ctrl.h \ $(srcdir)/opensm/osm_sa_lft_record.h \ + $(srcdir)/opensm/osm_sa_mft_record.h \ $(srcdir)/opensm/osm_resp.h \ $(srcdir)/opensm/osm_partition.h \ $(srcdir)/opensm/osm_slvl_map_rcv_ctrl.h \ @@ -47,6 +48,7 @@ EXTRA_DIST = \ $(srcdir)/opensm/osm_sminfo_rcv_ctrl.h \ $(srcdir)/opensm/osm_sa_pkey_record.h \ $(srcdir)/opensm/osm_sa_lft_record_ctrl.h \ + $(srcdir)/opensm/osm_sa_mft_record_ctrl.h \ $(srcdir)/opensm/osm_inform.h \ $(srcdir)/opensm/osm_path.h \ $(srcdir)/opensm/osm_lin_fwd_rcv.h \ diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h index 3611025..87c943f 100644 --- a/osm/include/opensm/osm_msgdef.h +++ b/osm/include/opensm/osm_msgdef.h @@ -196,6 +196,7 @@ enum OSM_MSG_MAD_GUIDINFO_RECORD, OSM_MSG_MAD_INFORM_INFO_RECORD, OSM_MSG_MAD_SWITCH_INFO_RECORD, + OSM_MSG_MAD_MFT_RECORD, #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) OSM_MSG_MAD_MULTIPATH_RECORD, #endif diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h index ae8d5ac..1508f44 100644 --- a/osm/include/opensm/osm_sa.h +++ b/osm/include/opensm/osm_sa.h @@ -77,6 +77,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -195,6 +196,10 @@ typedef struct _osm_sa /* SwitchInfo Query */ osm_sir_rcv_t sir_rcv; osm_sir_rcv_ctrl_t sir_rcv_ctrl; + + /* MulticastForwardingTable Query */ + osm_mftr_rcv_t mftr_rcv; + osm_mftr_rcv_ctrl_t mftr_rcv_ctrl; } osm_sa_t; /* * FIELDS diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am index aed60d7..8f42387 100644 --- a/osm/opensm/Makefile.am +++ b/osm/opensm/Makefile.am @@ -43,7 +43,8 @@ opensm_SOURCES = main.c osm_console.c os osm_resp.c osm_sa.c osm_sa_class_port_info.c \ osm_sa_class_port_info_ctrl.c osm_sa_informinfo.c \ osm_sa_informinfo_ctrl.c osm_sa_lft_record.c \ - osm_sa_lft_record_ctrl.c osm_sa_link_record.c \ + osm_sa_lft_record_ctrl.c osm_sa_mft_record.c \ + osm_sa_mft_record_ctrl.c osm_sa_link_record.c \ osm_sa_link_record_ctrl.c osm_sa_mad_ctrl.c \ osm_sa_mcmember_record.c osm_sa_mcmember_record_ctrl.c \ osm_sa_node_record.c osm_sa_node_record_ctrl.c \ diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c index 983d5e5..7a993f1 100644 --- a/osm/opensm/osm_sa.c +++ b/osm/opensm/osm_sa.c @@ -131,6 +131,9 @@ osm_sa_construct( osm_sir_rcv_construct( &p_sa->sir_rcv ); osm_sir_rcv_ctrl_construct( &p_sa->sir_rcv_ctrl ); + + osm_mftr_rcv_construct( &p_sa->mftr_rcv ); + osm_mftr_rcv_ctrl_construct( &p_sa->mftr_rcv_ctrl ); } /********************************************************************** @@ -163,6 +166,7 @@ osm_sa_shutdown( osm_pkey_rec_rcv_ctrl_destroy( &p_sa->pkey_rec_rcv_ctrl ); osm_lftr_rcv_ctrl_destroy( &p_sa->lftr_rcv_ctrl ); osm_sir_rcv_ctrl_destroy( &p_sa->sir_rcv_ctrl ); + osm_mftr_rcv_ctrl_destroy( &p_sa->mftr_rcv_ctrl ); osm_sa_mad_ctrl_destroy( &p_sa->mad_ctrl ); OSM_LOG_EXIT( p_sa->p_log ); @@ -195,6 +199,7 @@ osm_sa_destroy( osm_pkey_rec_rcv_destroy( &p_sa->pkey_rec_rcv ); osm_lftr_rcv_destroy( &p_sa->lftr_rcv ); osm_sir_rcv_destroy( &p_sa->sir_rcv ); + osm_mftr_rcv_destroy( &p_sa->mftr_rcv ); osm_sa_resp_destroy( &p_sa->resp ); OSM_LOG_EXIT( p_sa->p_log ); @@ -537,6 +542,24 @@ osm_sa_init( if( status != IB_SUCCESS ) goto Exit; + status = osm_mftr_rcv_init( + &p_sa->mftr_rcv, + &p_sa->resp, + p_sa->p_mad_pool, + p_subn, + p_log, + p_lock); + if( status != IB_SUCCESS ) + goto Exit; + + status = osm_mftr_rcv_ctrl_init( + &p_sa->mftr_rcv_ctrl, + &p_sa->mftr_rcv, + p_log, + p_disp ); + if( status != IB_SUCCESS ) + goto Exit; + Exit: OSM_LOG_EXIT( p_log ); return( status ); diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c index 4d7bcbb..84fa016 100644 --- a/osm/opensm/osm_sa_class_port_info.c +++ b/osm/opensm/osm_sa_class_port_info.c @@ -195,7 +195,6 @@ __osm_cpi_rcv_respond( /* we do not support the following optional records: OSM_CAP_IS_SUBN_OPT_RECS_SUP : RandomForwardingTableRecord, - MulticastForwardingTableRecord, ServiceAssociationRecord other optional records supported "under the table" diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c index 90c732d..85d0b2a 100644 --- a/osm/opensm/osm_sa_mad_ctrl.c +++ b/osm/opensm/osm_sa_mad_ctrl.c @@ -216,6 +216,10 @@ __osm_sa_mad_ctrl_process( msg_id = OSM_MSG_MAD_SWITCH_INFO_RECORD; break; + case IB_MAD_ATTR_MFT_RECORD: + msg_id = OSM_MSG_MAD_MFT_RECORD; + break; + #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP) case IB_MAD_ATTR_MULTIPATH_RECORD: msg_id = OSM_MSG_MAD_MULTIPATH_RECORD; From halr at voltaire.com Fri Dec 29 09:12:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 29 Dec 2006 12:12:30 -0500 Subject: [openib-general] [PATCH 4/4] osmtest/osmtest.c: Add SA MFTRecord tests Message-ID: <1167412348.29620.227809.camel@hal.voltaire.com> osmtest/osmtest.c: Add SA MFTRecord tests Signed-off-by: Hal Rosenstock diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 3dd229c..ba42fc6 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -4854,6 +4854,93 @@ osmtest_get_lft_rec_by_lid( IN osmtest_t } /********************************************************************** + * Get MFT record by LID + **********************************************************************/ +ib_api_status_t +osmtest_get_mft_rec_by_lid( IN osmtest_t * const p_osmt, + IN ib_net16_t const lid, + IN OUT osmtest_req_context_t * const p_context ) +{ + ib_api_status_t status = IB_SUCCESS; + osmv_user_query_t user; + osmv_query_req_t req; + ib_mft_record_t record; + ib_mad_t *p_mad; + + OSM_LOG_ENTER( &p_osmt->log, osmtest_get_mft_rec_by_lid ); + + if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) ) + { + osm_log( &p_osmt->log, OSM_LOG_VERBOSE, + "osmtest_get_mft_rec_by_lid: " + "Getting MFT record for LID 0x%02X\n", + cl_ntoh16( lid ) ); + } + + /* + * Do a blocking query for this record in the subnet. + * The result is returned in the result field of the caller's + * context structure. + * + * The query structures are locals. + */ + memset( &req, 0, sizeof( req ) ); + memset( &user, 0, sizeof( user ) ); + memset( &record, 0, sizeof( record ) ); + + record.lid = lid; + p_context->p_osmt = p_osmt; + if (lid) + user.comp_mask = IB_MFTR_COMPMASK_LID; + user.attr_id = IB_MAD_ATTR_MFT_RECORD; + user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) ); + user.p_attr = &record; + + req.query_type = OSMV_QUERY_USER_DEFINED; + req.timeout_ms = p_osmt->opt.transaction_timeout; + req.retry_cnt = p_osmt->opt.retry_count; + + req.flags = OSM_SA_FLAGS_SYNC; + req.query_context = p_context; + req.pfn_query_cb = osmtest_query_res_cb; + req.p_query_input = &user; + req.sm_key = 0; + + status = osmv_query_sa( p_osmt->h_bind, &req ); + if( status != IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_mft_rec_by_lid: ERR 009B: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + goto Exit; + } + + status = p_context->result.status; + + if( status != IB_SUCCESS ) + { + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_mft_rec_by_lid: ERR 009C: " + "ib_query failed (%s)\n", ib_get_err_str( status ) ); + if( status == IB_REMOTE_ERROR ) + { + p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw ); + osm_log( &p_osmt->log, OSM_LOG_ERROR, + "osmtest_get_mft_rec_by_lid: " + "Remote error = %s\n", + ib_get_mad_status_str( p_mad )); + + status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK ); + } + goto Exit; + } + + Exit: + OSM_LOG_EXIT( &p_osmt->log ); + return ( status ); +} + +/********************************************************************** **********************************************************************/ static ib_api_status_t osmtest_sminfo_record_request( @@ -5933,6 +6020,17 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; + /* MFT Record tests */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_mft_rec_by_lid( p_osmt, 0, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_mft_rec_by_lid( p_osmt, test_lid, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + /* Some LinkRecord tests */ /* FromLID */ memset( &context, 0, sizeof( context ) ); @@ -6288,6 +6386,12 @@ osmtest_validate_against_db( IN osmtest_ if ( status != IB_SUCCESS ) goto Exit; + /* Another MFT Record test */ + memset( &context, 0, sizeof( context ) ); + status = osmtest_get_mft_rec_by_lid( p_osmt, test_lid, &context ); + if ( status != IB_SUCCESS ) + goto Exit; + /* More LinkRecord tests */ /* FromLID */ memset( &context, 0, sizeof( context ) ); From sean.hefty at intel.com Fri Dec 29 14:21:28 2006 From: sean.hefty at intel.com (Hefty, Sean) Date: Fri, 29 Dec 2006 14:21:28 -0800 Subject: [openib-general] [PATCH] rdma_cm: avoid port reuse after close Message-ID: Randomize the starting port number, and avoid re-using port values immediately after they are closed. Instead, track the last port value used and increment it every time a new port number is assigned. These changes are in response to Michael's comments from this (old) thread: http://openib.org/pipermail/openib-general/2006-September/025996.html Signed-off-by: Sean Hefty --- I'm not sure if this is still needed, but I had it on my list of things to someday try to do. This should apply to 2.6.20-rc2. diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 533193d..23fdc45 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -71,6 +71,7 @@ static struct workqueue_struct *cma_wq; static DEFINE_IDR(sdp_ps); static DEFINE_IDR(tcp_ps); static DEFINE_IDR(udp_ps); +static int next_port; struct cma_device { struct list_head list; @@ -1711,33 +1712,74 @@ static int cma_alloc_port(struct idr *ps unsigned short snum) { struct rdma_bind_list *bind_list; - int port, start, ret; + int port, ret; bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); if (!bind_list) return -ENOMEM; - start = snum ? snum : sysctl_local_port_range[0]; + do { + ret = idr_get_new_above(ps, bind_list, snum, &port); + } while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL)); + + if (ret) + goto err1; + + if (port != snum) { + ret = -EADDRNOTAVAIL; + goto err2; + } + + bind_list->ps = ps; + bind_list->port = (unsigned short) port; + cma_bind_port(bind_list, id_priv); + return 0; +err2: + idr_remove(ps, port); +err1: + kfree(bind_list); + return ret; +} +static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) +{ + struct rdma_bind_list *bind_list; + int port, ret; + + bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL); + if (!bind_list) + return -ENOMEM; + +retry: do { - ret = idr_get_new_above(ps, bind_list, start, &port); + ret = idr_get_new_above(ps, bind_list, next_port, &port); } while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL)); if (ret) - goto err; + goto err1; - if ((snum && port != snum) || - (!snum && port > sysctl_local_port_range[1])) { - idr_remove(ps, port); + if (port > sysctl_local_port_range[1]) { + if (next_port != sysctl_local_port_range[0]) { + idr_remove(ps, port); + next_port = sysctl_local_port_range[0]; + goto retry; + } ret = -EADDRNOTAVAIL; - goto err; + goto err2; } + if (port == sysctl_local_port_range[1]) + next_port = sysctl_local_port_range[0]; + else + next_port = port + 1; + bind_list->ps = ps; bind_list->port = (unsigned short) port; cma_bind_port(bind_list, id_priv); return 0; -err: +err2: + idr_remove(ps, port); +err1: kfree(bind_list); return ret; } @@ -1800,7 +1842,7 @@ static int cma_get_port(struct rdma_id_p mutex_lock(&lock); if (cma_any_port(&id_priv->id.route.addr.src_addr)) - ret = cma_alloc_port(ps, id_priv, 0); + ret = cma_alloc_any_port(ps, id_priv); else ret = cma_use_port(ps, id_priv); mutex_unlock(&lock); @@ -2437,6 +2479,10 @@ static int cma_init(void) { int ret; + get_random_bytes(&next_port, sizeof next_port); + next_port = (next_port % (sysctl_local_port_range[1] - + sysctl_local_port_range[0])) + + sysctl_local_port_range[0]; cma_wq = create_singlethread_workqueue("rdma_cm_wq"); if (!cma_wq) return -ENOMEM; From eitan at sw053.yok.mtl.com Fri Dec 29 21:25:38 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Sat, 30 Dec 2006 07:25:38 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-30:normal completion Message-ID: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe Total=405 Pass=330 Fail=75 Pass: 45 Stability IS1-16.topo 45 Pkey IS1-16.topo 45 OsmStress IS1-16.topo 45 Multicast IS1-16.topo 45 LidMgr IS1-16.topo 15 Stability IS3-loop.topo 15 Stability IS3-128.topo 15 Pkey IS3-128.topo 15 OsmStress IS3-128.topo 15 Multicast IS3-loop.topo 15 Multicast IS3-128.topo 15 LidMgr IS3-128.topo Failures: 45 OsmTest IS1-16.topo 15 OsmTest IS3-loop.topo 15 OsmTest IS3-128.topo From halr at voltaire.com Sat Dec 30 04:09:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2006 07:09:18 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-30:normal completion In-Reply-To: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com> References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com> Message-ID: <1167480536.29620.286425.camel@hal.voltaire.com> Hi Eitan, On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 > ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe > Total=405 Pass=330 Fail=75 > > Pass: > 45 Stability IS1-16.topo > 45 Pkey IS1-16.topo > 45 OsmStress IS1-16.topo > 45 Multicast IS1-16.topo > 45 LidMgr IS1-16.topo > 15 Stability IS3-loop.topo > 15 Stability IS3-128.topo > 15 Pkey IS3-128.topo > 15 OsmStress IS3-128.topo > 15 Multicast IS3-loop.topo > 15 Multicast IS3-128.topo > 15 LidMgr IS3-128.topo > > Failures: > 45 OsmTest IS1-16.topo > 15 OsmTest IS3-loop.topo > 15 OsmTest IS3-128.topo Any idea on these osmtest failures ? I did add SA MFTRecord yesterday and made a change to SA LFTRecord and SwitchInfoRecord the day before as well as additional osmtests for MFTRecord and LFTRecord. Also, why are osmtest failures allowed for "normal completion" ? -- Hal From eitan at mellanox.co.il Sat Dec 30 13:03:25 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 30 Dec 2006 23:03:25 +0200 Subject: [openib-general] [PATCH] osm: fat-tree documentation In-Reply-To: <1167410047.29620.225730.camel@hal.voltaire.com> References: <45929D0B.3090308@dev.mellanox.co.il> <1167240747.29620.77561.camel@hal.voltaire.com> <1167410047.29620.225730.camel@hal.voltaire.com> Message-ID: <4596D41D.3080607@mellanox.co.il> Hal Rosenstock wrote: > On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote: > >> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote: >> >>> Hi Hal. >>> >>> Added fat-tree routing details and some cosmetics in the txt files. >>> >>> -- >>> Yevgeny >>> >>> Signed-off-by: Yevgeny Kliteynik >>> >> Thanks. Applied. >> >> A couple of minor questions: >> >> Should similar text as in current-routing.txt be added to the OpenSM man >> page ? >> > > I took care of making the man page including the fat tree routing > information you put into current-routing.txt. > > The question below is outstanding: > > >> Also, rather than HCA in the below, is CA better (to include TCAs as >> well) ? >> > > I agree CA is better then HCA. Hal, can you take it or want a patch? > Thanks. > > -- Hal > > >> -- Hal >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From kliteyn at dev.mellanox.co.il Sat Dec 30 13:07:10 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sat, 30 Dec 2006 23:07:10 +0200 Subject: [openib-general] [PATCH] osm: fat-tree documentation In-Reply-To: <1167410047.29620.225730.camel@hal.voltaire.com> References: <45929D0B.3090308@dev.mellanox.co.il> <1167240747.29620.77561.camel@hal.voltaire.com> <1167410047.29620.225730.camel@hal.voltaire.com> Message-ID: <4596D4FE.4000307@dev.mellanox.co.il> Hal Rosenstock wrote: > On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote: >> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote: >>> Hi Hal. >>> >>> Added fat-tree routing details and some cosmetics in the txt files. >>> >>> -- >>> Yevgeny >>> >>> Signed-off-by: Yevgeny Kliteynik >> Thanks. Applied. >> >> A couple of minor questions: >> >> Should similar text as in current-routing.txt be added to the OpenSM man >> page ? > > I took care of making the man page including the fat tree routing > information you put into current-routing.txt. Thanks. > The question below is outstanding: > >> Also, rather than HCA in the below, is CA better (to include TCAs as >> well) ? Right, CA is better. -- Yevgeny > Thanks. > > -- Hal > >> -- Hal > From eitan at mellanox.co.il Sat Dec 30 13:12:01 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 30 Dec 2006 23:12:01 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-30:normal completion In-Reply-To: <1167480536.29620.286425.camel@hal.voltaire.com> References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com> <1167480536.29620.286425.camel@hal.voltaire.com> Message-ID: <4596D621.80207@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote: > >> OSM Simulation Regression Summary >> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 >> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe >> Total=405 Pass=330 Fail=75 >> >> Pass: >> 45 Stability IS1-16.topo >> 45 Pkey IS1-16.topo >> 45 OsmStress IS1-16.topo >> 45 Multicast IS1-16.topo >> 45 LidMgr IS1-16.topo >> 15 Stability IS3-loop.topo >> 15 Stability IS3-128.topo >> 15 Pkey IS3-128.topo >> 15 OsmStress IS3-128.topo >> 15 Multicast IS3-loop.topo >> 15 Multicast IS3-128.topo >> 15 LidMgr IS3-128.topo >> >> Failures: >> 45 OsmTest IS1-16.topo >> 15 OsmTest IS3-loop.topo >> 15 OsmTest IS3-128.topo >> > > Any idea on these osmtest failures ? I did add SA MFTRecord yesterday > and made a change to SA LFTRecord and SwitchInfoRecord the day before as > well as additional osmtests for MFTRecord and LFTRecord. > I get Dec 30 07:13:20 163508 [B7F1F8E0] -> osmtest_get_sw_info_rec_by_lid: Getting SwitchInfo record for LID 0x01 Dec 30 07:13:20 165737 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting LFT record for LID 0x00 Dec 30 07:13:20 169968 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting LFT record for LID 0x01 Dec 30 07:13:20 172573 [B7F1F8E0] -> osmtest_get_mft_rec_by_lid: Getting MFT record for LID 0x00 Dec 30 07:13:50 182807 [B7F1EBB0] -> __osmv_txn_timeout_cb: ERR 6702: The transaction request (tid=0x26) timed out (after 4 retrie s). Invoking the error callback. Dec 30 07:13:50 182964 [B7F1EBB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_TIMEOUT) I wonder where the LID=0 comes from. Might be a simulation issue but not sure.I will double check tomorrow. > Also, why are osmtest failures allowed for "normal completion" ? > "Normal completion" means completion without resource issues. Unlike "disk full". > -- Hal > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Sat Dec 30 13:21:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2006 16:21:17 -0500 Subject: [openib-general] [PATCH] osm: fat-tree documentation In-Reply-To: <4596D41D.3080607@mellanox.co.il> References: <45929D0B.3090308@dev.mellanox.co.il> <1167240747.29620.77561.camel@hal.voltaire.com> <1167410047.29620.225730.camel@hal.voltaire.com> <4596D41D.3080607@mellanox.co.il> Message-ID: <1167513670.29620.315478.camel@hal.voltaire.com> On Sat, 2006-12-30 at 16:03, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote: > > > >> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote: > >> > >>> Hi Hal. > >>> > >>> Added fat-tree routing details and some cosmetics in the txt files. > >>> > >>> -- > >>> Yevgeny > >>> > >>> Signed-off-by: Yevgeny Kliteynik > >>> > >> Thanks. Applied. > >> > >> A couple of minor questions: > >> > >> Should similar text as in current-routing.txt be added to the OpenSM man > >> page ? > >> > > > > I took care of making the man page including the fat tree routing > > information you put into current-routing.txt. > > > > The question below is outstanding: > > > > > >> Also, rather than HCA in the below, is CA better (to include TCAs as > >> well) ? > >> > > > > > I agree CA is better then HCA. > Hal, can you take it or want a patch? I'll change this. -- Hal > > Thanks. > > > > -- Hal > > > > > >> -- Hal > >> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From halr at voltaire.com Sat Dec 30 13:24:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2006 16:24:29 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-30:normal completion In-Reply-To: <4596D621.80207@mellanox.co.il> References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com> <1167480536.29620.286425.camel@hal.voltaire.com> <4596D621.80207@mellanox.co.il> Message-ID: <1167513866.29620.315610.camel@hal.voltaire.com> On Sat, 2006-12-30 at 16:12, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote: > > > >> OSM Simulation Regression Summary > >> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 > >> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe > >> Total=405 Pass=330 Fail=75 > >> > >> Pass: > >> 45 Stability IS1-16.topo > >> 45 Pkey IS1-16.topo > >> 45 OsmStress IS1-16.topo > >> 45 Multicast IS1-16.topo > >> 45 LidMgr IS1-16.topo > >> 15 Stability IS3-loop.topo > >> 15 Stability IS3-128.topo > >> 15 Pkey IS3-128.topo > >> 15 OsmStress IS3-128.topo > >> 15 Multicast IS3-loop.topo > >> 15 Multicast IS3-128.topo > >> 15 LidMgr IS3-128.topo > >> > >> Failures: > >> 45 OsmTest IS1-16.topo > >> 15 OsmTest IS3-loop.topo > >> 15 OsmTest IS3-128.topo > >> > > > > Any idea on these osmtest failures ? I did add SA MFTRecord yesterday > > and made a change to SA LFTRecord and SwitchInfoRecord the day before as > > well as additional osmtests for MFTRecord and LFTRecord. > > > I get > Dec 30 07:13:20 163508 [B7F1F8E0] -> osmtest_get_sw_info_rec_by_lid: > Getting SwitchInfo record for LID 0x01 > Dec 30 07:13:20 165737 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting > LFT record for LID 0x00 > Dec 30 07:13:20 169968 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting > LFT record for LID 0x01 > Dec 30 07:13:20 172573 [B7F1F8E0] -> osmtest_get_mft_rec_by_lid: Getting > MFT record for LID 0x00 > Dec 30 07:13:50 182807 [B7F1EBB0] -> __osmv_txn_timeout_cb: ERR 6702: > The transaction request (tid=0x26) timed out (after 4 retrie > s). Invoking the error callback. > Dec 30 07:13:50 182964 [B7F1EBB0] -> osmtest_query_res_cb: ERR 0003: > Error on query (IB_TIMEOUT) > > I wonder where the LID=0 comes from. This is currently by "design". It is used to wildcard rather than an additional parameter to set the component mask: /* LFT Record tests */ memset( &context, 0, sizeof( context ) ); status = osmtest_get_lft_rec_by_lid( p_osmt, 0, &context ); ... /* MFT Record tests */ memset( &context, 0, sizeof( context ) ); status = osmtest_get_mft_rec_by_lid( p_osmt, 0, &context ); It seems like you might not have rebuilt OpenSM properly though to add the SA MFTRecord handler. > Might be a simulation issue but not > sure.I will double check tomorrow. > > > Also, why are osmtest failures allowed for "normal completion" ? > > > "Normal completion" means completion without resource issues. Unlike > "disk full". OK; I thought normal completion indicated something about success or failure. -- Hal > > -- Hal > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From eitan at mellanox.co.il Sat Dec 30 13:33:57 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 30 Dec 2006 23:33:57 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-30:normal completion In-Reply-To: <1167480536.29620.286425.camel@hal.voltaire.com> References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com> <1167480536.29620.286425.camel@hal.voltaire.com> Message-ID: <4596DB45.5070108@mellanox.co.il> Hal Rosenstock wrote: > Hi Eitan, > > On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote: > >> OSM Simulation Regression Summary >> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 >> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe >> Total=405 Pass=330 Fail=75 >> >> Pass: >> 45 Stability IS1-16.topo >> 45 Pkey IS1-16.topo >> 45 OsmStress IS1-16.topo >> 45 Multicast IS1-16.topo >> 45 LidMgr IS1-16.topo >> 15 Stability IS3-loop.topo >> 15 Stability IS3-128.topo >> 15 Pkey IS3-128.topo >> 15 OsmStress IS3-128.topo >> 15 Multicast IS3-loop.topo >> 15 Multicast IS3-128.topo >> 15 LidMgr IS3-128.topo >> >> Failures: >> 45 OsmTest IS1-16.topo >> 15 OsmTest IS3-loop.topo >> 15 OsmTest IS3-128.topo >> > > Any idea on these osmtest failures ? I did add SA MFTRecord yesterday > and made a change to SA LFTRecord and SwitchInfoRecord the day before as > well as additional osmtests for MFTRecord and LFTRecord. > Actually I get a core dump: #0 0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, block_num=-32575, position=0 '\0', p_block=0xb19e4d2c) at osm_mcast_tbl.c:299 299 p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position]; (gdb) p i $1 = 2 (gdb) p mlid_start_ho $2 = 6176 (gdb) p position $3 = 0 '\0' (gdb) where #0 0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, block_num=-32575, position=0 '\0', p_block=0xb19e4d2c) at osm_mcast_tbl.c:299 #1 0x08073d29 in osm_switch_get_mft_block (p_sw=0x8f6eed8, block_num=32961, position=0 '\0', p_block=0xb19e4d2c) at ./../include/opensm/osm_switch.h:1074 #2 0x08073b8c in __osm_mftr_rcv_new_mftr (p_rcv=0x80e9a6c, p_sw=0x8f6eed8, p_list=0xb61c0370, lid=512, block=32961, position=0 '\0') at osm_sa_mft_record.c:181 #3 0x08074273 in __osm_mftr_rcv_by_comp_mask (p_map_item=0x8f6eed8, context=0xb61c0330) at osm_sa_mft_record.c:317 #4 0x00cd9747 in cl_qmap_apply_func (p_map=0x80e8584, pfn_func=0x8073f98 <__osm_mftr_rcv_by_comp_mask>, context=0xb61c0330) at cl_map.c:287 #5 0x08074653 in osm_mftr_rcv_process (p_rcv=0x80e9a6c, p_madw=0x8f29f0c) at osm_sa_mft_record.c:390 #6 0x08074ef2 in __osm_mftr_rcv_ctrl_disp_callback (context=0x80e9afc, p_data=0x8f29f0c) at osm_sa_mft_record_ctrl.c:63 #7 0x00cd3d4f in __cl_disp_worker (context=0x80e9d18) at cl_dispatcher.c:102 #8 0x00ce1297 in __cl_thread_pool_routine (context=0x80e9d5c) at cl_threadpool.c:74 #9 0x00ce0f61 in __cl_thread_wrapper (arg=0x8f1c690) at cl_thread.c:58 #10 0x00361371 in start_thread () from /lib/tls/libpthread.so.0 #11 0x001eaffe in clone () from /lib/tls/libc.so.6 > Also, why are osmtest failures allowed for "normal completion" ? > > -- Hal > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Sat Dec 30 14:25:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 30 Dec 2006 17:25:00 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-30:normal completion In-Reply-To: <4596DB45.5070108@mellanox.co.il> References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com> <1167480536.29620.286425.camel@hal.voltaire.com> <4596DB45.5070108@mellanox.co.il> Message-ID: <1167517497.29620.318774.camel@hal.voltaire.com> On Sat, 2006-12-30 at 16:33, Eitan Zahavi wrote: > Hal Rosenstock wrote: > > Hi Eitan, > > > > On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote: > > > >> OSM Simulation Regression Summary > >> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 > >> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe > >> Total=405 Pass=330 Fail=75 > >> > >> Pass: > >> 45 Stability IS1-16.topo > >> 45 Pkey IS1-16.topo > >> 45 OsmStress IS1-16.topo > >> 45 Multicast IS1-16.topo > >> 45 LidMgr IS1-16.topo > >> 15 Stability IS3-loop.topo > >> 15 Stability IS3-128.topo > >> 15 Pkey IS3-128.topo > >> 15 OsmStress IS3-128.topo > >> 15 Multicast IS3-loop.topo > >> 15 Multicast IS3-128.topo > >> 15 LidMgr IS3-128.topo > >> > >> Failures: > >> 45 OsmTest IS1-16.topo > >> 15 OsmTest IS3-loop.topo > >> 15 OsmTest IS3-128.topo > >> > > > > Any idea on these osmtest failures ? I did add SA MFTRecord yesterday > > and made a change to SA LFTRecord and SwitchInfoRecord the day before as > > well as additional osmtests for MFTRecord and LFTRecord. > > > Actually I get a core dump: Thanks for providing this! > #0 0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, > block_num=-32575, position=0 '\0', p_block=0xb19e4d2c) > at osm_mcast_tbl.c:299 > 299 p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position]; > > (gdb) p i > $1 = 2 > (gdb) p mlid_start_ho > $2 = 6176 > (gdb) p position > $3 = 0 '\0' > (gdb) where > #0 0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, > block_num=-32575, position=0 '\0', p_block=0xb19e4d2c) > at osm_mcast_tbl.c:299 > #1 0x08073d29 in osm_switch_get_mft_block (p_sw=0x8f6eed8, > block_num=32961, position=0 '\0', p_block=0xb19e4d2c) > at ./../include/opensm/osm_switch.h:1074 > #2 0x08073b8c in __osm_mftr_rcv_new_mftr (p_rcv=0x80e9a6c, > p_sw=0x8f6eed8, p_list=0xb61c0370, lid=512, block=32961, ^^^^^ max block number is 511 so this is what caused the core dump. I just checked in a patch for this which should work. -- Hal > position=0 '\0') at osm_sa_mft_record.c:181 > #3 0x08074273 in __osm_mftr_rcv_by_comp_mask (p_map_item=0x8f6eed8, > context=0xb61c0330) at osm_sa_mft_record.c:317 > #4 0x00cd9747 in cl_qmap_apply_func (p_map=0x80e8584, > pfn_func=0x8073f98 <__osm_mftr_rcv_by_comp_mask>, context=0xb61c0330) > at cl_map.c:287 > #5 0x08074653 in osm_mftr_rcv_process (p_rcv=0x80e9a6c, > p_madw=0x8f29f0c) at osm_sa_mft_record.c:390 > #6 0x08074ef2 in __osm_mftr_rcv_ctrl_disp_callback (context=0x80e9afc, > p_data=0x8f29f0c) at osm_sa_mft_record_ctrl.c:63 > #7 0x00cd3d4f in __cl_disp_worker (context=0x80e9d18) at > cl_dispatcher.c:102 > #8 0x00ce1297 in __cl_thread_pool_routine (context=0x80e9d5c) at > cl_threadpool.c:74 > #9 0x00ce0f61 in __cl_thread_wrapper (arg=0x8f1c690) at cl_thread.c:58 > #10 0x00361371 in start_thread () from /lib/tls/libpthread.so.0 > #11 0x001eaffe in clone () from /lib/tls/libc.so.6 > > > > Also, why are osmtest failures allowed for "normal completion" ? > > > > -- Hal > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From gfiyasmer1 at verizon.com Sat Dec 30 19:24:14 2006 From: gfiyasmer1 at verizon.com (=?windows-1255?Q?=E2=E9=EC?=) Date: Sun, 31 Dec 2006 05:24:14 +0200 Subject: [openib-general] =?windows-1255?b?4OnqIOzk+OXl6ecgIDIwMDAgpCAg?= =?windows-1255?b?4efl4/kg7uTx7OXs4PjpIPns6iA/IQ==?= Message-ID: <8d4865a43d10b7fc4769e83e001b8393@verizon.com> An HTML attachment was scrubbed... URL: From eitan at sw053.yok.mtl.com Sat Dec 30 21:01:23 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Sun, 31 Dec 2006 07:01:23 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-31:normal completion Message-ID: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Fri_Dec_29_16:01:04_2006 0ccdf3 ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe Total=378 Pass=308 Fail=70 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo Failures: 42 OsmTest IS1-16.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo From dotanb at dev.mellanox.co.il Sun Dec 31 01:50:52 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Sun, 31 Dec 2006 11:50:52 +0200 (IST) Subject: [openib-general] [rdma_ucm] enabling the rdma_ucm and restarting the driver several times causes kernel oops In-Reply-To: <4593FBD2.4000109@ichips.intel.com> References: <459381DA.7030007@dev.mellanox.co.il> <4593FBD2.4000109@ichips.intel.com> Message-ID: <1296.85.65.224.155.1167558652.squirrel@dev.mellanox.co.il> > Dotan Barak wrote: >> here is the backtrace from the /var/log/messages: >> Dec 27 15:36:25 sw086 kernel: Unable to handle kernel NULL pointer >> dereference at 0000000000000001 RIP: >> Dec 27 15:36:25 sw086 kernel: [<0000000000000001>] >> Dec 27 15:36:25 sw086 kernel: PGD 11f4c3067 PUD 11fed7067 PMD 0 >> Dec 27 15:36:25 sw086 kernel: Oops: 0000 [1] SMP >> Dec 27 15:36:25 sw086 kernel: CPU 1 >> Dec 27 15:36:25 sw086 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm >> iw_cm ib_addr ib_ipoib ib_mthca ib_umad ib_ucm ib_u >> verbs ib_cm ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp >> parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod >> button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core tg3 sg >> ext3 jbd sd_mod > > Can you narrow down which module unload is causing the issue? Is anything > using > the rdma_ucm or ib_uverbs? Is ib_sdp the first module unloaded? > > - Sean > vlad: can you please tell us what the order of the modules loading/unloading is? As i wrote in the problem description, I only enabled the rdma_ucm module and restarted the driver (without even using this module or any other module in the driver) and i got this kernel oops. Dotan From halr at voltaire.com Sun Dec 31 04:36:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Dec 2006 07:36:18 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-31:normal completion In-Reply-To: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com> References: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com> Message-ID: <1167568561.29620.364215.camel@hal.voltaire.com> Hi Eitan, On Sun, 2006-12-31 at 00:01, Eitan Zahavi wrote: > OSM Simulation Regression Summary > OpenSM rev = Fri_Dec_29_16:01:04_2006 0ccdf3 How can you tell what is is the latest git version included ? Is 0ccdf3 short for some SHA1 hash ? Was the latest opensm/osm_sa_mft_record.c change included in this ? -- Hal > ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe > Total=378 Pass=308 Fail=70 > > Pass: > 42 Stability IS1-16.topo > 42 Pkey IS1-16.topo > 42 OsmStress IS1-16.topo > 42 Multicast IS1-16.topo > 42 LidMgr IS1-16.topo > 14 Stability IS3-loop.topo > 14 Stability IS3-128.topo > 14 Pkey IS3-128.topo > 14 OsmStress IS3-128.topo > 14 Multicast IS3-loop.topo > 14 Multicast IS3-128.topo > 14 LidMgr IS3-128.topo > > Failures: > 42 OsmTest IS1-16.topo > 14 OsmTest IS3-loop.topo > 14 OsmTest IS3-128.topo From mst at mellanox.co.il Sun Dec 31 04:46:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 31 Dec 2006 14:46:11 +0200 Subject: [openib-general] nightly osm_sim report 2006-12-31:normal completion In-Reply-To: <1167568561.29620.364215.camel@hal.voltaire.com> References: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com> <1167568561.29620.364215.camel@hal.voltaire.com> Message-ID: <20061231124610.GC6083@mellanox.co.il> > How can you tell what is is the latest git version included ? Is 0ccdf3 > short for some SHA1 hash ? Try git show 0ccdf3 -- MST From halr at voltaire.com Sun Dec 31 04:54:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 31 Dec 2006 07:54:59 -0500 Subject: [openib-general] nightly osm_sim report 2006-12-31:normal completion In-Reply-To: <20061231124610.GC6083@mellanox.co.il> References: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com> <1167568561.29620.364215.camel@hal.voltaire.com> <20061231124610.GC6083@mellanox.co.il> Message-ID: <1167569697.29620.365179.camel@hal.voltaire.com> On Sun, 2006-12-31 at 07:46, Michael S. Tsirkin wrote: > > How can you tell what is is the latest git version included ? Is 0ccdf3 > > short for some SHA1 hash ? > Try git show 0ccdf3 Thanks. The potential fix is not included in the OpenSM build (as this commit was at the end of Friday rather than Saturday). When are the updates picked up into the build system ? -- Hal From mst at mellanox.co.il Sun Dec 31 11:09:42 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 31 Dec 2006 21:09:42 +0200 Subject: [openib-general] [PATCH RFC] return qp pointer as part of ib_wc Message-ID: <20061231190942.GB32485@mellanox.co.il> ib_wc currently only includes the local QP number: this matches the IB spec, but seems mostly useless. The following patch replaces this with the pointer to qp itself, and updates all low level drivers and all users. This has the following advantages: - Ability to get a per-qp context through wc->qp->qp_context - Existing drivers already have the qp pointer ready in poll cq, so this change actually saves a tiny bit (extra memory read) on data path - We will be able to put NULL in there if some hardware does not support reporting the qp number (it is optional in IB spec) - no such option with qpn - Users that need the QP number can still get it through wc->qp->qp_num. Use case: In IPoIB CM code, I have a common CQ shared by multiple QPs. To track connection usage, I need a way to get at some per-QP context upon the completion, and I would like to avoid allocating context object per work request just to stick a QP pointer into it. With this code, I can just use wc->qp->qp_context. Note: I don't know whether updating the userspace API in a similiar way is a good idea. We probably should wait for an actual user; and keeping an extra object pointed to by WR ID might be less of a problem there since virtual memory is cheap. Signed-off-by: Michael S. Tsirkin --- Untested. Please comment. diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 5ed141e..13efd41 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -642,7 +642,8 @@ static void snoop_recv(struct ib_mad_qp_info *qp_info, spin_unlock_irqrestore(&qp_info->snoop_lock, flags); } -static void build_smp_wc(u64 wr_id, u16 slid, u16 pkey_index, u8 port_num, +static void build_smp_wc(struct ib_qp *qp, + u64 wr_id, u16 slid, u16 pkey_index, u8 port_num, struct ib_wc *wc) { memset(wc, 0, sizeof *wc); @@ -652,7 +653,7 @@ static void build_smp_wc(u64 wr_id, u16 slid, u16 pkey_index, u8 port_num, wc->pkey_index = pkey_index; wc->byte_len = sizeof(struct ib_mad) + sizeof(struct ib_grh); wc->src_qp = IB_QP0; - wc->qp_num = IB_QP0; + wc->qp = qp; wc->slid = slid; wc->sl = 0; wc->dlid_path_bits = 0; @@ -713,7 +714,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv, goto out; } - build_smp_wc(send_wr->wr_id, be16_to_cpu(smp->dr_slid), + build_smp_wc(mad_agent_priv->agent.qp, + send_wr->wr_id, be16_to_cpu(smp->dr_slid), send_wr->wr.ud.pkey_index, send_wr->wr.ud.port_num, &mad_wc); @@ -2355,7 +2357,8 @@ static void local_completions(struct work_struct *work) * Defined behavior is to complete response * before request */ - build_smp_wc((unsigned long) local->mad_send_wr, + build_smp_wc(recv_mad_agent->agent.qp, + (unsigned long) local->mad_send_wr, be16_to_cpu(IB_LID_PERMISSIVE), 0, recv_mad_agent->agent.port_num, &wc); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 743247e..df1efbc 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -933,7 +933,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file, resp->wc[i].vendor_err = wc[i].vendor_err; resp->wc[i].byte_len = wc[i].byte_len; resp->wc[i].imm_data = (__u32 __force) wc[i].imm_data; - resp->wc[i].qp_num = wc[i].qp_num; + resp->wc[i].qp_num = wc[i].qp->qp_num; resp->wc[i].src_qp = wc[i].src_qp; resp->wc[i].wc_flags = wc[i].wc_flags; resp->wc[i].pkey_index = wc[i].pkey_index; diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 05c9154..5175c99 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -153,7 +153,7 @@ static inline int c2_poll_one(struct c2_dev *c2dev, entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce)); entry->wr_id = ce->hdr.context; - entry->qp_num = ce->handle; + entry->qp = &qp->ibqp; entry->wc_flags = 0; entry->slid = 0; entry->sl = 0; diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index b46bda1..40e39ff 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -579,7 +579,7 @@ poll_cq_one_read_cqe: } else wc->status = IB_WC_SUCCESS; - wc->qp_num = cqe->local_qp_number; + wc->qp = &qp->ibqp; wc->byte_len = cqe->nr_bytes_transferred; wc->pkey_index = cqe->pkey_index; wc->slid = cqe->rlid; diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 46c1c89..64f07b1 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -379,7 +379,7 @@ void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err) wc.vendor_err = 0; wc.byte_len = 0; wc.imm_data = 0; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = 0; wc.wc_flags = 0; wc.pkey_index = 0; diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c index ce60387..5ff20cb 100644 --- a/drivers/infiniband/hw/ipath/ipath_rc.c +++ b/drivers/infiniband/hw/ipath/ipath_rc.c @@ -702,7 +702,7 @@ void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc) wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; wc->vendor_err = 0; wc->byte_len = 0; - wc->qp_num = qp->ibqp.qp_num; + wc->qp = &qp->ibqp; wc->src_qp = qp->remote_qpn; wc->pkey_index = 0; wc->slid = qp->remote_ah_attr.dlid; @@ -836,7 +836,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode) wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; wc.vendor_err = 0; wc.byte_len = wqe->length; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; @@ -951,7 +951,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode) wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; wc.vendor_err = 0; wc.byte_len = 0; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; @@ -1511,7 +1511,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; wc.vendor_err = 0; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index f753051..e86cb17 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -137,7 +137,7 @@ bad_lkey: wc.vendor_err = 0; wc.byte_len = 0; wc.imm_data = 0; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = 0; wc.wc_flags = 0; wc.pkey_index = 0; @@ -336,7 +336,7 @@ again: wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; wc.vendor_err = 0; wc.byte_len = 0; - wc.qp_num = sqp->ibqp.qp_num; + wc.qp = &sqp->ibqp; wc.src_qp = sqp->remote_qpn; wc.pkey_index = 0; wc.slid = sqp->remote_ah_attr.dlid; @@ -426,7 +426,7 @@ again: wc.status = IB_WC_SUCCESS; wc.vendor_err = 0; wc.byte_len = wqe->length; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; /* XXX do we know which pkey matched? Only needed for GSI. */ wc.pkey_index = 0; @@ -447,7 +447,7 @@ send_comp: wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; wc.vendor_err = 0; wc.byte_len = wqe->length; - wc.qp_num = sqp->ibqp.qp_num; + wc.qp = &sqp->ibqp; wc.src_qp = 0; wc.pkey_index = 0; wc.slid = 0; diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c index e636cfd..325d663 100644 --- a/drivers/infiniband/hw/ipath/ipath_uc.c +++ b/drivers/infiniband/hw/ipath/ipath_uc.c @@ -49,7 +49,7 @@ static void complete_last_send(struct ipath_qp *qp, struct ipath_swqe *wqe, wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode]; wc->vendor_err = 0; wc->byte_len = wqe->length; - wc->qp_num = qp->ibqp.qp_num; + wc->qp = &qp->ibqp; wc->src_qp = qp->remote_qpn; wc->pkey_index = 0; wc->slid = qp->remote_ah_attr.dlid; @@ -411,7 +411,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; wc.vendor_err = 0; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = qp->remote_qpn; wc.pkey_index = 0; wc.slid = qp->remote_ah_attr.dlid; diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 49f1102..9a3e546 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -66,7 +66,7 @@ bad_lkey: wc.vendor_err = 0; wc.byte_len = 0; wc.imm_data = 0; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = 0; wc.wc_flags = 0; wc.pkey_index = 0; @@ -255,7 +255,7 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, wc->status = IB_WC_SUCCESS; wc->opcode = IB_WC_RECV; wc->vendor_err = 0; - wc->qp_num = qp->ibqp.qp_num; + wc->qp = &qp->ibqp; wc->src_qp = sqp->ibqp.qp_num; /* XXX do we know which pkey matched? Only needed for GSI. */ wc->pkey_index = 0; @@ -474,7 +474,7 @@ done: wc.vendor_err = 0; wc.opcode = IB_WC_SEND; wc.byte_len = len; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = 0; wc.wc_flags = 0; /* XXX initialize other fields? */ @@ -651,7 +651,7 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr, wc.status = IB_WC_SUCCESS; wc.opcode = IB_WC_RECV; wc.vendor_err = 0; - wc.qp_num = qp->ibqp.qp_num; + wc.qp = &qp->ibqp; wc.src_qp = src_qp; /* XXX do we know which pkey matched? Only needed for GSI. */ wc.pkey_index = 0; diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 768df72..968d151 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1854,7 +1854,7 @@ int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int ignore_bkey, memset(inbox + 256, 0, 256); - MTHCA_PUT(inbox, in_wc->qp_num, MAD_IFC_MY_QPN_OFFSET); + MTHCA_PUT(inbox, in_wc->qp->qp_num, MAD_IFC_MY_QPN_OFFSET); MTHCA_PUT(inbox, in_wc->src_qp, MAD_IFC_RQPN_OFFSET); val = in_wc->sl << 4; diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 283d50b..5862411 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -530,7 +530,7 @@ static inline int mthca_poll_one(struct mthca_dev *dev, } } - entry->qp_num = (*cur_qp)->qpn; + entry->qp = &(*cur_qp)->ibqp; if (is_send) { wq = &(*cur_qp)->sq; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 0bfa332..54cde37 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -419,8 +419,8 @@ struct ib_wc { enum ib_wc_opcode opcode; u32 vendor_err; u32 byte_len; + struct ib_qp *qp; __be32 imm_data; - u32 qp_num; u32 src_qp; int wc_flags; u16 pkey_index; -- MST From eitan at sw053.yok.mtl.com Sun Dec 31 21:31:14 2006 From: eitan at sw053.yok.mtl.com (Eitan Zahavi) Date: Mon, 1 Jan 2007 07:31:14 +0200 Subject: [openib-general] nightly osm_sim report 2007-01-01:normal completion Message-ID: <200701010531.l015VExN007699@sw053.yok.mtl.com> OSM Simulation Regression Summary OpenSM rev = Sat_Dec_30_17:20:32_2006 000033 ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe Total=378 Pass=308 Fail=70 Pass: 42 Stability IS1-16.topo 42 Pkey IS1-16.topo 42 OsmStress IS1-16.topo 42 Multicast IS1-16.topo 42 LidMgr IS1-16.topo 14 Stability IS3-loop.topo 14 Stability IS3-128.topo 14 Pkey IS3-128.topo 14 OsmStress IS3-128.topo 14 Multicast IS3-loop.topo 14 Multicast IS3-128.topo 14 LidMgr IS3-128.topo Failures: 42 OsmTest IS1-16.topo 14 OsmTest IS3-loop.topo 14 OsmTest IS3-128.topo From mst at mellanox.co.il Sun Dec 31 23:03:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 1 Jan 2007 09:03:15 +0200 Subject: [openib-general] v2.6.20-rc2 merged into ofed 1.2 Message-ID: <20070101070315.GB25691@mellanox.co.il> Upstream v2.6.20-rc2 has been merged into ofed 1.2 branch. There has been no API changes since -rc1, so not backports need to be updated. -- MST