From ogerlitz at voltaire.com Wed Nov 1 00:42:28 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 01 Nov 2006 10:42:28 +0200 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <15ddcffd0610311245q614fee15g810d0438cbf965fa@mail.gmail.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <4540CA0E.9020807@voltaire.com> <45447D71.40405@voltaire.com> <4545F8E5.2000003@voltaire.com> <45463535.2050302@ichips.intel.com> <45466770.9050107@ichips.intel.com> <45474572.8080109@voltaire.com> <454779BF.2080703@ichips.intel.com> <15ddcffd0610311245q614fee15g810d0438cbf965fa@mail.gmail.com> Message-ID: <45485DF4.5030905@voltaire.com> Or Gerlitz wrote: >> >> root at excell01 librdmacm]# /home/ogerlitz/ib1.1/bin/mckey -m 224.5.5.5 >> >> [root at excell02 src]# /home/ogerlitz/ib1.1/bin/mckey -m 224.5.5.5 -s -C >> >> 10240 -S 1024 >> You need to use the same message parameters (count and size) for both >> sender and >> receiver. I have it working now, for that i had to indeed have both sender and receiver use the same -C and -S options **and** make the post sends with IBV_SEND_FLAG plus wait for the send completions. I guess that doing that only for the last send should work as well. Or. From ogerlitz at voltaire.com Wed Nov 1 02:48:12 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 01 Nov 2006 12:48:12 +0200 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <45477E81.3040205@ichips.intel.com> References: <4547308F.2030708@voltaire.com> <20061031115017.GF2387@mellanox.co.il> <454746A8.1040604@voltaire.com> <45477E81.3040205@ichips.intel.com> Message-ID: <45487B6C.2070408@voltaire.com> Sean Hefty wrote: > Or Gerlitz wrote: >> If yes, this seems to me as one big over-doing, assuming the consumer >> always either call XXX_destory_id() OR returns non zero from a >> callback on this ID, there must be away to avoid the race within the >> ID provider module, so at least the api can be saved... > > As long as the user can destroy a cm_id from their callback, the ib_cm > and rdma_cm have this issue. This feature ends up being fairly useful, > so I'm hesitant to remove it. The alternative is that a user must Assuming it is indeed useful and nice feature which we don't want to remove (does someone is aware to any similar example in the kernel where you can delete a resource from its associated callback? ie i am quite sure you are **not** allowed to delete a timer or destroy a socket from within their callbacks). > always call xxx_destroy_id(), but that cannot be done from within the > callback thread itself. This would require a user to schedule a thread > to call destroy, which may not always be possible. (Consider the case > where the cm creates a new id as part of a connection request. For the > user to schedule the destruction, it would need to queue the new cm_id > somewhere, which may not be possible.) what about enhancing xxx_destory_id() to sense that it was called from this id callback context so the xxx module code defers the destory_id() execution to run after the callback is over. This can be done by writing at the id the pid of the thread running the callback before going to the consumer and deleting it when the callback returns. Then if in_callback(id) holds, have the destory_id() call schedule itself to later stage, where it checks again etc. At the bottom line, users must call xxx_destory_id() explicitly the xxx module would be able to handle in_callback situations. Or. From mst at mellanox.co.il Wed Nov 1 03:31:37 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Nov 2006 13:31:37 +0200 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <45487B6C.2070408@voltaire.com> References: <45487B6C.2070408@voltaire.com> Message-ID: <20061101113137.GA6515@mellanox.co.il> Quoting r. Or Gerlitz : > what about enhancing xxx_destory_id() to sense that it was called from > this id callback context so the xxx module code defers the destory_id() > execution to run after the callback is over. This can be done by > writing at the id the pid of the thread running the callback before > going to the consumer and deleting it when the callback returns. Then if > in_callback(id) holds, have the destory_id() call schedule itself to > later stage, where it checks again etc. And then you still have the module unloading race. We have the client registration all over the place now - ib_cm was the only one left. -- MST From michael.arndt at informatik.tu-chemnitz.de Wed Nov 1 04:35:51 2006 From: michael.arndt at informatik.tu-chemnitz.de (Michael Arndt) Date: Wed, 1 Nov 2006 13:35:51 +0100 Subject: [openib-general] OSM multiple subnets Message-ID: <000b01c6fdb2$44669f60$21606d86@one7> Hi, there is a comment in the main.c file: /* This is the global opensm object. One opensm object is required per subnet. Future versions could support multiple subnets by instantiating more than one opensm object. */ osm_opensm_t osm; Can I expect this feature to be implemented rather sooner or later? Thanks Michael From monis at voltaire.com Wed Nov 1 04:54:17 2006 From: monis at voltaire.com (Moni Shoua) Date: Wed, 01 Nov 2006 14:54:17 +0200 Subject: [openib-general] OFED 1.1 Build Issue In-Reply-To: <45476C55.1080300@dev.mellanox.co.il> References: <45470D59.7020705@dev.mellanox.co.il> <45472202.9080104@voltaire.com> <454742C2.2050900@silverstorm.com> <45476C55.1080300@dev.mellanox.co.il> Message-ID: <454898F9.1080303@voltaire.com> Vladimir Sokolovsky wrote: > > Ramachandra K wrote: > >> Moni Shoua wrote: >> >>> We already tried to go this way and found that a local >>> Module.symvers is not always generated (but we might have missed >>> something though). >>> I suggest that you check that this alternative way works under all >>> OSs compilation (SuSE and RedHat to be precise)... >>> >>> >> I think Module.symvers generation for external modules was added >> sometime >> around 2.6.16, so its not generated on the older kernels (for eg >> 2.6.9 kernels >> on RHEL) >> >> In this scenario, when there is no Module.symvers file, I guess the >> other >> option is to use a single Kbuild file to build both modules, >> as explained in section 7.3 of Documentation/kbuild/modules.txt. >> >> But this may not be feasible always. Come to think of it, why does the >> OFED installation procedure not update the kernel Module.symvers file >> when it replaces the old kernel modules present in /lib/modules/ >> with the new ones ? >> >>> BTW, Why not updating the kernel Module.symvers when kernel-ib-devel >>> is installed? This will free the developer from copying it to >>> his/hers private directory. >>> >>> >> It might be a good idea to update the Module.symvers file as part of the >> normal installation and not only kernel-ib-devel. Because if the kernel >> modules are being replaced (or new modules are being added), shouldn't >> the Module.symvers file also be updated ? >> Regards, >> Ram > > Agree, > Module.symvers should be updated by kernel-ib RPM. > So, need to implement Moni's suggestion with light changes: update > kernel-ib RPM %post and %preun sections instead of kernel-ib-devel RPM > %pre and %postun. > > Regards, > Vladimir > I agree although there is no use in updated Module.symvers when the devel RPM is not installed. This is a part of the shell script that updates Module.symvers which you can use if you don't find a way how to generate Module.symvers in 2.4 kernels * for mod in $(find -name *.ko) ; do* * nm -o $mod |grep __crc >> /tmp/syms* * n_mods=$((n_mods+1))* * done* * n_syms=$(wc -l /tmp/syms |cut -f1 -d" ")* * echo found $n_syms InfiniBand symbols in $n_mods InfiniBand modules* * n=1* * MOD_SYMVERS_IB=./Module.symvers.ib* * MOD_SYMVERS_PATCH=./Module.symvers.patch* * if [ -f /lib/modules/$K_VER/source/Module.symvers ] ; then* * MOD_SYMVERS_KERNEL=/lib/modules/$K_VER/source/Module.symvers* * elif [ -f /lib/modules/$K_VER/build/Module.symvers ] ; then* * MOD_SYMVERS_KERNEL=/lib/modules/$K_VER/build/Module.symvers* * else* * echo file Module.symvers not found* * fi* * if [ ! -z $MOD_SYMVERS_KERNEL ] ; then * * * * rm -f $MOD_SYMVERS_IB* * while [ $n -le $n_syms ] ; do* * line=$(head -$n /tmp/syms|tail -1)* * line1=$(echo $line|cut -f1 -d:)* * line2=$(echo $line|cut -f2 -d:)* * file=$(echo $line1|cut -f6- -d/)* * file=$(echo $file|cut -f1 -d.)* * crc=$(echo $line2|cut -f1 -d" ")* * crc=${crc:8}* * sym=$(echo $line2|cut -f3 -d" ")* * sym=${sym:6}* * echo -e "0x$crc\t$sym\t$file" >> $MOD_SYMVERS_IB* * if [ -z $allsyms ] ; then* * allsyms=$sym* * else* * allsyms="$allsyms|$sym"* * fi* * n=$((n+1))* * done* * egrep -v "$allsyms" $MOD_SYMVERS_KERNEL >> $MOD_SYMVERS_IB* * diff -u $MOD_SYMVERS_KERNEL $MOD_SYMVERS_IB > $MOD_SYMVERS_PATCH* * patch -d $(dirname $MOD_SYMVERS_KERNEL) < $MOD_SYMVERS_PATCH* * mkdir -p /usr/voltaire/backup* * cp $MOD_SYMVERS_PATCH /usr/voltaire/backup* * fi* From sashak at voltaire.com Wed Nov 1 05:25:57 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 1 Nov 2006 15:25:57 +0200 Subject: [openib-general] OSM multiple subnets In-Reply-To: <000b01c6fdb2$44669f60$21606d86@one7> References: <000b01c6fdb2$44669f60$21606d86@one7> Message-ID: <20061101132557.GA22214@sashak.voltaire.com> On 13:35 Wed 01 Nov , Michael Arndt wrote: > Hi, > > there is a comment in the main.c file: > > /* > This is the global opensm object. > One opensm object is required per subnet. > Future versions could support multiple subnets by > instantiating more than one opensm object. > */ > osm_opensm_t osm; > > Can I expect this feature to be implemented rather sooner or later? You can run more than one opensm, so binded to different subnet ports this will serve multiple subnets. Is it what you are looking for? In this mode you may want to use -f option and OSM_CACHE_DIR environment variable in order to separate multiple opensm log and cache files. Sasha From halr at voltaire.com Wed Nov 1 05:19:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2006 08:19:30 -0500 Subject: [openib-general] OSM multiple subnets In-Reply-To: <000b01c6fdb2$44669f60$21606d86@one7> References: <000b01c6fdb2$44669f60$21606d86@one7> Message-ID: <1162387169.29957.51056.camel@hal.voltaire.com> On Wed, 2006-11-01 at 07:35, Michael Arndt wrote: > Hi, > > there is a comment in the main.c file: > > /* > This is the global opensm object. > One opensm object is required per subnet. > Future versions could support multiple subnets by > instantiating more than one opensm object. > */ > osm_opensm_t osm; > > Can I expect this feature to be implemented rather sooner or later? There is no current plan to work on this as far as I know. I'm not sure whether this would be the way to go or whether separate OpenSM's (one per subnet) is the way to go. I may be wrong but I think the latter approach might work now (using different HCA ports and ensuring the two subnets were not interconnected (by switches)). -- Hal > Thanks Michael > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Wed Nov 1 05:29:45 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2006 08:29:45 -0500 Subject: [openib-general] {PATCH] OpenSM: Add option for force SDR link speed Message-ID: <1162387784.29957.51482.camel@hal.voltaire.com> OpenSM: Add option for force SDR link speed Add option to opensm.opts to force link speed. Currently, only forcing to SDR link speed is supported. Signed-off-by: Hal Rosenstock Index: include/opensm/osm_subnet.h =================================================================== --- include/opensm/osm_subnet.h (revision 10010) +++ include/opensm/osm_subnet.h (working copy) @@ -34,7 +34,6 @@ * $Id$ */ - /* * Abstract: * Declaration of osm_subn_t. @@ -238,9 +237,10 @@ typedef struct _osm_subn_opt uint8_t sm_priority; uint8_t lmc; boolean_t lmc_esp0; - uint8_t max_op_vls; + uint8_t max_op_vls; + uint8_t force_link_speed; boolean_t reassign_lids; - boolean_t reassign_lfts; + boolean_t reassign_lfts; boolean_t ignore_other_sm; boolean_t single_thread; boolean_t no_multicast_option; Index: opensm/osm_subnet.c =================================================================== --- opensm/osm_subnet.c (revision 10018) +++ opensm/osm_subnet.c (working copy) @@ -452,6 +452,7 @@ osm_subn_set_default_opt( p_opt->lmc = OSM_DEFAULT_LMC; p_opt->lmc_esp0 = FALSE; p_opt->max_op_vls = OSM_DEFAULT_MAX_OP_VLS; + p_opt->force_link_speed = 0; p_opt->reassign_lids = FALSE; p_opt->reassign_lfts = TRUE; p_opt->ignore_other_sm = FALSE; @@ -840,6 +841,10 @@ osm_subn_parse_conf_file( "max_op_vls", p_key, p_val, &p_opts->max_op_vls); + __osm_subn_opts_unpack_uint8( + "force_link_speed", + p_key, p_val, &p_opts->force_link_speed); + __osm_subn_opts_unpack_boolean( "reassign_lids", p_key, p_val, &p_opts->reassign_lids); @@ -1061,6 +1066,9 @@ osm_subn_write_conf_file( "leaf_head_of_queue_lifetime 0x%02x\n\n" "# Limit the maximal operational VLs\n" "max_op_vls %u\n\n" + "# Force switch links which are more than SDR capable to \n" + "# operate at SDR speed\n\n" + "force_link_speed %u\n\n" "# The subnet_timeout code that will be set for all the ports\n" "# The actual timeout is 4.096usec * 2^\n" "subnet_timeout %u\n\n" @@ -1081,6 +1089,7 @@ osm_subn_write_conf_file( p_opts->head_of_queue_lifetime, p_opts->leaf_head_of_queue_lifetime, p_opts->max_op_vls, + p_opts->force_link_speed, p_opts->subnet_timeout, p_opts->local_phy_errors_threshold, p_opts->overrun_errors_threshold Index: opensm/osm_lid_mgr.c =================================================================== --- opensm/osm_lid_mgr.c (revision 10010) +++ opensm/osm_lid_mgr.c (working copy) @@ -1152,6 +1152,14 @@ __osm_lid_mgr_set_physp_pi( sizeof(p_pi->link_width_enabled) )) send_set = TRUE; + if ( p_mgr->p_subn->opt.force_link_speed ) + ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); + else + ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); + if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, + sizeof(p_pi->link_speed) )) + send_set = TRUE; + /* M_KeyProtectBits are always zero */ p_pi->mkey_lmc = p_mgr->p_subn->opt.lmc; /* Check to see if the value we are setting is different than Index: opensm/osm_link_mgr.c =================================================================== --- opensm/osm_link_mgr.c (revision 10010) +++ opensm/osm_link_mgr.c (working copy) @@ -310,6 +310,14 @@ __osm_link_mgr_set_physp_pi( sizeof(p_pi->link_width_enabled) )) send_set = TRUE; + if ( p_mgr->p_subn->opt.force_link_speed ) + ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); + else + ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); + if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, + sizeof(p_pi->link_speed) )) + send_set = TRUE; + /* calc new op_vls and mtu */ op_vls = osm_physp_calc_link_op_vls( p_mgr->p_log, p_mgr->p_subn, p_physp ); From jsquyres at cisco.com Wed Nov 1 05:44:07 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 1 Nov 2006 08:44:07 -0500 Subject: [openib-general] Static linking with libibverbs Message-ID: <2054C355-813A-4E62-80F0-CFCB610AF5C4@cisco.com> Do you like large executables? Do you eschew system-installed libc's? Were you ever frustrated that linking against libibverbs would prevent using "-static"? Well, now you too can produce 100% statically linked applications that use libibverbs! If you call now, you can take part in this exclusive offer from the OpenFabrics Alliance and Open MPI project. Hurry, supplies are limited. First, you need to download and install libibverbs v1.0.4 (which was released after OFED v1.1.1, so if you have installed OFED v1.1.1 or prior, you will need to manually update your libibverbs). Prior versions of libibverbs will not work with the "-static" option to gcc (or whatever the static linking option is for your compiler). Second, read these FAQ entries on the Open MPI web site: http://www.open-mpi.org/faq/?category=mpi-apps#static-mpi-apps http://www.open-mpi.org/faq/?category=mpi-apps#static-ofa-mpi-apps Although these FAQ entries specifically describe linking MPI applications statically, the same techniques described also applies to general [non-MPI] ibverbs applications. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From umaxx at oleco.net Wed Nov 1 07:00:49 2006 From: umaxx at oleco.net (Joerg Zinke) Date: Wed, 1 Nov 2006 16:00:49 +0100 Subject: [openib-general] libsdp question Message-ID: <20061101160049.5f13a33f@marvin.local> Hi, i use libsdp without problems. LD_PRELOAD stuff works fine, great work! but i have question: Why is there this libsdp_sys on my system? Its build from socket.c For what did i need this lib? AFAIK after i set LD_PRELOAD to libsdp.so, my application will use the socket() function from port.c, why another socket() function in another lib? regards, Joerg From eitan at mellanox.co.il Wed Nov 1 07:33:56 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Wed, 01 Nov 2006 17:33:56 +0200 Subject: [openib-general] libsdp question In-Reply-To: <20061101160049.5f13a33f@marvin.local> References: <20061101160049.5f13a33f@marvin.local> Message-ID: <4548BE64.4000002@mellanox.co.il> Hi Joerg, The socket.c is not being used anymore. Should be cleaned up. It might be needed for the old SDP implementation. Joerg Zinke wrote: > Hi, > > i use libsdp without problems. LD_PRELOAD stuff works fine, great work! > but i have question: > Why is there this libsdp_sys on my system? Its build from socket.c > For what did i need this lib? AFAIK after i set LD_PRELOAD to libsdp.so, > my application will use the socket() function from port.c, why another > socket() function in another lib? > > regards, > > Joerg > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Wed Nov 1 07:53:26 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 1 Nov 2006 10:53:26 -0500 Subject: [openib-general] Fwd: [mvapich] Announcing the release of MVAPICH2 0.9.6 with on-demand connection management, multi-core optimized shared memory communication and memory hook support References: Message-ID: Forwarding this to the mvapich-discuss list because it has gotten zero replies on the openib-general list. If someone from OSU could reply, it would be most helpful. Thanks. Begin forwarded message: > From: Jeff Squyres > Date: October 27, 2006 11:05:17 AM EDT > To: openib > Subject: Re: [mvapich] Announcing the release of MVAPICH2 0.9.6 > with on-demand connection management, multi-core optimized shared > memory communication and memory hook support > > Any response from the OSU crew? > > Can someone provide a reason why MVAPICH is still in OpenIB's > Subversion repository? Please see my original mail, below, for > more detailed questions. > > Thanks. > > > On Oct 23, 2006, at 7:36 AM, Jeff Squyres wrote: > >> On Oct 22, 2006, at 11:53 PM, Dhabaleswar Panda wrote: >> >>> A stripped down version of this release is also available at the >>> OpenIB SVN. >> >> I see this statement in every MVAPICH release notice and it >> continues to puzzle me. >> >> I understand that there was a use for an alternate distribution >> source before MVAPICH became open source. But now that the >> MVAPICH code bases are freely available from OSU via multiple >> mechanisms (anonymous SVN, tarball download, etc.), why is a >> "stripped down version" maintained in the OpenIB SVN? >> >> 1. What, exactly, is the difference between the MVAPICH available >> from OSU and the "stripped down version" in the OpenIB SVN? >> >> 2. Why would someone choose to download the "stripped down >> version" from the OpenIB SVN? Have any real users/customers done so? >> >> 3. What is the point of maintaining yet more flavors of MVAPICH -- >> aren't there enough already (multiple versions from OSU, more >> versions available from each IB vendor)? >> >> DK -- can you please explain? Thanks. >> >> -- >> Jeff Squyres >> Server Virtualization Business Unit >> Cisco Systems >> >> > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From johnip at sgi.com Wed Nov 1 08:27:19 2006 From: johnip at sgi.com (John Partridge) Date: Wed, 01 Nov 2006 10:27:19 -0600 Subject: [openib-general] Ordering between PCI config space writes and MMIO reads? In-Reply-To: References: <20061024214724.GS25210@parisc-linux.org> <20061024223631.GT25210@parisc-linux.org> <20061024.154347.77057163.davem@davemloft.net> <20061031195312.GD5950@mellanox.co.il> <019301c6fd2c$044d7010$0732700a@djlaptop> <20061031204717.GG26964@parisc-linux.org> Message-ID: <4548CAE7.8010300@sgi.com> Roland Dreier wrote: > > I'm beginning to think Michael Tsirkin has the only solution to this > > -- architectures need to check that their hardware blocks until the > > config write completion has occurred (and if not, simulate that it has > > in software). > > OK, I guess I'm convinced. The vague language in the base PCI 3.0 > spec about "dependencies" made me think that a read of a config > register had to wait until all previous writes to the same register > are done. So I'll drop this patch for now. > > John, you'll need to try and come up with a way to solve this in the > Altix implementation of pci_write_config_xxx(). > > - R. Sorry, but I find this change a bit puzzling. The problem is particular to the PPB on the HCA and not Altix. I can't see anywhere that a PCI Config Write is required to block until completion, it is the driver and the HCA ,not the Altix hardware that requires the Config Write to have completed before we leave mthca_reset() Changing pci_write_config_xxx() will change the behavior for ALL drivers and the possibility of breaking something else. The fix was very low risk in mthca_reset(), changing the PCI code to fix this is much more onerous. I know you must feel like "piggy in the middle" with this, so I don't mean to cause you any problems, but I guess I don't understand the reluctance for the driver fix. John -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From swise at opengridcomputing.com Wed Nov 1 08:44:03 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 01 Nov 2006 10:44:03 -0600 Subject: [openib-general] [PATCH] perftest: updates for latest librdmacm Message-ID: <1162399443.32739.31.camel@stevo-desktop> Here is a patch that updates the src/userspace/perftest to the librdmacm patch Sean posted for supporting kernel ABI version 3 which is under review for 2.6.20. Signed-off-by: Steve Wise Index: rdma_bw.c =================================================================== --- rdma_bw.c (revision 9974) +++ rdma_bw.c (working copy) @@ -211,19 +211,18 @@ pid, __func__, event->event); goto err1; } - if (!event->private_data || - (event->private_data_len < sizeof(*data->rem_dest))) { + if (!event->param.conn.private_data || + (event->param.conn.private_data_len < sizeof(*data->rem_dest))) { fprintf(stderr, "%d:%s: bad private data ptr %p len %d\n", - pid, __func__, event->private_data, - event->private_data_len); + pid, __func__, event->param.conn.private_data, + event->param.conn.private_data_len); goto err1; } data->rem_dest = malloc(sizeof *data->rem_dest); if (!data->rem_dest) goto err1; - memcpy(data->rem_dest, event->private_data, - sizeof(*data->rem_dest)); + memcpy(data->rem_dest, event->param.conn.private_data, sizeof(*data->rem_dest)); rdma_ack_cm_event(event); } else { @@ -355,10 +354,10 @@ goto err2; } - if (!event->private_data || - (event->private_data_len < sizeof(*data->rem_dest))) { + if (!event->param.conn.private_data || + (event->param.conn.private_data_len < sizeof(*data->rem_dest))) { fprintf(stderr, "%d:%s: bad private data len %d\n", pid, - __func__, event->private_data_len); + __func__, event->param.conn.private_data_len); goto err2; } @@ -366,7 +365,7 @@ if (!data->rem_dest) goto err2; - memcpy(data->rem_dest, event->private_data, sizeof(*data->rem_dest)); + memcpy(data->rem_dest, event->param.conn.private_data, sizeof(*data->rem_dest)); child_cm_id = (struct rdma_cm_id *)event->id; ctx = pp_init_ctx(child_cm_id, data); Index: rdma_lat.c =================================================================== --- rdma_lat.c (revision 9974) +++ rdma_lat.c (working copy) @@ -285,19 +285,18 @@ pid, __func__, event->event); goto err1; } - if (!event->private_data || - (event->private_data_len < sizeof(*data->rem_dest))) { + if (!event->param.conn.private_data || + (event->param.conn.private_data_len < sizeof(*data->rem_dest))) { fprintf(stderr, "%d:%s: bad private data ptr %p len %d\n", - pid, __func__, event->private_data, - event->private_data_len); + pid, __func__, event->param.conn.private_data, + event->param.conn.private_data_len); goto err1; } data->rem_dest = malloc(sizeof *data->rem_dest); if (!data->rem_dest) goto err1; - memcpy(data->rem_dest, event->private_data, - sizeof(*data->rem_dest)); + memcpy(data->rem_dest, event->param.conn.private_data, sizeof(*data->rem_dest)); rdma_ack_cm_event(event); } else { for (t = res; t; t = t->ai_next) { @@ -399,10 +398,10 @@ goto err2; } - if (!event->private_data || - (event->private_data_len < sizeof(*data->rem_dest))) { + if (!event->param.conn.private_data || + (event->param.conn.private_data_len < sizeof(*data->rem_dest))) { fprintf(stderr, "%d:%s: bad private data len %d\n", pid, - __func__, event->private_data_len); + __func__, event->param.conn.private_data_len); goto err2; } @@ -410,7 +409,7 @@ if (!data->rem_dest) goto err2; - memcpy(data->rem_dest, event->private_data, sizeof(*data->rem_dest)); + memcpy(data->rem_dest, event->param.conn.private_data, sizeof(*data->rem_dest)); child_cm_id = (struct rdma_cm_id *)event->id; ctx = pp_init_ctx(child_cm_id, data); From matthew at wil.cx Wed Nov 1 08:46:44 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Wed, 1 Nov 2006 09:46:44 -0700 Subject: [openib-general] Ordering between PCI config space writes and MMIO reads? In-Reply-To: <4548CAE7.8010300@sgi.com> References: <20061024223631.GT25210@parisc-linux.org> <20061024.154347.77057163.davem@davemloft.net> <20061031195312.GD5950@mellanox.co.il> <019301c6fd2c$044d7010$0732700a@djlaptop> <20061031204717.GG26964@parisc-linux.org> <4548CAE7.8010300@sgi.com> Message-ID: <20061101164643.GH11399@parisc-linux.org> On Wed, Nov 01, 2006 at 10:27:19AM -0600, John Partridge wrote: > Sorry, but I find this change a bit puzzling. The problem is particular to > the PPB on the HCA and not Altix. That's not true; it's more likely on Altix, but it's not unique. *any* PCI-PCI bridge can reorder pci config reads and writes. Apparently the normal PCI host bridge implementation avoids this problem by blocking until the completion comes back. If you put a quad-port tulip card into an Altix, you could experience the same problem (but it would be massively unlikely. You'd probably have to bring up three interfaces, saturate them with traffic, then bring up the fourth to see it. And even then it would be rare). > I can't see anywhere that a PCI Config > Write > is required to block until completion, it is the driver and the HCA ,not the > Altix hardware that requires the Config Write to have completed before we > leave mthca_reset() There's several places in the PCI midlayer that require the config write to have completed before we do a config read. The MWI code relies on this to see if the device supports MWI. If it gets out of order, we'll think that the device doesn't support MWI when it thinks it's been told to use MWI. Data corruption could result. > Changing pci_write_config_xxx() will change the behavior > for ALL drivers and the possibility of breaking something else. The fix was > very low risk in mthca_reset(), changing the PCI code to fix this is much > more onerous. I really don't think so. At worst you'll be changing the timing. From johnip at sgi.com Wed Nov 1 09:08:08 2006 From: johnip at sgi.com (John Partridge) Date: Wed, 01 Nov 2006 11:08:08 -0600 Subject: [openib-general] Ordering between PCI config space writes and MMIO reads? In-Reply-To: <20061101164643.GH11399@parisc-linux.org> References: <20061024223631.GT25210@parisc-linux.org> <20061024.154347.77057163.davem@davemloft.net> <20061031195312.GD5950@mellanox.co.il> <019301c6fd2c$044d7010$0732700a@djlaptop> <20061031204717.GG26964@parisc-linux.org> <4548CAE7.8010300@sgi.com> <20061101164643.GH11399@parisc-linux.org> Message-ID: <4548D478.2080704@sgi.com> Matthew, So, if I understand correctly, you are saying because we cannot guarantee the "flush" a config write even by doing a config read of the same register (because the PPB can re-order) we have to make sure we block or spin on the config write completion at the lowest level of the config write ? Thanks John Matthew Wilcox wrote: > On Wed, Nov 01, 2006 at 10:27:19AM -0600, John Partridge wrote: > >>Sorry, but I find this change a bit puzzling. The problem is particular to >>the PPB on the HCA and not Altix. > > > That's not true; it's more likely on Altix, but it's not unique. *any* > PCI-PCI bridge can reorder pci config reads and writes. Apparently the > normal PCI host bridge implementation avoids this problem by blocking > until the completion comes back. If you put a quad-port tulip card into > an Altix, you could experience the same problem (but it would be > massively unlikely. You'd probably have to bring up three interfaces, > saturate them with traffic, then bring up the fourth to see it. And > even then it would be rare). > > >>I can't see anywhere that a PCI Config >>Write >>is required to block until completion, it is the driver and the HCA ,not the >>Altix hardware that requires the Config Write to have completed before we >>leave mthca_reset() > > > There's several places in the PCI midlayer that require the config write > to have completed before we do a config read. The MWI code relies on > this to see if the device supports MWI. If it gets out of order, we'll > think that the device doesn't support MWI when it thinks it's been told > to use MWI. Data corruption could result. > > >>Changing pci_write_config_xxx() will change the behavior >>for ALL drivers and the possibility of breaking something else. The fix was >>very low risk in mthca_reset(), changing the PCI code to fix this is much >>more onerous. > > > I really don't think so. At worst you'll be changing the timing. -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From matthew at wil.cx Wed Nov 1 09:14:44 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Wed, 1 Nov 2006 10:14:44 -0700 Subject: [openib-general] Ordering between PCI config space writes and MMIO reads? In-Reply-To: <4548D478.2080704@sgi.com> References: <20061024.154347.77057163.davem@davemloft.net> <20061031195312.GD5950@mellanox.co.il> <019301c6fd2c$044d7010$0732700a@djlaptop> <20061031204717.GG26964@parisc-linux.org> <4548CAE7.8010300@sgi.com> <20061101164643.GH11399@parisc-linux.org> <4548D478.2080704@sgi.com> Message-ID: <20061101171443.GI11399@parisc-linux.org> On Wed, Nov 01, 2006 at 11:08:08AM -0600, John Partridge wrote: > So, if I understand correctly, you are saying because we cannot guarantee > the "flush" a config write even by doing a config read of the same register > (because the PPB can re-order) we have to make sure we block or spin on the > config write completion at the lowest level of the config write ? That's correct. And I'm also saying that the reason this hasn't been thought about before is that other root bridges have a mechanism (implicit on x86, explicit on parisc) for waiting for the config write completion to come back. Seems to me that Altix uses the SAL calls to access PCI config space these days, so you can hide it in your firmware rather than patching Linux. From dotanb at mellanox.co.il Wed Nov 1 10:01:29 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 1 Nov 2006 20:01:29 +0200 Subject: [openib-general] what happens if one close the device in user level without releasing the resources? Message-ID: <6C2C79E72C305246B504CBA17B5500C91BD7E0@mtlexch01.mtl.com> Dotan Barak Senior Software Verification Engineer Mellanox Technologies Tel: +972-4-9097200 Ext: 231 Fax: +972-4-9593245 P.O. Box 86 Yokneam 20692 Israel. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Nov 1 10:45:59 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 1 Nov 2006 20:45:59 +0200 Subject: [openib-general] [PATCH v2] opensm: remove obsolete p_report_buf Message-ID: <20061101184559.GC22655@sashak.voltaire.com> This removes obsolete now shared sm->p_report_buf buffer and cleans up related code. And also introduces new log function osm_log_printf() which currently trivially sends formatted output to stdout. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_base.h | 5 -- osm/include/opensm/osm_log.h | 3 + osm/include/opensm/osm_sm.h | 2 - osm/include/opensm/osm_state_mgr.h | 8 -- osm/include/opensm/osm_ucast_mgr.h | 5 -- osm/opensm/libopensm.map | 3 +- osm/opensm/osm_log.c | 19 +++++ osm/opensm/osm_mcast_mgr.c | 11 ++-- osm/opensm/osm_sm.c | 15 +---- osm/opensm/osm_state_mgr.c | 138 ++++++++++++----------------------- osm/opensm/osm_ucast_mgr.c | 80 +++++++-------------- 11 files changed, 104 insertions(+), 185 deletions(-) diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h index 57dd4fd..20e2cc3 100644 --- a/osm/include/opensm/osm_base.h +++ b/osm/include/opensm/osm_base.h @@ -714,11 +714,6 @@ typedef enum _osm_state_mgr_mode * **********/ -#define OSM_REPORT_BUF_SIZE 0x10000 -#define OSM_REPORT_LINE_SIZE 0x256 -#define OSM_REPORT_BUF_THRESHOLD (OSM_REPORT_BUF_SIZE / OSM_REPORT_LINE_SIZE) - - /****d* OpenSM: Base/osm_sm_signal_t * NAME * osm_sm_signal_t diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h index 62f3a0c..6a1a93f 100644 --- a/osm/include/opensm/osm_log.h +++ b/osm/include/opensm/osm_log.h @@ -370,6 +370,9 @@ osm_log_is_active( * osm_log_destroy *********/ +extern int osm_log_printf(osm_log_t *p_log, osm_log_level_t level, + const char *fmt, ...); + void osm_log( IN osm_log_t* const p_log, diff --git a/osm/include/opensm/osm_sm.h b/osm/include/opensm/osm_sm.h index bc812f3..05b87ac 100644 --- a/osm/include/opensm/osm_sm.h +++ b/osm/include/opensm/osm_sm.h @@ -178,8 +178,6 @@ typedef struct _osm_sm osm_vla_rcv_ctrl_t vla_rcv_ctrl; osm_pkey_rcv_t pkey_rcv; osm_pkey_rcv_ctrl_t pkey_rcv_ctrl; - char* p_report_buf; - } osm_sm_t; /* * FIELDS diff --git a/osm/include/opensm/osm_state_mgr.h b/osm/include/opensm/osm_state_mgr.h index ad4afa0..7aaab58 100644 --- a/osm/include/opensm/osm_state_mgr.h +++ b/osm/include/opensm/osm_state_mgr.h @@ -121,7 +121,6 @@ typedef struct _osm_state_mgr cl_qlist_t idle_time_list; cl_plock_t *p_lock; cl_event_t *p_subnet_up_event; - char *p_report_buf; osm_sm_state_t state; osm_state_mgr_mode_t state_step_mode; osm_signal_t next_stage_signal; @@ -170,9 +169,6 @@ typedef struct _osm_state_mgr * p_subnet_up_event * Pointer to the event to set if/when the subnet comes up. * -* p_report_buf -* Pointer to the large log buffer used for user reports. -* * state * State of the SM. * @@ -380,7 +376,6 @@ osm_state_mgr_init( IN const osm_sm_mad_ctrl_t* const p_mad_ctrl, IN cl_plock_t* const p_lock, IN cl_event_t* const p_subnet_up_event, - IN char* const p_report_buf, IN osm_log_t* const p_log ); /* * PARAMETERS @@ -420,9 +415,6 @@ osm_state_mgr_init( * p_subnet_up_event * [in] Pointer to the event to set if/when the subnet comes up. * -* p_report_buf -* [in] Pointer to the large log buffer used for user reports. -* * p_log * [in] Pointer to the log object. * diff --git a/osm/include/opensm/osm_ucast_mgr.h b/osm/include/opensm/osm_ucast_mgr.h index 0fbfc66..1c10abb 100644 --- a/osm/include/opensm/osm_ucast_mgr.h +++ b/osm/include/opensm/osm_ucast_mgr.h @@ -105,7 +105,6 @@ typedef struct _osm_ucast_mgr osm_req_t *p_req; osm_log_t *p_log; cl_plock_t *p_lock; - char *p_report_buf; } osm_ucast_mgr_t; /* * FIELDS @@ -204,7 +203,6 @@ osm_ucast_mgr_init( IN osm_ucast_mgr_t* const p_mgr, IN osm_req_t* const p_req, IN osm_subn_t* const p_subn, - IN char* const p_report_buf, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ); /* @@ -218,9 +216,6 @@ osm_ucast_mgr_init( * p_subn * [in] Pointer to the Subnet object for this subnet. * -* p_report_buf -* [in] Pointer to the large log buffer used for user reporting. -* * p_log * [in] Pointer to the log object. * diff --git a/osm/opensm/libopensm.map b/osm/opensm/libopensm.map index 60d532f..25370b1 100644 --- a/osm/opensm/libopensm.map +++ b/osm/opensm/libopensm.map @@ -1,6 +1,7 @@ -OPENSM_1.3 { +OPENSM_1.4 { global: osm_log; + osm_log_printf; osm_is_debug; osm_log_init; osm_log_init_v2; diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c index 8ac7f8f..c6cc072 100644 --- a/osm/opensm/osm_log.c +++ b/osm/opensm/osm_log.c @@ -109,6 +109,25 @@ static void truncate_log_file(osm_log_t* } #endif /* ndef WIN32 */ +int osm_log_printf(osm_log_t *p_log, osm_log_level_t level, + const char *fmt, ...) +{ + va_list args; + int ret; + + if (!(p_log->level&level)) + return 0; + + va_start(args, fmt); + ret = vfprintf(stdout, fmt, args); + va_end(args); + + if (p_log->flush || level&OSM_LOG_ERROR) + fflush( stdout ); + + return ret; +} + void osm_log( IN osm_log_t* const p_log, diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c index 5a01578..82ef7c3 100644 --- a/osm/opensm/osm_mcast_mgr.c +++ b/osm/opensm/osm_mcast_mgr.c @@ -1382,14 +1382,13 @@ static void mcast_mgr_dump_sw_routes( IN const osm_mcast_mgr_t* const p_mgr, IN const osm_switch_t* const p_sw, - IN FILE *p_mcfdbFile ) + IN FILE *file ) { osm_mcast_tbl_t* p_tbl; int16_t mlid_ho = 0; int16_t mlid_start_ho; uint8_t position = 0; int16_t block_num = 0; - char line[OSM_REPORT_LINE_SIZE]; boolean_t print_lid; const osm_node_t* p_node; uint16_t i, j; @@ -1404,7 +1403,7 @@ mcast_mgr_dump_sw_routes( p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw ); - fprintf( p_mcfdbFile, "\nSwitch 0x%016" PRIx64 "\n" + fprintf( file, "\nSwitch 0x%016" PRIx64 "\n" "LID : Out Port(s)\n", cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); while ( block_num <= p_tbl->max_block_in_use ) @@ -1415,7 +1414,7 @@ mcast_mgr_dump_sw_routes( mlid_ho = mlid_start_ho + i; position = 0; print_lid = FALSE; - sprintf( line, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); + fprintf( file, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); while ( position <= p_tbl->max_position ) { mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]); @@ -1428,13 +1427,13 @@ mcast_mgr_dump_sw_routes( for (j = 0 ; j < 16 ; j++) { if ( (1 << j) & mask_entry ) - sprintf( line, "%s 0x%03X ", line, j+(position*16) ); + fprintf( file, " 0x%03X ", j+(position*16) ); } position++; } if (print_lid) { - fprintf( p_mcfdbFile, "%s\n", line ); + fprintf( file, "\n" ); } } block_num++; diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c index fef3cac..fb4f759 100644 --- a/osm/opensm/osm_sm.c +++ b/osm/opensm/osm_sm.c @@ -256,9 +256,6 @@ osm_sm_destroy( cl_event_destroy( &p_sm->signal ); cl_event_destroy( &p_sm->subnet_up_event ); - if( p_sm->p_report_buf != NULL ) - free( p_sm->p_report_buf ); - osm_log( p_sm->p_log, OSM_LOG_SYS, "Exiting SM\n" ); /* Format Waived */ OSM_LOG_EXIT( p_sm->p_log ); } @@ -291,15 +288,6 @@ osm_sm_init( p_sm->p_disp = p_disp; p_sm->p_lock = p_lock; - p_sm->p_report_buf = malloc( OSM_REPORT_BUF_SIZE ); - if( p_sm->p_report_buf == NULL ) - { - osm_log( p_sm->p_log, OSM_LOG_ERROR, - "osm_sm_init: ERR 2E09: " - "Can't allocate report buffer\n" ); - status = IB_INSUFFICIENT_MEMORY; - goto Exit; - } status = cl_event_init( &p_sm->signal, FALSE ); if( status != CL_SUCCESS ) goto Exit; @@ -385,7 +373,6 @@ osm_sm_init( status = osm_ucast_mgr_init( &p_sm->ucast_mgr, &p_sm->req, p_sm->p_subn, - p_sm->p_report_buf, p_sm->p_log, p_sm->p_lock ); if( status != IB_SUCCESS ) goto Exit; @@ -409,7 +396,7 @@ osm_sm_init( &p_sm->mad_ctrl, p_sm->p_lock, &p_sm->subnet_up_event, - p_sm->p_report_buf, p_sm->p_log ); + p_sm->p_log ); if( status != IB_SUCCESS ) goto Exit; diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index a2efee4..66da6fa 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -118,7 +118,6 @@ osm_state_mgr_init( IN const osm_sm_mad_ctrl_t * const p_mad_ctrl, IN cl_plock_t * const p_lock, IN cl_event_t * const p_subnet_up_event, - IN char *const p_report_buf, IN osm_log_t * const p_log ) { cl_status_t status; @@ -136,7 +135,6 @@ osm_state_mgr_init( CL_ASSERT( p_sm_state_mgr ); CL_ASSERT( p_mad_ctrl ); CL_ASSERT( p_lock ); - CL_ASSERT( p_report_buf ); osm_state_mgr_construct( p_mgr ); @@ -154,7 +152,6 @@ osm_state_mgr_init( p_mgr->state = OSM_SM_STATE_IDLE; p_mgr->p_lock = p_lock; p_mgr->p_subnet_up_event = p_subnet_up_event; - p_mgr->p_report_buf = p_report_buf; p_mgr->state_step_mode = OSM_STATE_STEP_CONTINUOUS; p_mgr->next_stage_signal = OSM_SIGNAL_NONE; @@ -1255,16 +1252,19 @@ __osm_state_mgr_report( uint8_t port_num; uint8_t start_port; uint32_t num_ports; - char line[OSM_REPORT_LINE_SIZE]; uint8_t node_type; - uint32_t line_num = 0; + + if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) + return; OSM_LOG_ENTER( p_mgr->p_log, __osm_state_mgr_report ); - if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) - { - goto Exit; - } + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, + "\n===================================================" + "====================================================" + "\nVendor : Ty " + ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " + " : Neighbor Port (Port #)\n" ); p_tbl = &p_mgr->p_subn->port_guid_tbl; @@ -1294,29 +1294,16 @@ __osm_state_mgr_report( num_ports = osm_port_get_num_physp( p_port ); for( port_num = start_port; port_num < num_ports; port_num++ ) { - if( line_num == 0 ) - { - strcpy( p_mgr->p_report_buf, - "\n===================================================" - "====================================================" ); - strcat( p_mgr->p_report_buf, - "\nVendor : Ty " - ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " - " : Neighbor Port (Port #)\n" ); - line_num++; - } - p_physp = osm_port_get_phys_ptr( p_port, port_num ); if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) ) continue; - sprintf( line, "%s : %s : %02X :", - osm_get_manufacturer_str( cl_ntoh64 - ( osm_node_get_node_guid - ( p_node ) ) ), - osm_get_node_type_str_fixed_width( node_type ), port_num ); - - strcat( p_mgr->p_report_buf, line ); + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "%s : %s : %02X :", + osm_get_manufacturer_str( cl_ntoh64 + ( osm_node_get_node_guid + ( p_node ) ) ), + osm_get_node_type_str_fixed_width( node_type ), + port_num ); p_pi = osm_physp_get_port_info_ptr( p_physp ); @@ -1324,61 +1311,40 @@ __osm_state_mgr_report( * Port state is not defined for switch port 0 */ if( port_num == 0 ) - strcat( p_mgr->p_report_buf, " :" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " :" ); else - { - sprintf( line, " %s :", - osm_get_port_state_str_fixed_width - ( ib_port_info_get_port_state( p_pi ) ) ); - strcat( p_mgr->p_report_buf, line ); - } + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " %s :", + osm_get_port_state_str_fixed_width + ( ib_port_info_get_port_state( p_pi ) ) ); /* * LID values are only meaningful in select cases. */ - if( ib_port_info_get_port_state( p_pi ) != IB_LINK_DOWN ) - { - if( ( ( node_type == IB_NODE_TYPE_SWITCH ) && ( port_num == 0 ) ) - || ( node_type != IB_NODE_TYPE_SWITCH ) ) - { - sprintf( line, " %04X : %01X :", - cl_ntoh16( p_pi->base_lid ), - ib_port_info_get_lmc( p_pi ) ); - - strcat( p_mgr->p_report_buf, line ); - } - else - strcat( p_mgr->p_report_buf, " : :" ); - } + if( ib_port_info_get_port_state( p_pi ) != IB_LINK_DOWN + && ( ( node_type == IB_NODE_TYPE_SWITCH && port_num == 0 ) + || node_type != IB_NODE_TYPE_SWITCH ) ) + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " %04X : %01X :", + cl_ntoh16( p_pi->base_lid ), + ib_port_info_get_lmc( p_pi ) ); else - strcat( p_mgr->p_report_buf, " : :" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " : :" ); if( port_num != 0 ) - { - sprintf( line, " %s : %s : %s ", - osm_get_mtu_str( ib_port_info_get_neighbor_mtu( p_pi ) ), - osm_get_lwa_str( p_pi->link_width_active ), - osm_get_lsa_str( ib_port_info_get_link_speed_active - ( p_pi ) ) ); - } + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " %s : %s : %s ", + osm_get_mtu_str( ib_port_info_get_neighbor_mtu( p_pi ) ), + osm_get_lwa_str( p_pi->link_width_active ), + osm_get_lsa_str( ib_port_info_get_link_speed_active + ( p_pi ) ) ); else - { - sprintf( line, " %s : %s : %s ", " ", " ", " " ); - } - strcat( p_mgr->p_report_buf, line ); + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " : : " ); if( osm_physp_get_port_guid( p_physp ) == p_mgr->p_subn->sm_port_guid ) - { - sprintf( line, "* %016" PRIx64 " *", - cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); - } + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "* %016" PRIx64 " *", + cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); else - { - sprintf( line, ": %016" PRIx64 " :", - cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); - } - strcat( p_mgr->p_report_buf, line ); + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, ": %016" PRIx64 " :", + cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); if( port_num && ( ib_port_info_get_port_state( p_pi ) != IB_LINK_DOWN ) ) @@ -1386,36 +1352,26 @@ __osm_state_mgr_report( p_remote_physp = osm_physp_get_remote( p_physp ); if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) ) { - sprintf( line, " %016" PRIx64 " (%02X)", - cl_ntoh64( osm_physp_get_port_guid - ( p_remote_physp ) ), - osm_physp_get_port_num( p_remote_physp ) ); - strcat( p_mgr->p_report_buf, line ); + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, + " %016" PRIx64 " (%02X)", + cl_ntoh64( osm_physp_get_port_guid + ( p_remote_physp ) ), + osm_physp_get_port_num( p_remote_physp ) ); } else - strcat( p_mgr->p_report_buf, " UNKNOWN" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " UNKNOWN" ); } - strcat( p_mgr->p_report_buf, "\n" ); - - if( ++line_num >= OSM_REPORT_BUF_THRESHOLD ) - { - osm_log_raw( p_mgr->p_log, OSM_LOG_VERBOSE, p_mgr->p_report_buf ); - line_num = 0; - } + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "\n" ); } - strcat( p_mgr->p_report_buf, - "------------------------------------------------------" - "------------------------------------------------\n" ); + + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, + "------------------------------------------------------" + "------------------------------------------------\n" ); p_port = ( osm_port_t * ) cl_qmap_next( &p_port->map_item ); } CL_PLOCK_RELEASE( p_mgr->p_lock ); - - if( line_num != 0 ) - osm_log_raw( p_mgr->p_log, OSM_LOG_VERBOSE, p_mgr->p_report_buf ); - - Exit: OSM_LOG_EXIT( p_mgr->p_log ); } diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c index f1d085c..fc97094 100644 --- a/osm/opensm/osm_ucast_mgr.c +++ b/osm/opensm/osm_ucast_mgr.c @@ -103,7 +103,6 @@ osm_ucast_mgr_init( IN osm_ucast_mgr_t* const p_mgr, IN osm_req_t* const p_req, IN osm_subn_t* const p_subn, - IN char* const p_report_buf, IN osm_log_t* const p_log, IN cl_plock_t* const p_lock ) { @@ -121,7 +120,6 @@ osm_ucast_mgr_init( p_mgr->p_subn = p_subn; p_mgr->p_lock = p_lock; p_mgr->p_req = p_req; - p_mgr->p_report_buf = p_report_buf; OSM_LOG_EXIT( p_mgr->p_log ); return( status ); @@ -184,26 +182,25 @@ __osm_ucast_mgr_dump_path_distribution( ib_net64_t remote_guid_ho; osm_switch_t* p_sw = (osm_switch_t *)p_map_item; osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; - char line[OSM_REPORT_LINE_SIZE]; OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_path_distribution ); p_node = osm_switch_get_node_ptr( p_sw ); num_ports = osm_switch_get_num_ports( p_sw ); - sprintf( p_mgr->p_report_buf, "__osm_ucast_mgr_dump_path_distribution: " - "Switch 0x%" PRIx64 "\n" - "Port : Path Count Through Port", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, + "__osm_ucast_mgr_dump_path_distribution: " + "Switch 0x%" PRIx64 "\n" + "Port : Path Count Through Port", + cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); for( i = 0; i < num_ports; i++ ) { num_paths = osm_switch_path_count_get( p_sw , i ); - sprintf( line, "\n %03u : %u", i, num_paths ); - strcat( p_mgr->p_report_buf, line ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG,"\n %03u : %u", i, num_paths ); if( i == 0 ) { - strcat( p_mgr->p_report_buf, " (switch management port)" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (switch management port)" ); continue; } @@ -216,26 +213,24 @@ __osm_ucast_mgr_dump_path_distribution( switch( osm_node_get_remote_type( p_node, i ) ) { case IB_NODE_TYPE_SWITCH: - strcat( p_mgr->p_report_buf, " (link to switch" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to switch" ); break; case IB_NODE_TYPE_ROUTER: - strcat( p_mgr->p_report_buf, " (link to router" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to router" ); break; case IB_NODE_TYPE_CA: - strcat( p_mgr->p_report_buf, " (link to CA" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to CA" ); break; default: - strcat( p_mgr->p_report_buf, " (link to unknown node type" ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to unknown node type" ); break; } - sprintf( line, " 0x%" PRIx64 ")", remote_guid_ho ); - strcat( p_mgr->p_report_buf, line ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " 0x%" PRIx64 ")", + remote_guid_ho ); } - strcat( p_mgr->p_report_buf, "\n" ); - - osm_log_raw( p_mgr->p_log, OSM_LOG_ROUTING, p_mgr->p_report_buf ); + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, "\n" ); OSM_LOG_EXIT( p_mgr->p_log ); } @@ -254,29 +249,24 @@ __osm_ucast_mgr_dump_ucast_routes( uint8_t best_port; uint16_t max_lid_ho; uint16_t lid_ho; - uint32_t line_num = 0; boolean_t ui_ucast_fdb_assign_func_defined; osm_switch_t* p_sw = (osm_switch_t *)p_map_item; osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; - FILE *p_fdbFile = ((struct ucast_mgr_dump_context *)cxt)->file; - char line[OSM_REPORT_LINE_SIZE]; - + FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; + OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_ucast_routes ); p_node = osm_switch_get_node_ptr( p_sw ); max_lid_ho = osm_switch_get_max_lid_ho( p_sw ); + fprintf( file, "__osm_ucast_mgr_dump_ucast_routes: " + "Switch 0x%016" PRIx64 "\n" + "LID : Port : Hops : Optimal\n", + cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); for( lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++ ) { - if( line_num == 0 ) - { - sprintf( p_mgr->p_report_buf, "__osm_ucast_mgr_dump_ucast_routes: " - "Switch 0x%016" PRIx64 "\n" - "LID : Port : Hops : Optimal\n", - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); - line_num++; - } + fprintf(file, "0x%04X : ", lid_ho); port_num = osm_switch_get_port_by_lid( p_sw, lid_ho ); if( port_num == OSM_NO_PATH ) @@ -287,9 +277,7 @@ __osm_ucast_mgr_dump_ucast_routes( will reassign and compress the LID range. The subnet should work fine either way. */ - sprintf( line, "0x%04X : UNREACHABLE\n", lid_ho ); - strcat( p_mgr->p_report_buf, line ); - line_num++; + fprintf( file, "UNREACHABLE\n" ); continue; } /* @@ -301,19 +289,15 @@ __osm_ucast_mgr_dump_ucast_routes( num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); if( num_hops == OSM_NO_PATH ) { - sprintf( line, "0x%04X : UNREACHABLE\n", lid_ho ); - strcat( p_mgr->p_report_buf, line ); - line_num++; + fprintf( file, "UNREACHABLE\n" ); continue; } best_hops = osm_switch_get_least_hops( p_sw, lid_ho ); - sprintf( line, "0x%04X : %03u : %02u : ", - lid_ho, port_num, num_hops ); - strcat( p_mgr->p_report_buf, line ); + fprintf( file, "%03u : %02u : ", port_num, num_hops ); if( best_hops == num_hops ) - strcat( p_mgr->p_report_buf, "yes" ); + fprintf( file, "yes" ); else { if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) @@ -328,23 +312,13 @@ __osm_ucast_mgr_dump_ucast_routes( p_sw, lid_ho, TRUE, NULL, NULL, NULL, NULL, /* No LMC Optimization */ ui_ucast_fdb_assign_func_defined ); - sprintf( line, "No %u hop path possible via port %u!", + fprintf( file, "No %u hop path possible via port %u!", best_hops, best_port ); - strcat( p_mgr->p_report_buf, line ); } - strcat( p_mgr->p_report_buf, "\n" ); - - if( ++line_num >= OSM_REPORT_BUF_THRESHOLD ) - { - fprintf(p_fdbFile,"%s",p_mgr->p_report_buf ); - line_num = 0; - } + fprintf( file, "\n" ); } - if( line_num != 0 ) - fprintf(p_fdbFile,"%s\n",p_mgr->p_report_buf ); - OSM_LOG_EXIT( p_mgr->p_log ); } -- 1.4.3.2.g4bf7 From sashak at voltaire.com Wed Nov 1 10:48:17 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 1 Nov 2006 20:48:17 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm: trivial indentaion fixes Message-ID: <20061101184817.GD22655@sashak.voltaire.com> Trivial indentaion fixes. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_inform.h | 8 ++++---- osm/include/opensm/osm_sa.h | 33 ++++++++++++++++----------------- 2 files changed, 20 insertions(+), 21 deletions(-) diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h index a6926c3..cdce214 100644 --- a/osm/include/opensm/osm_inform.h +++ b/osm/include/opensm/osm_inform.h @@ -133,7 +133,7 @@ typedef struct _osm_infr_t */ osm_infr_t* osm_infr_new( - IN const osm_infr_t *p_infr_rec ); + IN const osm_infr_t *p_infr_rec ); /* * PARAMETERS * p_inf_rec @@ -305,9 +305,9 @@ osm_infr_remove_from_db( */ ib_api_status_t osm_report_notice( - IN osm_log_t* const p_log, - IN osm_subn_t* p_subn, - IN ib_mad_notice_attr_t* p_ntc ); + IN osm_log_t* const p_log, + IN osm_subn_t* p_subn, + IN ib_mad_notice_attr_t* p_ntc ); /* * PARAMETERS * p_rcv diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h index 0a6bc04..0d450ad 100644 --- a/osm/include/opensm/osm_sa.h +++ b/osm/include/opensm/osm_sa.h @@ -172,31 +172,30 @@ #if defined (VENDOR_RMPP_SUPPORT) && def osm_mpr_rcv_ctrl_t mpr_rcv_ctrl; #endif - /* InformInfo Receiver */ - osm_infr_rcv_t infr_rcv; - osm_infr_rcv_ctrl_t infr_rcv_ctrl; + /* InformInfo Receiver */ + osm_infr_rcv_t infr_rcv; + osm_infr_rcv_ctrl_t infr_rcv_ctrl; - /* VL Arbitrartion Query */ - osm_vlarb_rec_rcv_t vlarb_rec_rcv; - osm_vlarb_rec_rcv_ctrl_t vlarb_rec_rcv_ctrl; + /* VL Arbitrartion Query */ + osm_vlarb_rec_rcv_t vlarb_rec_rcv; + osm_vlarb_rec_rcv_ctrl_t vlarb_rec_rcv_ctrl; - /* SLtoVL Map Query */ - osm_slvl_rec_rcv_t slvl_rec_rcv; - osm_slvl_rec_rcv_ctrl_t slvl_rec_rcv_ctrl; + /* SLtoVL Map Query */ + osm_slvl_rec_rcv_t slvl_rec_rcv; + osm_slvl_rec_rcv_ctrl_t slvl_rec_rcv_ctrl; - /* P_Key table Query */ - osm_pkey_rec_rcv_t pkey_rec_rcv; - osm_pkey_rec_rcv_ctrl_t pkey_rec_rcv_ctrl; + /* P_Key table Query */ + osm_pkey_rec_rcv_t pkey_rec_rcv; + osm_pkey_rec_rcv_ctrl_t pkey_rec_rcv_ctrl; - /* LinearForwardingTable Query */ - osm_lftr_rcv_t lftr_rcv; - osm_lftr_rcv_ctrl_t lftr_rcv_ctrl; - + /* LinearForwardingTable Query */ + osm_lftr_rcv_t lftr_rcv; + osm_lftr_rcv_ctrl_t lftr_rcv_ctrl; } osm_sa_t; /* * FIELDS * state - State of this SA object +* State of this SA object * p_subn * Pointer to the Subnet object for this subnet. * -- 1.4.3.2.g4bf7 From mshefty at ichips.intel.com Wed Nov 1 10:48:12 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 01 Nov 2006 10:48:12 -0800 Subject: [openib-general] remote node/port going down notification In-Reply-To: <45479B39.1070404@veritas.com> References: <45479B39.1070404@veritas.com> Message-ID: <4548EBEC.50309@ichips.intel.com> somenath wrote: > is there a way to get remote node/port down notification (other part of > a connected qpair)? This would be part of event registration with the SA (InformInfo/Notice support). This is being worked on, but is a few weeks away from being ready. - Sean From mst at mellanox.co.il Wed Nov 1 10:58:31 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Nov 2006 20:58:31 +0200 Subject: [openib-general] Static linking with libibverbs In-Reply-To: <2054C355-813A-4E62-80F0-CFCB610AF5C4@cisco.com> References: <2054C355-813A-4E62-80F0-CFCB610AF5C4@cisco.com> Message-ID: <20061101185831.GA9085@mellanox.co.il> Quoting r. Jeff Squyres : > Subject: Static linking with libibverbs > > Do you like large executables? > Do you eschew system-installed libc's? > Were you ever frustrated that linking against libibverbs would > prevent using "-static"? > > Well, now you too can produce 100% statically linked applications > that use libibverbs! If you call now, you can take part in this > exclusive offer from the OpenFabrics Alliance and Open MPI project. > Hurry, supplies are limited. > > First, you need to download and install libibverbs v1.0.4 (which was > released after OFED v1.1.1, so if you have installed OFED v1.1.1 or > prior, you will need to manually update your libibverbs). Prior > versions of libibverbs will not work with the "-static" option to gcc > (or whatever the static linking option is for your compiler). static linking actually can be made to work even with older library versions. See this HowTo (written on 02 of November, 2005). https://openib.org/tiki/tiki-index.php?page=HowToFAQ -- MST From mst at mellanox.co.il Wed Nov 1 10:59:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 1 Nov 2006 20:59:39 +0200 Subject: [openib-general] OFED 1.1 Build Issue In-Reply-To: <454898F9.1080303@voltaire.com> References: <45470D59.7020705@dev.mellanox.co.il> <45472202.9080104@voltaire.com> <454742C2.2050900@silverstorm.com> <45476C55.1080300@dev.mellanox.co.il> <454898F9.1080303@voltaire.com> Message-ID: <20061101185939.GB9085@mellanox.co.il> Quoting r. Moni Shoua : > This is a part of the shell script that updates Module.symvers which you > can use if you don't find a way how to generate Module.symvers in 2.4 > kernels I don't think OFED supports 2.4. -- MST From jsquyres at cisco.com Wed Nov 1 11:07:56 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 1 Nov 2006 14:07:56 -0500 Subject: [openib-general] Static linking with libibverbs In-Reply-To: <20061101185831.GA9085@mellanox.co.il> References: <2054C355-813A-4E62-80F0-CFCB610AF5C4@cisco.com> <20061101185831.GA9085@mellanox.co.il> Message-ID: On Nov 1, 2006, at 1:58 PM, Michael S. Tsirkin wrote: > static linking actually can be made to work even with older library > versions. > See this HowTo (written on 02 of November, 2005). > https://openib.org/tiki/tiki-index.php?page=HowToFAQ Are you talking about linking with "-static" or just with libibverbs.a? I'm talking about linking with -static, which has more requirements than just linking libibverbs.a and . I don't see mention of that in the thread that you cite on the HowToFAQ. There are issues with deep linker voodoo that prevent -static from working properly that Roland just fixed -- he had to change the order of loading up the plugins so that you wouldn't get multiple versions of system libraries loaded into the same process, such as one statically linked in and one dynamically linked in that was pulled in by an implicit linker dependency from the DSO that was dlopen()'ed. This causes Bad Things to happen; lions, tigers, and bears. Roland -- can you explain what you did? I think I could explain it, but better to come from the guy who did it so that the details will be right. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From halr at voltaire.com Wed Nov 1 11:29:37 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 01 Nov 2006 14:29:37 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: trivial indentaion fixes In-Reply-To: <20061101184817.GD22655@sashak.voltaire.com> References: <20061101184817.GD22655@sashak.voltaire.com> Message-ID: <1162409364.29957.65277.camel@hal.voltaire.com> On Wed, 2006-11-01 at 13:48, Sasha Khapyorsky wrote: > Trivial indentaion fixes. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From sashak at voltaire.com Wed Nov 1 12:39:36 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 1 Nov 2006 22:39:36 +0200 Subject: [openib-general] [PATCH] opensm: reuse PKey values for "dynamic" partitions. Message-ID: <20061101203936.GC9985@sashak.voltaire.com> When partition is specified in partition configuration file without desired PKey value OpenSM will generate one dynamically. The problem is that when the list of such "dynamic" partitions is edited (some partitions are removed and/or some added), PKey values will be regenerated again and reassigned. This patch fixes this undesired behavior. Now OpenSM will try to reuse PKey values for such "dynamic" partitions. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_partition.h | 14 +++++++-- osm/opensm/osm_prtn.c | 55 ++++++++++++++++++++++++++++------- 2 files changed, 55 insertions(+), 14 deletions(-) diff --git a/osm/include/opensm/osm_partition.h b/osm/include/opensm/osm_partition.h index 3e2c896..0a63411 100644 --- a/osm/include/opensm/osm_partition.h +++ b/osm/include/opensm/osm_partition.h @@ -118,9 +118,17 @@ typedef struct _osm_prtn * sl * The Service Level (SL) associated with this Partiton. * -* port_guid_tbl -* Container of pointers to all Port objects in the Partition, -* indexed by port GUID. +* part_guid_tbl +* Container of pointers to all Port objects in the Partition +* with limited membership, indexed by port GUID. +* +* full_guid_tbl +* Container of pointers to all Port objects in the Partition +* with full membership, indexed by port GUID. +* +* name +* Name of the Partition as specified in partition +* configuration. * * SEE ALSO * Partition diff --git a/osm/opensm/osm_prtn.c b/osm/opensm/osm_prtn.c index ae0f6e0..7719fd7 100644 --- a/osm/opensm/osm_prtn.c +++ b/osm/opensm/osm_prtn.c @@ -263,6 +263,22 @@ static uint16_t __generate_pkey(osm_subn return 0; } +static osm_prtn_t *find_prtn_by_name(osm_subn_t *p_subn, const char *name) +{ + cl_map_item_t *p_next; + osm_prtn_t *p; + + p_next = cl_qmap_head(&p_subn->prtn_pkey_tbl); + while (p_next != cl_qmap_end(&p_subn->prtn_pkey_tbl)) { + p = (osm_prtn_t *)p_next; + p_next = cl_qmap_next(&p->map_item); + if (!strncmp(p->name, name, sizeof(p->name))) + return p; + } + + return NULL; +} + osm_prtn_t *osm_prtn_make_new(osm_log_t *p_log, osm_subn_t *p_subn, const char *name, uint16_t pkey) { @@ -270,8 +286,12 @@ osm_prtn_t *osm_prtn_make_new(osm_log_t pkey &= cl_hton16((uint16_t)~0x8000); - if (pkey == 0 && !(pkey = __generate_pkey(p_subn))) - return NULL; + if (!pkey) { + if (name && (p = find_prtn_by_name(p_subn, name))) + return p; + if(!(pkey = __generate_pkey(p_subn))) + return NULL; + } p = osm_prtn_new(name, pkey); if (!p) { @@ -327,7 +347,8 @@ ib_api_status_t osm_prtn_make_partitions const char *file_name; boolean_t is_config = TRUE; ib_api_status_t status = IB_SUCCESS; - osm_prtn_t *p, *p_next; + cl_map_item_t *p_next; + osm_prtn_t *p; file_name = p_subn->opt.partition_config_file ? p_subn->opt.partition_config_file : @@ -335,15 +356,14 @@ ib_api_status_t osm_prtn_make_partitions if (stat(file_name, &statbuf)) is_config = FALSE; - /* cl_qmap uses self addresses so we cannot just save - qmap state and clean it later, so clean all now */ - p_next = (osm_prtn_t *)cl_qmap_head(&p_subn->prtn_pkey_tbl); - while (p_next != (osm_prtn_t *)cl_qmap_end(&p_subn->prtn_pkey_tbl)) { - p = p_next; - p_next = (osm_prtn_t *)cl_qmap_next(&p->map_item); - osm_prtn_delete(&p); + /* clean up current port maps */ + p_next = cl_qmap_head(&p_subn->prtn_pkey_tbl); + while (p_next != cl_qmap_end(&p_subn->prtn_pkey_tbl)) { + p = (osm_prtn_t *)p_next; + p_next = cl_qmap_next(&p->map_item); + cl_map_remove_all(&p->part_guid_tbl); + cl_map_remove_all(&p->full_guid_tbl); } - cl_qmap_init(&p_subn->prtn_pkey_tbl); global_pkey_counter = 0; @@ -357,6 +377,19 @@ ib_api_status_t osm_prtn_make_partitions "was not fully processed\n"); } + /* and now clean up empty partitions */ + p_next = cl_qmap_head(&p_subn->prtn_pkey_tbl); + while (p_next != cl_qmap_end(&p_subn->prtn_pkey_tbl)) { + p = (osm_prtn_t *)p_next; + p_next = cl_qmap_next(&p->map_item); + if (cl_map_count(&p->part_guid_tbl) == 0 && + cl_map_count(&p->full_guid_tbl) == 0) { + cl_qmap_remove_item(&p_subn->prtn_pkey_tbl, + (cl_map_item_t *)p); + osm_prtn_delete(&p); + } + } + _err: return status; } -- 1.4.3.2.g4bf7 From sashak at voltaire.com Wed Nov 1 13:36:52 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 1 Nov 2006 23:36:52 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm: trivial log message fix Message-ID: <20061101213652.GE9985@sashak.voltaire.com> Trivial log message fix. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_ucast_file.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/osm/opensm/osm_ucast_file.c b/osm/opensm/osm_ucast_file.c index 4550177..e844faa 100644 --- a/osm/opensm/osm_ucast_file.c +++ b/osm/opensm/osm_ucast_file.c @@ -300,7 +300,7 @@ static int do_lid_matrix_file_load(void file = fopen(file_name, "r"); if (!file) { osm_log(&p_osm->log, OSM_LOG_ERROR|OSM_LOG_SYS, - "do_do_lid_matrix_file_load: ERR 6305: " + "do_lid_matrix_file_load: ERR 6305: " "cannot open lid matrix file \'%s\'; " "using default lid matrix generation algorithm\n", file_name); -- 1.4.3.2.g4bf7 From rdreier at cisco.com Wed Nov 1 13:39:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Nov 2006 13:39:07 -0800 Subject: [openib-general] Static linking with libibverbs In-Reply-To: <20061101185831.GA9085@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 1 Nov 2006 20:58:31 +0200") References: <2054C355-813A-4E62-80F0-CFCB610AF5C4@cisco.com> <20061101185831.GA9085@mellanox.co.il> Message-ID: > static linking actually can be made to work even with older library versions. > See this HowTo (written on 02 of November, 2005). > https://openib.org/tiki/tiki-index.php?page=HowToFAQ That's not really static linking. If you try to build a true static executable, which contains static libc and in particular static libdl, there's no way the old code can work, for multiple reasons. For one thing, dlopen(NULL, RTLD_NOW) doesn't work on static executables so libibverbs couldn't find a low-level driver that is statically linked in. Second, loading a low-level driver dynamically into a static executable (which would be done even if the static driver was sufficient for all the devices in the system) would bring in a dynamic copy of libibverbs and hence a dynamic copy of libdl, which would clash with the static copy of libdl already linked in and cause a crash. So static linking never really worked until libibverbs 1.0.4. From vishal at endace.com Wed Nov 1 13:53:21 2006 From: vishal at endace.com (vishal) Date: Thu, 02 Nov 2006 10:53:21 +1300 Subject: [openib-general] Error inserting ib_umad Message-ID: <1162418001.5609.17.camel@julia.et.endace.com> Hi, I have installed OFED-1.1 rc7. Got the following error on trying 'modprobe ib_umad':- FATAL: Error inserting ib_umad (/lib/modules/2.6.16.13-4-smp/kernel/drivers/infiniband/core/ib_umad.ko): Unknown symbol in module, or unknown parameter (see dmesg) dmesg output:- ib_umad: module not supported by Novell, setting U taint flag. ib_umad: disagrees about version of symbol ib_unregister_client ib_umad: Unknown symbol ib_unregister_client ib_umad: Unknown symbol ib_get_mad_data_offset ib_umad: disagrees about version of symbol ib_modify_port ib_umad: Unknown symbol ib_modify_port ib_umad: disagrees about version of symbol ib_create_ah ib_umad: Unknown symbol ib_create_ah ib_umad: disagrees about version of symbol ib_register_client ib_umad: Unknown symbol ib_register_client ib_umad: disagrees about version of symbol ib_unregister_mad_agent ib_umad: Unknown symbol ib_unregister_mad_agent ib_umad: Unknown symbol ib_response_mad ib_umad: disagrees about version of symbol ib_post_send_mad ib_umad: Unknown symbol ib_post_send_mad ib_umad: disagrees about version of symbol ib_create_send_mad ib_umad: Unknown symbol ib_create_send_mad ib_umad: disagrees about version of symbol ib_set_client_data ib_umad: Unknown symbol ib_set_client_data ib_umad: disagrees about version of symbol ib_get_client_data ib_umad: Unknown symbol ib_get_client_data ib_umad: Unknown symbol ib_is_mad_class_rmpp ib_umad: disagrees about version of symbol ib_free_send_mad ib_umad: Unknown symbol ib_free_send_mad ib_umad: disagrees about version of symbol ib_destroy_ah ib_umad: Unknown symbol ib_destroy_ah ib_umad: Unknown symbol ib_get_rmpp_segment ib_umad: disagrees about version of symbol ib_register_mad_agent ib_umad: Unknown symbol ib_register_mad_agent I am using SUSE 10.1 Enterprise x86_64. Thanks! Vishal From python152 at gmail.com Wed Nov 1 13:52:49 2006 From: python152 at gmail.com (Oliver) Date: Wed, 1 Nov 2006 16:52:49 -0500 Subject: [openib-general] question on QoS support Message-ID: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> Hi, folks - I am trying to verify and evaluate IB QoS support, running openSM as subnet manager. The perftest program is extended to set SL as command line options instead of default 0, and by modifying VL arbitration tables, I am expecting to see the traffic shaping can actually take place, but it did not. More details on configuration: in opensm.opts: # QoS default options qos_high_limit 255 # disable low priority table qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 (corresponding to SL 2) a higher weight 8 qos_sl2vl 0,1,2,3,4, ... # no changes here I think (though not verified) the Voltaire HCA we are using can support 8 data VLs. I don't have much more information to go on why qos shaping is not taking place, any suggestions? A related question is, if I modify qos setting in SM, do I need to restart SA on each hosts for it to see the changes? (I am hoping not, as I tried in the test, it doesn't seem to make a difference) Thanks for help. -- Oliver From rdreier at cisco.com Wed Nov 1 14:08:13 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Nov 2006 14:08:13 -0800 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <45487B6C.2070408@voltaire.com> (Or Gerlitz's message of "Wed, 01 Nov 2006 12:48:12 +0200") References: <4547308F.2030708@voltaire.com> <20061031115017.GF2387@mellanox.co.il> <454746A8.1040604@voltaire.com> <45477E81.3040205@ichips.intel.com> <45487B6C.2070408@voltaire.com> Message-ID: > what about enhancing xxx_destory_id() to sense that it was called from > this id callback context so the xxx module code defers the > destory_id() execution to run after the callback is over. This can be > done by writing at the id the pid of the thread running the callback > before going to the consumer and deleting it when the callback > returns. Then if in_callback(id) holds, have the destory_id() call > schedule itself to later stage, where it checks again etc. Unfortunately I don't think this solves the module unloading race at all: there is still a window where code in the client module callback is running, but the callback has dropped all references etc. so the client module will happily proceed to unload. > At the bottom line, users must call xxx_destory_id() explicitly the > xxx module would be able to handle in_callback situations. I think this is actually a good point for the CM case at least. Clients already have something registered with the CM (namely the CM ID itself), so if we required all consumers to destroy their IDs explicitly, then there's no reason to add additional client registration. But I don't see a way to reconcile that with letting callbacks destroy CM IDs. - R. From patrick at fundum.net Wed Nov 1 14:12:25 2006 From: patrick at fundum.net (patrick at fundum.net) Date: Wed, 1 Nov 2006 23:12:25 +0100 (CET) Subject: [openib-general] PCI-Express card loses connectivity on quad opteron Message-ID: <24204.62.140.137.30.1162419145.squirrel@mail.fundum.net> Hi, I am testing the Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0) PCI-Express card on RHEL 4 upd 4. The machine is a quad opteron running on x86_64. I'm using the OFED1.1 package. After moving some data (a few Megs with netperf) over the IB network the IB card loses network connectivity. Pinging the card itself still works at that point. Any thoughts? From swise at opengridcomputing.com Wed Nov 1 14:13:56 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 01 Nov 2006 16:13:56 -0600 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> Message-ID: <1162419236.6366.50.camel@stevo-desktop> Sean, This patch removes rdma_get/set_option(). Is that what you intended? On Wed, 2006-10-25 at 13:49 -0700, Sean Hefty wrote: > Updates the librdmacm to work with ABI version 3, which is the proposed > kernel changes for inclusion in 2.6.20. > > Test programs are also updated. > > Signed-off-by: Sean Hefty > --- > Index: include/rdma/rdma_cma_abi.h > =================================================================== > --- include/rdma/rdma_cma_abi.h (revision 9192) > +++ include/rdma/rdma_cma_abi.h (working copy) > @@ -33,14 +33,15 @@ > #ifndef RDMA_CMA_ABI_H > #define RDMA_CMA_ABI_H > > +#include > #include > > /* > * This file must be kept in sync with the kernel's version of rdma_user_cm.h > */ > > -#define RDMA_USER_CM_MIN_ABI_VERSION 1 > -#define RDMA_USER_CM_MAX_ABI_VERSION 2 > +#define RDMA_USER_CM_MIN_ABI_VERSION 3 > +#define RDMA_USER_CM_MAX_ABI_VERSION 3 > > #define RDMA_MAX_PRIVATE_DATA 256 > > @@ -60,7 +61,7 @@ enum { > UCMA_CMD_GET_EVENT, > UCMA_CMD_GET_OPTION, > UCMA_CMD_SET_OPTION, > - UCMA_CMD_GET_DST_ATTR, > + UCMA_CMD_ESTABLISH, > UCMA_CMD_JOIN_MCAST, > UCMA_CMD_LEAVE_MCAST > }; > @@ -71,11 +72,6 @@ struct ucma_abi_cmd_hdr { > __u16 out; > }; > > -struct ucma_abi_create_id_v1 { > - __u64 uid; > - __u64 response; > -}; > - > struct ucma_abi_create_id { > __u64 uid; > __u64 response; > @@ -133,7 +129,7 @@ struct ucma_abi_query_route_resp { > > struct ucma_abi_conn_param { > __u32 qp_num; > - __u32 qp_type; > + __u32 reserved; > __u8 private_data[RDMA_MAX_PRIVATE_DATA]; > __u8 private_data_len; > __u8 srq; > @@ -145,6 +141,15 @@ struct ucma_abi_conn_param { > __u8 valid; > }; > > +struct ucma_abi_ud_param { > + __u32 qp_num; > + __u32 qkey; > + struct ibv_kern_ah_attr ah_attr; > + __u8 private_data[RDMA_MAX_PRIVATE_DATA]; > + __u8 private_data_len; > + __u8 reserved[7]; > +}; > + > struct ucma_abi_connect { > struct ucma_abi_conn_param conn_param; > __u32 id; > @@ -180,25 +185,13 @@ struct ucma_abi_init_qp_attr { > __u32 qp_state; > }; > > -struct ucma_abi_join_mcast { > - __u32 id; > - struct sockaddr_in6 addr; > - __u64 uid; > -}; > - > -struct ucma_abi_leave_mcast { > +struct ucma_abi_establish { > __u32 id; > - struct sockaddr_in6 addr; > -}; > - > -struct ucma_abi_dst_attr_resp { > - __u32 remote_qpn; > - __u32 remote_qkey; > - struct ibv_kern_ah_attr ah_attr; > }; > > -struct ucma_abi_get_dst_attr { > - __u64 response; > +struct ucma_abi_join_mcast { > + __u64 response; /* ucma_abi_create_id_resp */ > + __u64 uid; > struct sockaddr_in6 addr; > __u32 id; > }; > @@ -212,30 +205,10 @@ struct ucma_abi_event_resp { > __u32 id; > __u32 event; > __u32 status; > - __u8 private_data_len; > - __u8 reserved[3]; > - __u8 private_data[RDMA_MAX_PRIVATE_DATA]; > -}; > - > -struct ucma_abi_get_option { > - __u64 response; > - __u64 optval; > - __u32 id; > - __u32 level; > - __u32 optname; > - __u32 optlen; > -}; > - > -struct ucma_abi_get_option_resp { > - __u32 optlen; > -}; > - > -struct ucma_abi_set_option { > - __u64 optval; > - __u32 id; > - __u32 level; > - __u32 optname; > - __u32 optlen; > + union { > + struct ucma_abi_conn_param conn; > + struct ucma_abi_ud_param ud; > + } param; > }; > > #endif /* RDMA_CMA_ABI_H */ > Index: include/rdma/rdma_cma.h > =================================================================== > --- include/rdma/rdma_cma.h (revision 9272) > +++ include/rdma/rdma_cma.h (working copy) > @@ -61,11 +61,11 @@ enum rdma_port_space { > RDMA_PS_UDP = 0x0111, > }; > > -/* Protocol levels for get/set options. */ > -enum { > - RDMA_PROTO_IP = 0, > - RDMA_PROTO_IB = 1, > -}; > +/* > + * Global qkey value for all UD QPs and multicast groups created via the > + * RDMA CM. > + */ > +#define RDMA_UD_QKEY 0x01234567 > > struct ib_addr { > union ibv_gid sgid; > @@ -74,8 +74,12 @@ struct ib_addr { > }; > > struct rdma_addr { > - struct sockaddr_in6 src_addr; > - struct sockaddr_in6 dst_addr; > + struct sockaddr src_addr; > + uint8_t src_pad[sizeof(struct sockaddr_storage) - > + sizeof(struct sockaddr)]; > + struct sockaddr dst_addr; > + uint8_t dst_pad[sizeof(struct sockaddr_storage) - > + sizeof(struct sockaddr)]; > union { > struct ib_addr ibaddr; > } addr; > @@ -101,11 +105,25 @@ struct rdma_cm_id { > uint8_t port_num; > }; > > -struct rdma_multicast_data { > - void *context; > - struct sockaddr addr; > - uint8_t pad[sizeof(struct sockaddr_in6) - > - sizeof(struct sockaddr)]; > +struct rdma_conn_param { > + const void *private_data; > + uint8_t private_data_len; > + uint8_t responder_resources; > + uint8_t initiator_depth; > + uint8_t flow_control; > + uint8_t retry_count; /* ignored when accepting */ > + uint8_t rnr_retry_count; > + /* Fields below ignored if a QP is created on the rdma_cm_id. */ > + uint8_t srq; > + uint32_t qp_num; > +}; > + > +struct rdma_ud_param { > + const void *private_data; > + uint8_t private_data_len; > + struct ibv_ah_attr ah_attr; > + uint32_t qp_num; > + uint32_t qkey; > }; > > struct rdma_cm_event { > @@ -113,8 +131,10 @@ struct rdma_cm_event { > struct rdma_cm_id *listen_id; > enum rdma_cm_event_type event; > int status; > - void *private_data; > - uint8_t private_data_len; > + union { > + struct rdma_conn_param conn; > + struct rdma_ud_param ud; > + } param; > }; > > /** > @@ -206,20 +226,6 @@ int rdma_create_qp(struct rdma_cm_id *id > */ > void rdma_destroy_qp(struct rdma_cm_id *id); > > -struct rdma_conn_param { > - const void *private_data; > - uint8_t private_data_len; > - uint8_t responder_resources; > - uint8_t initiator_depth; > - uint8_t flow_control; > - uint8_t retry_count; /* ignored when accepting */ > - uint8_t rnr_retry_count; > - /* Fields below ignored if a QP is created on the rdma_cm_id. */ > - uint8_t srq; > - uint32_t qp_num; > - enum ibv_qp_type qp_type; > -}; > - > /** > * rdma_connect - Initiate an active connection request. > * > @@ -251,6 +257,16 @@ int rdma_reject(struct rdma_cm_id *id, c > uint8_t private_data_len); > > /** > + * rdma_establish - Forces a connection state to established. > + * @id: Connection identifier to transition to established. > + * > + * This routine should be invoked by users who receive messages on a > + * QP before being notified that the connection has been established by the > + * RDMA CM. > + */ > +int rdma_establish(struct rdma_cm_id *id); > + > +/** > * rdma_disconnect - This function disconnects the associated QP and > * transitions it into the error state. > */ > @@ -298,40 +314,17 @@ int rdma_get_cm_event(struct rdma_event_ > */ > int rdma_ack_cm_event(struct rdma_cm_event *event); > > -/** > - * rdma_get_option - Retrieve options for an rdma_cm_id. > - * @id: Communication identifier to retrieve option for. > - * @level: Protocol level of the option to retrieve. > - * @optname: Name of the option to retrieve. > - * @optval: Buffer to receive the returned options. > - * @optlen: On input, the size of the %optval buffer. On output, the > - * size of the returned data. > - */ > -int rdma_get_option(struct rdma_cm_id *id, int level, int optname, > - void *optval, size_t *optlen); > - > -/** > - * rdma_set_option - Set options for an rdma_cm_id. > - * @id: Communication identifier to set option for. > - * @level: Protocol level of the option to set. > - * @optname: Name of the option to set. > - * @optval: Reference to the option data. > - * @optlen: The size of the %optval buffer. > - */ > -int rdma_set_option(struct rdma_cm_id *id, int level, int optname, > - void *optval, size_t optlen); > - > static inline uint16_t rdma_get_src_port(struct rdma_cm_id *id) > { > - return id->route.addr.src_addr.sin6_family == PF_INET6 ? > - id->route.addr.src_addr.sin6_port : > + return id->route.addr.src_addr.sa_family == PF_INET6 ? > + ((struct sockaddr_in6 *) &id->route.addr.src_addr)->sin6_port : > ((struct sockaddr_in *) &id->route.addr.src_addr)->sin_port; > } > > static inline uint16_t rdma_get_dst_port(struct rdma_cm_id *id) > { > - return id->route.addr.dst_addr.sin6_family == PF_INET6 ? > - id->route.addr.dst_addr.sin6_port : > + return id->route.addr.dst_addr.sa_family == PF_INET6 ? > + ((struct sockaddr_in6 *) &id->route.addr.dst_addr)->sin6_port : > ((struct sockaddr_in *) &id->route.addr.dst_addr)->sin_port; > } > > Index: src/cma.c > =================================================================== > --- src/cma.c (revision 9696) > +++ src/cma.c (working copy) > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2005 Intel Corporation. All rights reserved. > + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -54,7 +54,6 @@ > #include > #include > #include > -#include > > #define PFX "librdmacm: " > > @@ -116,6 +115,28 @@ struct cma_id_private { > pthread_cond_t cond; > pthread_mutex_t mut; > uint32_t handle; > + struct cma_multicast *mc_list; > +}; > + > +struct cma_multicast { > + struct cma_multicast *next; > + struct cma_id_private *id_priv; > + void *context; > + int events_completed; > + pthread_cond_t cond; > + uint32_t handle; > + union ibv_gid mgid; > + uint16_t mlid; > + struct sockaddr addr; > + uint8_t pad[sizeof(struct sockaddr_in6) - > + sizeof(struct sockaddr)]; > +}; > + > +struct cma_event { > + struct rdma_cm_event event; > + uint8_t private_data[RDMA_MAX_PRIVATE_DATA]; > + struct cma_id_private *id_priv; > + struct cma_multicast *mc; > }; > > static struct cma_device *cma_dev_array; > @@ -335,41 +356,6 @@ err: ucma_free_id(id_priv); > return NULL; > } > > -static int ucma_create_id_v1(struct rdma_event_channel *channel, > - struct rdma_cm_id **id, void *context, > - enum rdma_port_space ps) > -{ > - struct ucma_abi_create_id_resp *resp; > - struct ucma_abi_create_id_v1 *cmd; > - struct cma_id_private *id_priv; > - void *msg; > - int ret, size; > - > - if (ps != RDMA_PS_TCP) { > - fprintf(stderr, "librdmacm: Kernel ABI does not support " > - "requested port space.\n"); > - return -EPROTONOSUPPORT; > - } > - > - id_priv = ucma_alloc_id(channel, context, ps); > - if (!id_priv) > - return -ENOMEM; > - > - CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_CREATE_ID, size); > - cmd->uid = (uintptr_t) id_priv; > - > - ret = write(channel->fd, msg, size); > - if (ret != size) > - goto err; > - > - id_priv->handle = resp->id; > - *id = &id_priv->id; > - return 0; > - > -err: ucma_free_id(id_priv); > - return ret; > -} > - > int rdma_create_id(struct rdma_event_channel *channel, > struct rdma_cm_id **id, void *context, > enum rdma_port_space ps) > @@ -384,9 +370,6 @@ int rdma_create_id(struct rdma_event_cha > if (ret) > return ret; > > - if (abi_ver == 1) > - return ucma_create_id_v1(channel, id, context, ps); > - > id_priv = ucma_alloc_id(channel, context, ps); > if (!id_priv) > return -ENOMEM; > @@ -492,9 +475,9 @@ static int ucma_query_route(struct rdma_ > sizeof id->route.addr.addr.ibaddr.dgid); > id->route.addr.addr.ibaddr.pkey = resp->ib_route[0].pkey; > memcpy(&id->route.addr.src_addr, &resp->src_addr, > - sizeof id->route.addr.src_addr); > + sizeof resp->src_addr); > memcpy(&id->route.addr.dst_addr, &resp->dst_addr, > - sizeof id->route.addr.dst_addr); > + sizeof resp->dst_addr); > > if (!id_priv->cma_dev && resp->node_guid) { > ret = ucma_get_device(id_priv, resp->node_guid); > @@ -696,7 +679,7 @@ static int ucma_init_ib_qp(struct cma_id > > qp_attr.port_num = id_priv->id.port_num; > qp_attr.qp_state = IBV_QPS_INIT; > - qp_attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE; > + qp_attr.qp_access_flags = 0; > return ibv_modify_qp(qp, &qp_attr, IBV_QP_STATE | IBV_QP_ACCESS_FLAGS | > IBV_QP_PKEY_INDEX | IBV_QP_PORT); > } > @@ -767,11 +750,9 @@ void rdma_destroy_qp(struct rdma_cm_id * > > static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, > struct rdma_conn_param *src, > - uint32_t qp_num, > - enum ibv_qp_type qp_type, uint8_t srq) > + uint32_t qp_num, uint8_t srq) > { > dst->qp_num = qp_num; > - dst->qp_type = qp_type; > dst->srq = srq; > dst->responder_resources = src->responder_resources; > dst->initiator_depth = src->initiator_depth; > @@ -799,12 +780,11 @@ int rdma_connect(struct rdma_cm_id *id, > cmd->id = id_priv->handle; > if (id->qp) > ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, > - id->qp->qp_num, id->qp->qp_type, > + id->qp->qp_num, > (id->qp->srq != NULL)); > else > ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, > conn_param->qp_num, > - conn_param->qp_type, > conn_param->srq); > > ret = write(id->channel->fd, msg, size); > @@ -852,12 +832,11 @@ int rdma_accept(struct rdma_cm_id *id, s > cmd->uid = (uintptr_t) id_priv; > if (id->qp) > ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, > - id->qp->qp_num, id->qp->qp_type, > + id->qp->qp_num, > (id->qp->srq != NULL)); > else > ucma_copy_conn_param_to_kern(&cmd->conn_param, conn_param, > conn_param->qp_num, > - conn_param->qp_type, > conn_param->srq); > > ret = write(id->channel->fd, msg, size); > @@ -894,6 +873,24 @@ int rdma_reject(struct rdma_cm_id *id, c > return 0; > } > > +int rdma_establish(struct rdma_cm_id *id) > +{ > + struct ucma_abi_establish *cmd; > + struct cma_id_private *id_priv; > + void *msg; > + int ret, size; > + > + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_ESTABLISH, size); > + > + id_priv = container_of(id, struct cma_id_private, id); > + cmd->id = id_priv->handle; > + ret = write(id->channel->fd, msg, size); > + if (ret != size) > + return (ret > 0) ? -ENODATA : ret; > + > + return 0; > +} > + > int rdma_disconnect(struct rdma_cm_id *id) > { > struct ucma_abi_disconnect *cmd; > @@ -929,74 +926,102 @@ int rdma_join_multicast(struct rdma_cm_i > void *context) > { > struct ucma_abi_join_mcast *cmd; > + struct ucma_abi_create_id_resp *resp; > struct cma_id_private *id_priv; > + struct cma_multicast *mc, **pos; > void *msg; > int ret, size, addrlen; > > + id_priv = container_of(id, struct cma_id_private, id); > addrlen = ucma_addrlen(addr); > if (!addrlen) > return -EINVAL; > > - CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_JOIN_MCAST, size); > - id_priv = container_of(id, struct cma_id_private, id); > + mc = malloc(sizeof *mc); > + if (!mc) > + return -ENOMEM; > + > + memset(mc, 0, sizeof *mc); > + mc->context = context; > + mc->id_priv = id_priv; > + memcpy(&mc->addr, addr, addrlen); > + if (pthread_cond_init(&id_priv->cond, NULL)) { > + ret = -1; > + goto err1; > + } > + > + pthread_mutex_lock(&id_priv->mut); > + mc->next = id_priv->mc_list; > + id_priv->mc_list = mc; > + pthread_mutex_unlock(&id_priv->mut); > + > + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_JOIN_MCAST, size); > cmd->id = id_priv->handle; > memcpy(&cmd->addr, addr, addrlen); > - cmd->uid = (uintptr_t) context; > + cmd->uid = (uintptr_t) mc; > > ret = write(id->channel->fd, msg, size); > - if (ret != size) > - return (ret > 0) ? -ENODATA : ret; > + if (ret != size) { > + ret = (ret > 0) ? -ENODATA : ret; > + goto err2; > + } > > + mc->handle = resp->id; > return 0; > +err2: > + pthread_mutex_lock(&id_priv->mut); > + for (pos = &id_priv->mc_list; *pos != mc; pos = &(*pos)->next) > + ; > + *pos = mc->next; > + pthread_mutex_unlock(&id_priv->mut); > +err1: > + free(mc); > + return ret; > } > > int rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr) > { > - struct ucma_abi_leave_mcast *cmd; > + struct ucma_abi_destroy_id *cmd; > + struct ucma_abi_destroy_id_resp *resp; > struct cma_id_private *id_priv; > + struct cma_multicast *mc, **pos; > void *msg; > int ret, size, addrlen; > - struct ibv_ah_attr ah_attr; > - uint32_t qp_info; > > addrlen = ucma_addrlen(addr); > if (!addrlen) > return -EINVAL; > > - CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_LEAVE_MCAST, size); > id_priv = container_of(id, struct cma_id_private, id); > - cmd->id = id_priv->handle; > - memcpy(&cmd->addr, addr, addrlen); > + pthread_mutex_lock(&id_priv->mut); > + for (pos = &id_priv->mc_list; *pos; pos = &(*pos)->next) > + if (!memcmp(&(*pos)->addr, addr, addrlen)) > + break; > > - if (id->qp) { > - ret = rdma_get_dst_attr(id, addr, &ah_attr, &qp_info, &qp_info); > - if (ret) > - goto out; > + mc = *pos; > + if (*pos) > + *pos = mc->next; > + pthread_mutex_unlock(&id_priv->mut); > + if (!mc) > + return -EADDRNOTAVAIL; > > - ret = ibv_detach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); > - if (ret) > - goto out; > - } > + if (id->qp) > + ibv_detach_mcast(id->qp, &mc->mgid, mc->mlid); > > + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_LEAVE_MCAST, size); > + cmd->id = mc->handle; > + > ret = write(id->channel->fd, msg, size); > if (ret != size) > ret = (ret > 0) ? -ENODATA : ret; > -out: > - return ret; > -} > > -static void ucma_copy_event_from_kern(struct rdma_cm_event *dst, > - struct ucma_abi_event_resp *src) > -{ > - dst->event = src->event; > - dst->status = src->status; > - dst->private_data_len = src->private_data_len; > - if (src->private_data_len) { > - dst->private_data = dst + 1; > - memcpy(dst->private_data, src->private_data, > - src->private_data_len); > - } else > - dst->private_data = NULL; > + pthread_mutex_lock(&id_priv->mut); > + while (mc->events_completed < resp->events_reported) > + pthread_cond_wait(&mc->cond, &id_priv->mut); > + pthread_mutex_unlock(&id_priv->mut); > + > + free(mc); > + return ret; > } > > static void ucma_complete_event(struct cma_id_private *id_priv) > @@ -1007,38 +1032,49 @@ static void ucma_complete_event(struct c > pthread_mutex_unlock(&id_priv->mut); > } > > +static void ucma_complete_mc_event(struct cma_multicast *mc) > +{ > + pthread_mutex_lock(&mc->id_priv->mut); > + mc->events_completed++; > + pthread_cond_signal(&mc->cond); > + mc->id_priv->events_completed++; > + pthread_cond_signal(&mc->id_priv->cond); > + pthread_mutex_unlock(&mc->id_priv->mut); > +} > + > int rdma_ack_cm_event(struct rdma_cm_event *event) > { > - struct rdma_cm_id *id; > + struct cma_event *evt; > > if (!event) > return -EINVAL; > > - id = (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) ? > - event->listen_id : event->id; > + evt = container_of(event, struct cma_event, event); > > - ucma_complete_event(container_of(id, struct cma_id_private, id)); > - free(event); > + if (evt->mc) > + ucma_complete_mc_event(evt->mc); > + else > + ucma_complete_event(evt->id_priv); > + free(evt); > return 0; > } > > -static int ucma_process_conn_req(struct rdma_cm_event *event, > +static int ucma_process_conn_req(struct cma_event *evt, > uint32_t handle) > { > - struct cma_id_private *listen_id_priv, *id_priv; > + struct cma_id_private *id_priv; > int ret; > > - listen_id_priv = container_of(event->id, struct cma_id_private, id); > - id_priv = ucma_alloc_id(event->id->channel, event->id->context, > - event->id->ps); > + id_priv = ucma_alloc_id(evt->id_priv->id.channel, > + evt->id_priv->id.context, evt->id_priv->id.ps); > if (!id_priv) { > - ucma_destroy_kern_id(event->id->channel->fd, handle); > + ucma_destroy_kern_id(evt->id_priv->id.channel->fd, handle); > ret = -ENOMEM; > goto err; > } > > - event->listen_id = event->id; > - event->id = &id_priv->id; > + evt->event.listen_id = &evt->id_priv->id; > + evt->event.id = &id_priv->id; > id_priv->handle = handle; > > ret = ucma_query_route(&id_priv->id); > @@ -1049,7 +1085,7 @@ static int ucma_process_conn_req(struct > > return 0; > err: > - ucma_complete_event(listen_id_priv); > + ucma_complete_event(evt->id_priv); > return ret; > } > > @@ -1093,34 +1129,54 @@ static int ucma_process_establish(struct > return ret; > } > > -static void ucma_process_mcast(struct rdma_cm_id *id, struct rdma_cm_event *evt) > +static int ucma_process_join(struct cma_event *evt) > { > - struct ucma_abi_join_mcast kmc_data; > - struct rdma_multicast_data *mc_data; > - struct ibv_ah_attr ah_attr; > - uint32_t qp_info; > - > - kmc_data = *(struct ucma_abi_join_mcast *) evt->private_data; > - > - mc_data = evt->private_data; > - mc_data->context = (void *) (uintptr_t) kmc_data.uid; > - memcpy(&mc_data->addr, &kmc_data.addr, > - ucma_addrlen((struct sockaddr *) &kmc_data.addr)); > - > - if (evt->status || !id->qp) > - return; > - > - evt->status = rdma_get_dst_attr(id, &mc_data->addr, &ah_attr, > - &qp_info, &qp_info); > - if (evt->status) > - goto err; > + evt->mc->mgid = evt->event.param.ud.ah_attr.grh.dgid; > + evt->mc->mlid = evt->event.param.ud.ah_attr.dlid; > > - evt->status = ibv_attach_mcast(id->qp, &ah_attr.grh.dgid, ah_attr.dlid); > - if (evt->status) > - goto err; > - return; > -err: > - evt->event = RDMA_CM_EVENT_MULTICAST_ERROR; > + if (evt->id_priv->id.qp) > + return ibv_attach_mcast(evt->id_priv->id.qp, > + &evt->mc->mgid, evt->mc->mlid); > + else > + return 0; > +} > + > +static void ucma_copy_conn_event(struct cma_event *event, > + struct ucma_abi_conn_param *src) > +{ > + struct rdma_conn_param *dst = &event->event.param.conn; > + > + dst->private_data_len = src->private_data_len; > + if (src->private_data_len) { > + dst->private_data = &event->private_data; > + memcpy(&event->private_data, src->private_data, > + src->private_data_len); > + } > + > + dst->responder_resources = src->responder_resources; > + dst->initiator_depth = src->initiator_depth; > + dst->flow_control = src->flow_control; > + dst->retry_count = src->retry_count; > + dst->rnr_retry_count = src->rnr_retry_count; > + dst->srq = src->srq; > + dst->qp_num = src->qp_num; > +} > + > +static void ucma_copy_ud_event(struct cma_event *event, > + struct ucma_abi_ud_param *src) > +{ > + struct rdma_ud_param *dst = &event->event.param.ud; > + > + dst->private_data_len = src->private_data_len; > + if (src->private_data_len) { > + dst->private_data = &event->private_data; > + memcpy(&event->private_data, src->private_data, > + src->private_data_len); > + } > + > + ibv_copy_ah_attr_from_kern(&dst->ah_attr, &src->ah_attr); > + dst->qp_num = src->qp_num; > + dst->qkey = src->qkey; > } > > int rdma_get_cm_event(struct rdma_event_channel *channel, > @@ -1128,8 +1184,7 @@ int rdma_get_cm_event(struct rdma_event_ > { > struct ucma_abi_event_resp *resp; > struct ucma_abi_get_event *cmd; > - struct cma_id_private *id_priv; > - struct rdma_cm_event *evt; > + struct cma_event *evt; > void *msg; > int ret, size; > > @@ -1140,155 +1195,119 @@ int rdma_get_cm_event(struct rdma_event_ > if (!event) > return -EINVAL; > > - evt = malloc(sizeof *evt + RDMA_MAX_PRIVATE_DATA); > + evt = malloc(sizeof *evt); > if (!evt) > return -ENOMEM; > > retry: > + memset(evt, 0, sizeof *evt); > CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_GET_EVENT, size); > ret = write(channel->fd, msg, size); > if (ret != size) { > free(evt); > return (ret > 0) ? -ENODATA : ret; > } > - > - id_priv = (void *) (uintptr_t) resp->uid; > - evt->id = &id_priv->id; > - ucma_copy_event_from_kern(evt, resp); > > - switch (evt->event) { > + evt->event.event = resp->event; > + switch (resp->event) { > case RDMA_CM_EVENT_ADDR_RESOLVED: > - evt->status = ucma_query_route(&id_priv->id); > - if (evt->status) > - evt->event = RDMA_CM_EVENT_ADDR_ERROR; > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + evt->event.id = &evt->id_priv->id; > + evt->event.status = ucma_query_route(&evt->id_priv->id); > + if (evt->event.status) > + evt->event.event = RDMA_CM_EVENT_ADDR_ERROR; > break; > case RDMA_CM_EVENT_ROUTE_RESOLVED: > - evt->status = ucma_query_route(&id_priv->id); > - if (evt->status) > - evt->event = RDMA_CM_EVENT_ROUTE_ERROR; > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + evt->event.id = &evt->id_priv->id; > + evt->event.status = ucma_query_route(&evt->id_priv->id); > + if (evt->event.status) > + evt->event.event = RDMA_CM_EVENT_ROUTE_ERROR; > break; > case RDMA_CM_EVENT_CONNECT_REQUEST: > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + if (evt->id_priv->id.ps == RDMA_PS_TCP) > + ucma_copy_conn_event(evt, &resp->param.conn); > + else > + ucma_copy_ud_event(evt, &resp->param.ud); > + > ret = ucma_process_conn_req(evt, resp->id); > if (ret) > goto retry; > break; > case RDMA_CM_EVENT_CONNECT_RESPONSE: > - evt->status = ucma_process_conn_resp(id_priv); > - if (!evt->status) > - evt->event = RDMA_CM_EVENT_ESTABLISHED; > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + evt->event.id = &evt->id_priv->id; > + ucma_copy_conn_event(evt, &resp->param.conn); > + evt->event.status = ucma_process_conn_resp(evt->id_priv); > + if (!evt->event.status) > + evt->event.event = RDMA_CM_EVENT_ESTABLISHED; > else { > - evt->event = RDMA_CM_EVENT_CONNECT_ERROR; > - id_priv->connect_error = 1; > + evt->event.event = RDMA_CM_EVENT_CONNECT_ERROR; > + evt->id_priv->connect_error = 1; > } > break; > case RDMA_CM_EVENT_ESTABLISHED: > - if (id_priv->id.ps == RDMA_PS_UDP) > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + evt->event.id = &evt->id_priv->id; > + if (evt->id_priv->id.ps == RDMA_PS_UDP) { > + ucma_copy_ud_event(evt, &resp->param.ud); > break; > + } > > - evt->status = ucma_process_establish(&id_priv->id); > - if (evt->status) { > - evt->event = RDMA_CM_EVENT_CONNECT_ERROR; > - id_priv->connect_error = 1; > + ucma_copy_conn_event(evt, &resp->param.conn); > + evt->event.status = ucma_process_establish(&evt->id_priv->id); > + if (evt->event.status) { > + evt->event.event = RDMA_CM_EVENT_CONNECT_ERROR; > + evt->id_priv->connect_error = 1; > } > break; > case RDMA_CM_EVENT_REJECTED: > - if (id_priv->connect_error) { > - ucma_complete_event(id_priv); > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + if (evt->id_priv->connect_error) { > + ucma_complete_event(evt->id_priv); > goto retry; > } > - ucma_modify_qp_err(evt->id); > + evt->event.id = &evt->id_priv->id; > + ucma_copy_conn_event(evt, &resp->param.conn); > + ucma_modify_qp_err(evt->event.id); > break; > case RDMA_CM_EVENT_DISCONNECTED: > - if (id_priv->connect_error) { > - ucma_complete_event(id_priv); > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + if (evt->id_priv->connect_error) { > + ucma_complete_event(evt->id_priv); > goto retry; > } > + evt->event.id = &evt->id_priv->id; > + ucma_copy_conn_event(evt, &resp->param.conn); > break; > case RDMA_CM_EVENT_MULTICAST_JOIN: > + evt->mc = (void *) (uintptr_t) resp->uid; > + evt->id_priv = evt->mc->id_priv; > + evt->event.id = &evt->id_priv->id; > + ucma_copy_ud_event(evt, &resp->param.ud); > + evt->event.param.ud.private_data = evt->mc->context; > + evt->event.status = ucma_process_join(evt); > + if (evt->event.status) > + evt->event.event = RDMA_CM_EVENT_MULTICAST_ERROR; > + break; > case RDMA_CM_EVENT_MULTICAST_ERROR: > - ucma_process_mcast(&id_priv->id, evt); > + evt->mc = (void *) (uintptr_t) resp->uid; > + evt->id_priv = evt->mc->id_priv; > + evt->event.id = &evt->id_priv->id; > + evt->event.status = resp->status; > + evt->event.param.ud.private_data = evt->mc->context; > break; > default: > + evt->id_priv = (void *) (uintptr_t) resp->uid; > + evt->event.id = &evt->id_priv->id; > + if (evt->id_priv->id.ps == RDMA_PS_TCP) > + ucma_copy_conn_event(evt, &resp->param.conn); > + else > + ucma_copy_ud_event(evt, &resp->param.ud); > break; > } > > - *event = evt; > - return 0; > -} > - > -int rdma_get_option(struct rdma_cm_id *id, int level, int optname, > - void *optval, size_t *optlen) > -{ > - struct ucma_abi_get_option_resp *resp; > - struct ucma_abi_get_option *cmd; > - struct cma_id_private *id_priv; > - void *msg; > - int ret, size; > - > - CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_GET_OPTION, size); > - id_priv = container_of(id, struct cma_id_private, id); > - cmd->id = id_priv->handle; > - cmd->optval = (uintptr_t) optval; > - cmd->level = level; > - cmd->optname = optname; > - cmd->optlen = *optlen; > - > - ret = write(id->channel->fd, msg, size); > - if (ret != size) > - return (ret > 0) ? -ENODATA : ret; > - > - *optlen = resp->optlen; > - return 0; > -} > - > -int rdma_set_option(struct rdma_cm_id *id, int level, int optname, > - void *optval, size_t optlen) > -{ > - struct ucma_abi_set_option *cmd; > - struct cma_id_private *id_priv; > - void *msg; > - int ret, size; > - > - CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_SET_OPTION, size); > - id_priv = container_of(id, struct cma_id_private, id); > - cmd->id = id_priv->handle; > - cmd->optval = (uintptr_t) optval; > - cmd->level = level; > - cmd->optname = optname; > - cmd->optlen = optlen; > - > - ret = write(id->channel->fd, msg, size); > - if (ret != size) > - return (ret > 0) ? -ENODATA : ret; > - > - return 0; > -} > - > -int rdma_get_dst_attr(struct rdma_cm_id *id, struct sockaddr *addr, > - struct ibv_ah_attr *ah_attr, uint32_t *remote_qpn, > - uint32_t *remote_qkey) > -{ > - struct ucma_abi_dst_attr_resp *resp; > - struct ucma_abi_get_dst_attr *cmd; > - struct cma_id_private *id_priv; > - void *msg; > - int ret, size, addrlen; > - > - addrlen = ucma_addrlen(addr); > - if (!addrlen) > - return -EINVAL; > - > - CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_GET_DST_ATTR, size); > - id_priv = container_of(id, struct cma_id_private, id); > - cmd->id = id_priv->handle; > - memcpy(&cmd->addr, addr, addrlen); > - > - ret = write(id->channel->fd, msg, size); > - if (ret != size) > - return (ret > 0) ? -ENODATA : ret; > - > - ibv_copy_ah_attr_from_kern(ah_attr, &resp->ah_attr); > - *remote_qpn = resp->remote_qpn; > - *remote_qkey = resp->remote_qkey; > + *event = &evt->event; > return 0; > } > Index: Makefile.am > =================================================================== > --- Makefile.am (revision 9192) > +++ Makefile.am (working copy) > @@ -31,12 +31,10 @@ examples_mckey_LDADD = $(top_builddir)/s > librdmacmincludedir = $(includedir)/rdma > > librdmacminclude_HEADERS = include/rdma/rdma_cma_abi.h \ > - include/rdma/rdma_cma.h \ > - include/rdma/rdma_cma_ib.h > + include/rdma/rdma_cma.h > > EXTRA_DIST = include/rdma/rdma_cma_abi.h \ > include/rdma/rdma_cma.h \ > - include/rdma/rdma_cma_ib.h \ > src/librdmacm.map \ > librdmacm.spec.in > > Index: examples/mckey.c > =================================================================== > --- examples/mckey.c (revision 9208) > +++ examples/mckey.c (working copy) > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2005 Intel Corporation. All rights reserved. > + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -42,9 +42,9 @@ > #include > #include > #include > +#include > > #include > -#include > > struct cmatest_node { > int id; > @@ -76,6 +76,8 @@ static int connections = 1; > static int message_size = 100; > static int message_count = 10; > static int is_sender; > +static char *dst_addr; > +static char *src_addr; > > static int create_message(struct cmatest_node *node) > { > @@ -239,19 +241,12 @@ err: > return ret; > } > > -static int join_handler(struct cmatest_node *node) > +static int join_handler(struct cmatest_node *node, > + struct rdma_ud_param *param) > { > - struct ibv_ah_attr ah_attr; > - int ret; > - > - ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, > - &node->remote_qpn, &node->remote_qkey); > - if (ret) { > - printf("mckey: failure getting destination attributes\n"); > - goto err; > - } > - > - node->ah = ibv_create_ah(node->pd, &ah_attr); > + node->remote_qpn = param->qp_num; > + node->remote_qkey = param->qkey; > + node->ah = ibv_create_ah(node->pd, ¶m->ah_attr); > if (!node->ah) { > printf("mckey: failure creating address handle\n"); > goto err; > @@ -262,7 +257,7 @@ static int join_handler(struct cmatest_n > return 0; > err: > connect_error(); > - return ret; > + return -1; > } > > static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) > @@ -274,7 +269,7 @@ static int cma_handler(struct rdma_cm_id > ret = addr_handler(cma_id->context); > break; > case RDMA_CM_EVENT_MULTICAST_JOIN: > - ret = join_handler(cma_id->context); > + ret = join_handler(cma_id->context, &event->param.ud); > break; > case RDMA_CM_EVENT_ADDR_ERROR: > case RDMA_CM_EVENT_ROUTE_ERROR: > @@ -411,18 +406,21 @@ out: > return ret; > } > > -static int run(char *dst, char *src) > +static int run(void) > { > int i, ret; > > - printf("mckey: starting client\n"); > - if (src) { > - ret = get_addr(src, &test.src_in); > + if (is_sender) > + printf("mckey: starting client\n"); > + else > + printf("mckey: starting server\n"); > + if (src_addr) { > + ret = get_addr(src_addr, &test.src_in); > if (ret) > return ret; > } > > - ret = get_addr(dst, &test.dst_in); > + ret = get_addr(dst_addr, &test.dst_in); > if (ret) > return ret; > > @@ -431,7 +429,7 @@ static int run(char *dst, char *src) > printf("mckey: joining\n"); > for (i = 0; i < connections; i++) { > ret = rdma_resolve_addr(test.nodes[i].cma_id, > - src ? test.src_addr : NULL, > + src_addr ? test.src_addr : NULL, > test.dst_addr, 2000); > if (ret) { > printf("mckey: failure getting addr: %d\n", ret); > @@ -472,14 +470,39 @@ out: > > int main(int argc, char **argv) > { > - int ret; > + int op, ret; > > - if (argc < 3 || argc > 4) { > - printf("usage: %s {s[end] | r[ecv]} mcast_addr [bind_addr]]\n", > - argv[0]); > - exit(1); > + while ((op = getopt(argc, argv, "m:sb:c:C:S:")) != -1) { > + switch (op) { > + case 'm': > + dst_addr = optarg; > + break; > + case 's': > + is_sender = 1; > + break; > + case 'b': > + src_addr = optarg; > + break; > + case 'c': > + connections = atoi(optarg); > + break; > + case 'C': > + message_count = atoi(optarg); > + break; > + case 'S': > + message_size = atoi(optarg); > + break; > + default: > + printf("usage: %s\n", argv[0]); > + printf("\t-m multicast_address\n"); > + printf("\t[-s(ender)]\n"); > + printf("\t[-b bind_address]\n"); > + printf("\t[-c connections]\n"); > + printf("\t[-C message_count]\n"); > + printf("\t[-S message_size]\n"); > + exit(1); > + } > } > - is_sender = (argv[1][0] == 's'); > > test.dst_addr = (struct sockaddr *) &test.dst_in; > test.src_addr = (struct sockaddr *) &test.src_in; > @@ -494,7 +517,7 @@ int main(int argc, char **argv) > if (alloc_nodes()) > exit(1); > > - ret = run(argv[2], (argc == 4) ? argv[3] : NULL); > + ret = run(); > > printf("test complete\n"); > destroy_nodes(); > Index: examples/udaddy.c > =================================================================== > --- examples/udaddy.c (revision 9208) > +++ examples/udaddy.c (working copy) > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2005 Intel Corporation. All rights reserved. > + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -41,15 +41,9 @@ > #include > #include > #include > +#include > > #include > -#include > - > -/* > - * To execute: > - * Server: udaddy > - * Client: udaddy [server_addr [src_addr]] > - */ > > struct cmatest_node { > int id; > @@ -80,7 +74,8 @@ static struct cmatest test; > static int connections = 1; > static int message_size = 100; > static int message_count = 10; > -static int is_server; > +static char *dst_addr; > +static char *src_addr; > > static int create_message(struct cmatest_node *node) > { > @@ -246,7 +241,6 @@ static int route_handler(struct cmatest_ > > memset(&conn_param, 0, sizeof conn_param); > conn_param.qp_num = node->cma_id->qp->qp_num; > - conn_param.qp_type = node->cma_id->qp->qp_type; > conn_param.retry_count = 5; > ret = rdma_connect(node->cma_id, &conn_param); > if (ret) { > @@ -284,7 +278,6 @@ static int connect_handler(struct rdma_c > > memset(&conn_param, 0, sizeof conn_param); > conn_param.qp_num = node->cma_id->qp->qp_num; > - conn_param.qp_type = node->cma_id->qp->qp_type; > ret = rdma_accept(node->cma_id, &conn_param); > if (ret) { > printf("udaddy: failure accepting: %d\n", ret); > @@ -303,19 +296,12 @@ err1: > return ret; > } > > -static int resolved_handler(struct cmatest_node *node) > +static int resolved_handler(struct cmatest_node *node, > + struct rdma_cm_event *event) > { > - struct ibv_ah_attr ah_attr; > - int ret; > - > - ret = rdma_get_dst_attr(node->cma_id, test.dst_addr, &ah_attr, > - &node->remote_qpn, &node->remote_qkey); > - if (ret) { > - printf("udaddy: failure getting destination attributes\n"); > - goto err; > - } > - > - node->ah = ibv_create_ah(node->pd, &ah_attr); > + node->remote_qpn = event->param.ud.qp_num; > + node->remote_qkey = event->param.ud.qkey; > + node->ah = ibv_create_ah(node->pd, &event->param.ud.ah_attr); > if (!node->ah) { > printf("udaddy: failure creating address handle\n"); > goto err; > @@ -326,7 +312,7 @@ static int resolved_handler(struct cmate > return 0; > err: > connect_error(); > - return ret; > + return -1; > } > > static int cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) > @@ -344,7 +330,7 @@ static int cma_handler(struct rdma_cm_id > ret = connect_handler(cma_id); > break; > case RDMA_CM_EVENT_ESTABLISHED: > - ret = resolved_handler(cma_id->context); > + ret = resolved_handler(cma_id->context, event); > break; > case RDMA_CM_EVENT_ADDR_ERROR: > case RDMA_CM_EVENT_ROUTE_ERROR: > @@ -404,7 +390,7 @@ static int alloc_nodes(void) > > for (i = 0; i < connections; i++) { > test.nodes[i].id = i; > - if (!is_server) { > + if (dst_addr) { > ret = rdma_create_id(test.channel, > &test.nodes[i].cma_id, > &test.nodes[i], RDMA_PS_UDP); > @@ -475,6 +461,28 @@ static int connect_events(void) > return ret; > } > > +static int get_addr(char *dst, struct sockaddr_in *addr) > +{ > + struct addrinfo *res; > + int ret; > + > + ret = getaddrinfo(dst, NULL, NULL, &res); > + if (ret) { > + printf("getaddrinfo failed - invalid hostname or IP address\n"); > + return ret; > + } > + > + if (res->ai_family != PF_INET) { > + ret = -1; > + goto out; > + } > + > + *addr = *(struct sockaddr_in *) res->ai_addr; > +out: > + freeaddrinfo(res); > + return ret; > +} > + > static int run_server(void) > { > struct rdma_cm_id *listen_id; > @@ -487,7 +495,13 @@ static int run_server(void) > return ret; > } > > - test.src_in.sin_family = PF_INET; > + if (src_addr) { > + ret = get_addr(src_addr, &test.src_in); > + if (ret) > + goto out; > + } else > + test.src_in.sin_family = PF_INET; > + > test.src_in.sin_port = 7174; > ret = rdma_bind_addr(listen_id, test.src_addr); > if (ret) { > @@ -526,40 +540,18 @@ out: > return ret; > } > > -static int get_addr(char *dst, struct sockaddr_in *addr) > -{ > - struct addrinfo *res; > - int ret; > - > - ret = getaddrinfo(dst, NULL, NULL, &res); > - if (ret) { > - printf("getaddrinfo failed - invalid hostname or IP address\n"); > - return ret; > - } > - > - if (res->ai_family != PF_INET) { > - ret = -1; > - goto out; > - } > - > - *addr = *(struct sockaddr_in *) res->ai_addr; > -out: > - freeaddrinfo(res); > - return ret; > -} > - > -static int run_client(char *dst, char *src) > +static int run_client(void) > { > int i, ret; > > printf("udaddy: starting client\n"); > - if (src) { > - ret = get_addr(src, &test.src_in); > + if (src_addr) { > + ret = get_addr(src_addr, &test.src_in); > if (ret) > return ret; > } > > - ret = get_addr(dst, &test.dst_in); > + ret = get_addr(dst_addr, &test.dst_in); > if (ret) > return ret; > > @@ -568,7 +560,7 @@ static int run_client(char *dst, char *s > printf("udaddy: connecting\n"); > for (i = 0; i < connections; i++) { > ret = rdma_resolve_addr(test.nodes[i].cma_id, > - src ? test.src_addr : NULL, > + src_addr ? test.src_addr : NULL, > test.dst_addr, 2000); > if (ret) { > printf("udaddy: failure getting addr: %d\n", ret); > @@ -601,13 +593,35 @@ out: > > int main(int argc, char **argv) > { > - int ret; > + int op, ret; > > - if (argc > 3) { > - printf("usage: %s [server_addr [src_addr]]\n", argv[0]); > - exit(1); > + while ((op = getopt(argc, argv, "s:b:c:C:S:")) != -1) { > + switch (op) { > + case 's': > + dst_addr = optarg; > + break; > + case 'b': > + src_addr = optarg; > + break; > + case 'c': > + connections = atoi(optarg); > + break; > + case 'C': > + message_count = atoi(optarg); > + break; > + case 'S': > + message_size = atoi(optarg); > + break; > + default: > + printf("usage: %s\n", argv[0]); > + printf("\t[-s server_address]\n"); > + printf("\t[-b bind_address]\n"); > + printf("\t[-c connections]\n"); > + printf("\t[-C message_count]\n"); > + printf("\t[-S message_size]\n"); > + exit(1); > + } > } > - is_server = (argc == 1); > > test.dst_addr = (struct sockaddr *) &test.dst_in; > test.src_addr = (struct sockaddr *) &test.src_in; > @@ -622,10 +636,10 @@ int main(int argc, char **argv) > if (alloc_nodes()) > exit(1); > > - if (is_server) > - ret = run_server(); > + if (dst_addr) > + ret = run_client(); > else > - ret = run_client(argv[1], (argc == 3) ? argv[2] : NULL); > + ret = run_server(); > > printf("test complete\n"); > destroy_nodes(); > Index: examples/cmatose.c > =================================================================== > --- examples/cmatose.c (revision 9192) > +++ examples/cmatose.c (working copy) > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2005 Intel Corporation. All rights reserved. > + * Copyright (c) 2005-2006 Intel Corporation. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -41,6 +41,7 @@ > #include > #include > #include > +#include > > #include > > @@ -52,12 +53,6 @@ static inline uint64_t cpu_to_be64(uint6 > static inline uint32_t cpu_to_be32(uint32_t x) { return bswap_32(x); } > #endif > > -/* > - * To execute: > - * Server: rdma_cmatose > - * Client: rdma_cmatose > - */ > - > struct cmatest_node { > int id; > struct rdma_cm_id *cma_id; > @@ -85,7 +80,8 @@ static struct cmatest test; > static int connections = 1; > static int message_size = 100; > static int message_count = 10; > -static int is_server; > +static char *dst_addr; > +static char *src_addr; > > static int create_message(struct cmatest_node *node) > { > @@ -377,7 +373,7 @@ static int alloc_nodes(void) > > for (i = 0; i < connections; i++) { > test.nodes[i].id = i; > - if (!is_server) { > + if (dst_addr) { > ret = rdma_create_id(test.channel, > &test.nodes[i].cma_id, > &test.nodes[i], RDMA_PS_TCP); > @@ -460,6 +456,28 @@ static int disconnect_events(void) > return ret; > } > > +static int get_addr(char *dst, struct sockaddr_in *addr) > +{ > + struct addrinfo *res; > + int ret; > + > + ret = getaddrinfo(dst, NULL, NULL, &res); > + if (ret) { > + printf("getaddrinfo failed - invalid hostname or IP address\n"); > + return ret; > + } > + > + if (res->ai_family != PF_INET) { > + ret = -1; > + goto out; > + } > + > + *addr = *(struct sockaddr_in *) res->ai_addr; > +out: > + freeaddrinfo(res); > + return ret; > +} > + > static int run_server(void) > { > struct rdma_cm_id *listen_id; > @@ -472,12 +490,18 @@ static int run_server(void) > return ret; > } > > - test.src_in.sin_family = PF_INET; > + if (src_addr) { > + ret = get_addr(src_addr, &test.src_in); > + if (ret) > + goto out; > + } else > + test.src_in.sin_family = PF_INET; > + > test.src_in.sin_port = 7471; > ret = rdma_bind_addr(listen_id, test.src_addr); > if (ret) { > printf("cmatose: bind address failed: %d\n", ret); > - return ret; > + goto out; > } > > ret = rdma_listen(listen_id, 0); > @@ -528,40 +552,18 @@ out: > return ret; > } > > -static int get_addr(char *dst, struct sockaddr_in *addr) > -{ > - struct addrinfo *res; > - int ret; > - > - ret = getaddrinfo(dst, NULL, NULL, &res); > - if (ret) { > - printf("getaddrinfo failed - invalid hostname or IP address\n"); > - return ret; > - } > - > - if (res->ai_family != PF_INET) { > - ret = -1; > - goto out; > - } > - > - *addr = *(struct sockaddr_in *) res->ai_addr; > -out: > - freeaddrinfo(res); > - return ret; > -} > - > -static int run_client(char *dst, char *src) > +static int run_client(void) > { > int i, ret, ret2; > > printf("cmatose: starting client\n"); > - if (src) { > - ret = get_addr(src, &test.src_in); > + if (src_addr) { > + ret = get_addr(src_addr, &test.src_in); > if (ret) > return ret; > } > > - ret = get_addr(dst, &test.dst_in); > + ret = get_addr(dst_addr, &test.dst_in); > if (ret) > return ret; > > @@ -570,7 +572,7 @@ static int run_client(char *dst, char *s > printf("cmatose: connecting\n"); > for (i = 0; i < connections; i++) { > ret = rdma_resolve_addr(test.nodes[i].cma_id, > - src ? test.src_addr : NULL, > + src_addr ? test.src_addr : NULL, > test.dst_addr, 2000); > if (ret) { > printf("cmatose: failure getting addr: %d\n", ret); > @@ -597,7 +599,6 @@ static int run_client(char *dst, char *s > } > > printf("data transfers complete\n"); > - > } > > ret = 0; > @@ -611,13 +612,35 @@ out: > > int main(int argc, char **argv) > { > - int ret; > + int op, ret; > > - if (argc > 3) { > - printf("usage: %s [server_addr [src_addr]]\n", argv[0]); > - exit(1); > + while ((op = getopt(argc, argv, "s:b:c:C:S:")) != -1) { > + switch (op) { > + case 's': > + dst_addr = optarg; > + break; > + case 'b': > + src_addr = optarg; > + break; > + case 'c': > + connections = atoi(optarg); > + break; > + case 'C': > + message_count = atoi(optarg); > + break; > + case 'S': > + message_size = atoi(optarg); > + break; > + default: > + printf("usage: %s\n", argv[0]); > + printf("\t[-s server_address]\n"); > + printf("\t[-b bind_address]\n"); > + printf("\t[-c connections]\n"); > + printf("\t[-C message_count]\n"); > + printf("\t[-S message_size]\n"); > + exit(1); > + } > } > - is_server = (argc == 1); > > test.dst_addr = (struct sockaddr *) &test.dst_in; > test.src_addr = (struct sockaddr *) &test.src_in; > @@ -633,10 +656,10 @@ int main(int argc, char **argv) > if (alloc_nodes()) > exit(1); > > - if (is_server) > - ret = run_server(); > + if (dst_addr) > + ret = run_client(); > else > - ret = run_client(argv[1], (argc == 3) ? argv[2] : NULL); > + ret = run_server(); > > printf("test complete\n"); > destroy_nodes(); > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Nov 1 14:30:10 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 00:30:10 +0200 Subject: [openib-general] question on QoS support In-Reply-To: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> Message-ID: <20061101223010.GF9985@sashak.voltaire.com> On 16:52 Wed 01 Nov , Oliver wrote: > Hi, folks - > > I am trying to verify and evaluate IB QoS support, running openSM as > subnet manager. The perftest program is extended to set SL as command > line options instead of default 0, and by modifying VL arbitration > tables, I am expecting to see the traffic shaping can actually take > place, but it did not. More details on configuration: > > in opensm.opts: > # QoS default options > qos_high_limit 255 # disable low priority table > qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 > (corresponding to SL 2) a higher weight 8 > qos_sl2vl 0,1,2,3,4, ... # no changes here > > I think (though not verified) the Voltaire HCA we are using can > support 8 data VLs. I don't have much more information to go on why > qos shaping is not taking place, any suggestions? You can verify actual port's parameters with smpquery (from diags), you will need to run to get QoS related parameters: smpquery portinfo ... smpquery vlarb ... smpquery sl2vl ... Sasha > A related question is, if I modify qos setting in SM, do I need to > restart SA on each hosts for it to see the changes? (I am hoping not, > as I tried in the test, it doesn't seem to make a difference) > > Thanks for help. > -- > Oliver > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From minich at ornl.gov Wed Nov 1 14:42:41 2006 From: minich at ornl.gov (Makia Minich) Date: Wed, 01 Nov 2006 17:42:41 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <20061101223010.GF9985@sashak.voltaire.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <20061101223010.GF9985@sashak.voltaire.com> Message-ID: <454922E1.1050002@ornl.gov> It just so happens that we've started looking at this here at ORNL as well. I had a question about the options. The manpage makes it seem that you can set these qos options (e.g. qos_high_limit) from the command line, but I haven't been overly successful. Is there an example of this being done? Or is changing the /var/cache/osm/opensm.opts file the preferred method of changing the options? Sasha Khapyorsky wrote: > On 16:52 Wed 01 Nov , Oliver wrote: >> Hi, folks - >> >> I am trying to verify and evaluate IB QoS support, running openSM as >> subnet manager. The perftest program is extended to set SL as command >> line options instead of default 0, and by modifying VL arbitration >> tables, I am expecting to see the traffic shaping can actually take >> place, but it did not. More details on configuration: >> >> in opensm.opts: >> # QoS default options >> qos_high_limit 255 # disable low priority table >> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 >> (corresponding to SL 2) a higher weight 8 >> qos_sl2vl 0,1,2,3,4, ... # no changes here >> >> I think (though not verified) the Voltaire HCA we are using can >> support 8 data VLs. I don't have much more information to go on why >> qos shaping is not taking place, any suggestions? > > You can verify actual port's parameters with smpquery (from diags), you > will need to run to get QoS related parameters: > > smpquery portinfo ... > smpquery vlarb ... > smpquery sl2vl ... > > Sasha > >> A related question is, if I modify qos setting in SM, do I need to >> restart SA on each hosts for it to see the changes? (I am hoping not, >> as I tried in the test, it doesn't seem to make a difference) >> >> Thanks for help. >> -- >> Oliver >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 From minich at ornl.gov Wed Nov 1 14:46:26 2006 From: minich at ornl.gov (Makia Minich) Date: Wed, 01 Nov 2006 17:46:26 -0500 Subject: [openib-general] PCI-Express card loses connectivity on quad opteron In-Reply-To: <24204.62.140.137.30.1162419145.squirrel@mail.fundum.net> References: <24204.62.140.137.30.1162419145.squirrel@mail.fundum.net> Message-ID: <454923C2.305@ornl.gov> Do you see anything out of dmesg or any useful output? Also, what firmware levels are you running on the HCA? Is it all traffic that goes away (e.g. what happens if you attempt to run the ib_rdma_bw test)? And just to complete my list of questions, does the node itself go catatonic or is it still loginable? If no one else has a better idea, it might be helpful to also know the motherboard (as there might be some tricks that need to happen that are motherboard specific). patrick at fundum.net wrote: > Hi, > > I am testing the Mellanox Technologies MT25208 InfiniHost III Ex (Tavor > compatibility mode) (rev a0) PCI-Express card on RHEL 4 upd 4. The machine > is a quad opteron running on x86_64. I'm using the OFED1.1 package. > > After moving some data (a few Megs with netperf) over the IB network the > IB card loses network connectivity. Pinging the card itself still works at > that point. > > > Any thoughts? > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 From ardavis at ichips.intel.com Wed Nov 1 14:55:16 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Wed, 01 Nov 2006 14:55:16 -0800 Subject: [openib-general] Verbs QP create with RQ=0? In-Reply-To: References: Message-ID: <454925D4.7050500@ichips.intel.com> Roland Dreier wrote: > >As was already suggested, you should be able to use the same CQ for >receives and for sends. If you never post any receives on the QP, you >don't have to allocate any extra space on your send CQ. And it should >work to have 0 receive work queue entries. Have you tried it? > > > > Yes, these settings work fine (recv_cq = req_cq, max_recv_wr = 0, max_recv_sge = 0). thanks, -arlin From boris at mellanox.com Wed Nov 1 14:55:45 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Wed, 1 Nov 2006 14:55:45 -0800 Subject: [openib-general] PCI-Express card loses connectivity on quad opteron Message-ID: <1E3DCD1C63492545881FACB6063A57C16E3ED8@mtiexch01.mti.com> You need to make sure the SM is up and running - it is crucial for IPoIB operation. My 2 cents. Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Makia Minich Sent: Wednesday, November 01, 2006 2:46 PM To: patrick at fundum.net Cc: openib-general at openib.org Subject: Re: [openib-general] PCI-Express card loses connectivity on quad opteron Do you see anything out of dmesg or any useful output? Also, what firmware levels are you running on the HCA? Is it all traffic that goes away (e.g. what happens if you attempt to run the ib_rdma_bw test)? And just to complete my list of questions, does the node itself go catatonic or is it still loginable? If no one else has a better idea, it might be helpful to also know the motherboard (as there might be some tricks that need to happen that are motherboard specific). patrick at fundum.net wrote: > Hi, > > I am testing the Mellanox Technologies MT25208 InfiniHost III Ex > (Tavor compatibility mode) (rev a0) PCI-Express card on RHEL 4 upd 4. > The machine is a quad opteron running on x86_64. I'm using the OFED1.1 package. > > After moving some data (a few Megs with netperf) over the IB network > the IB card loses network connectivity. Pinging the card itself still > works at that point. > > > Any thoughts? > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Wed Nov 1 14:58:35 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Nov 2006 14:58:35 -0800 Subject: [openib-general] [PATCH repost] IB/srp: destroy/recreate qp/cq at reconnect In-Reply-To: <20061031194357.GC5950@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 31 Oct 2006 21:43:57 +0200") References: <20061019195719.GB2674@mellanox.co.il> <20061031194357.GC5950@mellanox.co.il> Message-ID: > Roland, what do you think about this patch? > Seems like a good idea, to me. Sorry, I haven't made this a high priority. It seems a little like fiddling with the code just for the sake of fiddling -- why pick this one place to recreate a CQ? Why not ipoib, etc? From davem at davemloft.net Wed Nov 1 15:04:18 2006 From: davem at davemloft.net (David Miller) Date: Wed, 01 Nov 2006 15:04:18 -0800 (PST) Subject: [openib-general] Ordering between PCI config space writes and MMIO reads? In-Reply-To: <4548CAE7.8010300@sgi.com> References: <20061031204717.GG26964@parisc-linux.org> <4548CAE7.8010300@sgi.com> Message-ID: <20061101.150418.26278280.davem@davemloft.net> From: John Partridge Date: Wed, 01 Nov 2006 10:27:19 -0600 > Sorry, but I find this change a bit puzzling. The problem is > particular to the PPB on the HCA and not Altix. I can't see anywhere > that a PCI Config Write is required to block until completion, it is > the driver and the HCA ,not the Altix hardware that requires the > Config Write to have completed before we leave mthca_reset() > Changing pci_write_config_xxx() will change the behavior for ALL > drivers and the possibility of breaking something else. The fix was > very low risk in mthca_reset(), changing the PCI code to fix this is > much more onerous. The issue is that something as simple as: val = pci_read_config(REG); val |= bit; pci_write_config(REG, val); newval = pci_read_config(REG); BUG_ON(!(newval & bit)); is not guarenteed by PCI (aparently). I see no valid reason why every PCI device driver should be troubled with this lunacy and the ordering should thus be ensured by the PCI layer. It just so happens to take care of the original driver issue too :-) From mshefty at ichips.intel.com Wed Nov 1 15:18:27 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 01 Nov 2006 15:18:27 -0800 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: References: <4547308F.2030708@voltaire.com> <20061031115017.GF2387@mellanox.co.il> <454746A8.1040604@voltaire.com> <45477E81.3040205@ichips.intel.com> <45487B6C.2070408@voltaire.com> Message-ID: <45492B43.5010408@ichips.intel.com> Roland Dreier wrote: > I think this is actually a good point for the CM case at least. > Clients already have something registered with the CM (namely the CM > ID itself), so if we required all consumers to destroy their IDs > explicitly, then there's no reason to add additional client > registration. The issue is more related to cm_id's that are created when a new connection request arrives. For the user to destroy the new id's, they either need to be able to queue them somewhere for later destruction, call destroy from the callback, or indicate that the id's should be destroyed when the callback returns. - Sean From bugzilla-daemon at openib.org Wed Nov 1 15:24:10 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Wed, 1 Nov 2006 15:24:10 -0800 (PST) Subject: [openib-general] [Bug 266] IPoIB multicast does not work with RHEL4 U4 Message-ID: <20061101232410.293D52283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=266 ------- Comment #3 from dledford at redhat.com 2006-11-01 15:24 ------- I've submitted the patch for this for internal review and possible inclusion in RHEL4 Update 5. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From sashak at voltaire.com Wed Nov 1 15:31:40 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 01:31:40 +0200 Subject: [openib-general] question on QoS support In-Reply-To: <454922E1.1050002@ornl.gov> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <20061101223010.GF9985@sashak.voltaire.com> <454922E1.1050002@ornl.gov> Message-ID: <20061101233140.GG9985@sashak.voltaire.com> On 17:42 Wed 01 Nov , Makia Minich wrote: > It just so happens that we've started looking at this here at ORNL as > well. I had a question about the options. The manpage makes it seem > that you can set these qos options (e.g. qos_high_limit) from the > command line, AFAIK there is option -Q which enables/disables QoS configuration, it does nothing with particular qos_high_limit parameter. Configuration parameters (qos_max_vls, qos_high_limit, qos_vlarb_high, qos_vlarb_low and qos_sl2vl templates) should be specified in opensm.opts file (or other OpenSM configuration file which does not exist yet). > but I haven't been overly successful. Is there an example > of this being done? Or is changing the /var/cache/osm/opensm.opts file > the preferred method of changing the options? Yes, you need to specify QoS parameters in opensm.opts file. There is some readme file osm/doc/qos-config.txt which describes details (I think man page have similar section too). Ah, important note with OFED QoS is disabled by default in OpenSM, so -Q option should be used, which for OFED means --qos. OpenSM from trunk supports QoS configuration by default and -Q option disables this (and means --no-qos), this can be confused, I know. Sasha > > Sasha Khapyorsky wrote: > > On 16:52 Wed 01 Nov , Oliver wrote: > >> Hi, folks - > >> > >> I am trying to verify and evaluate IB QoS support, running openSM as > >> subnet manager. The perftest program is extended to set SL as command > >> line options instead of default 0, and by modifying VL arbitration > >> tables, I am expecting to see the traffic shaping can actually take > >> place, but it did not. More details on configuration: > >> > >> in opensm.opts: > >> # QoS default options > >> qos_high_limit 255 # disable low priority table > >> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 > >> (corresponding to SL 2) a higher weight 8 > >> qos_sl2vl 0,1,2,3,4, ... # no changes here > >> > >> I think (though not verified) the Voltaire HCA we are using can > >> support 8 data VLs. I don't have much more information to go on why > >> qos shaping is not taking place, any suggestions? > > > > You can verify actual port's parameters with smpquery (from diags), you > > will need to run to get QoS related parameters: > > > > smpquery portinfo ... > > smpquery vlarb ... > > smpquery sl2vl ... > > > > Sasha > > > >> A related question is, if I modify qos setting in SM, do I need to > >> restart SA on each hosts for it to see the changes? (I am hoping not, > >> as I tried in the test, it doesn't seem to make a difference) > >> > >> Thanks for help. > >> -- > >> Oliver > >> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > > -- > Makia Minich > National Center for Computation Science > Oak Ridge National Laboratory > Phone: 865.574.7460 From mshefty at ichips.intel.com Wed Nov 1 15:29:39 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 01 Nov 2006 15:29:39 -0800 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <1162419236.6366.50.camel@stevo-desktop> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <1162419236.6366.50.camel@stevo-desktop> Message-ID: <45492DE3.1070102@ichips.intel.com> > This patch removes rdma_get/set_option(). Is that what you intended? Yes. I wanted to reconsider the approach here. I believe that there's a cleaner implementation for getting path records that involves a userspace SA library/daemon than going through the rdma cm. And no one was using the option to set a specific path. For the CM timeout options, those were added to support uDAPL, but I believe that a better approach which would accomplish the higher level goal is to have the kernel rdma cm issue MRA (message received acknowledged) messages for clients which are slow to respond to requests. - Sean From mshefty at ichips.intel.com Wed Nov 1 15:40:46 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 01 Nov 2006 15:40:46 -0800 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <45492B43.5010408@ichips.intel.com> References: <4547308F.2030708@voltaire.com> <20061031115017.GF2387@mellanox.co.il> <454746A8.1040604@voltaire.com> <45477E81.3040205@ichips.intel.com> <45487B6C.2070408@voltaire.com> <45492B43.5010408@ichips.intel.com> Message-ID: <4549307E.9060200@ichips.intel.com> >>I think this is actually a good point for the CM case at least. >>Clients already have something registered with the CM (namely the CM >>ID itself), so if we required all consumers to destroy their IDs >>explicitly, then there's no reason to add additional client >>registration. > > The issue is more related to cm_id's that are created when a new connection > request arrives. For the user to destroy the new id's, they either need to be > able to queue them somewhere for later destruction, call destroy from the > callback, or indicate that the id's should be destroyed when the callback returns. I should add that the point is taken though. If we only allow new cm_id's to be destroyed this way, then we avoid the issue. I _think_ that all users of the ib_cm and rdma_cm behave this way, but I need to verify this to be sure. - Sean From arlin.r.davis at intel.com Wed Nov 1 16:36:50 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 1 Nov 2006 16:36:50 -0800 Subject: [openib-general] [PATCH 1/3] uDAPL cma: add support for new client register event Message-ID: <000001c6fe16$fcaa9090$bb97070a@amr.corp.intel.com> Added support for new ib verbs client register event. No extra processing required at the uDAPL level. Shows up if opensm bounces. Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 9916) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -744,9 +744,16 @@ hca->async_un_ctx); break; } + case IBV_EVENT_CLIENT_REREGISTER: + /* no need to report this event this time */ + dapl_dbg_log (DAPL_DBG_TYPE_WARN, + " async_event: IBV_EVENT_CLIENT_REREGISTER\n"); + break; + default: dapl_dbg_log (DAPL_DBG_TYPE_WARN, - " async_event: UNKNOWN\n"); + " async_event: %d UNKNOWN\n", + event.event_type); break; } -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Wed Nov 1 16:36:56 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 1 Nov 2006 16:36:56 -0800 Subject: [openib-general] [PATCH 2/3] uDAPL cma: fix issues with creating qp without rcv resources Message-ID: <000501c6fe17$00659cc0$bb97070a@amr.corp.intel.com> Fix some issues supporting create qp without recv cq handle or recv qp resources. IB verbs assume a recv_cq handle and uDAPL dapl_ep_create assumes there is always recv_sge resources specified. Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/common/dapl_ep_create.c =================================================================== --- dapl/common/dapl_ep_create.c (revision 9916) +++ dapl/common/dapl_ep_create.c (working copy) @@ -166,7 +166,7 @@ (recv_evd_handle != DAT_HANDLE_NULL && ep_attr->max_recv_dtos == 0) || (request_evd_handle == DAT_HANDLE_NULL && ep_attr->max_request_dtos != 0) || (request_evd_handle != DAT_HANDLE_NULL && ep_attr->max_request_dtos == 0) || - ep_attr->max_recv_iov == 0 || + (recv_evd_handle != DAT_HANDLE_NULL && ep_attr->max_recv_iov == 0) || ep_attr->max_request_iov == 0 || (DAT_SUCCESS != dapl_ep_check_recv_completion_flags ( ep_attr->recv_completion_flags)) )) Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 10032) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -143,13 +143,21 @@ /* Setup attributes and create qp */ dapl_os_memzero((void*)&qp_create, sizeof(qp_create)); qp_create.cap.max_send_wr = attr->max_request_dtos; - qp_create.cap.max_recv_wr = attr->max_recv_dtos; qp_create.cap.max_send_sge = attr->max_request_iov; - qp_create.cap.max_recv_sge = attr->max_recv_iov; qp_create.cap.max_inline_data = ia_ptr->hca_ptr->ib_trans.max_inline_send; qp_create.send_cq = req_cq; - qp_create.recv_cq = rcv_cq; + + /* ibv assumes rcv_cq is never NULL, set to req_cq */ + if (rcv_cq == NULL) { + qp_create.recv_cq = req_cq; + qp_create.cap.max_recv_wr = 0; + qp_create.cap.max_recv_sge = 0; + } else { + qp_create.recv_cq = rcv_cq; + qp_create.cap.max_recv_wr = attr->max_recv_dtos; + qp_create.cap.max_recv_sge = attr->max_recv_iov; + } qp_create.qp_type = IBV_QPT_RC; qp_create.qp_context = (void*)ep_ptr; -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Wed Nov 1 16:37:39 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Wed, 1 Nov 2006 16:37:39 -0800 Subject: [openib-general] [PATCH 3/3] uDAPL cma: add support for address and route retries, call disconnect when recving dreq Message-ID: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> Fix some timeout and long disconnect delay issues discovered during scale-out testing. Added support to retry rdma_cm address and route resolution with configuration options and provide a disconnect call when receiving the disconnect request to force an immediate disconnect reply to the remote side. Here are the new options (environment variables) with the default setting DAPL_CM_ARP_TIMEOUT_MS 4000 DAPL_CM_ARP_RETRY_COUNT 15 DAPL_CM_ROUTE_TIMEOUT_MS 4000 DAPL_CM_ROUTE_RETRY_COUNT 15 Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 9916) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -58,6 +58,9 @@ #include "dapl_ib_util.h" #include #include +#include +#include +#include #include extern struct rdma_event_channel *g_cm_events; @@ -99,8 +102,8 @@ &ipaddr->src_addr)->sin_addr.s_addr), ntohl(((struct sockaddr_in *) &ipaddr->dst_addr)->sin_addr.s_addr)); - - ret = rdma_resolve_route(conn->cm_id, 2000); + + ret = rdma_resolve_route(conn->cm_id, conn->route_timeout); if (ret) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_connect failed: %s\n",strerror(errno)); @@ -120,6 +123,7 @@ struct rdma_addr *ipaddr = &conn->cm_id->route.addr; struct ib_addr *ibaddr = &conn->cm_id->route.addr.addr.ibaddr; #endif + dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: cm_id %p SRC %x DST %x PORT %d\n", conn->cm_id, @@ -381,6 +385,7 @@ break; case RDMA_CM_EVENT_DISCONNECTED: + rdma_disconnect(conn->cm_id); /* force the DREP */ /* validate EP handle */ if (!DAPL_BAD_HANDLE(conn->ep, DAPL_MAGIC_EP)) dapl_evd_connection_callback(conn, @@ -494,6 +499,7 @@ break; case RDMA_CM_EVENT_DISCONNECTED: + rdma_disconnect(conn->cm_id); /* force the DREP */ /* validate SP handle context */ if (!DAPL_BAD_HANDLE(conn->sp, DAPL_MAGIC_PSP) || !DAPL_BAD_HANDLE(conn->sp, DAPL_MAGIC_RSP)) @@ -543,7 +549,8 @@ IN void *p_data) { struct dapl_ep *ep_ptr = ep_handle; - + struct dapl_cm_id *conn; + /* Sanity check */ if (NULL == ep_ptr) return DAT_SUCCESS; @@ -552,36 +559,38 @@ r_qual,p_data,p_size); /* rdma conn and cm_id pre-bound; reference via qp_handle */ - ep_ptr->cm_handle = ep_ptr->qp_handle; + conn = ep_ptr->cm_handle = ep_ptr->qp_handle; /* Setup QP/CM parameters and private data in cm_id */ - (void)dapl_os_memzero(&ep_ptr->cm_handle->params, - sizeof(ep_ptr->cm_handle->params)); - ep_ptr->cm_handle->params.responder_resources = IB_TARGET_MAX; - ep_ptr->cm_handle->params.initiator_depth = IB_INITIATOR_DEPTH; - ep_ptr->cm_handle->params.flow_control = 1; - ep_ptr->cm_handle->params.rnr_retry_count = IB_RNR_RETRY_COUNT; - ep_ptr->cm_handle->params.retry_count = IB_RC_RETRY_COUNT; + (void)dapl_os_memzero(&conn->params, sizeof(conn->params)); + conn->params.responder_resources = IB_TARGET_MAX; + conn->params.initiator_depth = IB_INITIATOR_DEPTH; + conn->params.flow_control = 1; + conn->params.rnr_retry_count = IB_RNR_RETRY_COUNT; + conn->params.retry_count = IB_RC_RETRY_COUNT; if (p_size) { - dapl_os_memcpy(ep_ptr->cm_handle->p_data, p_data, p_size); - ep_ptr->cm_handle->params.private_data = - ep_ptr->cm_handle->p_data; - ep_ptr->cm_handle->params.private_data_len = p_size; + dapl_os_memcpy(conn->p_data, p_data, p_size); + conn->params.private_data = conn->p_data; + conn->params.private_data_len = p_size; } + /* copy in remote address, need a copy for retry attempts */ + dapl_os_memcpy(&conn->r_addr, r_addr, sizeof(*r_addr)); + /* Resolve remote address, src already bound during QP create */ - ((struct sockaddr_in*)r_addr)->sin_port = htons(MAKE_PORT(r_qual)); - if (rdma_resolve_addr(ep_ptr->cm_handle->cm_id, - NULL, (struct sockaddr *)r_addr, 2000)) + ((struct sockaddr_in*)&conn->r_addr)->sin_port = htons(MAKE_PORT(r_qual)); + ((struct sockaddr_in*)&conn->r_addr)->sin_family = AF_INET; + + if (rdma_resolve_addr(conn->cm_id, NULL, + (struct sockaddr *)&conn->r_addr, + conn->arp_timeout)) return dapl_convert_errno(errno,"ib_connect"); dapl_dbg_log(DAPL_DBG_TYPE_CM, - " connect: resolve_addr: cm_id %p SRC %x DST %x port %d\n", - ep_ptr->cm_handle->cm_id, - ntohl(((struct sockaddr_in *) - &ep_ptr->cm_handle->hca->hca_address)->sin_addr.s_addr), - ntohl(((struct sockaddr_in *)r_addr)->sin_addr.s_addr), - MAKE_PORT(r_qual) ); + " connect: resolve_addr: cm_id %p -> %s port %d\n", + conn->cm_id, + inet_ntoa(((struct sockaddr_in *)&conn->r_addr)->sin_addr), + ((struct sockaddr_in*)&conn->r_addr)->sin_port ); return DAT_SUCCESS; } @@ -1163,15 +1172,58 @@ case RDMA_CM_EVENT_ADDR_RESOLVED: dapli_addr_resolve(conn); break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: dapli_route_resolve(conn); break; + case RDMA_CM_EVENT_ADDR_ERROR: + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " CM ADDR ERROR: -> %s retry (%d)..\n", + inet_ntoa(((struct sockaddr_in *) + &conn->r_addr)->sin_addr), + conn->arp_retries); + + /* retry address resolution */ + if (--conn->arp_retries) { + int ret; + ret = rdma_resolve_addr( + conn->cm_id, NULL, + (struct sockaddr *)&conn->r_addr, + conn->arp_timeout); + if (!ret) + break; + else { + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " ERROR: rdma_resolve_addr = " + "%d %s\n", + ret,strerror(errno)); + } + } + /* retries exhausted or resolve_addr failed */ + dapl_evd_connection_callback( + conn, IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->ep); + break; + + case RDMA_CM_EVENT_ROUTE_ERROR: - dapl_evd_connection_callback(conn, - IB_CME_DESTINATION_UNREACHABLE, - NULL, conn->ep); + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " CM ROUTE ERROR: -> %s retry (%d)..\n", + inet_ntoa(((struct sockaddr_in *) + &conn->r_addr)->sin_addr), + conn->route_retries ); + + /* retry route resolution */ + if (--conn->route_retries) + dapli_addr_resolve(conn); + else + dapl_evd_connection_callback( conn, + IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->ep); break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: dapl_evd_connection_callback(conn, IB_CME_LOCAL_FAILURE, Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 10032) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -160,6 +168,17 @@ conn->cm_id = cm_id; conn->ep = ep_ptr; conn->hca = ia_ptr->hca_ptr; + + /* setup timers for address and route resolution */ + conn->arp_timeout = dapl_os_get_env_val("DAPL_CM_ARP_TIMEOUT_MS", + IB_ARP_TIMEOUT); + conn->arp_retries = dapl_os_get_env_val("DAPL_CM_ARP_RETRY_COUNT", + IB_ARP_RETRY_COUNT); + conn->route_timeout = dapl_os_get_env_val("DAPL_CM_ROUTE_TIMEOUT_MS", + IB_ROUTE_TIMEOUT); + conn->route_retries = dapl_os_get_env_val("DAPL_CM_ROUTE_RETRY_COUNT", + IB_ROUTE_RETRY_COUNT); + ep_ptr->qp_handle = conn; ep_ptr->qp_state = IB_QP_STATE_INIT; Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 9916) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -67,8 +67,12 @@ #define IB_RC_RETRY_COUNT 7 #define IB_RNR_RETRY_COUNT 7 -#define IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ -#define IB_CM_RETRIES 15 +#define IB_CM_RESPONSE_TIMEOUT 23 /* 16 sec */ +#define IB_CM_RETRIES 15 /* 240 sec total default */ +#define IB_ARP_TIMEOUT 4000 /* 4 sec */ +#define IB_ARP_RETRY_COUNT 15 /* 60 sec total */ +#define IB_ROUTE_TIMEOUT 4000 /* 4 sec */ +#define IB_ROUTE_RETRY_COUNT 15 /* 60 sec total */ #define IB_REQ_MRA_TIMEOUT 27 /* a little over 9 minutes */ #define IB_MAX_AT_RETRY 3 #define IB_TARGET_MAX 4 /* max_qp_ous_rd_atom */ @@ -177,12 +181,17 @@ struct dapl_cm_id { DAPL_OS_LOCK lock; int destroy; + int arp_retries; + int arp_timeout; + int route_retries; + int route_timeout; int in_callback; struct rdma_cm_id *cm_id; struct dapl_hca *hca; struct dapl_sp *sp; struct dapl_ep *ep; struct rdma_conn_param params; + DAT_SOCK_ADDR6 r_addr; int p_len; unsigned char p_data[IB_MAX_DREP_PDATA_SIZE]; }; -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnip at sgi.com Wed Nov 1 17:08:37 2006 From: johnip at sgi.com (John Partridge) Date: Wed, 01 Nov 2006 19:08:37 -0600 Subject: [openib-general] Ordering between PCI config space writes and MMIO reads? In-Reply-To: <20061101.150418.26278280.davem@davemloft.net> References: <20061031204717.GG26964@parisc-linux.org> <4548CAE7.8010300@sgi.com> <20061101.150418.26278280.davem@davemloft.net> Message-ID: <45494515.8050304@sgi.com> David Miller wrote: > From: John Partridge > Date: Wed, 01 Nov 2006 10:27:19 -0600 > > >>Sorry, but I find this change a bit puzzling. The problem is >>particular to the PPB on the HCA and not Altix. I can't see anywhere >>that a PCI Config Write is required to block until completion, it is >>the driver and the HCA ,not the Altix hardware that requires the >>Config Write to have completed before we leave mthca_reset() >>Changing pci_write_config_xxx() will change the behavior for ALL >>drivers and the possibility of breaking something else. The fix was >>very low risk in mthca_reset(), changing the PCI code to fix this is >>much more onerous. > > > The issue is that something as simple as: > > val = pci_read_config(REG); > val |= bit; > pci_write_config(REG, val); > newval = pci_read_config(REG); > BUG_ON(!(newval & bit)); > > is not guarenteed by PCI (aparently). > > I see no valid reason why every PCI device driver should > be troubled with this lunacy and the ordering should thus > be ensured by the PCI layer. > > It just so happens to take care of the original driver > issue too :-) Yeah, Matthew has convinced me of that now. Thanks -- John Partridge Silicon Graphics Inc Tel: 651-683-3428 Vnet: 233-3428 E-Mail: johnip at sgi.com From venkatesh.babu at 3leafnetworks.com Wed Nov 1 17:51:36 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Wed, 01 Nov 2006 17:51:36 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <000101c6fd4c$47cc7000$ff0da8c0@amr.corp.intel.com> References: <000101c6fd4c$47cc7000$ff0da8c0@amr.corp.intel.com> Message-ID: <45494F28.2060008@3leafnetworks.com> Are these changes to replace ib_cm_init_rearm_attr() interface ? I tested the following changes without using the function ib_cm_init_rearm_attr() and just setting path_mig_state to IB_MIG_REARM and calling ib_modify_qp( ) to rearm it. The path migration from Primary to Alternate succeeded, then reloaded the alternate path. Then I removed the cable of new primary path and it failed with the IB_WC_RETRY_EXC_ERR. But I got the event IB_EVENT_PATH_MIG. With the ib_cm_init_rearm_attr() being called, failover/failback worked fine. VBabu Sean Hefty wrote: >The following patch attempts to fix issues in the ib_cm regarding support >for path migration. The fixes are mainly on feedback from Venkatesh. >The patch has NOT been tested to verify that APM works correctly, but I did >check that it didn't break anything. I need to develop a test program to >verify that APM works. > >I'd like to get feedback to this approach. For the most part, it makes >use of the existing interfaces where possible to limit changes to the >userspace library. More specifically: > >The ib_cm_establish() call is replaced with a more generic ib_cm_notify(). >This routine is used to notify the CM that failover has occurred, so that >future CM messages (LAP, DREQ) reach the remote CM. > >New alternate path information is captured when a LAP message is sent or >received. This allows QP attributes to be initialized for the user >when loading a new path after failover has occurred. > >Signed-off-by: Sean Hefty >--- >Venkatesh / anyone else: it would be helpful if someone could try porting >their application to this interface, and let me know if it works. I'm >working on a test program for this, but it will take a few days to create >it. > >diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >index 1cf0d42..c4e9bb5 100644 >--- a/drivers/infiniband/core/cm.c >+++ b/drivers/infiniband/core/cm.c >@@ -152,7 +152,6 @@ struct cm_id_private { > u8 peer_to_peer; > u8 responder_resources; > u8 initiator_depth; >- u8 local_ack_timeout; > u8 retry_count; > u8 rnr_retry_count; > u8 service_timeout; >@@ -691,7 +690,7 @@ static void cm_enter_timewait(struct cm_ > * timewait before notifying the user that we've exited timewait. > */ > cm_id_priv->id.state = IB_CM_TIMEWAIT; >- wait_time = cm_convert_to_ms(cm_id_priv->local_ack_timeout); >+ wait_time = cm_convert_to_ms(cm_id_priv->av.packet_life_time + 1); > queue_delayed_work(cm.wq, &cm_id_priv->timewait_info->work.work, > msecs_to_jiffies(wait_time)); > cm_id_priv->timewait_info = NULL; >@@ -1024,8 +1023,6 @@ int ib_send_cm_req(struct ib_cm_id *cm_i > > cm_id_priv->local_qpn = cm_req_get_local_qpn(req_msg); > cm_id_priv->rq_psn = cm_req_get_starting_psn(req_msg); >- cm_id_priv->local_ack_timeout = >- cm_req_get_primary_local_ack_timeout(req_msg); > > spin_lock_irqsave(&cm_id_priv->lock, flags); > ret = ib_post_send_mad(cm_id_priv->msg, NULL); >@@ -1411,8 +1408,6 @@ static int cm_req_handler(struct cm_work > cm_id_priv->responder_resources = cm_req_get_init_depth(req_msg); > cm_id_priv->path_mtu = cm_req_get_path_mtu(req_msg); > cm_id_priv->sq_psn = cm_req_get_starting_psn(req_msg); >- cm_id_priv->local_ack_timeout = >- cm_req_get_primary_local_ack_timeout(req_msg); > cm_id_priv->retry_count = cm_req_get_retry_count(req_msg); > cm_id_priv->rnr_retry_count = cm_req_get_rnr_retry_count(req_msg); > cm_id_priv->qp_type = cm_req_get_qp_type(req_msg); >@@ -1716,7 +1711,7 @@ static int cm_establish_handler(struct c > unsigned long flags; > int ret; > >- /* See comment in ib_cm_establish about lookup. */ >+ /* See comment in cm_establish about lookup. */ > cm_id_priv = cm_acquire_id(work->local_id, work->remote_id); > if (!cm_id_priv) > return -EINVAL; >@@ -2402,11 +2397,16 @@ int ib_send_cm_lap(struct ib_cm_id *cm_i > cm_id_priv = container_of(cm_id, struct cm_id_private, id); > spin_lock_irqsave(&cm_id_priv->lock, flags); > if (cm_id->state != IB_CM_ESTABLISHED || >- cm_id->lap_state != IB_CM_LAP_IDLE) { >+ (cm_id->lap_state != IB_CM_LAP_UNINIT && >+ cm_id->lap_state != IB_CM_LAP_IDLE)) { > ret = -EINVAL; > goto out; > } > >+ ret = cm_init_av_by_path(alternate_path, &cm_id_priv->alt_av); >+ if (ret) >+ goto out; >+ > ret = cm_alloc_msg(cm_id_priv, &msg); > if (ret) > goto out; >@@ -2480,6 +2480,7 @@ static int cm_lap_handler(struct cm_work > goto unlock; > > switch (cm_id_priv->id.lap_state) { >+ case IB_CM_LAP_UNINIT: > case IB_CM_LAP_IDLE: > break; > case IB_CM_MRA_LAP_SENT: >@@ -2502,6 +2503,10 @@ static int cm_lap_handler(struct cm_work > > cm_id_priv->id.lap_state = IB_CM_LAP_RCVD; > cm_id_priv->tid = lap_msg->hdr.tid; >+ cm_init_av_for_response(work->port, work->mad_recv_wc->wc, >+ work->mad_recv_wc->recv_buf.grh, >+ &cm_id_priv->av); >+ cm_init_av_by_path(param->alternate_path, &cm_id_priv->alt_av); > ret = atomic_inc_and_test(&cm_id_priv->work_count); > if (!ret) > list_add_tail(&work->list, &cm_id_priv->work_list); >@@ -3040,7 +3045,7 @@ static void cm_work_handler(void *data) > cm_free_work(work); > } > >-int ib_cm_establish(struct ib_cm_id *cm_id) >+static int cm_establish(struct ib_cm_id *cm_id) > { > struct cm_id_private *cm_id_priv; > struct cm_work *work; >@@ -3088,7 +3093,43 @@ int ib_cm_establish(struct ib_cm_id *cm_ > out: > return ret; > } >-EXPORT_SYMBOL(ib_cm_establish); >+ >+static int cm_migrate(struct ib_cm_id *cm_id) >+{ >+ struct cm_id_private *cm_id_priv; >+ unsigned long flags; >+ int ret = 0; >+ >+ cm_id_priv = container_of(cm_id, struct cm_id_private, id); >+ spin_lock_irqsave(&cm_id_priv->lock, flags); >+ if (cm_id->state == IB_CM_ESTABLISHED && >+ (cm_id->lap_state == IB_CM_LAP_UNINIT || >+ cm_id->lap_state == IB_CM_LAP_IDLE)) >+ cm_id_priv->av = cm_id_priv->alt_av; >+ else >+ ret = -EINVAL; >+ spin_unlock_irqrestore(&cm_id_priv->lock, flags); >+ >+ return ret; >+} >+ >+int ib_cm_notify(struct ib_cm_id *cm_id, enum ib_event_type event) >+{ >+ int ret; >+ >+ switch (event) { >+ case IB_EVENT_COMM_EST: >+ ret = cm_establish(cm_id); >+ break; >+ case IB_EVENT_PATH_MIG: >+ ret = cm_migrate(cm_id); >+ break; >+ default: >+ ret = -EINVAL; >+ } >+ return ret; >+} >+EXPORT_SYMBOL(ib_cm_notify); > > static void cm_recv_handler(struct ib_mad_agent *mad_agent, > struct ib_mad_recv_wc *mad_recv_wc) >@@ -3221,6 +3262,9 @@ static int cm_init_qp_rtr_attr(struct cm > if (cm_id_priv->alt_av.ah_attr.dlid) { > *qp_attr_mask |= IB_QP_ALT_PATH; > qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; >+ qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; >+ qp_attr->alt_timeout = >+ cm_id_priv->alt_av.packet_life_time + 1; > qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; > } > ret = 0; >@@ -3247,19 +3291,31 @@ static int cm_init_qp_rts_attr(struct cm > case IB_CM_REP_SENT: > case IB_CM_MRA_REP_RCVD: > case IB_CM_ESTABLISHED: >- *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; >- qp_attr->sq_psn = be32_to_cpu(cm_id_priv->sq_psn); >- if (cm_id_priv->qp_type == IB_QPT_RC) { >- *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | >- IB_QP_RNR_RETRY | >- IB_QP_MAX_QP_RD_ATOMIC; >- qp_attr->timeout = cm_id_priv->local_ack_timeout; >- qp_attr->retry_cnt = cm_id_priv->retry_count; >- qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; >- qp_attr->max_rd_atomic = cm_id_priv->initiator_depth; >- } >- if (cm_id_priv->alt_av.ah_attr.dlid) { >- *qp_attr_mask |= IB_QP_PATH_MIG_STATE; >+ if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) { >+ *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; >+ qp_attr->sq_psn = be32_to_cpu(cm_id_priv->sq_psn); >+ if (cm_id_priv->qp_type == IB_QPT_RC) { >+ *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | >+ IB_QP_RNR_RETRY | >+ IB_QP_MAX_QP_RD_ATOMIC; >+ qp_attr->timeout = >+ cm_id_priv->av.packet_life_time + 1; >+ qp_attr->retry_cnt = cm_id_priv->retry_count; >+ qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; >+ qp_attr->max_rd_atomic = >+ cm_id_priv->initiator_depth; >+ } >+ if (cm_id_priv->alt_av.ah_attr.dlid) { >+ *qp_attr_mask |= IB_QP_PATH_MIG_STATE; >+ qp_attr->path_mig_state = IB_MIG_REARM; >+ } >+ } else { >+ *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE; >+ qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; >+ qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; >+ qp_attr->alt_timeout = >+ cm_id_priv->alt_av.packet_life_time + 1; >+ qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; > qp_attr->path_mig_state = IB_MIG_REARM; > } > ret = 0; >diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c >index ad4f4d5..e04f662 100644 >--- a/drivers/infiniband/core/ucm.c >+++ b/drivers/infiniband/core/ucm.c >@@ -685,11 +685,11 @@ out: > return result; > } > >-static ssize_t ib_ucm_establish(struct ib_ucm_file *file, >- const char __user *inbuf, >- int in_len, int out_len) >+static ssize_t ib_ucm_notify(struct ib_ucm_file *file, >+ const char __user *inbuf, >+ int in_len, int out_len) > { >- struct ib_ucm_establish cmd; >+ struct ib_ucm_notify cmd; > struct ib_ucm_context *ctx; > int result; > >@@ -700,7 +700,7 @@ static ssize_t ib_ucm_establish(struct i > if (IS_ERR(ctx)) > return PTR_ERR(ctx); > >- result = ib_cm_establish(ctx->cm_id); >+ result = ib_cm_notify(ctx->cm_id, (enum ib_event_type) cmd.event); > ib_ucm_ctx_put(ctx); > return result; > } >@@ -1107,7 +1107,7 @@ static ssize_t (*ucm_cmd_table[])(struct > [IB_USER_CM_CMD_DESTROY_ID] = ib_ucm_destroy_id, > [IB_USER_CM_CMD_ATTR_ID] = ib_ucm_attr_id, > [IB_USER_CM_CMD_LISTEN] = ib_ucm_listen, >- [IB_USER_CM_CMD_ESTABLISH] = ib_ucm_establish, >+ [IB_USER_CM_CMD_NOTIFY] = ib_ucm_notify, > [IB_USER_CM_CMD_SEND_REQ] = ib_ucm_send_req, > [IB_USER_CM_CMD_SEND_REP] = ib_ucm_send_rep, > [IB_USER_CM_CMD_SEND_RTU] = ib_ucm_send_rtu, >diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h >index c9b4738..5c07017 100644 >--- a/include/rdma/ib_cm.h >+++ b/include/rdma/ib_cm.h >@@ -60,6 +60,7 @@ enum ib_cm_state { > }; > > enum ib_cm_lap_state { >+ IB_CM_LAP_UNINIT, > IB_CM_LAP_IDLE, > IB_CM_LAP_SENT, > IB_CM_LAP_RCVD, >@@ -443,13 +444,20 @@ int ib_send_cm_drep(struct ib_cm_id *cm_ > u8 private_data_len); > > /** >- * ib_cm_establish - Forces a connection state to established. >+ * ib_cm_notify - Notifies the CM of an event reported to the consumer. > * @cm_id: Connection identifier to transition to established. >+ * @event: Type of event. > * >- * This routine should be invoked by users who receive messages on a >- * connected QP before an RTU has been received. >+ * This routine should be invoked by users to notify the CM of relevant >+ * communication events. Events that should be reported to the CM and >+ * when to report them are: >+ * >+ * IB_EVENT_COMM_EST - Used when a message is received on a connected >+ * QP before an RTU has been received. >+ * IB_EVENT_PATH_MIG - Notifies the CM that the connection has failed over >+ * to the alternate path. > */ >-int ib_cm_establish(struct ib_cm_id *cm_id); >+int ib_cm_notify(struct ib_cm_id *cm_id, enum ib_event_type event); > > /** > * ib_send_cm_rej - Sends a connection rejection message to the >diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h >old mode 100644 >new mode 100755 >index 066c20b..37650af >--- a/include/rdma/ib_user_cm.h >+++ b/include/rdma/ib_user_cm.h >@@ -38,7 +38,7 @@ #define IB_USER_CM_H > > #include > >-#define IB_USER_CM_ABI_VERSION 4 >+#define IB_USER_CM_ABI_VERSION 5 > > enum { > IB_USER_CM_CMD_CREATE_ID, >@@ -46,7 +46,7 @@ enum { > IB_USER_CM_CMD_ATTR_ID, > > IB_USER_CM_CMD_LISTEN, >- IB_USER_CM_CMD_ESTABLISH, >+ IB_USER_CM_CMD_NOTIFY, > > IB_USER_CM_CMD_SEND_REQ, > IB_USER_CM_CMD_SEND_REP, >@@ -117,8 +117,9 @@ struct ib_ucm_listen { > __u32 reserved; > }; > >-struct ib_ucm_establish { >+struct ib_ucm_notify { > __u32 id; >+ __u32 event; > }; > > struct ib_ucm_private_data { > > > From sato-tomoaki at jp.fujitsu.com Wed Nov 1 17:33:17 2006 From: sato-tomoaki at jp.fujitsu.com (Tomoaki Sato) Date: Thu, 2 Nov 2006 10:33:17 +0900 Subject: [openib-general] Mellanox SRP target implementation In-Reply-To: <44E9649D.7070603@mellanox.co.il> Message-ID: <072901c6fe1e$df4f54b0$0c00460a@pato> Hi Can anybody tell me about the mellanox "SRP target" implementation code which is included in MTD2000 with NFS-RDMA server ? Is this gen2 base ? http://www.mellanox.com/news/press_releases/pr_103106.php Thanks, Tomo Sato From jeremy at sgi.com Wed Nov 1 19:05:11 2006 From: jeremy at sgi.com (Jeremy Higdon) Date: Wed, 1 Nov 2006 19:05:11 -0800 Subject: [openib-general] Ordering between PCI config space writes and MMIO reads? In-Reply-To: <20061024232755.GA26521@sgi.com> References: <20061024192210.GE2043@havoc.gtf.org> <20061024214724.GS25210@parisc-linux.org> <20061024223631.GT25210@parisc-linux.org> <20061024232755.GA26521@sgi.com> Message-ID: <20061102030511.GS150820@sgi.com> On Tue, Oct 24, 2006 at 06:27:55PM -0500, Jack Steiner wrote: > On Tue, Oct 24, 2006 at 04:36:32PM -0600, Matthew Wilcox wrote: > > On Tue, Oct 24, 2006 at 02:51:30PM -0700, Roland Dreier wrote: > > > > I think the right way to fix this is to ensure mmio write ordering in > > > > the pci_write_config_*() implementations. Like this. > > > > > > I'm happy to fix this in the PCI core and not force drivers to worry > > > about this. > > > > > > John, can you confirm that this patch fixes the issue for you? > > > > Hang on. I wasn't thinking clearly. mmiowb() only ensures the write > > has got as far as the shub. > > I think mmiowb() should work on SN hardware. mmiowb() delays until shub > reports that all previously issued PIO writes have completed. > > The processor "mf.a" guarantees "platform acceptance" which on SN means > that shub has accepted the write - not that it has actually completed (or > even forwarded anywhere by shub). That makes "mf.a" more-or-less useless > on SN. However, shub has an additional MMR register (PIO_WRITE_COUNT) that > counts actual outstanding PIOs. mmiob() delays until that count goes to > zero. > > I'll check if there is any additional reordering that can occur AFTER the > PIO_WRITE_COUNT goes to zero. If so, it would be at bus level - not in > shub or routers. As I understand it, the mmiowb on the shub waits only for the PIO write to be accepted by the destination node (shub or tio) that the I/O device is attached to, thus guaranteeing that no reordering will happen within the NL. If the PPB can reorder the write, then mmiowb is not sufficient. You'd have to do a readback from a chip register (assuming you can trust the PPB not to reorder reads and writes), or some other work around I haven't thought of. jeremy From sean.hefty at intel.com Wed Nov 1 19:49:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 1 Nov 2006 19:49:37 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <45494F28.2060008@3leafnetworks.com> Message-ID: <000001c6fe31$ebc6e150$e0d8180a@amr.corp.intel.com> >Are these changes to replace ib_cm_init_rearm_attr() interface ? Yes - you use ib_cm_init_qp_attr() to get the qp_attr after a loading a new alternate path. The new path is loaded using ib_send_cm_lap(). So, after a path fails: One side calls ib_send_cm_lap() to propose a new alternate path. Second side responds by calling ib_send_cm_apr(). Both sides call ib_cm_init_qp_attr(), then ib_modify_qp() to load the new path. This is intended to work if failover has occurred, or if the user detects that the alternate path is down and wants to replace it. There is an additional call, ib_cm_notify() which is used to let the CM know that the primary path has failed, and the alternate path should be used when sending future CM messages. In case of failover, this needs to be called before calling ib_send_cm_lap() to ensure that the LAP message reaches the remote user. >The path migration from Primary to Alternate succeeded, then reloaded >the alternate path. How did you reload the alternate path? >failed with the IB_WC_RETRY_EXC_ERR. But I got the event IB_EVENT_PATH_MIG. > >With the ib_cm_init_rearm_attr() being called, failover/failback worked >fine. Were you calling ib_send_cm_lap() to load a new alternate path, or just assuming that the old path would work after failover occurred? - Sean From mst at mellanox.co.il Thu Nov 2 00:14:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 10:14:53 +0200 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <4549307E.9060200@ichips.intel.com> References: <4547308F.2030708@voltaire.com> <20061031115017.GF2387@mellanox.co.il> <454746A8.1040604@voltaire.com> <45477E81.3040205@ichips.intel.com> <45487B6C.2070408@voltaire.com> <45492B43.5010408@ichips.intel.com> <4549307E.9060200@ichips.intel.com> Message-ID: <20061102081453.GA7247@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race > > >>I think this is actually a good point for the CM case at least. > >>Clients already have something registered with the CM (namely the CM > >>ID itself), so if we required all consumers to destroy their IDs > >>explicitly, then there's no reason to add additional client > >>registration. > > > > The issue is more related to cm_id's that are created when a new connection > > request arrives. For the user to destroy the new id's, they either need to be > > able to queue them somewhere for later destruction, call destroy from the > > callback, or indicate that the id's should be destroyed when the callback returns. > > I should add that the point is taken though. If we only allow new cm_id's to be > destroyed this way, then we avoid the issue. > > I _think_ that all users of the ib_cm and rdma_cm behave this way, but I need to > verify this to be sure. All active side users are fine I think. But any client on the passive side currently might destroy the new ID by returning error from the callback, and I like this interface since it frees the resources immediately. Since all such passive side users currently are out of tree, I don't think it's urgent for us to do anything about the passive side race - but please do not at least break code that uses passive side in major ways just yet. Once there are in-tree passive side users, I think registration at module load/unload time would be the best approach. -- MST From mst at mellanox.co.il Thu Nov 2 00:19:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 10:19:41 +0200 Subject: [openib-general] Static linking with libibverbs In-Reply-To: References: Message-ID: <20061102081941.GA7468@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] Static linking with libibverbs > > > static linking actually can be made to work even with older library versions. > > See this HowTo (written on 02 of November, 2005). > > https://openib.org/tiki/tiki-index.php?page=HowToFAQ > > That's not really static linking. OK, its a difference of terms then :) > If you try to build a true static > executable, which contains static libc and in particular static libdl, > there's no way the old code can work, for multiple reasons. For one > thing, dlopen(NULL, RTLD_NOW) doesn't work on static executables so > libibverbs couldn't find a low-level driver that is statically linked > in. Does linking in low level driver work now even with -static? -- MST From mst at mellanox.co.il Thu Nov 2 00:19:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 10:19:49 +0200 Subject: [openib-general] [PATCH repost] IB/srp: destroy/recreate qp/cq at reconnect In-Reply-To: References: Message-ID: <20061102081948.GB7468@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH repost] IB/srp: destroy/recreate qp/cq at reconnect > > > Roland, what do you think about this patch? > > Seems like a good idea, to me. > > Sorry, I haven't made this a high priority. It seems a little like > fiddling with the code just for the sake of fiddling -- why pick this > one place to recreate a CQ? Why not ipoib, etc? Mainly because changing the QPN in ipoib will affect hardware address, so it creates more problems than it solves. -- MST From vuhuong at mellanox.com Thu Nov 2 00:28:34 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Thu, 02 Nov 2006 00:28:34 -0800 Subject: [openib-general] Mellanox SRP target implementation In-Reply-To: <072901c6fe1e$df4f54b0$0c00460a@pato> References: <072901c6fe1e$df4f54b0$0c00460a@pato> Message-ID: <4549AC32.7020003@mellanox.com> Tomoaki, > > Can anybody tell me about the mellanox "SRP target" implementation code which is included in MTD2000 with NFS-RDMA server ? > Is this gen2 base ? > *srp target* is still on gen1 code base - IBGD *nfs-rdma server* is on gen2 code base From monil at voltaire.com Thu Nov 2 01:27:14 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 2 Nov 2006 11:27:14 +0200 Subject: [openib-general] OFED 1.1 Build Issue In-Reply-To: <45476C55.1080300@dev.mellanox.co.il> References: <45470D59.7020705@dev.mellanox.co.il> <45472202.9080104@voltaire.com> <454742C2.2050900@silverstorm.com> <45476C55.1080300@dev.mellanox.co.il> Message-ID: <6a122cc00611020127l1a11f897w845ec3745dd89ea1@mail.gmail.com> Vlad, On 10/31/06, Vladimir Sokolovsky wrote: > > Ramachandra K wrote: > > Moni Shoua wrote: > > > >> We already tried to go this way and found that a local Module.symvers > >> is not always generated (but we might have missed something though). > >> I suggest that you check that this alternative way works under all > >> OSs compilation (SuSE and RedHat to be precise)... > >> > >> > > I think Module.symvers generation for external modules was added sometime > > around 2.6.16, so its not generated on the older kernels (for eg 2.6.9 > > kernels > > on RHEL) > > > > In this scenario, when there is no Module.symvers file, I guess the other > > option is to use a single Kbuild file to build both modules, > > as explained in section 7.3 of Documentation/kbuild/modules.txt. > > > > But this may not be feasible always. Come to think of it, why does the > > OFED installation procedure not update the kernel Module.symvers file > > when it replaces the old kernel modules present in /lib/modules/ > > with the new ones ? > > > >> BTW, Why not updating the kernel Module.symvers when kernel-ib-devel > >> is installed? This will free the developer from copying it to > >> his/hers private directory. > >> > >> > > It might be a good idea to update the Module.symvers file as part of the > > normal installation and not only kernel-ib-devel. Because if the kernel > > modules are being replaced (or new modules are being added), shouldn't > > the Module.symvers file also be updated ? > > Regards, > > Ram > Agree, > Module.symvers should be updated by kernel-ib RPM. AFAIK Module.symvers is used in compile time only so the same logic that is used for .h files (the devel package) seems reasonable for it. --Moni > So, need to implement Moni's suggestion with light changes: update > kernel-ib RPM %post and %preun sections instead of kernel-ib-devel RPM > %pre and %postun. > > Regards, > Vladimir > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From mst at mellanox.co.il Thu Nov 2 01:34:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 11:34:38 +0200 Subject: [openib-general] OFED 1.1 Build Issue In-Reply-To: <6a122cc00611020127l1a11f897w845ec3745dd89ea1@mail.gmail.com> References: <45470D59.7020705@dev.mellanox.co.il> <45472202.9080104@voltaire.com> <454742C2.2050900@silverstorm.com> <45476C55.1080300@dev.mellanox.co.il> <6a122cc00611020127l1a11f897w845ec3745dd89ea1@mail.gmail.com> Message-ID: <20061102093438.GH7468@mellanox.co.il> Quoting r. Moni Levy : > AFAIK Module.symvers is used in compile time only so the same logic > that is used for .h files (the devel package) seems reasonable for it. I agree. It would be nice however for all devel files to go under prefix/. -- MST From sashak at voltaire.com Thu Nov 2 02:53:48 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 12:53:48 +0200 Subject: [openib-general] [PATCH] opensm: strict osm_log arguments/format check Message-ID: <20061102105348.GA16559@sashak.voltaire.com> This adds gcc attribute to osm_log() which causes the compiler to check argument types against a format string. And also there are related fixes in osm_log() usage in opensm and osmtest. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_log.h | 8 +++++++- osm/libvendor/osm_vendor_ibumad_sa.c | 2 +- osm/opensm/main.c | 3 ++- osm/opensm/osm_pkey_mgr.c | 1 + osm/opensm/osm_port_info_rcv.c | 5 +++-- osm/opensm/osm_sa_informinfo.c | 4 ++-- osm/opensm/osm_sa_link_record.c | 8 ++++---- osm/opensm/osm_sa_mad_ctrl.c | 3 ++- osm/opensm/osm_sa_response.c | 2 +- osm/opensm/osm_sm_state_mgr.c | 3 ++- osm/opensm/osm_sminfo_rcv.c | 9 +++++---- osm/opensm/osm_state_mgr.c | 8 ++++---- osm/osmtest/osmt_multicast.c | 12 +++++++----- osm/osmtest/osmt_service.c | 6 +++--- osm/osmtest/osmtest.c | 8 ++++---- 15 files changed, 48 insertions(+), 34 deletions(-) diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h index 62f3a0c..2b24886 100644 --- a/osm/include/opensm/osm_log.h +++ b/osm/include/opensm/osm_log.h @@ -60,6 +60,12 @@ #include #include +#ifdef __GNUC__ +#define STRICT_OSM_LOG_FORMAT __attribute__((format(printf, 3, 4))) +#else +#define STRICT_OSM_LOG_FORMAT +#endif + #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } @@ -374,7 +380,7 @@ void osm_log( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, - IN const char *p_str, ... ); + IN const char *p_str, ... ) STRICT_OSM_LOG_FORMAT; void osm_log_raw( diff --git a/osm/libvendor/osm_vendor_ibumad_sa.c b/osm/libvendor/osm_vendor_ibumad_sa.c index 7fd0655..7c4a2f7 100644 --- a/osm/libvendor/osm_vendor_ibumad_sa.c +++ b/osm/libvendor/osm_vendor_ibumad_sa.c @@ -853,7 +853,7 @@ osmv_query_sa( if ( p_mpr_req->sgid_count + p_mpr_req->dgid_count > IB_MULTIPATH_MAX_GIDS ) { osm_log( p_log, OSM_LOG_ERROR, - "osmv_query_sa DBG:001 MULTIPATH_REC ", + "osmv_query_sa DBG:001 MULTIPATH_REC " "SGID count %d DGID count %d max count %d\n", p_mpr_req->sgid_count, p_mpr_req->dgid_count, IB_MULTIPATH_MAX_GIDS ); diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 729702a..752b546 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -460,7 +460,8 @@ parse_ignore_guids_file(IN char *guids_f { osm_log( &p_osm->log, OSM_LOG_ERROR, "parse_ignore_guids_file: ERR 0601: " - "Unable to open ignore guids file (%s)\n" ); + "Unable to open ignore guids file (%s)\n", + guids_file_name ); status = IB_ERROR; goto Exit; } diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index f2cb221..735dc14 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -139,6 +139,7 @@ pkey_mgr_process_physical_port( "pkey_mgr_process_physical_port: ERR 0503: " "Failed to obtain P_Key 0x%04x block and index for node " "0x%016" PRIx64 " port %u\n", + ib_pkey_get_base( pkey ), cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); return; diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c index 95112dc..f6d3595 100644 --- a/osm/opensm/osm_port_info_rcv.c +++ b/osm/opensm/osm_port_info_rcv.c @@ -724,8 +724,9 @@ osm_pi_rcv_process( { osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "osm_pi_rcv_process: " - "Got light sweep response from remote port of parent node GUID = 0x%" PRIx64 - " port = %u, Commencing heavy sweep\n", + "Got light sweep response from remote port of parent node " + "GUID = 0x%" PRIx64 " port = 0x%016" PRIx64 + ", Commencing heavy sweep\n", cl_ntoh64( node_guid ), cl_ntoh64( port_guid ) ); osm_state_mgr_process( p_rcv->p_state_mgr, diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index 69dca1d..da96d35 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -163,8 +163,8 @@ __validate_ports_access_rights( { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__validate_ports_access_rights: ERR 4301: " - "Invalid port guid: 0x%016\n", - portguid ); + "Invalid port guid: 0x%016" PRIx64 "\n", + cl_hton64(portguid) ); valid = FALSE; goto Exit; } diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 751023f..0ca9092 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -145,10 +145,10 @@ __osm_lr_rcv_build_physp_link( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_lr_rcv_build_physp_link: ERR 1801: " "Unable to acquire link record\n" - "\t\t\t\tFrom port 0x%\n" - "\t\t\t\tTo port 0x%\n" - "\t\t\t\tFrom lid 0x%\n" - "\t\t\t\tTo lid 0x%\n", + "\t\t\t\tFrom port 0x%u\n" + "\t\t\t\tTo port 0x%u\n" + "\t\t\t\tFrom lid 0x%u\n" + "\t\t\t\tTo lid 0x%u\n", from_port, to_port, cl_ntoh16(from_lid), cl_ntoh16(to_lid) ); diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c index cd896b6..208f0d2 100644 --- a/osm/opensm/osm_sa_mad_ctrl.c +++ b/osm/opensm/osm_sa_mad_ctrl.c @@ -132,7 +132,8 @@ __osm_sa_mad_ctrl_process( "__osm_sa_mad_ctrl_process: " /* "Responding BUSY status since the dispatcher is already"*/ "Dropping MAD since the dispatcher is already" - " overloaded with %u messages and queue time of:%u[msec]\n", + " overloaded with %u messages and queue time of:" + "%" PRIu64 "[msec]\n", num_messages, last_dispatched_msg_queue_time_msec ); /* send a busy response */ diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c index db36ea2..27f4e9d 100644 --- a/osm/opensm/osm_sa_response.c +++ b/osm/opensm/osm_sa_response.c @@ -117,7 +117,7 @@ osm_sa_send_error( if (osm_exit_flag) { osm_log( p_resp->p_log, OSM_LOG_DEBUG, - "osm_sa_send_error: ", + "osm_sa_send_error: " "Ignoring requested send after exit\n" ); goto Exit; } diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c index aadc43a..1ba5eda 100644 --- a/osm/opensm/osm_sm_state_mgr.c +++ b/osm/opensm/osm_sm_state_mgr.c @@ -247,7 +247,8 @@ __osm_sm_state_mgr_send_master_sm_info_r { osm_log( p_sm_mgr->p_log, OSM_LOG_ERROR, "__osm_sm_state_mgr_send_master_sm_info_req: ERR 3203: " - "No port object for GUID 0x%X\n", p_sm_mgr->master_guid ); + "No port object for GUID 0x%016" PRIx64 "\n", + cl_hton64(p_sm_mgr->master_guid) ); goto Exit; } diff --git a/osm/opensm/osm_sminfo_rcv.c b/osm/opensm/osm_sminfo_rcv.c index 825b18b..7657e97 100644 --- a/osm/opensm/osm_sminfo_rcv.c +++ b/osm/opensm/osm_sminfo_rcv.c @@ -402,8 +402,8 @@ __osm_sminfo_rcv_process_set_request( osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "__osm_sminfo_rcv_process_set_request: " "Received a STANDBY signal. Updating " - "sm_state_mgr master_guid: 0x%X\n", - p_rcv_smi->guid ); + "sm_state_mgr master_guid: 0x%016" PRIx64 "\n", + cl_hton64(p_rcv_smi->guid) ); p_rcv->p_sm_state_mgr->master_guid = p_rcv_smi->guid; } @@ -482,8 +482,9 @@ __osm_sminfo_rcv_process_get_sm( /* we will poll it - as long as it lives - we should be in Standby. */ osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "__osm_sminfo_rcv_process_get_sm: " - "Found higher SM. Updating sm_state_mgr master_guid: 0x%X\n", - p_sm->p_port->guid ); + "Found higher SM. Updating sm_state_mgr master_guid:" + " 0x%016" PRIx64 "\n", + cl_hton64(p_sm->p_port->guid) ); p_rcv->p_sm_state_mgr->master_guid = p_sm->p_port->guid; } break; diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 28e0c4c..ad22d9e 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -481,7 +481,7 @@ __osm_state_mgr_signal_warning( { osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, "__osm_state_mgr_signal_warning: " - "Invalid signal %s(%d) in state %s\n", + "Invalid signal %s(%lu) in state %s\n", osm_get_sm_signal_str( signal ), signal, osm_get_sm_state_str( p_mgr->state ) ); } @@ -500,7 +500,7 @@ __osm_state_mgr_signal_error( else osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_state_mgr_signal_error: ERR 3303: " - "Invalid signal %s(%d) in state %s\n", + "Invalid signal %s(%lu) in state %s\n", osm_get_sm_signal_str( signal ), signal, osm_get_sm_state_str( p_mgr->state ) ); } @@ -1480,8 +1480,8 @@ __osm_state_mgr_exists_other_master_sm( { osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, "__osm_state_mgr_exists_other_master_sm: " - "Found remote master SM with guid:0x%X\n", - p_sm->smi.guid ); + "Found remote master SM with guid:0x%016" PRIx64 "\n", + cl_hton64(p_sm->smi.guid) ); p_sm_res = p_sm; goto Exit; } diff --git a/osm/osmtest/osmt_multicast.c b/osm/osmtest/osmt_multicast.c index 33a4f47..19f9d37 100644 --- a/osm/osmtest/osmt_multicast.c +++ b/osm/osmtest/osmt_multicast.c @@ -1885,8 +1885,9 @@ osmt_run_mcast_flow( IN osmtest_t * cons { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_mcast_flow: ERR 0209: " - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", - p_mc_res->mgid + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) ); status = IB_ERROR; goto Exit; @@ -2044,8 +2045,9 @@ osmt_run_mcast_flow( IN osmtest_t * cons { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_mcast_flow: ERR 0212: " - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", - p_mc_res->mgid + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) ); status = IB_ERROR; goto Exit; @@ -3345,7 +3347,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons /* Delete all MCG that are not of IPoIB */ osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_run_mcast_flow : " - "Cleanup all MCG that are not IPoIB...\n", cnt ); + "Cleanup all MCG that are not IPoIB...\n" ); p_mgrp_mlid_tbl = &p_osmt->exp_subn.mgrp_mlid_tbl; p_mgrp = (osmtest_mgrp_t*)cl_qmap_head( p_mgrp_mlid_tbl ); diff --git a/osm/osmtest/osmt_service.c b/osm/osmtest/osmt_service.c index ec9a39e..ab95fec 100644 --- a/osm/osmtest/osmt_service.c +++ b/osm/osmtest/osmt_service.c @@ -1559,7 +1559,7 @@ osmt_run_service_records_flow( IN osmtes { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_service_records_flow: ERR 4A20: " - "Found service: id: 0x%016 " PRIx64 + "Found service: id: 0x%016" PRIx64 " " "that is invalid\n", id[7] ); status = IB_ERROR; @@ -1573,7 +1573,7 @@ osmt_run_service_records_flow( IN osmtes { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_service_records_flow: ERR 4A21: " - "Fail to find service: id: 0x%016 " PRIx64 + "Fail to find service: id: 0x%016" PRIx64 " " "name: %s\n", id[0], (char*)service_name[0] ); @@ -1588,7 +1588,7 @@ osmt_run_service_records_flow( IN osmtes { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_service_records_flow: ERR 4A22: " - "Fail to find service: id: 0x%016 " PRIx64 + "Fail to find service: id: 0x%016" PRIx64 " " "name: %s\n", id[5], (char*)service_name[6] ); diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 92a4190..a35e0c5 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -2787,7 +2787,8 @@ osmtest_create_inventory_file( IN osmtes { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_create_inventory_file: ERR 0079: " - "Unable to open inventory file (%s)\n" ); + "Unable to open inventory file (%s)\n", + p_osmt->opt.file_name ); status = IB_ERROR; goto Exit; } @@ -3356,7 +3357,7 @@ osmtest_validate_path_data( IN osmtest_t osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_validate_path_data: ERR 0012: " "PKEY mismatch on path SLID 0x%X to DLID 0x%X\n" - "\t\t\t\tExpected 0x%X, received 0x%X\n", + "\t\t\t\tExpected 0x%" PRIx64 ", received 0x%" PRIx64 "\n", cl_ntoh16( p_path->rec.slid ), cl_ntoh16( p_path->rec.dlid ), cl_ntoh64( p_path->rec.pkey ), cl_ntoh64( p_rec->pkey ) ); @@ -7165,8 +7166,7 @@ osmtest_bind( IN osmtest_t * p_osmt, { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_bind: ERR 0135: " - "No local ports. Unable to proceed\n", - ib_get_err_str( status ) ); + "No local ports. Unable to proceed\n" ); goto Exit; } guid = attr_array[port_index].port_guid; -- 1.4.3.3.g8387 From rkuchimanchi at silverstorm.com Thu Nov 2 02:49:12 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Thu, 02 Nov 2006 16:19:12 +0530 Subject: [openib-general] OFED 1.1 Build Issue In-Reply-To: <20061102093438.GH7468@mellanox.co.il> References: <45470D59.7020705@dev.mellanox.co.il> <45472202.9080104@voltaire.com> <454742C2.2050900@silverstorm.com> <45476C55.1080300@dev.mellanox.co.il> <6a122cc00611020127l1a11f897w845ec3745dd89ea1@mail.gmail.com> <20061102093438.GH7468@mellanox.co.il> Message-ID: <4549CD28.50605@silverstorm.com> Michael S. Tsirkin wrote: >Quoting r. Moni Levy : > > >>AFAIK Module.symvers is used in compile time only so the same logic >>that is used for .h files (the devel package) seems reasonable for it. >> >> > >I agree. It would be nice however for all devel files to go under prefix/. > > > That raises a basic doubt for me. What is the general convention about include files in /lib/modules/... when installing new kernel modules ? Should the include files always correspond to the kernel modules that are installed ? I am thinking of a scenario where a user does not install the development package, in which case their IB include files and the kernel Module.symvers are essentially stale. Later on, if they try to compile another kernel module that depends on the IB modules, that module will refuse to load due to the difference in symbol versions in the old Module.symvers and the currently loaded IB kernel modules. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Thu Nov 2 02:57:16 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 12:57:16 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm: osm_sm_state_mgr.h trivial indentation fixes Message-ID: <20061102105716.GB16559@sashak.voltaire.com> Trivial indentation fixes in osm_sm_state_mgr.h Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_sm_state_mgr.h | 40 +++++++++++++++++--------------- 1 files changed, 21 insertions(+), 19 deletions(-) diff --git a/osm/include/opensm/osm_sm_state_mgr.h b/osm/include/opensm/osm_sm_state_mgr.h index 87dc5a7..5f60276 100644 --- a/osm/include/opensm/osm_sm_state_mgr.h +++ b/osm/include/opensm/osm_sm_state_mgr.h @@ -107,15 +107,15 @@ BEGIN_C_DECLS */ typedef struct _osm_sm_state_mgr { - cl_spinlock_t state_lock; - cl_timer_t polling_timer; - uint32_t retry_number; - ib_net64_t master_guid; - osm_state_mgr_t* p_state_mgr; - osm_subn_t* p_subn; - osm_req_t* p_req; - osm_log_t* p_log; - osm_remote_sm_t* p_polling_sm; + cl_spinlock_t state_lock; + cl_timer_t polling_timer; + uint32_t retry_number; + ib_net64_t master_guid; + osm_state_mgr_t* p_state_mgr; + osm_subn_t* p_subn; + osm_req_t* p_req; + osm_log_t* p_log; + osm_remote_sm_t* p_polling_sm; } osm_sm_state_mgr_t; /* @@ -124,26 +124,28 @@ typedef struct _osm_sm_state_mgr * Spinlock guarding the state and processes. * * retry_number -* Used on Standby state - to count the number of retries of queries to the master SM. +* Used on Standby state - to count the number of retries +* of queries to the master SM. * -* polling_timer -* Timer for polling +* polling_timer +* Timer for polling * -* p_state_mgr -* Point to the state manager object +* p_state_mgr +* Point to the state manager object * * p_subn * Pointer to the Subnet object for this subnet. * -* p_req +* p_req * Pointer to the generic attribute request object. * * p_log * Pointer to the log object. * -* p_polling_sm -* Pointer to a osm_remote_sm_t object. When our SM needs to poll on a remote -* sm, this will be the pointer of the polled SM. +* p_polling_sm +* Pointer to a osm_remote_sm_t object. When our SM needs +* to poll on a remote sm, this will be the pointer of the +* polled SM. * * SEE ALSO * SM State Manager object @@ -298,7 +300,7 @@ osm_sm_state_mgr_process( * * DESCRIPTION * Signals that the remote Master SM is alive. -* Need to clear the retry_number variable. +* Need to clear the retry_number variable. * * SYNOPSIS */ -- 1.4.3.3.g8387 From ogerlitz at voltaire.com Thu Nov 2 03:04:40 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Nov 2006 13:04:40 +0200 Subject: [openib-general] [PATCH 1/3] uDAPL cma: add support for new client register event In-Reply-To: <000001c6fe16$fcaa9090$bb97070a@amr.corp.intel.com> References: <000001c6fe16$fcaa9090$bb97070a@amr.corp.intel.com> Message-ID: <4549D0C8.9010705@voltaire.com> Arlin Davis wrote: > Added support for new ib verbs client register event. No extra > processing required at the uDAPL level. Shows up if opensm bounces. > Index: dapl/openib_cma/dapl_ib_util.c > =================================================================== > --- dapl/openib_cma/dapl_ib_util.c (revision 9916) > +++ dapl/openib_cma/dapl_ib_util.c (working copy) > @@ -744,9 +744,16 @@ Arlin, Can you please generate the patches with the -p flag which adds the function/structure context and resend? else it is not really possible to review your work. You might want to use this alias alias svndiff='/usr/bin/svn diff --diff-cmd=/usr/bin/diff -x -up' Or. From ogerlitz at voltaire.com Thu Nov 2 03:17:11 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Nov 2006 13:17:11 +0200 Subject: [openib-general] [PATCH 3/3] uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> Message-ID: <4549D3B7.1050208@voltaire.com> Arlin Davis wrote: > Fix some timeout and long disconnect delay issues discovered during > scale-out testing. Added support to retry rdma_cm address and route > resolution with configuration options and provide a disconnect call when > receiving the disconnect request to force an immediate disconnect reply > to the remote side. Can be very nice if you share with the community the IB stack issues revealed under scale-out testing... basically what was the testbed? From what the patch does I understand you attempt to handle timeout on address and route resolution and long disconnect delay. Was the issue with address resolution being ARP request or reply messages getting lost? Was the issue with route resolution being timeout on SA Path queries? Please note that for the first two, you want to retry if the event status is -ETIMEDOUT, the patch ignores the status field. Was the issue with disconnect delay that peer A called dat_ep_disconnect() (ie sending DREQ) and the DREP was sent only when peer B got the disconnect event and called dat_ep_disconnect()? so now the DREP is sent from within the provider code when it gets the DREQ? Or. From monil at voltaire.com Thu Nov 2 03:21:18 2006 From: monil at voltaire.com (Moni Levy) Date: Thu, 2 Nov 2006 13:21:18 +0200 Subject: [openib-general] OFED 1.1 Build Issue In-Reply-To: <4549CD28.50605@silverstorm.com> References: <45470D59.7020705@dev.mellanox.co.il> <45472202.9080104@voltaire.com> <454742C2.2050900@silverstorm.com> <45476C55.1080300@dev.mellanox.co.il> <6a122cc00611020127l1a11f897w845ec3745dd89ea1@mail.gmail.com> <20061102093438.GH7468@mellanox.co.il> <4549CD28.50605@silverstorm.com> Message-ID: <6a122cc00611020321o2cc8e646n866dd8a9b01836ff@mail.gmail.com> On 11/2/06, Ramachandra K wrote: > Michael S. Tsirkin wrote: > Quoting r. Moni Levy : > AFAIK Module.symvers is used in compile time only so the same logic that is > used for .h files (the devel package) seems reasonable for it. > I agree. It would be nice however for all devel files to go under prefix/. > That raises a basic doubt for me. What is the general convention about > include files in /lib/modules/... when installing new kernel modules ? > Should the include files always correspond to the kernel modules that are > installed ? I am thinking of a scenario where a user does not install the > development package, in which case their IB include files and the kernel > Module.symvers are essentially stale. I think that the basic assumption should be that there are no .h files unless kernel-devel or kernel-sources type packages are installed. User that intends to develop/compile should install the devel/source package and in that case he will have the appropriate matching snapshot of the .h files. -- Moni Later on, if they try to compile > another kernel module that depends on the IB modules, that module will > refuse to load due to the difference in symbol versions in the old > Module.symvers and the currently loaded IB kernel modules. Regards, Ram > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From halr at voltaire.com Thu Nov 2 04:37:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 07:37:26 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: trivial log message fix In-Reply-To: <20061101213652.GE9985@sashak.voltaire.com> References: <20061101213652.GE9985@sashak.voltaire.com> Message-ID: <1162471039.29957.106309.camel@hal.voltaire.com> On Wed, 2006-11-01 at 16:36, Sasha Khapyorsky wrote: > Trivial log message fix. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Thu Nov 2 04:37:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 07:37:41 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: osm_sm_state_mgr.h trivial indentation fixes In-Reply-To: <20061102105716.GB16559@sashak.voltaire.com> References: <20061102105716.GB16559@sashak.voltaire.com> Message-ID: <1162471041.29957.106311.camel@hal.voltaire.com> On Thu, 2006-11-02 at 05:57, Sasha Khapyorsky wrote: > Trivial indentation fixes in osm_sm_state_mgr.h > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From jsquyres at cisco.com Thu Nov 2 04:40:50 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 2 Nov 2006 07:40:50 -0500 Subject: [openib-general] Static linking with libibverbs In-Reply-To: <20061102081941.GA7468@mellanox.co.il> References: <20061102081941.GA7468@mellanox.co.il> Message-ID: <2B4E1E5C-8B34-46E6-B9C4-DC937100BC0C@cisco.com> On Nov 2, 2006, at 3:19 AM, Michael S. Tsirkin wrote: >>> static linking actually can be made to work even with older >>> library versions. >>> See this HowTo (written on 02 of November, 2005). >>> https://openib.org/tiki/tiki-index.php?page=HowToFAQ >> >> That's not really static linking. > > OK, its a difference of terms then :) Static linking means making an executable that does not link to dynamic libraries at all (e.g., run "ldd a.out" and it says "not a dynamic executable"). Linking to static libraries is simply that -- linking to static libraries. >> If you try to build a true static >> executable, which contains static libc and in particular static >> libdl, >> there's no way the old code can work, for multiple reasons. For one >> thing, dlopen(NULL, RTLD_NOW) doesn't work on static executables so >> libibverbs couldn't find a low-level driver that is statically linked >> in. > > Does linking in low level driver work now even with -static? Yes. See the FAQ items on the OMPI web site from my first mail. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From sashak at voltaire.com Thu Nov 2 05:01:44 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 15:01:44 +0200 Subject: [openib-general] [PATCH TRIVIAL] management/libib*: strip trailing whitespaces Message-ID: <20061102130144.GC16867@sashak.voltaire.com> Strip trailing whitespaces for libibcommon, libibumad, libibmad. Signed-off-by: Sasha Khapyorsky --- libibcommon/include/infiniband/common.h | 2 +- libibcommon/src/hash.c | 6 +++--- libibmad/include/infiniband/mad.h | 4 ++-- libibmad/src/dump.c | 12 ++++++------ libibmad/src/fields.c | 4 ++-- libibmad/src/gs.c | 6 +++--- libibmad/src/register.c | 2 +- libibmad/src/resolve.c | 4 ++-- libibmad/src/rpc.c | 2 +- libibmad/src/sa.c | 2 +- libibmad/src/serv.c | 2 +- libibmad/src/smp.c | 4 ++-- libibumad/include/infiniband/umad.h | 4 ++-- libibumad/src/umad.c | 26 +++++++++++++------------- 14 files changed, 40 insertions(+), 40 deletions(-) diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h index 3537bdf..83c0679 100644 --- a/libibcommon/include/infiniband/common.h +++ b/libibcommon/include/infiniband/common.h @@ -152,7 +152,7 @@ __attribute__((unused)) static char _bui #endif __attribute__((unused)) static inline char* -get_build_version(void) +get_build_version(void) { return _build_version; } diff --git a/libibcommon/src/hash.c b/libibcommon/src/hash.c index 8f216a1..d05d221 100644 --- a/libibcommon/src/hash.c +++ b/libibcommon/src/hash.c @@ -57,16 +57,16 @@ For every delta with one or two bits set have at least 1/4 probability of changing. * If mix() is run forward, every bit of c will change between 1/3 and 2/3 of the time. (Well, 22/100 and 78/100 for some 2-bit deltas.) -mix() was built out of 36 single-cycle latency instructions in a +mix() was built out of 36 single-cycle latency instructions in a structure that could supported 2x parallelism, like so: - a -= b; + a -= b; a -= c; x = (c>>13); b -= c; a ^= x; b -= a; x = (a<<8); c -= a; b ^= x; c -= b; x = (b>>13); ... - Unfortunately, superscalar Pentiums and Sparcs can't take advantage + Unfortunately, superscalar Pentiums and Sparcs can't take advantage of that parallelism. They've also turned some of those single-cycle latency instructions into multi-cycle latency instructions. Still, this is the fastest good hash I could find. There were about 2^^68 diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 523f630..b6bbcbc 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -257,7 +257,7 @@ enum MAD_FIELDS { IB_SM_DATA_F, /* bytes 64 - 256 */ - IB_GS_DATA_F, + IB_GS_DATA_F, /* bytes 128 - 191 */ IB_DRSMP_PATH_F, @@ -602,7 +602,7 @@ enum { IB_NODE_ROUTER, NODE_RNIC, - IB_NODE_MAX = NODE_RNIC + IB_NODE_MAX = NODE_RNIC }; /******************************************************************************/ diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c index 1042ab1..eab3a8e 100644 --- a/libibmad/src/dump.c +++ b/libibmad/src/dump.c @@ -247,7 +247,7 @@ mad_dump_linkwidthsup(char *buf, int buf break; case 15: snprintf(buf, bufsz, "1X or 4X or 8X or 12X"); - break; + break; default: IBWARN("bad width %d", width); buf[0] = 0; @@ -637,7 +637,7 @@ ib_slvl_get_i(ib_slvl_table_t *tbl, int } typedef struct _ib_vl_arb_element { - uint8_t res_vl; + uint8_t res_vl; uint8_t weight; } __attribute__((packed)) ib_vl_arb_element_t; @@ -806,7 +806,7 @@ _mad_dump_field(ib_field_t *f, char *nam dots[32 - l] = 0; } - n = snprintf(buf, bufsz, "%s:%s", name, dots); + n = snprintf(buf, bufsz, "%s:%s", name, dots); _mad_dump_val(f, buf + n, bufsz - n, val); buf[bufsz - 1] = 0; @@ -816,13 +816,13 @@ _mad_dump_field(ib_field_t *f, char *nam int _mad_dump(ib_mad_dump_fn *fn, char *name, void *val, int valsz) { - ib_field_t f = { .def_dump_fn = fn, .bitlen = valsz * 8}; + ib_field_t f = { .def_dump_fn = fn, .bitlen = valsz * 8}; char buf[512]; - return printf("%s\n", _mad_dump_field(&f, name, buf, sizeof buf, val)); + return printf("%s\n", _mad_dump_field(&f, name, buf, sizeof buf, val)); } -int +int _mad_print_field(ib_field_t *f, char *name, void *val, int valsz) { return _mad_dump(f->def_dump_fn, name ? name : f->name, val, valsz ? valsz : ALIGN(f->bitlen, 8) / 8); diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 3f0ed44..d100713 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -323,7 +323,7 @@ ib_field_t ib_mad_f [] = { [IB_ATS_SM_MAGIC_KEY_F] {BITSOFFS(16*8, 16), "ATSMagicKey", mad_dump_hex}, [IB_ATS_SM_NODE_TYPE_F] {BITSOFFS(18*8, 16), "ATSNodeType", mad_dump_hex}, [IB_ATS_SM_NODE_NAME_F] {32*8, 32*8, "ATSNodeName", mad_dump_string}, - + /* * SLTOVL MAPPING TABLE */ @@ -348,7 +348,7 @@ ib_field_t ib_mad_f [] = { [IB_PC_EXT_XMT_BYTES_F] {64, 64, "PortXmitData", mad_dump_uint}, [IB_PC_EXT_RCV_BYTES_F] {128, 64, "PortRcvData", mad_dump_uint}, [IB_PC_EXT_XMT_PKTS_F] {192, 64, "PortXmitPkts", mad_dump_uint}, - [IB_PC_EXT_RCV_PKTS_F] {256, 64, "PortRcvPkts", mad_dump_uint}, + [IB_PC_EXT_RCV_PKTS_F] {256, 64, "PortRcvPkts", mad_dump_uint}, [IB_PC_EXT_XMT_UPKTS_F] {320, 64, "PortUnicastXmitPkts", mad_dump_uint}, [IB_PC_EXT_RCV_UPKTS_F] {384, 64, "PortUnicastRcvPkts", mad_dump_uint}, [IB_PC_EXT_XMT_MPKTS_F] {448, 64, "PortMulticastXmitPkts", mad_dump_uint}, diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index 8a4bd6d..d45dc9f 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -78,7 +78,7 @@ pma_query(void *rcvbuf, ib_portid_t *des if (!dest->qkey) dest->qkey = IB_DEFAULT_QP1_QKEY; - return madrpc(&rpc, dest, rcvbuf, rcvbuf); + return madrpc(&rpc, dest, rcvbuf, rcvbuf); } uint8_t * @@ -109,7 +109,7 @@ performance_reset(void *rcvbuf, ib_porti if (!mask) mask = ~0; - + rpc.mgtclass = IB_PERFORMANCE_CLASS; rpc.method = IB_MAD_METHOD_SET; rpc.attr.id = id; @@ -127,7 +127,7 @@ performance_reset(void *rcvbuf, ib_porti if (!dest->qkey) dest->qkey = IB_DEFAULT_QP1_QKEY; - return madrpc(&rpc, dest, rcvbuf, rcvbuf); + return madrpc(&rpc, dest, rcvbuf, rcvbuf); } uint8_t * diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 52d6989..602b997 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -67,7 +67,7 @@ register_agent(int agent, int mclass) memset(class_agent, 0xff, sizeof class_agent); memset(agent_class, 0xff, sizeof agent_class); } - + if (mclass < 0 || mclass >= MAX_CLASS || agent < 0 || agent >= MAX_AGENTS) { DEBUG("bad mgmt class %d or agent %d", mclass, agent); diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 93ddb43..da505d2 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -87,7 +87,7 @@ ib_resolve_guid(ib_portid_t *portid, uin return 0; } - + int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, int dest_type, ib_portid_t *sm_id) { @@ -144,7 +144,7 @@ ib_resolve_self(ib_portid_t *portid, int uint8_t portinfo[64]; uint8_t nodeinfo[64]; uint64_t guid, prefix; - + if (!smp_query(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0)) return -1; diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index e90920b..142f8d8 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -111,7 +111,7 @@ madrpc_portid(void) return mad_portid; } -static int +static int _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, int timeout) { diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c index 8de49e5..7b4b8a0 100644 --- a/libibmad/src/sa.c +++ b/libibmad/src/sa.c @@ -111,7 +111,7 @@ sa_rpc_call(void *ibmad_port, void *rcvb #define IB_PR_DEF_MASK (IB_PR_COMPMASK_DGID |\ IB_PR_COMPMASK_SGID |\ IB_PR_COMPMASK_NUMBPATH) - + int ib_path_query(ib_gid_t srcgid, ib_gid_t destgid, ib_portid_t *sm_id, void *buf) { diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index 3b100c8..63be97c 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -142,7 +142,7 @@ mad_respond(void *umad, ib_portid_t *por if (mad_build_pkt(umad, &rpc, portid, 0, 0) < 0) return -1; - if (ibdebug > 1) + if (ibdebug > 1) xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE); if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad, diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c index ff55997..f6477a1 100644 --- a/libibmad/src/smp.c +++ b/libibmad/src/smp.c @@ -72,7 +72,7 @@ smp_set(void *data, ib_portid_t *portid, portid->sl = 0; portid->qp = 0; - return madrpc(&rpc, portid, data, data); + return madrpc(&rpc, portid, data, data); } uint8_t * @@ -99,5 +99,5 @@ smp_query(void *rcvbuf, ib_portid_t *por portid->sl = 0; portid->qp = 0; - return madrpc(&rpc, portid, 0, rcvbuf); + return madrpc(&rpc, portid, 0, rcvbuf); } diff --git a/libibumad/include/infiniband/umad.h b/libibumad/include/infiniband/umad.h index 6026a86..8058263 100644 --- a/libibumad/include/infiniband/umad.h +++ b/libibumad/include/infiniband/umad.h @@ -178,7 +178,7 @@ int umad_get_fd(int portid); int umad_register(int portid, int mgmt_class, int mgmt_version, uint8_t rmpp_version, uint32_t method_mask[4]); -int umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, +int umad_register_oui(int portid, int mgmt_class, uint8_t rmpp_version, uint8_t oui[3], uint32_t method_mask[4]); int umad_unregister(int portid, int agentid); @@ -191,7 +191,7 @@ void umad_dump(void *umad); static inline void * umad_alloc(int num, size_t size) /* alloc array of umad buffers */ { - return calloc(num, size); + return calloc(num, size); } static inline void diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 860dbef..71b6833 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -292,7 +292,7 @@ resolve_ca_name(char *ca_name, int *best return 0; return ca_name; } - + /* Get the list of CA names */ if ((n = umad_get_cas_names((void *)names, UMAD_CA_NAME_LEN)) < 0) return 0; @@ -300,7 +300,7 @@ resolve_ca_name(char *ca_name, int *best /* Find the first existing CA with an active port */ for (caidx = 0; caidx < n; caidx++) { TRACE("checking ca '%s'", names[caidx]); - + port = *best_port; if ((port_type = resolve_ca_port(names[caidx], &port)) < 0) continue; @@ -345,7 +345,7 @@ get_ca(char *ca_name, umad_ca_t *ca) int portnum; strncpy(ca->ca_name, ca_name, sizeof ca->ca_name); - + snprintf(dir_name, sizeof dir_name - 1, "%s/%s", SYS_INFINIBAND, ca->ca_name); dir_name[sizeof dir_name - 1] = 0; @@ -506,9 +506,9 @@ umad_get_cas_names(char cas[][UMAD_CA_NA n = scandir(SYS_INFINIBAND, &namelist, 0, alphasort); if (n > 0) { for (i = 0; i < n; i++) { - if (!strcmp(namelist[i]->d_name, ".") || + if (!strcmp(namelist[i]->d_name, ".") || !strcmp(namelist[i]->d_name, "..")) { - } else + } else strncpy(cas[j++], namelist[i]->d_name, UMAD_CA_NAME_LEN); free(namelist[i]); @@ -615,7 +615,7 @@ umad_release_ca(umad_ca_t *ca) return r; DEBUG("releasing %s", ca->ca_name); - return 0; + return 0; } int @@ -647,7 +647,7 @@ umad_release_port(umad_port_t *port) return r; DEBUG("releasing %s:%d", port->ca_name, port->portnum); - return 0; + return 0; } int @@ -660,7 +660,7 @@ umad_close_port(int portid) return -EINVAL; close(port->dev_fd); - + port_free(port); DEBUG("closed %s fd %d", port->dev_file, port->dev_fd); @@ -892,10 +892,10 @@ umad_register_oui(int portid, int mgmt_c portid, req.id, req.qpn, oui); return req.id; /* return agentid */ } - + DEBUG("portid %d registering qp %d class 0x%x version %d oui 0x%x failed: %m", portid, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); - return -EPERM; + return -EPERM; } int @@ -923,17 +923,17 @@ umad_register(int portid, int mgmt_class else memset(req.method_mask, 0, sizeof req.method_mask); - memcpy(&req.oui, (char *)&oui + 1, sizeof req.oui); + memcpy(&req.oui, (char *)&oui + 1, sizeof req.oui); if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { DEBUG("portid %d registered to use agent %d qp %d", portid, req.id, qp); return req.id; /* return agentid */ } - + DEBUG("portid %d registering qp %d class 0x%x version %d failed: %m", portid, qp, mgmt_class, mgmt_version); - return -EPERM; + return -EPERM; } int -- 1.4.3.3.g8387 From sashak at voltaire.com Thu Nov 2 05:04:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 15:04:32 +0200 Subject: [openib-general] [PATCH TRIVIAL] diags: strip trailing whitespaces Message-ID: <20061102130432.GD16867@sashak.voltaire.com> Strip trailing whitespaces in diags. Signed-off-by: Sasha Khapyorsky --- diags/src/grouping.c | 60 ++++++++++++++++++++++---------------------- diags/src/ibaddr.c | 2 +- diags/src/ibnetdiscover.c | 12 ++++---- diags/src/ibping.c | 2 +- diags/src/ibportstate.c | 2 +- diags/src/ibroute.c | 4 +- diags/src/ibsysstat.c | 4 +- diags/src/ibtracert.c | 24 +++++++++--------- diags/src/saquery.c | 2 +- diags/src/smpdump.c | 6 ++-- diags/src/smpquery.c | 4 +- 11 files changed, 61 insertions(+), 61 deletions(-) diff --git a/diags/src/grouping.c b/diags/src/grouping.c index fbca4e0..09ac10a 100644 --- a/diags/src/grouping.c +++ b/diags/src/grouping.c @@ -77,7 +77,7 @@ char *get_chassis_slot(unsigned char cha return ChassisSlotStr[chassisslot]; } -static struct ChassisList *find_chassisnum(unsigned char chassisnum) +static struct ChassisList *find_chassisnum(unsigned char chassisnum) { ChassisList *current; @@ -192,7 +192,7 @@ int anafa_spine4_slot_2_slb[25] = { 0, 1 static void get_sfb_slot(Node *node, Port *lineport) { ChassisRecord *ch = node->chrecord; - + ch->chassisslot = SPINE_CS; if (is_spine_9096(node)) { ch->chassistype = ISR9096_CT; @@ -210,7 +210,7 @@ static void get_router_slot(Node *node, ChassisRecord *ch = node->chrecord; int guessnum = 0; - if (!ch) { + if (!ch) { if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) IBPANIC("out of mem"); ch = node->chrecord; @@ -229,7 +229,7 @@ static void get_router_slot(Node *node, /* module 1 <--> remote anafa 3 */ /* module 2 <--> remote anafa 2 */ /* module 3 <--> remote anafa 1 */ - ch->anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); + ch->anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); } } @@ -260,7 +260,7 @@ static void fill_chassis_record(Node *no if (node->chrecord) /* somehow this node has already been passed */ return; - + if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) IBPANIC("out of mem"); @@ -285,7 +285,7 @@ static void fill_chassis_record(Node *no /* we assume here that remoteport belongs to line */ get_sfb_slot(node, port->remoteport); - /* we could break here, but need to find if more routers connected */ + /* we could break here, but need to find if more routers connected */ } } else if (is_line(node)) { @@ -307,7 +307,7 @@ static int get_line_index(Node *node) { int retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; - if (retval > LINES_MAX_NUM || retval < 1) + if (retval > LINES_MAX_NUM || retval < 1) IBPANIC("Grouping: Internal error"); return retval; } @@ -319,9 +319,9 @@ static int get_spine_index(Node *node) if (is_spine_9288(node)) retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; else - retval = node->chrecord->slotnum; + retval = node->chrecord->slotnum; - if (retval > SPINES_MAX_NUM || retval < 1) + if (retval > SPINES_MAX_NUM || retval < 1) IBPANIC("Grouping: Internal error"); return retval; } @@ -330,7 +330,7 @@ static void insert_line_router(Node *nod { int i = get_line_index(node); - if (chassislist->linenode[i]) + if (chassislist->linenode[i]) return; /* already filled slot */ chassislist->linenode[i] = node; @@ -357,7 +357,7 @@ static void pass_on_lines_catch_spines(C for (i = 1; i <= LINES_MAX_NUM; i++) { node = chassislist->linenode[i]; - if (!(node && is_line(node))) + if (!(node && is_line(node))) continue; /* empty slot or router */ for (port = node->ports; port; port = port->next) { @@ -383,7 +383,7 @@ static void pass_on_spines_catch_lines(C for (i = 1; i <= SPINES_MAX_NUM; i++) { node = chassislist->spinenode[i]; - if (!node) + if (!node) continue; /* empty slot */ for (port = node->ports; port; port = port->next) { if (!port->remoteport) @@ -399,34 +399,34 @@ static void pass_on_spines_catch_lines(C /* Stupid interpolation algorithm... - But nothing to do - have to be compliant with VoltaireSM/NMS + But nothing to do - have to be compliant with VoltaireSM/NMS */ static void pass_on_spines_interpolate_chguid(ChassisList *chassislist) { Node *node; int i; - + for (i = 1; i <= SPINES_MAX_NUM; i++) { node = chassislist->spinenode[i]; - if (!node) + if (!node) continue; /* skip the empty slots */ /* take first guid less one: consistent with SM... */ chassislist->chassisguid = node->nodeguid - 1; break; } -} +} /* This function fills chassislist structure with all nodes - in that chassis + in that chassis chassislist structure = structure of one standalone chassis */ static void build_chassis(Node *node, ChassisList *chassislist) { Node *remnode = 0; Port *port = 0; - + /* we get here with node = chassis_spine */ chassislist->chassistype = node->chrecord->chassistype; insert_spine(node, chassislist); @@ -442,12 +442,12 @@ static void build_chassis(Node *node, Ch insert_line_router(remnode, chassislist); } - + pass_on_lines_catch_spines(chassislist); /* this pass needed for to catch routers, since routers connected only */ /* to spines in slot 1 or 4 and we could miss them first time */ pass_on_spines_catch_lines(chassislist); - + /* additional 2 passes needed for to overcome a problem of pure "in-chassis" */ /* connectivity - extra pass to ensure that all related chips/modules */ /* inserted into the chassislist */ @@ -465,8 +465,8 @@ Description : On ISR9288/9096 external p is not matching the internal ( anafa ) port indexes. Use this MAP to translate the data you get from the OpenIB diagnostics (smpquery, ibroute, ibtracert, etc.) - - + + Module : sLB-24 anafa 1 anafa 2 ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 @@ -474,14 +474,14 @@ int port | 22 23 24 18 17 16 | 22 23 24 ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 int port | 19 20 21 15 14 13 | 19 20 21 15 14 13 ------------------------------------------------ - + Module : sLB-8 anafa 1 anafa 2 ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 - + -----------> anafa 1 anafa 2 ext port | - - 5 - - 6 | - - 7 - - 8 @@ -492,7 +492,7 @@ int port | 21 20 19 15 14 13 | 21 20 19 */ -int int2ext_map_slb24[2][25] = { +int int2ext_map_slb24[2][25] = { { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 4, 18, 17, 16, 1, 2, 3, 13, 14, 15 }, { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 11, 10, 24, 23, 22, 7, 8, 9, 19, 20, 21 } }; @@ -504,7 +504,7 @@ int int2ext_map_slb8[2][25] = { /* This function relevant only for line modules/chips - Returns string with external port index + Returns string with external port index */ char *portmapstring(Port *port) { @@ -522,11 +522,11 @@ char *portmapstring(Port *port) return NULL; memset(mapping, 0, sizeof(mapping)); - + chipnum = ch->anafanum - 1; if (is_line_24(node)) - pindex = int2ext_map_slb24[chipnum][portnum]; + pindex = int2ext_map_slb24[chipnum][portnum]; else pindex = int2ext_map_slb8[chipnum][portnum]; @@ -550,7 +550,7 @@ static void add_chassislist() } } -/* +/* Main grouping function Algorithm: 1. pass on every Voltaire node @@ -571,7 +571,7 @@ void group_nodes() mylist.current = NULL; mylist.last = NULL; - /* first pass on switches and build for every Voltaire node */ + /* first pass on switches and build for every Voltaire node */ /* an appropriate chassis record (slotnum and position) */ /* according to internal connectivity */ /* not very efficient but clear code so... */ diff --git a/diags/src/ibaddr.c b/diags/src/ibaddr.c index 25f9229..66ef71a 100644 --- a/diags/src/ibaddr.c +++ b/diags/src/ibaddr.c @@ -80,7 +80,7 @@ ib_resolve_addr(ib_portid_t *portid, int char buf1[64], buf2[64]; ib_gid_t gid; int lmc; - + if (!smp_query(nodeinfo, portid, IB_ATTR_NODE_INFO, 0, 0)) return -1; diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c index dac17d5..c6e35e4 100644 --- a/diags/src/ibnetdiscover.c +++ b/diags/src/ibnetdiscover.c @@ -200,7 +200,7 @@ clean_nodedesc(char *nodedesc) nodedesc[63] = '\0'; for (i = 0; i < 64; i++) { if (iscntrl(nodedesc[i]) || nodedesc[i] == '\0') { - nodedesc[i] = '\0'; + nodedesc[i] = '\0'; break; } } @@ -215,7 +215,7 @@ dump_endnode(ib_portid_t *path, char *pr #if __WORDSIZE == 64 fprintf(f, "%s -> %s %s {%016lx} portnum %d lid %d-%d\"%s\"\n", - portid2str(path), prompt, + portid2str(path), prompt, (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum, port->lid, port->lid + (1 << port->lmc) - 1, @@ -418,10 +418,10 @@ discover(ib_portid_t *from) for (i = 1; i <= node->numports; i++) { if (i == node->localport) continue; - + if (!(port = calloc(1, sizeof(Port)))) IBERROR("out of memory"); - + if (get_port(port, i, path) < 0) { IBWARN("can't reach node %s port %d", portid2str(path), i); return 0; @@ -696,7 +696,7 @@ dump_topology(int listtype, int group) DEBUG("SWITCH: dist %d node %p", dist, node); /* Now, skip chassis based switches */ if (node->chrecord) - if (node->chrecord->chassisnum) + if (node->chrecord->chassisnum) continue; out_switch(node, group); @@ -731,7 +731,7 @@ dump_topology(int listtype, int group) void usage(void) -{ +{ fprintf(stderr, "Usage: %s [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -V(ersion) -C ca_name -P ca_port " "-t(imeout) timeout_ms] []\n", argv0); diff --git a/diags/src/ibping.c b/diags/src/ibping.c index 6b1c8a3..98f9bef 100644 --- a/diags/src/ibping.c +++ b/diags/src/ibping.c @@ -323,7 +323,7 @@ main(int argc, char **argv) if (!argc && !server) usage(); - + madrpc_init(ca, ca_port, mgmt_classes, 3); if (server) { diff --git a/diags/src/ibportstate.c b/diags/src/ibportstate.c index 1af87c7..b8f6f4d 100644 --- a/diags/src/ibportstate.c +++ b/diags/src/ibportstate.c @@ -119,7 +119,7 @@ get_port_info(ib_portid_t *dest, uint8_t return 0; } -static int +static int set_port_info(ib_portid_t *dest, uint8_t *data, int portnum, int port_op) { char buf[2048]; diff --git a/diags/src/ibroute.c b/diags/src/ibroute.c index 4fb334d..21be3f3 100644 --- a/diags/src/ibroute.c +++ b/diags/src/ibroute.c @@ -187,7 +187,7 @@ dump_multicast_tables(ib_portid_t *porti IBWARN("illegal start mlid %x, set to %x", startlid, IB_MIN_MCAST_LID); startlid = IB_MIN_MCAST_LID; } - + if (endlid > IB_MAX_MCAST_LID) { IBWARN("illegal end mlid %x, truncate to %x", endlid, IB_MAX_MCAST_LID); endlid = IB_MAX_MCAST_LID; @@ -333,7 +333,7 @@ dump_unicast_tables(ib_portid_t *portid, if (!endlid || endlid > top) endlid = top; - + if (endlid > IB_MAX_UCAST_LID) { IBWARN("ilegal lft top %d, truncate to %d", endlid, IB_MAX_UCAST_LID); endlid = IB_MAX_UCAST_LID; diff --git a/diags/src/ibsysstat.c b/diags/src/ibsysstat.c index f7ff994..489e0f9 100644 --- a/diags/src/ibsysstat.c +++ b/diags/src/ibsysstat.c @@ -174,7 +174,7 @@ match_attr(char *str) return IB_CPUINFO_ATTR; return -1; } - + static char * ibsystat(ib_portid_t *portid, int attr) { @@ -361,7 +361,7 @@ main(int argc, char **argv) if (mad_register_client(sysstat_class, 0) < 0) IBERROR("can't register to sysstat class %d", sysstat_class); - + if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) IBERROR("can't resolve destination port %s", argv[0]); diff --git a/diags/src/ibtracert.c b/diags/src/ibtracert.c index 0400f34..64dbe00 100644 --- a/diags/src/ibtracert.c +++ b/diags/src/ibtracert.c @@ -368,15 +368,15 @@ find_route(ib_portid_t *from, ib_portid_ dump_endnode(dump, "To", node, port); return 0; -badport: +badport: IBWARN("Bad port state found: node \"%s\" port %d state %d", node->nodedesc, portnum, port->state); return -1; -badoutport: +badoutport: IBWARN("Bad out port state found: node \"%s\" outport %d state %d", node->nodedesc, outport, port->state); return -1; -badtbl: +badtbl: IBWARN("Bad forwarding table entry found at: node \"%s\" lid entry %d is %d (top %d)", node->nodedesc, to, outport, sw.linearFDBtop); return -1; @@ -476,7 +476,7 @@ switch_mclookup(Node *node, ib_portid_t int maxsets, block, i, set; memset(map, 0, 256); - + if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) return -1; @@ -499,7 +499,7 @@ switch_mclookup(Node *node, ib_portid_t uint16_t mask = ntohs(msets[mlid % 32]); if (mask & (1 << i)) *map = 1; - else + else continue; VERBOSE("Switch guid 0x%Lx: mlid 0x%x is forwarded to port %d", node->nodeguid, mlid + 0xc000, i + set * 16); @@ -583,19 +583,19 @@ find_mcpath(ib_portid_t *from, int mlid) for (i = 1; i <= node->numports; i++) { if (!map[i] || i == node->upport) continue; - + if (dist == 0 && leafport) { if (from->drpath.cnt > 0) path->drpath.cnt--; } else { if (!(port = calloc(1, sizeof(Port)))) IBERROR("out of memory"); - + if (get_port(port, i, path) < 0) { IBWARN("can't reach node %s port %d", portid2str(path), i); return 0; } - + if (port->physstate != 5) { /* LinkUP */ free(port); continue; @@ -608,13 +608,13 @@ find_mcpath(ib_portid_t *from, int mlid) if (extend_dpath(&path->drpath, i) < 0) return 0; } - + if (!(remotenode = calloc(1, sizeof(Node)))) IBERROR("out of memory"); if (!(remoteport = calloc(1, sizeof(Port)))) IBERROR("out of memory"); - + if (get_node(remotenode, remoteport, path) < 0) { IBWARN("NodeInfo on %s port %d failed, skipping port", portid2str(path), i); @@ -623,7 +623,7 @@ find_mcpath(ib_portid_t *from, int mlid) free(remoteport); continue; } - + remotenode->upnode = node; remotenode->upport = remoteport->portnum; remoteport->remoteport = port; @@ -724,7 +724,7 @@ dump_mcpath(Node *node, int dumplevel) (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), node->nodeguid, node->ports->portnum, node->ports->lid, node->ports->lid + (1 << node->ports->lmc) - 1, - node->nodedesc); + node->nodedesc); #endif } diff --git a/diags/src/saquery.c b/diags/src/saquery.c index 27daee2..e9b7469 100644 --- a/diags/src/saquery.c +++ b/diags/src/saquery.c @@ -110,7 +110,7 @@ print_node_record(ib_node_record_t *node ib_node_info_t *p_ni = NULL; p_ni = &(node_record->node_info); - + switch (node_print_desc) { case LID_ONLY: case UNIQUE_LID_ONLY: diff --git a/diags/src/smpdump.c b/diags/src/smpdump.c index 7eb636b..418cebe 100644 --- a/diags/src/smpdump.c +++ b/diags/src/smpdump.c @@ -234,7 +234,7 @@ main(int argc, char *argv[]) int dump_char = 0, timeout_ms = 1000; int dev_port = 0, mgmt_class = CLASS_SUBN_LID_ROUTE, dlid = 0; char *dev_name = 0; - void *umad; + void *umad; struct drsmp *smp; int i, portid, mod = 0, attr; DRPath path; @@ -299,14 +299,14 @@ main(int argc, char *argv[]) if (mgmt_class == CLASS_SUBN_DIRECTED_ROUTE && str2DRPath(strdupa(argv[0]), &path) < 0) IBPANIC("bad path str '%s'", argv[0]); - + if (mgmt_class == CLASS_SUBN_LID_ROUTE) dlid = strtoul(argv[0], 0, 0); attr = strtoul(argv[1], 0, 0); if (argc > 2) mod = strtoul(argv[2], 0, 0); - + if (umad_init() < 0) IBPANIC("can't init UMAD library"); diff --git a/diags/src/smpquery.c b/diags/src/smpquery.c index 68f9258..2bd315c 100644 --- a/diags/src/smpquery.c +++ b/diags/src/smpquery.c @@ -204,7 +204,7 @@ pkey_table(ib_portid_t *dest, char **arg if (i + 1 == (n + 31) / 32) k = ((n + 7 - i * 32) / 8) * 8; else - k = 32; + k = 32; p = (uint16_t *) data; for (j = 0; j < k; j += 8, p += 8) { printf("%4u: 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x 0x%04x\n", @@ -485,7 +485,7 @@ main(int argc, char **argv) memset(concat, 0, 64); snprintf(concat, sizeof(concat), "%s %s", argv[1], argv[2]); if (ib_resolve_portid_str(&portid, concat, dest_type, sm_id) < 0) - IBERROR("can't resolve destination port %s", concat); + IBERROR("can't resolve destination port %s", concat); if ((err = fn(&portid, argv+3, argc-3))) IBERROR("operation %s: %s", argv[0], err); } -- 1.4.3.3.g8387 From mst at mellanox.co.il Thu Nov 2 05:13:51 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 15:13:51 +0200 Subject: [openib-general] Static linking with libibverbs In-Reply-To: <2B4E1E5C-8B34-46E6-B9C4-DC937100BC0C@cisco.com> References: <20061102081941.GA7468@mellanox.co.il> <2B4E1E5C-8B34-46E6-B9C4-DC937100BC0C@cisco.com> Message-ID: <20061102131351.GA8885@mellanox.co.il> Quoting r. Jeff Squyres : > Yes. See the FAQ items on the OMPI web site from my first mail. OK, I see. So what it boils down to, is linking with -Wl,--whole-archive -libverbs /mthca.a -Wl,--no-whole-archive Is that right? But -u openib_driver_init will work as well, won't it? -- MST From mst at mellanox.co.il Thu Nov 2 05:19:13 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 15:19:13 +0200 Subject: [openib-general] [PATCH] use mmiowb after doorbell ring In-Reply-To: References: Message-ID: <20061102131913.GB8885@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] use mmiowb after doorbell ring > > > I just look a quick look at the directory setup and if you are > > changing things I'd say you should also arrange to have the libibverbs > > soname stamped into the plugin path and soname. Something like > > libmthca-libibverbs.2.so.0. Once you do that it is pretty safe > > to put it in /usr/lib* > > That makes sense (although I guess it would be > libmthca-libibverbs.2.so without the .0, since libmthca is just a > plugin that doesn't have an independent soname of its own). Then we > could have each plugin drop a file in /etc/libibverbs.conf.d/ with the > name -- something like > > driver mthca > > (and possibly also read $HOME/.libibverbs.conf if desired) > > The only two things I need to figure out, I hope with help from > smarter people: > - What is the autoconf/automake chicanery needed to make the > libmthca figure out the right libibverbs soname to stick in the > name of the .so it installs? > - And what is the autoconf/automake chicanery needed to fall back to > having libmthca install plain mthca.so under /usr/lib/infiniband > when it detects that it is being built against libibverbs 1.0? By the way, what's up with this project? It's still planned for libibverbs 1.1, isn't it? -- MST From jsquyres at cisco.com Thu Nov 2 05:23:28 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 2 Nov 2006 08:23:28 -0500 Subject: [openib-general] Static linking with libibverbs In-Reply-To: <20061102131351.GA8885@mellanox.co.il> References: <20061102081941.GA7468@mellanox.co.il> <2B4E1E5C-8B34-46E6-B9C4-DC937100BC0C@cisco.com> <20061102131351.GA8885@mellanox.co.il> Message-ID: <7A407559-073A-46B9-BA84-09DCEE64B299@cisco.com> On Nov 2, 2006, at 8:13 AM, Michael S. Tsirkin wrote: > Quoting r. Jeff Squyres : >> Yes. See the FAQ items on the OMPI web site from my first mail. > > OK, I see. > So what it boils down to, is linking with > -Wl,--whole-archive -libverbs /mthca.a -Wl,--no-whole-archive > Is that right? There's a few other details, but this is the Main Point, yes. > But -u openib_driver_init will work as well, won't it? I'm not entirely sure -- it might (I didn't try it). It *should* force creation of a valid code path into mthca.a and therefore use it for all the resolution that is required (i.e., link in all the parts of mthca.a that are actually required). What I'm not sure about is whether the symbols that mthca needs from libibverbs will be linked in properly (since the linker order is left to right "-libverbs /mthca.a"). I *think* they'll be available from when mthca.a was originally created (i.e., libibverbs.a was statically linked into mthca.a), but I don't know if the linker will be smart enough to realize that there are two copies of some symbols in libibverbs and further to realize that they are actually duplicates of the same underlying symbol, and one can be safely eliminated. It's worth trying (but I don't really care too much :-) ). Every time I think I understand linkers, we get weirdo cases like this that make me remember I have no clue how they work. :-) -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From kliteyn at dev.mellanox.co.il Thu Nov 2 05:24:00 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 02 Nov 2006 15:24:00 +0200 Subject: [openib-general] [PATCH v2] opensm: remove obsolete p_report_buf In-Reply-To: <20061101184559.GC22655@sashak.voltaire.com> References: <20061101184559.GC22655@sashak.voltaire.com> Message-ID: <4549F170.5080707@dev.mellanox.co.il> Hi Sasha. Looks good, thanks. -- Yevgeny Sasha Khapyorsky wrote: > This removes obsolete now shared sm->p_report_buf buffer and cleans > up related code. And also introduces new log function osm_log_printf() > which currently trivially sends formatted output to stdout. > > Signed-off-by: Sasha Khapyorsky > --- > osm/include/opensm/osm_base.h | 5 -- > osm/include/opensm/osm_log.h | 3 + > osm/include/opensm/osm_sm.h | 2 - > osm/include/opensm/osm_state_mgr.h | 8 -- > osm/include/opensm/osm_ucast_mgr.h | 5 -- > osm/opensm/libopensm.map | 3 +- > osm/opensm/osm_log.c | 19 +++++ > osm/opensm/osm_mcast_mgr.c | 11 ++-- > osm/opensm/osm_sm.c | 15 +---- > osm/opensm/osm_state_mgr.c | 138 ++++++++++++----------------------- > osm/opensm/osm_ucast_mgr.c | 80 +++++++-------------- > 11 files changed, 104 insertions(+), 185 deletions(-) > > diff --git a/osm/include/opensm/osm_base.h b/osm/include/opensm/osm_base.h > index 57dd4fd..20e2cc3 100644 > --- a/osm/include/opensm/osm_base.h > +++ b/osm/include/opensm/osm_base.h > @@ -714,11 +714,6 @@ typedef enum _osm_state_mgr_mode > * > **********/ > > -#define OSM_REPORT_BUF_SIZE 0x10000 > -#define OSM_REPORT_LINE_SIZE 0x256 > -#define OSM_REPORT_BUF_THRESHOLD (OSM_REPORT_BUF_SIZE / OSM_REPORT_LINE_SIZE) > - > - > /****d* OpenSM: Base/osm_sm_signal_t > * NAME > * osm_sm_signal_t > diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h > index 62f3a0c..6a1a93f 100644 > --- a/osm/include/opensm/osm_log.h > +++ b/osm/include/opensm/osm_log.h > @@ -370,6 +370,9 @@ osm_log_is_active( > * osm_log_destroy > *********/ > > +extern int osm_log_printf(osm_log_t *p_log, osm_log_level_t level, > + const char *fmt, ...); > + > void > osm_log( > IN osm_log_t* const p_log, > diff --git a/osm/include/opensm/osm_sm.h b/osm/include/opensm/osm_sm.h > index bc812f3..05b87ac 100644 > --- a/osm/include/opensm/osm_sm.h > +++ b/osm/include/opensm/osm_sm.h > @@ -178,8 +178,6 @@ typedef struct _osm_sm > osm_vla_rcv_ctrl_t vla_rcv_ctrl; > osm_pkey_rcv_t pkey_rcv; > osm_pkey_rcv_ctrl_t pkey_rcv_ctrl; > - char* p_report_buf; > - > } osm_sm_t; > /* > * FIELDS > diff --git a/osm/include/opensm/osm_state_mgr.h b/osm/include/opensm/osm_state_mgr.h > index ad4afa0..7aaab58 100644 > --- a/osm/include/opensm/osm_state_mgr.h > +++ b/osm/include/opensm/osm_state_mgr.h > @@ -121,7 +121,6 @@ typedef struct _osm_state_mgr > cl_qlist_t idle_time_list; > cl_plock_t *p_lock; > cl_event_t *p_subnet_up_event; > - char *p_report_buf; > osm_sm_state_t state; > osm_state_mgr_mode_t state_step_mode; > osm_signal_t next_stage_signal; > @@ -170,9 +169,6 @@ typedef struct _osm_state_mgr > * p_subnet_up_event > * Pointer to the event to set if/when the subnet comes up. > * > -* p_report_buf > -* Pointer to the large log buffer used for user reports. > -* > * state > * State of the SM. > * > @@ -380,7 +376,6 @@ osm_state_mgr_init( > IN const osm_sm_mad_ctrl_t* const p_mad_ctrl, > IN cl_plock_t* const p_lock, > IN cl_event_t* const p_subnet_up_event, > - IN char* const p_report_buf, > IN osm_log_t* const p_log ); > /* > * PARAMETERS > @@ -420,9 +415,6 @@ osm_state_mgr_init( > * p_subnet_up_event > * [in] Pointer to the event to set if/when the subnet comes up. > * > -* p_report_buf > -* [in] Pointer to the large log buffer used for user reports. > -* > * p_log > * [in] Pointer to the log object. > * > diff --git a/osm/include/opensm/osm_ucast_mgr.h b/osm/include/opensm/osm_ucast_mgr.h > index 0fbfc66..1c10abb 100644 > --- a/osm/include/opensm/osm_ucast_mgr.h > +++ b/osm/include/opensm/osm_ucast_mgr.h > @@ -105,7 +105,6 @@ typedef struct _osm_ucast_mgr > osm_req_t *p_req; > osm_log_t *p_log; > cl_plock_t *p_lock; > - char *p_report_buf; > } osm_ucast_mgr_t; > /* > * FIELDS > @@ -204,7 +203,6 @@ osm_ucast_mgr_init( > IN osm_ucast_mgr_t* const p_mgr, > IN osm_req_t* const p_req, > IN osm_subn_t* const p_subn, > - IN char* const p_report_buf, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ); > /* > @@ -218,9 +216,6 @@ osm_ucast_mgr_init( > * p_subn > * [in] Pointer to the Subnet object for this subnet. > * > -* p_report_buf > -* [in] Pointer to the large log buffer used for user reporting. > -* > * p_log > * [in] Pointer to the log object. > * > diff --git a/osm/opensm/libopensm.map b/osm/opensm/libopensm.map > index 60d532f..25370b1 100644 > --- a/osm/opensm/libopensm.map > +++ b/osm/opensm/libopensm.map > @@ -1,6 +1,7 @@ > -OPENSM_1.3 { > +OPENSM_1.4 { > global: > osm_log; > + osm_log_printf; > osm_is_debug; > osm_log_init; > osm_log_init_v2; > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c > index 8ac7f8f..c6cc072 100644 > --- a/osm/opensm/osm_log.c > +++ b/osm/opensm/osm_log.c > @@ -109,6 +109,25 @@ static void truncate_log_file(osm_log_t* > } > #endif /* ndef WIN32 */ > > +int osm_log_printf(osm_log_t *p_log, osm_log_level_t level, > + const char *fmt, ...) > +{ > + va_list args; > + int ret; > + > + if (!(p_log->level&level)) > + return 0; > + > + va_start(args, fmt); > + ret = vfprintf(stdout, fmt, args); > + va_end(args); > + > + if (p_log->flush || level&OSM_LOG_ERROR) > + fflush( stdout ); > + > + return ret; > +} > + > void > osm_log( > IN osm_log_t* const p_log, > diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c > index 5a01578..82ef7c3 100644 > --- a/osm/opensm/osm_mcast_mgr.c > +++ b/osm/opensm/osm_mcast_mgr.c > @@ -1382,14 +1382,13 @@ static void > mcast_mgr_dump_sw_routes( > IN const osm_mcast_mgr_t* const p_mgr, > IN const osm_switch_t* const p_sw, > - IN FILE *p_mcfdbFile ) > + IN FILE *file ) > { > osm_mcast_tbl_t* p_tbl; > int16_t mlid_ho = 0; > int16_t mlid_start_ho; > uint8_t position = 0; > int16_t block_num = 0; > - char line[OSM_REPORT_LINE_SIZE]; > boolean_t print_lid; > const osm_node_t* p_node; > uint16_t i, j; > @@ -1404,7 +1403,7 @@ mcast_mgr_dump_sw_routes( > > p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw ); > > - fprintf( p_mcfdbFile, "\nSwitch 0x%016" PRIx64 "\n" > + fprintf( file, "\nSwitch 0x%016" PRIx64 "\n" > "LID : Out Port(s)\n", > cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); > while ( block_num <= p_tbl->max_block_in_use ) > @@ -1415,7 +1414,7 @@ mcast_mgr_dump_sw_routes( > mlid_ho = mlid_start_ho + i; > position = 0; > print_lid = FALSE; > - sprintf( line, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); > + fprintf( file, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO ); > while ( position <= p_tbl->max_position ) > { > mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]); > @@ -1428,13 +1427,13 @@ mcast_mgr_dump_sw_routes( > for (j = 0 ; j < 16 ; j++) > { > if ( (1 << j) & mask_entry ) > - sprintf( line, "%s 0x%03X ", line, j+(position*16) ); > + fprintf( file, " 0x%03X ", j+(position*16) ); > } > position++; > } > if (print_lid) > { > - fprintf( p_mcfdbFile, "%s\n", line ); > + fprintf( file, "\n" ); > } > } > block_num++; > diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c > index fef3cac..fb4f759 100644 > --- a/osm/opensm/osm_sm.c > +++ b/osm/opensm/osm_sm.c > @@ -256,9 +256,6 @@ osm_sm_destroy( > cl_event_destroy( &p_sm->signal ); > cl_event_destroy( &p_sm->subnet_up_event ); > > - if( p_sm->p_report_buf != NULL ) > - free( p_sm->p_report_buf ); > - > osm_log( p_sm->p_log, OSM_LOG_SYS, "Exiting SM\n" ); /* Format Waived */ > OSM_LOG_EXIT( p_sm->p_log ); > } > @@ -291,15 +288,6 @@ osm_sm_init( > p_sm->p_disp = p_disp; > p_sm->p_lock = p_lock; > > - p_sm->p_report_buf = malloc( OSM_REPORT_BUF_SIZE ); > - if( p_sm->p_report_buf == NULL ) > - { > - osm_log( p_sm->p_log, OSM_LOG_ERROR, > - "osm_sm_init: ERR 2E09: " > - "Can't allocate report buffer\n" ); > - status = IB_INSUFFICIENT_MEMORY; > - goto Exit; > - } > status = cl_event_init( &p_sm->signal, FALSE ); > if( status != CL_SUCCESS ) > goto Exit; > @@ -385,7 +373,6 @@ osm_sm_init( > status = osm_ucast_mgr_init( &p_sm->ucast_mgr, > &p_sm->req, > p_sm->p_subn, > - p_sm->p_report_buf, > p_sm->p_log, p_sm->p_lock ); > if( status != IB_SUCCESS ) > goto Exit; > @@ -409,7 +396,7 @@ osm_sm_init( > &p_sm->mad_ctrl, > p_sm->p_lock, > &p_sm->subnet_up_event, > - p_sm->p_report_buf, p_sm->p_log ); > + p_sm->p_log ); > if( status != IB_SUCCESS ) > goto Exit; > > diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c > index a2efee4..66da6fa 100644 > --- a/osm/opensm/osm_state_mgr.c > +++ b/osm/opensm/osm_state_mgr.c > @@ -118,7 +118,6 @@ osm_state_mgr_init( > IN const osm_sm_mad_ctrl_t * const p_mad_ctrl, > IN cl_plock_t * const p_lock, > IN cl_event_t * const p_subnet_up_event, > - IN char *const p_report_buf, > IN osm_log_t * const p_log ) > { > cl_status_t status; > @@ -136,7 +135,6 @@ osm_state_mgr_init( > CL_ASSERT( p_sm_state_mgr ); > CL_ASSERT( p_mad_ctrl ); > CL_ASSERT( p_lock ); > - CL_ASSERT( p_report_buf ); > > osm_state_mgr_construct( p_mgr ); > > @@ -154,7 +152,6 @@ osm_state_mgr_init( > p_mgr->state = OSM_SM_STATE_IDLE; > p_mgr->p_lock = p_lock; > p_mgr->p_subnet_up_event = p_subnet_up_event; > - p_mgr->p_report_buf = p_report_buf; > p_mgr->state_step_mode = OSM_STATE_STEP_CONTINUOUS; > p_mgr->next_stage_signal = OSM_SIGNAL_NONE; > > @@ -1255,16 +1252,19 @@ __osm_state_mgr_report( > uint8_t port_num; > uint8_t start_port; > uint32_t num_ports; > - char line[OSM_REPORT_LINE_SIZE]; > uint8_t node_type; > - uint32_t line_num = 0; > + > + if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) > + return; > > OSM_LOG_ENTER( p_mgr->p_log, __osm_state_mgr_report ); > > - if( !osm_log_is_active( p_mgr->p_log, OSM_LOG_VERBOSE ) ) > - { > - goto Exit; > - } > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, > + "\n===================================================" > + "====================================================" > + "\nVendor : Ty " > + ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " > + " : Neighbor Port (Port #)\n" ); > > p_tbl = &p_mgr->p_subn->port_guid_tbl; > > @@ -1294,29 +1294,16 @@ __osm_state_mgr_report( > num_ports = osm_port_get_num_physp( p_port ); > for( port_num = start_port; port_num < num_ports; port_num++ ) > { > - if( line_num == 0 ) > - { > - strcpy( p_mgr->p_report_buf, > - "\n===================================================" > - "====================================================" ); > - strcat( p_mgr->p_report_buf, > - "\nVendor : Ty " > - ": # : Sta : LID : LMC : MTU : LWA : LSA : Port GUID " > - " : Neighbor Port (Port #)\n" ); > - line_num++; > - } > - > p_physp = osm_port_get_phys_ptr( p_port, port_num ); > if( ( p_physp == NULL ) || ( !osm_physp_is_valid( p_physp ) ) ) > continue; > > - sprintf( line, "%s : %s : %02X :", > - osm_get_manufacturer_str( cl_ntoh64 > - ( osm_node_get_node_guid > - ( p_node ) ) ), > - osm_get_node_type_str_fixed_width( node_type ), port_num ); > - > - strcat( p_mgr->p_report_buf, line ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "%s : %s : %02X :", > + osm_get_manufacturer_str( cl_ntoh64 > + ( osm_node_get_node_guid > + ( p_node ) ) ), > + osm_get_node_type_str_fixed_width( node_type ), > + port_num ); > > p_pi = osm_physp_get_port_info_ptr( p_physp ); > > @@ -1324,61 +1311,40 @@ __osm_state_mgr_report( > * Port state is not defined for switch port 0 > */ > if( port_num == 0 ) > - strcat( p_mgr->p_report_buf, " :" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " :" ); > else > - { > - sprintf( line, " %s :", > - osm_get_port_state_str_fixed_width > - ( ib_port_info_get_port_state( p_pi ) ) ); > - strcat( p_mgr->p_report_buf, line ); > - } > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " %s :", > + osm_get_port_state_str_fixed_width > + ( ib_port_info_get_port_state( p_pi ) ) ); > > /* > * LID values are only meaningful in select cases. > */ > - if( ib_port_info_get_port_state( p_pi ) != IB_LINK_DOWN ) > - { > - if( ( ( node_type == IB_NODE_TYPE_SWITCH ) && ( port_num == 0 ) ) > - || ( node_type != IB_NODE_TYPE_SWITCH ) ) > - { > - sprintf( line, " %04X : %01X :", > - cl_ntoh16( p_pi->base_lid ), > - ib_port_info_get_lmc( p_pi ) ); > - > - strcat( p_mgr->p_report_buf, line ); > - } > - else > - strcat( p_mgr->p_report_buf, " : :" ); > - } > + if( ib_port_info_get_port_state( p_pi ) != IB_LINK_DOWN > + && ( ( node_type == IB_NODE_TYPE_SWITCH && port_num == 0 ) > + || node_type != IB_NODE_TYPE_SWITCH ) ) > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " %04X : %01X :", > + cl_ntoh16( p_pi->base_lid ), > + ib_port_info_get_lmc( p_pi ) ); > else > - strcat( p_mgr->p_report_buf, " : :" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " : :" ); > > if( port_num != 0 ) > - { > - sprintf( line, " %s : %s : %s ", > - osm_get_mtu_str( ib_port_info_get_neighbor_mtu( p_pi ) ), > - osm_get_lwa_str( p_pi->link_width_active ), > - osm_get_lsa_str( ib_port_info_get_link_speed_active > - ( p_pi ) ) ); > - } > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " %s : %s : %s ", > + osm_get_mtu_str( ib_port_info_get_neighbor_mtu( p_pi ) ), > + osm_get_lwa_str( p_pi->link_width_active ), > + osm_get_lsa_str( ib_port_info_get_link_speed_active > + ( p_pi ) ) ); > else > - { > - sprintf( line, " %s : %s : %s ", " ", " ", " " ); > - } > - strcat( p_mgr->p_report_buf, line ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " : : " ); > > if( osm_physp_get_port_guid( p_physp ) == > p_mgr->p_subn->sm_port_guid ) > - { > - sprintf( line, "* %016" PRIx64 " *", > - cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); > - } > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "* %016" PRIx64 " *", > + cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); > else > - { > - sprintf( line, ": %016" PRIx64 " :", > - cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); > - } > - strcat( p_mgr->p_report_buf, line ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, ": %016" PRIx64 " :", > + cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); > > if( port_num && > ( ib_port_info_get_port_state( p_pi ) != IB_LINK_DOWN ) ) > @@ -1386,36 +1352,26 @@ __osm_state_mgr_report( > p_remote_physp = osm_physp_get_remote( p_physp ); > if( p_remote_physp && osm_physp_is_valid( p_remote_physp ) ) > { > - sprintf( line, " %016" PRIx64 " (%02X)", > - cl_ntoh64( osm_physp_get_port_guid > - ( p_remote_physp ) ), > - osm_physp_get_port_num( p_remote_physp ) ); > - strcat( p_mgr->p_report_buf, line ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, > + " %016" PRIx64 " (%02X)", > + cl_ntoh64( osm_physp_get_port_guid > + ( p_remote_physp ) ), > + osm_physp_get_port_num( p_remote_physp ) ); > } > else > - strcat( p_mgr->p_report_buf, " UNKNOWN" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, " UNKNOWN" ); > } > > - strcat( p_mgr->p_report_buf, "\n" ); > - > - if( ++line_num >= OSM_REPORT_BUF_THRESHOLD ) > - { > - osm_log_raw( p_mgr->p_log, OSM_LOG_VERBOSE, p_mgr->p_report_buf ); > - line_num = 0; > - } > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, "\n" ); > } > - strcat( p_mgr->p_report_buf, > - "------------------------------------------------------" > - "------------------------------------------------\n" ); > + > + osm_log_printf( p_mgr->p_log, OSM_LOG_VERBOSE, > + "------------------------------------------------------" > + "------------------------------------------------\n" ); > p_port = ( osm_port_t * ) cl_qmap_next( &p_port->map_item ); > } > > CL_PLOCK_RELEASE( p_mgr->p_lock ); > - > - if( line_num != 0 ) > - osm_log_raw( p_mgr->p_log, OSM_LOG_VERBOSE, p_mgr->p_report_buf ); > - > - Exit: > OSM_LOG_EXIT( p_mgr->p_log ); > } > > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c > index f1d085c..fc97094 100644 > --- a/osm/opensm/osm_ucast_mgr.c > +++ b/osm/opensm/osm_ucast_mgr.c > @@ -103,7 +103,6 @@ osm_ucast_mgr_init( > IN osm_ucast_mgr_t* const p_mgr, > IN osm_req_t* const p_req, > IN osm_subn_t* const p_subn, > - IN char* const p_report_buf, > IN osm_log_t* const p_log, > IN cl_plock_t* const p_lock ) > { > @@ -121,7 +120,6 @@ osm_ucast_mgr_init( > p_mgr->p_subn = p_subn; > p_mgr->p_lock = p_lock; > p_mgr->p_req = p_req; > - p_mgr->p_report_buf = p_report_buf; > > OSM_LOG_EXIT( p_mgr->p_log ); > return( status ); > @@ -184,26 +182,25 @@ __osm_ucast_mgr_dump_path_distribution( > ib_net64_t remote_guid_ho; > osm_switch_t* p_sw = (osm_switch_t *)p_map_item; > osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; > - char line[OSM_REPORT_LINE_SIZE]; > > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_path_distribution ); > > p_node = osm_switch_get_node_ptr( p_sw ); > num_ports = osm_switch_get_num_ports( p_sw ); > > - sprintf( p_mgr->p_report_buf, "__osm_ucast_mgr_dump_path_distribution: " > - "Switch 0x%" PRIx64 "\n" > - "Port : Path Count Through Port", > - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, > + "__osm_ucast_mgr_dump_path_distribution: " > + "Switch 0x%" PRIx64 "\n" > + "Port : Path Count Through Port", > + cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); > > for( i = 0; i < num_ports; i++ ) > { > num_paths = osm_switch_path_count_get( p_sw , i ); > - sprintf( line, "\n %03u : %u", i, num_paths ); > - strcat( p_mgr->p_report_buf, line ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG,"\n %03u : %u", i, num_paths ); > if( i == 0 ) > { > - strcat( p_mgr->p_report_buf, " (switch management port)" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (switch management port)" ); > continue; > } > > @@ -216,26 +213,24 @@ __osm_ucast_mgr_dump_path_distribution( > switch( osm_node_get_remote_type( p_node, i ) ) > { > case IB_NODE_TYPE_SWITCH: > - strcat( p_mgr->p_report_buf, " (link to switch" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to switch" ); > break; > case IB_NODE_TYPE_ROUTER: > - strcat( p_mgr->p_report_buf, " (link to router" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to router" ); > break; > case IB_NODE_TYPE_CA: > - strcat( p_mgr->p_report_buf, " (link to CA" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to CA" ); > break; > default: > - strcat( p_mgr->p_report_buf, " (link to unknown node type" ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " (link to unknown node type" ); > break; > } > > - sprintf( line, " 0x%" PRIx64 ")", remote_guid_ho ); > - strcat( p_mgr->p_report_buf, line ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, " 0x%" PRIx64 ")", > + remote_guid_ho ); > } > > - strcat( p_mgr->p_report_buf, "\n" ); > - > - osm_log_raw( p_mgr->p_log, OSM_LOG_ROUTING, p_mgr->p_report_buf ); > + osm_log_printf( p_mgr->p_log, OSM_LOG_DEBUG, "\n" ); > > OSM_LOG_EXIT( p_mgr->p_log ); > } > @@ -254,29 +249,24 @@ __osm_ucast_mgr_dump_ucast_routes( > uint8_t best_port; > uint16_t max_lid_ho; > uint16_t lid_ho; > - uint32_t line_num = 0; > boolean_t ui_ucast_fdb_assign_func_defined; > osm_switch_t* p_sw = (osm_switch_t *)p_map_item; > osm_ucast_mgr_t* p_mgr = ((struct ucast_mgr_dump_context *)cxt)->p_mgr; > - FILE *p_fdbFile = ((struct ucast_mgr_dump_context *)cxt)->file; > - char line[OSM_REPORT_LINE_SIZE]; > - > + FILE *file = ((struct ucast_mgr_dump_context *)cxt)->file; > + > OSM_LOG_ENTER( p_mgr->p_log, __osm_ucast_mgr_dump_ucast_routes ); > > p_node = osm_switch_get_node_ptr( p_sw ); > > max_lid_ho = osm_switch_get_max_lid_ho( p_sw ); > > + fprintf( file, "__osm_ucast_mgr_dump_ucast_routes: " > + "Switch 0x%016" PRIx64 "\n" > + "LID : Port : Hops : Optimal\n", > + cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); > for( lid_ho = 1; lid_ho <= max_lid_ho; lid_ho++ ) > { > - if( line_num == 0 ) > - { > - sprintf( p_mgr->p_report_buf, "__osm_ucast_mgr_dump_ucast_routes: " > - "Switch 0x%016" PRIx64 "\n" > - "LID : Port : Hops : Optimal\n", > - cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); > - line_num++; > - } > + fprintf(file, "0x%04X : ", lid_ho); > > port_num = osm_switch_get_port_by_lid( p_sw, lid_ho ); > if( port_num == OSM_NO_PATH ) > @@ -287,9 +277,7 @@ __osm_ucast_mgr_dump_ucast_routes( > will reassign and compress the LID range. The > subnet should work fine either way. > */ > - sprintf( line, "0x%04X : UNREACHABLE\n", lid_ho ); > - strcat( p_mgr->p_report_buf, line ); > - line_num++; > + fprintf( file, "UNREACHABLE\n" ); > continue; > } > /* > @@ -301,19 +289,15 @@ __osm_ucast_mgr_dump_ucast_routes( > num_hops = osm_switch_get_hop_count( p_sw, lid_ho, port_num ); > if( num_hops == OSM_NO_PATH ) > { > - sprintf( line, "0x%04X : UNREACHABLE\n", lid_ho ); > - strcat( p_mgr->p_report_buf, line ); > - line_num++; > + fprintf( file, "UNREACHABLE\n" ); > continue; > } > > best_hops = osm_switch_get_least_hops( p_sw, lid_ho ); > - sprintf( line, "0x%04X : %03u : %02u : ", > - lid_ho, port_num, num_hops ); > - strcat( p_mgr->p_report_buf, line ); > + fprintf( file, "%03u : %02u : ", port_num, num_hops ); > > if( best_hops == num_hops ) > - strcat( p_mgr->p_report_buf, "yes" ); > + fprintf( file, "yes" ); > else > { > if (p_mgr->p_subn->p_osm->routing_engine.ucast_fdb_assign) > @@ -328,23 +312,13 @@ __osm_ucast_mgr_dump_ucast_routes( > p_sw, lid_ho, TRUE, > NULL, NULL, NULL, NULL, /* No LMC Optimization */ > ui_ucast_fdb_assign_func_defined ); > - sprintf( line, "No %u hop path possible via port %u!", > + fprintf( file, "No %u hop path possible via port %u!", > best_hops, best_port ); > - strcat( p_mgr->p_report_buf, line ); > } > > - strcat( p_mgr->p_report_buf, "\n" ); > - > - if( ++line_num >= OSM_REPORT_BUF_THRESHOLD ) > - { > - fprintf(p_fdbFile,"%s",p_mgr->p_report_buf ); > - line_num = 0; > - } > + fprintf( file, "\n" ); > } > > - if( line_num != 0 ) > - fprintf(p_fdbFile,"%s\n",p_mgr->p_report_buf ); > - > OSM_LOG_EXIT( p_mgr->p_log ); > } > From kliteyn at dev.mellanox.co.il Thu Nov 2 05:38:47 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 02 Nov 2006 15:38:47 +0200 Subject: [openib-general] [PATCH] opensm: strict osm_log arguments/format check In-Reply-To: <20061102105348.GA16559@sashak.voltaire.com> References: <20061102105348.GA16559@sashak.voltaire.com> Message-ID: <4549F4E7.8010606@dev.mellanox.co.il> Hi Sasha. Good catch with those missing arguments. One question: in several places you used cl_hton64() to print guid. Shouldn't there be cl_ntoh64() instead? ...and yes, I know that these two functions are actually the same macro :) Thanks -- Yevgeny Sasha Khapyorsky wrote: > This adds gcc attribute to osm_log() which causes the compiler to check > argument types against a format string. And also there are related fixes > in osm_log() usage in opensm and osmtest. > > Signed-off-by: Sasha Khapyorsky > --- > osm/include/opensm/osm_log.h | 8 +++++++- > osm/libvendor/osm_vendor_ibumad_sa.c | 2 +- > osm/opensm/main.c | 3 ++- > osm/opensm/osm_pkey_mgr.c | 1 + > osm/opensm/osm_port_info_rcv.c | 5 +++-- > osm/opensm/osm_sa_informinfo.c | 4 ++-- > osm/opensm/osm_sa_link_record.c | 8 ++++---- > osm/opensm/osm_sa_mad_ctrl.c | 3 ++- > osm/opensm/osm_sa_response.c | 2 +- > osm/opensm/osm_sm_state_mgr.c | 3 ++- > osm/opensm/osm_sminfo_rcv.c | 9 +++++---- > osm/opensm/osm_state_mgr.c | 8 ++++---- > osm/osmtest/osmt_multicast.c | 12 +++++++----- > osm/osmtest/osmt_service.c | 6 +++--- > osm/osmtest/osmtest.c | 8 ++++---- > 15 files changed, 48 insertions(+), 34 deletions(-) > > diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h > index 62f3a0c..2b24886 100644 > --- a/osm/include/opensm/osm_log.h > +++ b/osm/include/opensm/osm_log.h > @@ -60,6 +60,12 @@ > #include > #include > > +#ifdef __GNUC__ > +#define STRICT_OSM_LOG_FORMAT __attribute__((format(printf, 3, 4))) > +#else > +#define STRICT_OSM_LOG_FORMAT > +#endif > + > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { > # define END_C_DECLS } > @@ -374,7 +380,7 @@ void > osm_log( > IN osm_log_t* const p_log, > IN const osm_log_level_t verbosity, > - IN const char *p_str, ... ); > + IN const char *p_str, ... ) STRICT_OSM_LOG_FORMAT; > > void > osm_log_raw( > diff --git a/osm/libvendor/osm_vendor_ibumad_sa.c b/osm/libvendor/osm_vendor_ibumad_sa.c > index 7fd0655..7c4a2f7 100644 > --- a/osm/libvendor/osm_vendor_ibumad_sa.c > +++ b/osm/libvendor/osm_vendor_ibumad_sa.c > @@ -853,7 +853,7 @@ osmv_query_sa( > if ( p_mpr_req->sgid_count + p_mpr_req->dgid_count > IB_MULTIPATH_MAX_GIDS ) > { > osm_log( p_log, OSM_LOG_ERROR, > - "osmv_query_sa DBG:001 MULTIPATH_REC ", > + "osmv_query_sa DBG:001 MULTIPATH_REC " > "SGID count %d DGID count %d max count %d\n", > p_mpr_req->sgid_count, p_mpr_req->dgid_count, > IB_MULTIPATH_MAX_GIDS ); > diff --git a/osm/opensm/main.c b/osm/opensm/main.c > index 729702a..752b546 100644 > --- a/osm/opensm/main.c > +++ b/osm/opensm/main.c > @@ -460,7 +460,8 @@ parse_ignore_guids_file(IN char *guids_f > { > osm_log( &p_osm->log, OSM_LOG_ERROR, > "parse_ignore_guids_file: ERR 0601: " > - "Unable to open ignore guids file (%s)\n" ); > + "Unable to open ignore guids file (%s)\n", > + guids_file_name ); > status = IB_ERROR; > goto Exit; > } > diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c > index f2cb221..735dc14 100644 > --- a/osm/opensm/osm_pkey_mgr.c > +++ b/osm/opensm/osm_pkey_mgr.c > @@ -139,6 +139,7 @@ pkey_mgr_process_physical_port( > "pkey_mgr_process_physical_port: ERR 0503: " > "Failed to obtain P_Key 0x%04x block and index for node " > "0x%016" PRIx64 " port %u\n", > + ib_pkey_get_base( pkey ), > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > osm_physp_get_port_num( p_physp ) ); > return; > diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c > index 95112dc..f6d3595 100644 > --- a/osm/opensm/osm_port_info_rcv.c > +++ b/osm/opensm/osm_port_info_rcv.c > @@ -724,8 +724,9 @@ osm_pi_rcv_process( > { > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "osm_pi_rcv_process: " > - "Got light sweep response from remote port of parent node GUID = 0x%" PRIx64 > - " port = %u, Commencing heavy sweep\n", > + "Got light sweep response from remote port of parent node " > + "GUID = 0x%" PRIx64 " port = 0x%016" PRIx64 > + ", Commencing heavy sweep\n", > cl_ntoh64( node_guid ), > cl_ntoh64( port_guid ) ); > osm_state_mgr_process( p_rcv->p_state_mgr, > diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c > index 69dca1d..da96d35 100644 > --- a/osm/opensm/osm_sa_informinfo.c > +++ b/osm/opensm/osm_sa_informinfo.c > @@ -163,8 +163,8 @@ __validate_ports_access_rights( > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__validate_ports_access_rights: ERR 4301: " > - "Invalid port guid: 0x%016\n", > - portguid ); > + "Invalid port guid: 0x%016" PRIx64 "\n", > + cl_hton64(portguid) ); > valid = FALSE; > goto Exit; > } > diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c > index 751023f..0ca9092 100644 > --- a/osm/opensm/osm_sa_link_record.c > +++ b/osm/opensm/osm_sa_link_record.c > @@ -145,10 +145,10 @@ __osm_lr_rcv_build_physp_link( > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_lr_rcv_build_physp_link: ERR 1801: " > "Unable to acquire link record\n" > - "\t\t\t\tFrom port 0x%\n" > - "\t\t\t\tTo port 0x%\n" > - "\t\t\t\tFrom lid 0x%\n" > - "\t\t\t\tTo lid 0x%\n", > + "\t\t\t\tFrom port 0x%u\n" > + "\t\t\t\tTo port 0x%u\n" > + "\t\t\t\tFrom lid 0x%u\n" > + "\t\t\t\tTo lid 0x%u\n", > from_port, to_port, > cl_ntoh16(from_lid), > cl_ntoh16(to_lid) ); > diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c > index cd896b6..208f0d2 100644 > --- a/osm/opensm/osm_sa_mad_ctrl.c > +++ b/osm/opensm/osm_sa_mad_ctrl.c > @@ -132,7 +132,8 @@ __osm_sa_mad_ctrl_process( > "__osm_sa_mad_ctrl_process: " > /* "Responding BUSY status since the dispatcher is already"*/ > "Dropping MAD since the dispatcher is already" > - " overloaded with %u messages and queue time of:%u[msec]\n", > + " overloaded with %u messages and queue time of:" > + "%" PRIu64 "[msec]\n", > num_messages, last_dispatched_msg_queue_time_msec ); > > /* send a busy response */ > diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c > index db36ea2..27f4e9d 100644 > --- a/osm/opensm/osm_sa_response.c > +++ b/osm/opensm/osm_sa_response.c > @@ -117,7 +117,7 @@ osm_sa_send_error( > if (osm_exit_flag) > { > osm_log( p_resp->p_log, OSM_LOG_DEBUG, > - "osm_sa_send_error: ", > + "osm_sa_send_error: " > "Ignoring requested send after exit\n" ); > goto Exit; > } > diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c > index aadc43a..1ba5eda 100644 > --- a/osm/opensm/osm_sm_state_mgr.c > +++ b/osm/opensm/osm_sm_state_mgr.c > @@ -247,7 +247,8 @@ __osm_sm_state_mgr_send_master_sm_info_r > { > osm_log( p_sm_mgr->p_log, OSM_LOG_ERROR, > "__osm_sm_state_mgr_send_master_sm_info_req: ERR 3203: " > - "No port object for GUID 0x%X\n", p_sm_mgr->master_guid ); > + "No port object for GUID 0x%016" PRIx64 "\n", > + cl_hton64(p_sm_mgr->master_guid) ); > goto Exit; > } > > diff --git a/osm/opensm/osm_sminfo_rcv.c b/osm/opensm/osm_sminfo_rcv.c > index 825b18b..7657e97 100644 > --- a/osm/opensm/osm_sminfo_rcv.c > +++ b/osm/opensm/osm_sminfo_rcv.c > @@ -402,8 +402,8 @@ __osm_sminfo_rcv_process_set_request( > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "__osm_sminfo_rcv_process_set_request: " > "Received a STANDBY signal. Updating " > - "sm_state_mgr master_guid: 0x%X\n", > - p_rcv_smi->guid ); > + "sm_state_mgr master_guid: 0x%016" PRIx64 "\n", > + cl_hton64(p_rcv_smi->guid) ); > p_rcv->p_sm_state_mgr->master_guid = p_rcv_smi->guid; > } > > @@ -482,8 +482,9 @@ __osm_sminfo_rcv_process_get_sm( > /* we will poll it - as long as it lives - we should be in Standby. */ > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "__osm_sminfo_rcv_process_get_sm: " > - "Found higher SM. Updating sm_state_mgr master_guid: 0x%X\n", > - p_sm->p_port->guid ); > + "Found higher SM. Updating sm_state_mgr master_guid:" > + " 0x%016" PRIx64 "\n", > + cl_hton64(p_sm->p_port->guid) ); > p_rcv->p_sm_state_mgr->master_guid = p_sm->p_port->guid; > } > break; > diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c > index 28e0c4c..ad22d9e 100644 > --- a/osm/opensm/osm_state_mgr.c > +++ b/osm/opensm/osm_state_mgr.c > @@ -481,7 +481,7 @@ __osm_state_mgr_signal_warning( > { > osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > "__osm_state_mgr_signal_warning: " > - "Invalid signal %s(%d) in state %s\n", > + "Invalid signal %s(%lu) in state %s\n", > osm_get_sm_signal_str( signal ), > signal, osm_get_sm_state_str( p_mgr->state ) ); > } > @@ -500,7 +500,7 @@ __osm_state_mgr_signal_error( > else > osm_log( p_mgr->p_log, OSM_LOG_ERROR, > "__osm_state_mgr_signal_error: ERR 3303: " > - "Invalid signal %s(%d) in state %s\n", > + "Invalid signal %s(%lu) in state %s\n", > osm_get_sm_signal_str( signal ), > signal, osm_get_sm_state_str( p_mgr->state ) ); > } > @@ -1480,8 +1480,8 @@ __osm_state_mgr_exists_other_master_sm( > { > osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > "__osm_state_mgr_exists_other_master_sm: " > - "Found remote master SM with guid:0x%X\n", > - p_sm->smi.guid ); > + "Found remote master SM with guid:0x%016" PRIx64 "\n", > + cl_hton64(p_sm->smi.guid) ); > p_sm_res = p_sm; > goto Exit; > } > diff --git a/osm/osmtest/osmt_multicast.c b/osm/osmtest/osmt_multicast.c > index 33a4f47..19f9d37 100644 > --- a/osm/osmtest/osmt_multicast.c > +++ b/osm/osmtest/osmt_multicast.c > @@ -1885,8 +1885,9 @@ osmt_run_mcast_flow( IN osmtest_t * cons > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_mcast_flow: ERR 0209: " > - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", > - p_mc_res->mgid > + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", > + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), > + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) > ); > status = IB_ERROR; > goto Exit; > @@ -2044,8 +2045,9 @@ osmt_run_mcast_flow( IN osmtest_t * cons > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_mcast_flow: ERR 0212: " > - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", > - p_mc_res->mgid > + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", > + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), > + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) > ); > status = IB_ERROR; > goto Exit; > @@ -3345,7 +3347,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons > /* Delete all MCG that are not of IPoIB */ > osm_log( &p_osmt->log, OSM_LOG_INFO, > "osmt_run_mcast_flow : " > - "Cleanup all MCG that are not IPoIB...\n", cnt ); > + "Cleanup all MCG that are not IPoIB...\n" ); > > p_mgrp_mlid_tbl = &p_osmt->exp_subn.mgrp_mlid_tbl; > p_mgrp = (osmtest_mgrp_t*)cl_qmap_head( p_mgrp_mlid_tbl ); > diff --git a/osm/osmtest/osmt_service.c b/osm/osmtest/osmt_service.c > index ec9a39e..ab95fec 100644 > --- a/osm/osmtest/osmt_service.c > +++ b/osm/osmtest/osmt_service.c > @@ -1559,7 +1559,7 @@ osmt_run_service_records_flow( IN osmtes > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_service_records_flow: ERR 4A20: " > - "Found service: id: 0x%016 " PRIx64 > + "Found service: id: 0x%016" PRIx64 " " > "that is invalid\n", > id[7] ); > status = IB_ERROR; > @@ -1573,7 +1573,7 @@ osmt_run_service_records_flow( IN osmtes > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_service_records_flow: ERR 4A21: " > - "Fail to find service: id: 0x%016 " PRIx64 > + "Fail to find service: id: 0x%016" PRIx64 " " > "name: %s\n", > id[0], > (char*)service_name[0] ); > @@ -1588,7 +1588,7 @@ osmt_run_service_records_flow( IN osmtes > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_service_records_flow: ERR 4A22: " > - "Fail to find service: id: 0x%016 " PRIx64 > + "Fail to find service: id: 0x%016" PRIx64 " " > "name: %s\n", > id[5], > (char*)service_name[6] ); > diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c > index 92a4190..a35e0c5 100644 > --- a/osm/osmtest/osmtest.c > +++ b/osm/osmtest/osmtest.c > @@ -2787,7 +2787,8 @@ osmtest_create_inventory_file( IN osmtes > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmtest_create_inventory_file: ERR 0079: " > - "Unable to open inventory file (%s)\n" ); > + "Unable to open inventory file (%s)\n", > + p_osmt->opt.file_name ); > status = IB_ERROR; > goto Exit; > } > @@ -3356,7 +3357,7 @@ osmtest_validate_path_data( IN osmtest_t > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmtest_validate_path_data: ERR 0012: " > "PKEY mismatch on path SLID 0x%X to DLID 0x%X\n" > - "\t\t\t\tExpected 0x%X, received 0x%X\n", > + "\t\t\t\tExpected 0x%" PRIx64 ", received 0x%" PRIx64 "\n", > cl_ntoh16( p_path->rec.slid ), > cl_ntoh16( p_path->rec.dlid ), > cl_ntoh64( p_path->rec.pkey ), cl_ntoh64( p_rec->pkey ) ); > @@ -7165,8 +7166,7 @@ osmtest_bind( IN osmtest_t * p_osmt, > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmtest_bind: ERR 0135: " > - "No local ports. Unable to proceed\n", > - ib_get_err_str( status ) ); > + "No local ports. Unable to proceed\n" ); > goto Exit; > } > guid = attr_array[port_index].port_guid; From kliteyn at dev.mellanox.co.il Thu Nov 2 05:44:24 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 02 Nov 2006 15:44:24 +0200 Subject: [openib-general] {PATCH] OpenSM: Add option for force SDR link speed In-Reply-To: <1162387784.29957.51482.camel@hal.voltaire.com> References: <1162387784.29957.51482.camel@hal.voltaire.com> Message-ID: <4549F638.4070904@dev.mellanox.co.il> Looks good, thanks. -- Yevgeny Hal Rosenstock wrote: > OpenSM: Add option for force SDR link speed > > Add option to opensm.opts to force link speed. Currently, only forcing > to SDR link speed is supported. > > Signed-off-by: Hal Rosenstock > > Index: include/opensm/osm_subnet.h > =================================================================== > --- include/opensm/osm_subnet.h (revision 10010) > +++ include/opensm/osm_subnet.h (working copy) > @@ -34,7 +34,6 @@ > * $Id$ > */ > > - > /* > * Abstract: > * Declaration of osm_subn_t. > @@ -238,9 +237,10 @@ typedef struct _osm_subn_opt > uint8_t sm_priority; > uint8_t lmc; > boolean_t lmc_esp0; > - uint8_t max_op_vls; > + uint8_t max_op_vls; > + uint8_t force_link_speed; > boolean_t reassign_lids; > - boolean_t reassign_lfts; > + boolean_t reassign_lfts; > boolean_t ignore_other_sm; > boolean_t single_thread; > boolean_t no_multicast_option; > Index: opensm/osm_subnet.c > =================================================================== > --- opensm/osm_subnet.c (revision 10018) > +++ opensm/osm_subnet.c (working copy) > @@ -452,6 +452,7 @@ osm_subn_set_default_opt( > p_opt->lmc = OSM_DEFAULT_LMC; > p_opt->lmc_esp0 = FALSE; > p_opt->max_op_vls = OSM_DEFAULT_MAX_OP_VLS; > + p_opt->force_link_speed = 0; > p_opt->reassign_lids = FALSE; > p_opt->reassign_lfts = TRUE; > p_opt->ignore_other_sm = FALSE; > @@ -840,6 +841,10 @@ osm_subn_parse_conf_file( > "max_op_vls", > p_key, p_val, &p_opts->max_op_vls); > > + __osm_subn_opts_unpack_uint8( > + "force_link_speed", > + p_key, p_val, &p_opts->force_link_speed); > + > __osm_subn_opts_unpack_boolean( > "reassign_lids", > p_key, p_val, &p_opts->reassign_lids); > @@ -1061,6 +1066,9 @@ osm_subn_write_conf_file( > "leaf_head_of_queue_lifetime 0x%02x\n\n" > "# Limit the maximal operational VLs\n" > "max_op_vls %u\n\n" > + "# Force switch links which are more than SDR capable to \n" > + "# operate at SDR speed\n\n" > + "force_link_speed %u\n\n" > "# The subnet_timeout code that will be set for all the ports\n" > "# The actual timeout is 4.096usec * 2^\n" > "subnet_timeout %u\n\n" > @@ -1081,6 +1089,7 @@ osm_subn_write_conf_file( > p_opts->head_of_queue_lifetime, > p_opts->leaf_head_of_queue_lifetime, > p_opts->max_op_vls, > + p_opts->force_link_speed, > p_opts->subnet_timeout, > p_opts->local_phy_errors_threshold, > p_opts->overrun_errors_threshold > Index: opensm/osm_lid_mgr.c > =================================================================== > --- opensm/osm_lid_mgr.c (revision 10010) > +++ opensm/osm_lid_mgr.c (working copy) > @@ -1152,6 +1152,14 @@ __osm_lid_mgr_set_physp_pi( > sizeof(p_pi->link_width_enabled) )) > send_set = TRUE; > > + if ( p_mgr->p_subn->opt.force_link_speed ) > + ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); > + else > + ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); > + if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, > + sizeof(p_pi->link_speed) )) > + send_set = TRUE; > + > /* M_KeyProtectBits are always zero */ > p_pi->mkey_lmc = p_mgr->p_subn->opt.lmc; > /* Check to see if the value we are setting is different than > Index: opensm/osm_link_mgr.c > =================================================================== > --- opensm/osm_link_mgr.c (revision 10010) > +++ opensm/osm_link_mgr.c (working copy) > @@ -310,6 +310,14 @@ __osm_link_mgr_set_physp_pi( > sizeof(p_pi->link_width_enabled) )) > send_set = TRUE; > > + if ( p_mgr->p_subn->opt.force_link_speed ) > + ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); > + else > + ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); > + if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, > + sizeof(p_pi->link_speed) )) > + send_set = TRUE; > + > /* calc new op_vls and mtu */ > op_vls = > osm_physp_calc_link_op_vls( p_mgr->p_log, p_mgr->p_subn, p_physp ); > > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From mst at mellanox.co.il Thu Nov 2 05:52:18 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 15:52:18 +0200 Subject: [openib-general] Static linking with libibverbs In-Reply-To: <7A407559-073A-46B9-BA84-09DCEE64B299@cisco.com> References: <7A407559-073A-46B9-BA84-09DCEE64B299@cisco.com> Message-ID: <20061102135218.GA9548@mellanox.co.il> Quoting r. Jeff Squyres : > Subject: Re: Static linking with libibverbs > > On Nov 2, 2006, at 8:13 AM, Michael S. Tsirkin wrote: > > > Quoting r. Jeff Squyres : > >> Yes. See the FAQ items on the OMPI web site from my first mail. > > > > OK, I see. > > So what it boils down to, is linking with > > -Wl,--whole-archive -libverbs /mthca.a -Wl,--no-whole-archive > > Is that right? > > There's a few other details, but this is the Main Point, yes. > > > But -u openib_driver_init will work as well, won't it? > > I'm not entirely sure -- it might (I didn't try it). It *should* > force creation of a valid code path into mthca.a and therefore use it > for all the resolution that is required (i.e., link in all the parts > of mthca.a that are actually required). Since it worked for linking with static mthca.a and libiverbs.a, I think it will link with -static as well. > What I'm not sure about is whether the symbols that mthca needs from > libibverbs will be linked in properly (since the linker order is left > to right "-libverbs /mthca.a"). I *think* they'll be available > from when mthca.a was originally created (i.e., libibverbs.a was > statically linked into mthca.a), Surely not. mthca.a does not include objects from libibverbs.a > but I don't know if the linker will > be smart enough to realize that there are two copies of some symbols > in libibverbs and further to realize that they are actually > duplicates of the same underlying symbol, and one can be safely > eliminated. It's worth trying (but I don't really care too much :-) ). > > Every time I think I understand linkers, we get weirdo cases like > this that make me remember I have no clue how they work. :-) Note that .a files are not actually created by the linker. -- MST From swise at opengridcomputing.com Thu Nov 2 05:55:00 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 02 Nov 2006 07:55:00 -0600 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <45492DE3.1070102@ichips.intel.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <1162419236.6366.50.camel@stevo-desktop> <45492DE3.1070102@ichips.intel.com> Message-ID: <1162475700.448.0.camel@stevo-desktop> On Wed, 2006-11-01 at 15:29 -0800, Sean Hefty wrote: > > This patch removes rdma_get/set_option(). Is that what you intended? > > Yes. I wanted to reconsider the approach here. > > I believe that there's a cleaner implementation for getting path records that > involves a userspace SA library/daemon than going through the rdma cm. And no > one was using the option to set a specific path. > > For the CM timeout options, those were added to support uDAPL, but I believe > that a better approach which would accomplish the higher level goal is to have > the kernel rdma cm issue MRA (message received acknowledged) messages for > clients which are slow to respond to requests. > Ok thanks. For my testing, I'll remove this code from uDAPL... Steve. From halr at voltaire.com Thu Nov 2 05:53:31 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 08:53:31 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> Message-ID: <1162475595.29957.109003.camel@hal.voltaire.com> Hi Oliver, On Wed, 2006-11-01 at 16:52, Oliver wrote: > Hi, folks - > > I am trying to verify and evaluate IB QoS support, running openSM as > subnet manager. The perftest program is extended to set SL as command > line options instead of default 0, and by modifying VL arbitration > tables, I am expecting to see the traffic shaping can actually take > place, How is this being observed/measured ? > but it did not. More details on configuration: > > in opensm.opts: > # QoS default options > qos_high_limit 255 # disable low priority table This doesn't disable it but it won't be scheduled unless there are no high priority packets to send. > qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 > (corresponding to SL 2) a higher weight 8 > qos_sl2vl 0,1,2,3,4, ... # no changes here > > I think (though not verified) the Voltaire HCA we are using can > support 8 data VLs. Yes, 8 VLs should be supported in your subnet. You can verify this with smpquery portinfo on the HCA port and examine OperVLs assuming the port is ACTIVE. > I don't have much more information to go on why > qos shaping is not taking place, any suggestions? Sasha's email is a good start. We can go from there. > A related question is, if I modify qos setting in SM, do I need to > restart SA on each hosts for it to see the changes? (I am hoping not, > as I tried in the test, it doesn't seem to make a difference) Not sure what you mean. SA is tightly coupled with the OpenSM. Do you mean SA client ? The client hosts don't need restarting but did you restart OpenSM with your QoS configuration ? BTW, which OpenSM are you running ? -- Hal > Thanks for help. From halr at voltaire.com Thu Nov 2 06:00:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 09:00:20 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <454922E1.1050002@ornl.gov> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <20061101223010.GF9985@sashak.voltaire.com> <454922E1.1050002@ornl.gov> Message-ID: <1162476001.29957.109286.camel@hal.voltaire.com> Makia, On Wed, 2006-11-01 at 17:42, Makia Minich wrote: > It just so happens that we've started looking at this here at ORNL as > well. I had a question about the options. The manpage makes it seem > that you can set these qos options (e.g. qos_high_limit) from the > command line, but I haven't been overly successful. What are you referring to in the man page ? Which OpenSM are you using (trunk or 1.1 based) ? > Is there an example of this being done? Yes in both the man page under QOS CONFIGURATION or under osm/doc/qos-config.txt in the repository. > Or is changing the /var/cache/osm/opensm.opts file > the preferred method of changing the options? I think it's the only way but it is imperative QoS is enabled for this to have any effect. -- Hal > Sasha Khapyorsky wrote: > > On 16:52 Wed 01 Nov , Oliver wrote: > >> Hi, folks - > >> > >> I am trying to verify and evaluate IB QoS support, running openSM as > >> subnet manager. The perftest program is extended to set SL as command > >> line options instead of default 0, and by modifying VL arbitration > >> tables, I am expecting to see the traffic shaping can actually take > >> place, but it did not. More details on configuration: > >> > >> in opensm.opts: > >> # QoS default options > >> qos_high_limit 255 # disable low priority table > >> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 > >> (corresponding to SL 2) a higher weight 8 > >> qos_sl2vl 0,1,2,3,4, ... # no changes here > >> > >> I think (though not verified) the Voltaire HCA we are using can > >> support 8 data VLs. I don't have much more information to go on why > >> qos shaping is not taking place, any suggestions? > > > > You can verify actual port's parameters with smpquery (from diags), you > > will need to run to get QoS related parameters: > > > > smpquery portinfo ... > > smpquery vlarb ... > > smpquery sl2vl ... > > > > Sasha > > > >> A related question is, if I modify qos setting in SM, do I need to > >> restart SA on each hosts for it to see the changes? (I am hoping not, > >> as I tried in the test, it doesn't seem to make a difference) > >> > >> Thanks for help. > >> -- > >> Oliver > >> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >> > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From minich at ornl.gov Thu Nov 2 06:15:17 2006 From: minich at ornl.gov (Makia Minich) Date: Thu, 02 Nov 2006 09:15:17 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <1162476001.29957.109286.camel@hal.voltaire.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <20061101223010.GF9985@sashak.voltaire.com> <454922E1.1050002@ornl.gov> <1162476001.29957.109286.camel@hal.voltaire.com> Message-ID: <4549FD75.2080705@ornl.gov> Hal Rosenstock wrote: > Makia, > > On Wed, 2006-11-01 at 17:42, Makia Minich wrote: >> It just so happens that we've started looking at this here at ORNL as >> well. I had a question about the options. The manpage makes it seem >> that you can set these qos options (e.g. qos_high_limit) from the >> command line, but I haven't been overly successful. > > What are you referring to in the man page ? OK, re-reading the man page section on qos, I now realize that I didn't understand the statement "cached options file" on my initial read through. So, now I've got it. > Which OpenSM are you using (trunk or 1.1 based) ? 1.1 based >> Is there an example of this being done? > > Yes in both the man page under QOS CONFIGURATION or under > osm/doc/qos-config.txt in the repository. I see that that file doesn't install in the doc directory with OFED, perhaps that should be added (so that I can find it in the ${OFED}/doc directory). >> Or is changing the /var/cache/osm/opensm.opts file >> the preferred method of changing the options? > > I think it's the only way but it is imperative QoS is enabled for this > to have any effect. > > -- Hal That part I've got set in the opensm.opts file: no_qos FALSE >> Sasha Khapyorsky wrote: >>> On 16:52 Wed 01 Nov , Oliver wrote: >>>> Hi, folks - >>>> >>>> I am trying to verify and evaluate IB QoS support, running openSM as >>>> subnet manager. The perftest program is extended to set SL as command >>>> line options instead of default 0, and by modifying VL arbitration >>>> tables, I am expecting to see the traffic shaping can actually take >>>> place, but it did not. More details on configuration: >>>> >>>> in opensm.opts: >>>> # QoS default options >>>> qos_high_limit 255 # disable low priority table >>>> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 >>>> (corresponding to SL 2) a higher weight 8 >>>> qos_sl2vl 0,1,2,3,4, ... # no changes here >>>> >>>> I think (though not verified) the Voltaire HCA we are using can >>>> support 8 data VLs. I don't have much more information to go on why >>>> qos shaping is not taking place, any suggestions? >>> You can verify actual port's parameters with smpquery (from diags), you >>> will need to run to get QoS related parameters: >>> >>> smpquery portinfo ... >>> smpquery vlarb ... >>> smpquery sl2vl ... >>> >>> Sasha >>> >>>> A related question is, if I modify qos setting in SM, do I need to >>>> restart SA on each hosts for it to see the changes? (I am hoping not, >>>> as I tried in the test, it doesn't seem to make a difference) >>>> >>>> Thanks for help. >>>> -- >>>> Oliver >>>> >>>> _______________________________________________ >>>> openib-general mailing list >>>> openib-general at openib.org >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > -- Makia Minich National Center for Computation Science Oak Ridge National Laboratory Phone: 865.574.7460 From halr at voltaire.com Thu Nov 2 06:42:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 09:42:57 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <4549FD75.2080705@ornl.gov> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <20061101223010.GF9985@sashak.voltaire.com> <454922E1.1050002@ornl.gov> <1162476001.29957.109286.camel@hal.voltaire.com> <4549FD75.2080705@ornl.gov> Message-ID: <1162478571.29957.110723.camel@hal.voltaire.com> On Thu, 2006-11-02 at 09:15, Makia Minich wrote: > Hal Rosenstock wrote: > > Makia, > > > > On Wed, 2006-11-01 at 17:42, Makia Minich wrote: > >> It just so happens that we've started looking at this here at ORNL as > >> well. I had a question about the options. The manpage makes it seem > >> that you can set these qos options (e.g. qos_high_limit) from the > >> command line, but I haven't been overly successful. > > > > What are you referring to in the man page ? > > OK, re-reading the man page section on qos, I now realize that I didn't > understand the statement "cached options file" on my initial read > through. So, now I've got it. > > > Which OpenSM are you using (trunk or 1.1 based) ? > > 1.1 based > > >> Is there an example of this being done? > > > > Yes in both the man page under QOS CONFIGURATION or under > > osm/doc/qos-config.txt in the repository. > > I see that that file doesn't install in the doc directory with OFED, > perhaps that should be added (so that I can find it in the ${OFED}/doc > directory). I used that doc and put it pretty much verbatim into the man page so IMO this is somewhat redundant but it could be added to the next release if you think this adds value (having the separate docs). -- Hal > >> Or is changing the /var/cache/osm/opensm.opts file > >> the preferred method of changing the options? > > > > I think it's the only way but it is imperative QoS is enabled for this > > to have any effect. > > > > -- Hal > > That part I've got set in the opensm.opts file: > > no_qos FALSE > > >> Sasha Khapyorsky wrote: > >>> On 16:52 Wed 01 Nov , Oliver wrote: > >>>> Hi, folks - > >>>> > >>>> I am trying to verify and evaluate IB QoS support, running openSM as > >>>> subnet manager. The perftest program is extended to set SL as command > >>>> line options instead of default 0, and by modifying VL arbitration > >>>> tables, I am expecting to see the traffic shaping can actually take > >>>> place, but it did not. More details on configuration: > >>>> > >>>> in opensm.opts: > >>>> # QoS default options > >>>> qos_high_limit 255 # disable low priority table > >>>> qos_vlarb_high: 0:4,1:4,2:8,3:0, 4:0 .... # this is to give VL 2 > >>>> (corresponding to SL 2) a higher weight 8 > >>>> qos_sl2vl 0,1,2,3,4, ... # no changes here > >>>> > >>>> I think (though not verified) the Voltaire HCA we are using can > >>>> support 8 data VLs. I don't have much more information to go on why > >>>> qos shaping is not taking place, any suggestions? > >>> You can verify actual port's parameters with smpquery (from diags), you > >>> will need to run to get QoS related parameters: > >>> > >>> smpquery portinfo ... > >>> smpquery vlarb ... > >>> smpquery sl2vl ... > >>> > >>> Sasha > >>> > >>>> A related question is, if I modify qos setting in SM, do I need to > >>>> restart SA on each hosts for it to see the changes? (I am hoping not, > >>>> as I tried in the test, it doesn't seem to make a difference) > >>>> > >>>> Thanks for help. > >>>> -- > >>>> Oliver > >>>> > >>>> _______________________________________________ > >>>> openib-general mailing list > >>>> openib-general at openib.org > >>>> http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>>> > >>> _______________________________________________ > >>> openib-general mailing list > >>> openib-general at openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >>> > >>> > > > > From ogerlitz at voltaire.com Thu Nov 2 06:58:59 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Nov 2006 16:58:59 +0200 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: References: <4547308F.2030708@voltaire.com> <20061031115017.GF2387@mellanox.co.il> <454746A8.1040604@voltaire.com> <45477E81.3040205@ichips.intel.com> <45487B6C.2070408@voltaire.com> Message-ID: <454A07B3.8070104@voltaire.com> Roland Dreier wrote: > Unfortunately I don't think this solves the module unloading race at > all: there is still a window where code in the client module callback > is running, but the callback has dropped all references etc. so the > client module will happily proceed to unload. > > > At the bottom line, users must call xxx_destory_id() explicitly the > > xxx module would be able to handle in_callback situations. > > I think this is actually a good point for the CM case at least. > Clients already have something registered with the CM (namely the CM > ID itself), so if we required all consumers to destroy their IDs > explicitly, then there's no reason to add additional client > registration. I agree. This applies also to the rdma cm. I think that as others pointed, the case of new id's generated by the cm / rdma cm for incoming connection request might be an exception, but lets first decide this is the only case we need to solve, and when time comes, discuss how to do that. As for client registration with the ib_mad ib_sa and ib_addr modules, i understand the first two where already implemented... and now Sean wants to add it also for the ib_addr module. Now, this module does not have ID's, so we can either add them or implement the registration... let it be what ever Sean prefers, i just think we should not take it to the cm and rdma cm level. Or. From python152 at gmail.com Thu Nov 2 07:20:06 2006 From: python152 at gmail.com (Oliver) Date: Thu, 2 Nov 2006 10:20:06 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <1162475595.29957.109003.camel@hal.voltaire.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <1162475595.29957.109003.camel@hal.voltaire.com> Message-ID: <23e627a30611020720i4a268098h3bf1549621e16f0@mail.gmail.com> Hi, Hal - > How is this being observed/measured ? Host A, B, with 4x DDR both connected to Flextronic switch. A single process of ibv_read_bw gives about 1415MB /s average bandwidth. Two concurrent process report 714.45 MB/s each, dead even. Now if I bump up one process with a different SL, then I expect to see shaping to take place. Please let me if the scenario makes sense. > Yes, 8 VLs should be supported in your subnet. You can verify this with > smpquery portinfo on the HCA port and examine OperVLs assuming the port > is ACTIVE. yes, I verified the data VL support, it is 8. I will poke for more info with suggested commands by Sasha. > > A related question is, if I modify qos setting in SM, do I need to > > restart SA on each hosts for it to see the changes? (I am hoping not, > > as I tried in the test, it doesn't seem to make a difference) > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > mean SA client ? The client hosts don't need restarting but did you > restart OpenSM with your QoS configuration ? I mean client SA. yes, I understand OpenSM needs to be restarted. > BTW, which OpenSM are you running ? OFED 1.1 based. thanks - Oliver From ogerlitz at voltaire.com Thu Nov 2 07:20:54 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Nov 2006 17:20:54 +0200 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <4540DE9B.7070900@ichips.intel.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <4540CA0E.9020807@voltaire.com> <4540DE9B.7070900@ichips.intel.com> Message-ID: <454A0CD6.3090703@voltaire.com> Sean Hefty wrote: >> 1) librdmacm does not get built against libibverbs-1.0 (see below) so >> i am using libibverbs (ie the non released yet libibverbs1.1) > I need to think about what we can do here. The librdmacm uses > functionality not found in libibverbs-1.0. Have you looked on that? from the compilation failure against libibverbs-1.0 the gap seem pretty small. If indeed this is the case, since libibverbs-1.1 is in development lets check with Roland if it makes sense for him to support these small-gap-features in libibverbs-1.0.X, i guess what matters here is ABI versions... If it is not possible, maybe we can somehow instrument the code of librdmacm to do well with libibverbs-1.0.Y If this is not possible as well, i guess the way to use librdmacm for the time being is against a devel drop of libibverbs-1.1 as i am doing now. >> 2) the cma rdma multicast does not let a consumer to join as send-only > This would require some sort of change to the API and ABI, so if this is > needed, I'd like to incorporate this now. (Adding it could be done by > specifying join parameters.) Do we need/want this level of control in > the librdmacm, or should users go to a direct IB interface for this? I think we do want it. The rdma cm provide the means to offload ip multicast to ib multicast though registration (join/leave etc) with the ib_sa module. IP Multicast does use the send-only feature and hence IP Multicast offloading apps need it as well. The rdma cm framework fits very well for such apps and the ib_usa (which does not exist now, and i am not sure needs to exist... it was a project of a summer student with open-mpi that required that...) not. Currently, librdmacm does not have the means to distinguish between sender and receiver, so it joins the sender as full member and attaches its qp to this group mgid, this hurts performance, first and second might cause this sender CQ to receive the posts as well (i am not sure here) which can get it go crazy... Or. From sashak at voltaire.com Thu Nov 2 07:26:52 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 17:26:52 +0200 Subject: [openib-general] [PATCH] opensm: strict osm_log arguments/format check In-Reply-To: <4549F4E7.8010606@dev.mellanox.co.il> References: <20061102105348.GA16559@sashak.voltaire.com> <4549F4E7.8010606@dev.mellanox.co.il> Message-ID: <20061102152652.GA17244@sashak.voltaire.com> On 15:38 Thu 02 Nov , Yevgeny Kliteynik wrote: > Hi Sasha. > > Good catch with those missing arguments. It is compiler... > One question: in several places you used cl_hton64() to print guid. > Shouldn't there be cl_ntoh64() instead? Right, it is mistake. Thanks for catching. Will resend. Sasha > > ...and yes, I know that these two functions are actually the same macro :) > > Thanks > > -- Yevgeny > > > Sasha Khapyorsky wrote: > > This adds gcc attribute to osm_log() which causes the compiler to check > > argument types against a format string. And also there are related fixes > > in osm_log() usage in opensm and osmtest. > > > > Signed-off-by: Sasha Khapyorsky > > --- > > osm/include/opensm/osm_log.h | 8 +++++++- > > osm/libvendor/osm_vendor_ibumad_sa.c | 2 +- > > osm/opensm/main.c | 3 ++- > > osm/opensm/osm_pkey_mgr.c | 1 + > > osm/opensm/osm_port_info_rcv.c | 5 +++-- > > osm/opensm/osm_sa_informinfo.c | 4 ++-- > > osm/opensm/osm_sa_link_record.c | 8 ++++---- > > osm/opensm/osm_sa_mad_ctrl.c | 3 ++- > > osm/opensm/osm_sa_response.c | 2 +- > > osm/opensm/osm_sm_state_mgr.c | 3 ++- > > osm/opensm/osm_sminfo_rcv.c | 9 +++++---- > > osm/opensm/osm_state_mgr.c | 8 ++++---- > > osm/osmtest/osmt_multicast.c | 12 +++++++----- > > osm/osmtest/osmt_service.c | 6 +++--- > > osm/osmtest/osmtest.c | 8 ++++---- > > 15 files changed, 48 insertions(+), 34 deletions(-) > > > > diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h > > index 62f3a0c..2b24886 100644 > > --- a/osm/include/opensm/osm_log.h > > +++ b/osm/include/opensm/osm_log.h > > @@ -60,6 +60,12 @@ > > #include > > #include > > > > +#ifdef __GNUC__ > > +#define STRICT_OSM_LOG_FORMAT __attribute__((format(printf, 3, 4))) > > +#else > > +#define STRICT_OSM_LOG_FORMAT > > +#endif > > + > > #ifdef __cplusplus > > # define BEGIN_C_DECLS extern "C" { > > # define END_C_DECLS } > > @@ -374,7 +380,7 @@ void > > osm_log( > > IN osm_log_t* const p_log, > > IN const osm_log_level_t verbosity, > > - IN const char *p_str, ... ); > > + IN const char *p_str, ... ) STRICT_OSM_LOG_FORMAT; > > > > void > > osm_log_raw( > > diff --git a/osm/libvendor/osm_vendor_ibumad_sa.c b/osm/libvendor/osm_vendor_ibumad_sa.c > > index 7fd0655..7c4a2f7 100644 > > --- a/osm/libvendor/osm_vendor_ibumad_sa.c > > +++ b/osm/libvendor/osm_vendor_ibumad_sa.c > > @@ -853,7 +853,7 @@ osmv_query_sa( > > if ( p_mpr_req->sgid_count + p_mpr_req->dgid_count > IB_MULTIPATH_MAX_GIDS ) > > { > > osm_log( p_log, OSM_LOG_ERROR, > > - "osmv_query_sa DBG:001 MULTIPATH_REC ", > > + "osmv_query_sa DBG:001 MULTIPATH_REC " > > "SGID count %d DGID count %d max count %d\n", > > p_mpr_req->sgid_count, p_mpr_req->dgid_count, > > IB_MULTIPATH_MAX_GIDS ); > > diff --git a/osm/opensm/main.c b/osm/opensm/main.c > > index 729702a..752b546 100644 > > --- a/osm/opensm/main.c > > +++ b/osm/opensm/main.c > > @@ -460,7 +460,8 @@ parse_ignore_guids_file(IN char *guids_f > > { > > osm_log( &p_osm->log, OSM_LOG_ERROR, > > "parse_ignore_guids_file: ERR 0601: " > > - "Unable to open ignore guids file (%s)\n" ); > > + "Unable to open ignore guids file (%s)\n", > > + guids_file_name ); > > status = IB_ERROR; > > goto Exit; > > } > > diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c > > index f2cb221..735dc14 100644 > > --- a/osm/opensm/osm_pkey_mgr.c > > +++ b/osm/opensm/osm_pkey_mgr.c > > @@ -139,6 +139,7 @@ pkey_mgr_process_physical_port( > > "pkey_mgr_process_physical_port: ERR 0503: " > > "Failed to obtain P_Key 0x%04x block and index for node " > > "0x%016" PRIx64 " port %u\n", > > + ib_pkey_get_base( pkey ), > > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > > osm_physp_get_port_num( p_physp ) ); > > return; > > diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c > > index 95112dc..f6d3595 100644 > > --- a/osm/opensm/osm_port_info_rcv.c > > +++ b/osm/opensm/osm_port_info_rcv.c > > @@ -724,8 +724,9 @@ osm_pi_rcv_process( > > { > > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > > "osm_pi_rcv_process: " > > - "Got light sweep response from remote port of parent node GUID = 0x%" PRIx64 > > - " port = %u, Commencing heavy sweep\n", > > + "Got light sweep response from remote port of parent node " > > + "GUID = 0x%" PRIx64 " port = 0x%016" PRIx64 > > + ", Commencing heavy sweep\n", > > cl_ntoh64( node_guid ), > > cl_ntoh64( port_guid ) ); > > osm_state_mgr_process( p_rcv->p_state_mgr, > > diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c > > index 69dca1d..da96d35 100644 > > --- a/osm/opensm/osm_sa_informinfo.c > > +++ b/osm/opensm/osm_sa_informinfo.c > > @@ -163,8 +163,8 @@ __validate_ports_access_rights( > > { > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__validate_ports_access_rights: ERR 4301: " > > - "Invalid port guid: 0x%016\n", > > - portguid ); > > + "Invalid port guid: 0x%016" PRIx64 "\n", > > + cl_hton64(portguid) ); > > valid = FALSE; > > goto Exit; > > } > > diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c > > index 751023f..0ca9092 100644 > > --- a/osm/opensm/osm_sa_link_record.c > > +++ b/osm/opensm/osm_sa_link_record.c > > @@ -145,10 +145,10 @@ __osm_lr_rcv_build_physp_link( > > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > > "__osm_lr_rcv_build_physp_link: ERR 1801: " > > "Unable to acquire link record\n" > > - "\t\t\t\tFrom port 0x%\n" > > - "\t\t\t\tTo port 0x%\n" > > - "\t\t\t\tFrom lid 0x%\n" > > - "\t\t\t\tTo lid 0x%\n", > > + "\t\t\t\tFrom port 0x%u\n" > > + "\t\t\t\tTo port 0x%u\n" > > + "\t\t\t\tFrom lid 0x%u\n" > > + "\t\t\t\tTo lid 0x%u\n", > > from_port, to_port, > > cl_ntoh16(from_lid), > > cl_ntoh16(to_lid) ); > > diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c > > index cd896b6..208f0d2 100644 > > --- a/osm/opensm/osm_sa_mad_ctrl.c > > +++ b/osm/opensm/osm_sa_mad_ctrl.c > > @@ -132,7 +132,8 @@ __osm_sa_mad_ctrl_process( > > "__osm_sa_mad_ctrl_process: " > > /* "Responding BUSY status since the dispatcher is already"*/ > > "Dropping MAD since the dispatcher is already" > > - " overloaded with %u messages and queue time of:%u[msec]\n", > > + " overloaded with %u messages and queue time of:" > > + "%" PRIu64 "[msec]\n", > > num_messages, last_dispatched_msg_queue_time_msec ); > > > > /* send a busy response */ > > diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c > > index db36ea2..27f4e9d 100644 > > --- a/osm/opensm/osm_sa_response.c > > +++ b/osm/opensm/osm_sa_response.c > > @@ -117,7 +117,7 @@ osm_sa_send_error( > > if (osm_exit_flag) > > { > > osm_log( p_resp->p_log, OSM_LOG_DEBUG, > > - "osm_sa_send_error: ", > > + "osm_sa_send_error: " > > "Ignoring requested send after exit\n" ); > > goto Exit; > > } > > diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c > > index aadc43a..1ba5eda 100644 > > --- a/osm/opensm/osm_sm_state_mgr.c > > +++ b/osm/opensm/osm_sm_state_mgr.c > > @@ -247,7 +247,8 @@ __osm_sm_state_mgr_send_master_sm_info_r > > { > > osm_log( p_sm_mgr->p_log, OSM_LOG_ERROR, > > "__osm_sm_state_mgr_send_master_sm_info_req: ERR 3203: " > > - "No port object for GUID 0x%X\n", p_sm_mgr->master_guid ); > > + "No port object for GUID 0x%016" PRIx64 "\n", > > + cl_hton64(p_sm_mgr->master_guid) ); > > goto Exit; > > } > > > > diff --git a/osm/opensm/osm_sminfo_rcv.c b/osm/opensm/osm_sminfo_rcv.c > > index 825b18b..7657e97 100644 > > --- a/osm/opensm/osm_sminfo_rcv.c > > +++ b/osm/opensm/osm_sminfo_rcv.c > > @@ -402,8 +402,8 @@ __osm_sminfo_rcv_process_set_request( > > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > > "__osm_sminfo_rcv_process_set_request: " > > "Received a STANDBY signal. Updating " > > - "sm_state_mgr master_guid: 0x%X\n", > > - p_rcv_smi->guid ); > > + "sm_state_mgr master_guid: 0x%016" PRIx64 "\n", > > + cl_hton64(p_rcv_smi->guid) ); > > p_rcv->p_sm_state_mgr->master_guid = p_rcv_smi->guid; > > } > > > > @@ -482,8 +482,9 @@ __osm_sminfo_rcv_process_get_sm( > > /* we will poll it - as long as it lives - we should be in Standby. */ > > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > > "__osm_sminfo_rcv_process_get_sm: " > > - "Found higher SM. Updating sm_state_mgr master_guid: 0x%X\n", > > - p_sm->p_port->guid ); > > + "Found higher SM. Updating sm_state_mgr master_guid:" > > + " 0x%016" PRIx64 "\n", > > + cl_hton64(p_sm->p_port->guid) ); > > p_rcv->p_sm_state_mgr->master_guid = p_sm->p_port->guid; > > } > > break; > > diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c > > index 28e0c4c..ad22d9e 100644 > > --- a/osm/opensm/osm_state_mgr.c > > +++ b/osm/opensm/osm_state_mgr.c > > @@ -481,7 +481,7 @@ __osm_state_mgr_signal_warning( > > { > > osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > > "__osm_state_mgr_signal_warning: " > > - "Invalid signal %s(%d) in state %s\n", > > + "Invalid signal %s(%lu) in state %s\n", > > osm_get_sm_signal_str( signal ), > > signal, osm_get_sm_state_str( p_mgr->state ) ); > > } > > @@ -500,7 +500,7 @@ __osm_state_mgr_signal_error( > > else > > osm_log( p_mgr->p_log, OSM_LOG_ERROR, > > "__osm_state_mgr_signal_error: ERR 3303: " > > - "Invalid signal %s(%d) in state %s\n", > > + "Invalid signal %s(%lu) in state %s\n", > > osm_get_sm_signal_str( signal ), > > signal, osm_get_sm_state_str( p_mgr->state ) ); > > } > > @@ -1480,8 +1480,8 @@ __osm_state_mgr_exists_other_master_sm( > > { > > osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > > "__osm_state_mgr_exists_other_master_sm: " > > - "Found remote master SM with guid:0x%X\n", > > - p_sm->smi.guid ); > > + "Found remote master SM with guid:0x%016" PRIx64 "\n", > > + cl_hton64(p_sm->smi.guid) ); > > p_sm_res = p_sm; > > goto Exit; > > } > > diff --git a/osm/osmtest/osmt_multicast.c b/osm/osmtest/osmt_multicast.c > > index 33a4f47..19f9d37 100644 > > --- a/osm/osmtest/osmt_multicast.c > > +++ b/osm/osmtest/osmt_multicast.c > > @@ -1885,8 +1885,9 @@ osmt_run_mcast_flow( IN osmtest_t * cons > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmt_run_mcast_flow: ERR 0209: " > > - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", > > - p_mc_res->mgid > > + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", > > + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), > > + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) > > ); > > status = IB_ERROR; > > goto Exit; > > @@ -2044,8 +2045,9 @@ osmt_run_mcast_flow( IN osmtest_t * cons > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmt_run_mcast_flow: ERR 0212: " > > - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", > > - p_mc_res->mgid > > + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", > > + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), > > + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) > > ); > > status = IB_ERROR; > > goto Exit; > > @@ -3345,7 +3347,7 @@ osmt_run_mcast_flow( IN osmtest_t * cons > > /* Delete all MCG that are not of IPoIB */ > > osm_log( &p_osmt->log, OSM_LOG_INFO, > > "osmt_run_mcast_flow : " > > - "Cleanup all MCG that are not IPoIB...\n", cnt ); > > + "Cleanup all MCG that are not IPoIB...\n" ); > > > > p_mgrp_mlid_tbl = &p_osmt->exp_subn.mgrp_mlid_tbl; > > p_mgrp = (osmtest_mgrp_t*)cl_qmap_head( p_mgrp_mlid_tbl ); > > diff --git a/osm/osmtest/osmt_service.c b/osm/osmtest/osmt_service.c > > index ec9a39e..ab95fec 100644 > > --- a/osm/osmtest/osmt_service.c > > +++ b/osm/osmtest/osmt_service.c > > @@ -1559,7 +1559,7 @@ osmt_run_service_records_flow( IN osmtes > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmt_run_service_records_flow: ERR 4A20: " > > - "Found service: id: 0x%016 " PRIx64 > > + "Found service: id: 0x%016" PRIx64 " " > > "that is invalid\n", > > id[7] ); > > status = IB_ERROR; > > @@ -1573,7 +1573,7 @@ osmt_run_service_records_flow( IN osmtes > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmt_run_service_records_flow: ERR 4A21: " > > - "Fail to find service: id: 0x%016 " PRIx64 > > + "Fail to find service: id: 0x%016" PRIx64 " " > > "name: %s\n", > > id[0], > > (char*)service_name[0] ); > > @@ -1588,7 +1588,7 @@ osmt_run_service_records_flow( IN osmtes > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmt_run_service_records_flow: ERR 4A22: " > > - "Fail to find service: id: 0x%016 " PRIx64 > > + "Fail to find service: id: 0x%016" PRIx64 " " > > "name: %s\n", > > id[5], > > (char*)service_name[6] ); > > diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c > > index 92a4190..a35e0c5 100644 > > --- a/osm/osmtest/osmtest.c > > +++ b/osm/osmtest/osmtest.c > > @@ -2787,7 +2787,8 @@ osmtest_create_inventory_file( IN osmtes > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmtest_create_inventory_file: ERR 0079: " > > - "Unable to open inventory file (%s)\n" ); > > + "Unable to open inventory file (%s)\n", > > + p_osmt->opt.file_name ); > > status = IB_ERROR; > > goto Exit; > > } > > @@ -3356,7 +3357,7 @@ osmtest_validate_path_data( IN osmtest_t > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmtest_validate_path_data: ERR 0012: " > > "PKEY mismatch on path SLID 0x%X to DLID 0x%X\n" > > - "\t\t\t\tExpected 0x%X, received 0x%X\n", > > + "\t\t\t\tExpected 0x%" PRIx64 ", received 0x%" PRIx64 "\n", > > cl_ntoh16( p_path->rec.slid ), > > cl_ntoh16( p_path->rec.dlid ), > > cl_ntoh64( p_path->rec.pkey ), cl_ntoh64( p_rec->pkey ) ); > > @@ -7165,8 +7166,7 @@ osmtest_bind( IN osmtest_t * p_osmt, > > { > > osm_log( &p_osmt->log, OSM_LOG_ERROR, > > "osmtest_bind: ERR 0135: " > > - "No local ports. Unable to proceed\n", > > - ib_get_err_str( status ) ); > > + "No local ports. Unable to proceed\n" ); > > goto Exit; > > } > > guid = attr_array[port_index].port_guid; From rkuchimanchi at silverstorm.com Thu Nov 2 07:40:46 2006 From: rkuchimanchi at silverstorm.com (Ramachandra K) Date: Thu, 02 Nov 2006 21:10:46 +0530 Subject: [openib-general] fixing sparse warnings ? Message-ID: <454A117E.2090105@silverstorm.com> I've been searching for a while but cant seem to find any pointers on how to fix sparse warnings (like cast to restricted type etc) or in general making code sparse check safe. I would appreciate if someone could point me in the right direction. Regards, Ram From halr at voltaire.com Thu Nov 2 07:49:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 10:49:04 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <23e627a30611020720i4a268098h3bf1549621e16f0@mail.gmail.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <1162475595.29957.109003.camel@hal.voltaire.com> <23e627a30611020720i4a268098h3bf1549621e16f0@mail.gmail.com> Message-ID: <1162482544.15232.585.camel@hal.voltaire.com> Hi Oliver, On Thu, 2006-11-02 at 10:20, Oliver wrote: > Hi, Hal - > > > How is this being observed/measured ? > > Host A, B, with 4x DDR both connected to Flextronic switch. > A single process of ibv_read_bw gives about 1415MB /s average > bandwidth. Two concurrent process report 714.45 MB/s each, dead even. > Now if I bump up one process with a different SL, then I expect to see > shaping to take place. Please let me if the scenario makes sense. It makes sense. However, if the higher priority traffic does not fill the scheduling, the low priority can take up the slack so I'm not sure if this is what you are seeing or something else. It might be interesting to try the same thing at SDR speeds. -- Hal > > Yes, 8 VLs should be supported in your subnet. You can verify this with > > smpquery portinfo on the HCA port and examine OperVLs assuming the port > > is ACTIVE. > > yes, I verified the data VL support, it is 8. I will poke for more > info with suggested commands by Sasha. > > > > A related question is, if I modify qos setting in SM, do I need to > > > restart SA on each hosts for it to see the changes? (I am hoping not, > > > as I tried in the test, it doesn't seem to make a difference) > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > > mean SA client ? The client hosts don't need restarting but did you > > restart OpenSM with your QoS configuration ? > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > BTW, which OpenSM are you running ? > > OFED 1.1 based. > > thanks > > - Oliver From Brian.Cain at ge.com Thu Nov 2 08:02:29 2006 From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare)) Date: Thu, 2 Nov 2006 11:02:29 -0500 Subject: [openib-general] Mellanox SRP target implementation In-Reply-To: <4549AC32.7020003@mellanox.com> Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB033015A7395@CINMLVEM11.e2k.ad.ge.com> > -----Original Message----- > From: openib-general-bounces at openib.org > [mailto:openib-general-bounces at openib.org] On Behalf Of Vu Pham > Sent: Thursday, November 02, 2006 2:29 AM > To: Tomoaki Sato > Cc: openib-general at openib.org > Subject: Re: [openib-general] Mellanox SRP target implementation > > Tomoaki, > > > > > Can anybody tell me about the mellanox "SRP target" > implementation code which is included in MTD2000 with > NFS-RDMA server ? > > Is this gen2 base ? > > > > *srp target* is still on gen1 code base - IBGD > > *nfs-rdma server* is on gen2 code base Any chance the MTD2000 runs openfiler? -Brian From bos at pathscale.com Thu Nov 2 08:33:39 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 02 Nov 2006 08:33:39 -0800 Subject: [openib-general] fixing sparse warnings ? In-Reply-To: <454A117E.2090105@silverstorm.com> References: <454A117E.2090105@silverstorm.com> Message-ID: <454A1DE3.3090006@pathscale.com> Ramachandra K wrote: > I've been searching for a while but cant seem to find any pointers on > how to fix sparse warnings (like cast to restricted type etc) or in general > making code sparse check safe. Add annotations to the data types that you're using, and make them consistent. For example, if you have a function that takes a u16, and you pass in a __le16, you need to decide whether it's the function or the caller that needs fixing. And you then need to propagate out those annotations until all of your sources of problems have gone away. It's a very simple process. For more information, do a Google search for "sparse site:lwn.net". References: <4547308F.2030708@voltaire.com> <20061031115017.GF2387@mellanox.co.il> <454746A8.1040604@voltaire.com> <45477E81.3040205@ichips.intel.com> <45487B6C.2070408@voltaire.com> <45492B43.5010408@ichips.intel.com> <4549307E.9060200@ichips.intel.com> <20061102081453.GA7247@mellanox.co.il> Message-ID: <454A24E4.8060001@ichips.intel.com> > All active side users are fine I think. But any client on the passive side > currently might destroy the new ID by returning error from the callback, and I > like this interface since it frees the resources immediately. As long as only *newly* created (i.e. associated with a connection request) cm_id's are destroyed this way, we're fine. Newly created cm_id's are associated with a listening cm_id. Destruction of the listening cm_id is blocked while a callback for a connection request is in progress. > Since all such passive side users currently are out of tree, I don't think > it's urgent for us to do anything about the passive side race - but please do > not at least break code that uses passive side in major ways just yet. I use the callback method of destruction for new cm_id's in the ucm and ucma modules, so I want to keep this feature myself. However, this method is unused, and likely unneeded, for events other than connection requests. If this is the case, we can update the documentation, and remove this support except for new connections. I looked at the existing users and didn't find any module unload races with either the ib_cm or rdma_cm, so I don't think that any immediate fixes are necessary. - Sean From dledford at redhat.com Thu Nov 2 09:23:45 2006 From: dledford at redhat.com (Doug Ledford) Date: Thu, 02 Nov 2006 12:23:45 -0500 Subject: [openib-general] [openfabrics-ewg] RHEL5 and OFED ... In-Reply-To: <453793A1.8000000@voltaire.com> References: <1161155330.2917.511.camel@fc6.xsintricity.com> <20061018072904.GA26507@mellanox.co.il> <1161177058.2917.513.camel@fc6.xsintricity.com> <20061019050907.GA1547@mellanox.co.il> <1161268837.2917.544.camel@fc6.xsintricity.com> <453793A1.8000000@voltaire.com> Message-ID: <1162488225.2898.346.camel@fc6.xsintricity.com> On Thu, 2006-10-19 at 17:02 +0200, Or Gerlitz wrote: > Doug Ledford wrote: > > ... and reviewing arpingib > > (which I'm going to remove from the ipoibtools and fix the native arping > > in RHEL5 to work properly over IB without needing a new flag, the -A or > > -U flags should be sufficient assuming those modes worked at all over IB > > which they don't in either the native arping or the patched arpingib in > > ipoibtools). I should get to it today though. > > Would you mind send the patch to arping for review? OK, this patch to arping actually makes it work for me in all modes (duplicate address detection, arp response, and unsolicited arp response). You shouldn't need any new flags to arping with this patch, you should be able to just use the existing modes of operation as they were intended to make the ipoibha.pl script work. There's still some debugging printf's in the patch, so don't consider this a final version. How's it work? The getsockname() function will return the full hw address if you give it a buffer large enough to do so. So, instead of allocating a single struct sockaddr_ll for me and he, which caps the address size at 8 bytes, allocate two and let the extra 12 bytes run over into the second struct element. Adjust the send_to and recv_from calls to accomodate this intentional size overrun. Finally, don't assume the broadcast address is all 1's, use sysfs to get the actual device broadcast address and convert it from text to binary (which will accommodate any possible future interface types that similarly don't have all 1's for broadcast address without requiring any recoding). That's all I had to do in order to get it to work for me. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: arping-infiniband.patch Type: text/x-patch Size: 4873 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From vuhuong at mellanox.com Thu Nov 2 09:35:19 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Thu, 02 Nov 2006 09:35:19 -0800 Subject: [openib-general] Mellanox SRP target implementation In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB033015A7395@CINMLVEM11.e2k.ad.ge.com> References: <2376B63A5AF8564F8A2A2D76BC6DB033015A7395@CINMLVEM11.e2k.ad.ge.com> Message-ID: <454A2C57.3080800@mellanox.com> >>*srp target* is still on gen1 code base - IBGD >> >>*nfs-rdma server* is on gen2 code base > > > Any chance the MTD2000 runs openfiler? > We have never installed openfiler. You can try -vu From mshefty at ichips.intel.com Thu Nov 2 09:34:48 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 09:34:48 -0800 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <454A0CD6.3090703@voltaire.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <4540CA0E.9020807@voltaire.com> <4540DE9B.7070900@ichips.intel.com> <454A0CD6.3090703@voltaire.com> Message-ID: <454A2C38.1040805@ichips.intel.com> > Have you looked on that? from the compilation failure against > libibverbs-1.0 the gap seem pretty small. If indeed this is the case, > since libibverbs-1.1 is in development lets check with Roland if it > makes sense for him to support these small-gap-features in > libibverbs-1.0.X, i guess what matters here is ABI versions... I have not had time to look into this yet. > I think we do want it. The rdma cm provide the means to offload ip > multicast to ib multicast though registration (join/leave etc) with the > ib_sa module. IP Multicast does use the send-only feature and hence IP > Multicast offloading apps need it as well. The rdma cm framework fits > very well for such apps and the ib_usa (which does not exist now, and i > am not sure needs to exist... it was a project of a summer student with > open-mpi that required that...) not. Are you wanting the rdma cm to join the same multicast groups that ipoib does? (This is simple to change, but it does not join the same groups today.) I will likely need to spin these patches again to incorporate the changes for path failover, so adding in join options wouldn't be difficult. Are you just wanting to see them added the rdma_join_multicast directly? - Sean From mst at mellanox.co.il Thu Nov 2 10:03:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 20:03:52 +0200 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <454A24E4.8060001@ichips.intel.com> References: <454A24E4.8060001@ichips.intel.com> Message-ID: <20061102180352.GA13591@mellanox.co.il> Quoting r. Sean Hefty : > Subject: Re: [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race > > > All active side users are fine I think. But any client on the passive side > > currently might destroy the new ID by returning error from the callback, and I > > like this interface since it frees the resources immediately. > > As long as only *newly* created (i.e. associated with a connection request) > cm_id's are destroyed this way, we're fine. Newly created cm_id's are > associated with a listening cm_id. Destruction of the listening cm_id is > blocked while a callback for a connection request is in progress. > > > Since all such passive side users currently are out of tree, I don't think > > it's urgent for us to do anything about the passive side race - but please do > > not at least break code that uses passive side in major ways just yet. > > I use the callback method of destruction for new cm_id's in the ucm and ucma > modules, so I want to keep this feature myself. However, this method is > unused, and likely unneeded, for events other than connection requests. If > this is the case, we can update the documentation, and remove this support > except for new connections. Another case is a request and then a reject. > I looked at the existing users and didn't find any module unload races with > either the ib_cm or rdma_cm, so I don't think that any immediate fixes are > necessary. > > - Sean > -- MST From mshefty at ichips.intel.com Thu Nov 2 10:09:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 10:09:58 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <4549D3B7.1050208@voltaire.com> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> <4549D3B7.1050208@voltaire.com> Message-ID: <454A3476.6090402@ichips.intel.com> Or Gerlitz wrote: > Can be very nice if you share with the community the IB stack issues > revealed under scale-out testing... basically what was the testbed? We have a 256 node (512 processors) cluster that we can test with on the second Tuesday following the first Monday of any month with two full moons. We're only now getting some time on the cluster, and our test capabilities are limited. The main issue that we saw was that the SA simply doesn't scale. > From what the patch does I understand you attempt to handle timeout on > address and route resolution and long disconnect delay. correct > Was the issue with address resolution being ARP request or reply > messages getting lost? This appears to be the case. During test startup, we try to form all to all connections. As we scaled, the number of address resolutions that timed out also increased. We suspect that this is a result of the ipoib broadcast channel getting hit with a 100,000+ requests. > Was the issue with route resolution being timeout on SA Path queries? Yes - but the issues are more complex than that. The SA was able to respond to 4000-6000 queries per second. With an all to all connection model, it gets about 130,000 requests. Assuming that none of these are lost and a 4 second timeout, it will be able to respond only a fraction of the original requests in time. The next 100,000+ requests that it responds to have already timed out before it can send the response. At 5000 queries per second, it will take the SA nearly 30 seconds to respond to the first set of requests, most of which will have timed out. By the time it reached the end of the first 130,000 requests, it had hundreds of thousands of queued retries, most of which had also already timed out. (E.g. even with a exponential backoff, you'd have retries at 4 seconds, 12 seconds, and 28 seconds before the SA can finish processing the first set of requests.) To further complicate the issue, retried requests are given new transaction IDs by the ib_sa module, which makes it impossible for the SA to detect retries from original requests. It sees all requests as new. On our largest run, we were never able to complete route resolution. We're still exploring possibilities in this area. > Was the issue with disconnect delay that peer A called > dat_ep_disconnect() (ie sending DREQ) and the DREP was sent only when > peer B got the disconnect event and called dat_ep_disconnect()? so now > the DREP is sent from within the provider code when it gets the DREQ? The disconnect delay occurred because of remote nodes being slow to respond to disconnect requests. We're still investigating this issue. - Sean From sean.hefty at intel.com Thu Nov 2 10:16:10 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 2 Nov 2006 10:16:10 -0800 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <20061102180352.GA13591@mellanox.co.il> Message-ID: <000101c6feaa$f95a7bf0$45248686@amr.corp.intel.com> >Another case is a request and then a reject. Yes - I considered reject, disconnect, and device removal as good candidates to make use of this. It's just that in these cases, the user has had the option of allocating resources with the cm_id that it can use to queue for destruction. With a new cm_id, the user may not be able to allocate the necessary resources in order to destroy it from another thread. Does SDP use this feature for events other than for connection requests? - Sean From mst at mellanox.co.il Thu Nov 2 10:22:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 20:22:07 +0200 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <454A24E4.8060001@ichips.intel.com> References: <454A24E4.8060001@ichips.intel.com> Message-ID: <20061102182207.GB13591@mellanox.co.il> Quoting r. Sean Hefty : > I use the callback method of destruction for new cm_id's in the ucm and ucma > modules, so I want to keep this feature myself. However, this method is > unused, and likely unneeded, for events other than connection requests. If > this is the case, we can update the documentation, and remove this support > except for new connections. > I rethought the issue, and I don't think its a good assumption to make. Let's stick to the old API. For example, SDP uses the callback destrouction capability for all IDs. For example, if on the active side I get a reject, it is much nicer to get the id cleaned up immediately since I have no reason to keep it around, and because I want to put the socket back in the same state it was in before connect ( that is without connection id), so new connect request will restart everything. Otherwise it is quite awkward, I'm just wasting memory, and applications actually *do* keep a huge number of inactive sockets around. I expect we'll want something like this for IPoIB connected mode too - keeping idle IDs and queueing work requests would be quite awkward I think. Adding registration at module start/stop seems simple enough and overhead is minimal. We already have this for other modules (e.g. ib_sa). I don't really unerstand why is there such a resistance to this simple fix for unload race? -- MST From mst at mellanox.co.il Thu Nov 2 10:24:38 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 20:24:38 +0200 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <000101c6feaa$f95a7bf0$45248686@amr.corp.intel.com> References: <000101c6feaa$f95a7bf0$45248686@amr.corp.intel.com> Message-ID: <20061102182438.GC13591@mellanox.co.il> Quoting r. Sean Hefty : > Does SDP use this feature for events other than for connection requests? Yes, it does. But as I said, since SDP is out of tree, for now we can just ignore the module unloading race. -- MST From mst at mellanox.co.il Thu Nov 2 10:34:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 2 Nov 2006 20:34:04 +0200 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <454A3476.6090402@ichips.intel.com> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> <4549D3B7.1050208@voltaire.com> <454A3476.6090402@ichips.intel.com> Message-ID: <20061102183404.GD13591@mellanox.co.il> Quoting r. Sean Hefty : > Subject: scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq > > Or Gerlitz wrote: > > Can be very nice if you share with the community the IB stack issues > > revealed under scale-out testing... basically what was the testbed? > > We have a 256 node (512 processors) cluster that we can test with on the second > Tuesday following the first Monday of any month with two full moons. We're only > now getting some time on the cluster, and our test capabilities are limited. > > The main issue that we saw was that the SA simply doesn't scale. We had an option to increase the RQ size for QP1 and QP0. This might help you too: try increasing IB_MAD_QP_RECV_SIZE. -- MST From viswa.krish at gmail.com Thu Nov 2 10:33:50 2006 From: viswa.krish at gmail.com (Viswanath Krishnamurthy) Date: Thu, 2 Nov 2006 10:33:50 -0800 Subject: [openib-general] opensm crash with topspin HCA In-Reply-To: <4df28be40610311710i4beb19fdp3ecd13cd95db7d17@mail.gmail.com> References: <4df28be40610311710i4beb19fdp3ecd13cd95db7d17@mail.gmail.com> Message-ID: <4df28be40611021033t3d77262fwea56de0377d11c31@mail.gmail.com> When we run opensm (OFED) release and if a Topspin HCA is in the IB network, opensm crashes in umad_receiver with NULL pointer exception. The transaction ID is zero is the MAD'S from topspin HCA on windows. The crashes seems to random in umad_receiver. HCA found: hca_id=InfiniHost0 vendor_id=0x02C9 vendor_part_id=0x5A44 hw_ver=0xA0 fw_ver=0x400060000 -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkatesh.babu at 3leafnetworks.com Thu Nov 2 11:37:51 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 02 Nov 2006 11:37:51 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <000001c6fe31$ebc6e150$e0d8180a@amr.corp.intel.com> References: <000001c6fe31$ebc6e150$e0d8180a@amr.corp.intel.com> Message-ID: <454A490F.6050103@3leafnetworks.com> Sean Hefty wrote: >>Are these changes to replace ib_cm_init_rearm_attr() interface ? >> >> > >Yes - you use ib_cm_init_qp_attr() to get the qp_attr after a loading a new >alternate path. The new path is loaded using ib_send_cm_lap(). So, after a >path fails: > > After path fails, I just call ib_qp_modify() on both active and passive side to switch to the alternate path by changing path_mig_state to IB_MIG_MIGRATED. Let me make the steps clear - 1. On Passive node register for remote port UP/DOWN event by registering with ib_sa_serv_notice_hdlr() 2. On Passive node start the listener by calling ib_cm_listen(). 3. On Active node create the RC QP and establish the connection by calling ib_send_cm_req(). In struct ib_cm_req_param specify both primary path (say, through Port1) and alternate path (say, through Port2). NOTE:-Assume Port1 of Active node is connected to Port1 of Passive node; and Port2 of Active node is connected to Port2 of Passive node. NOTE:- After this step QP's path_mig_state will be IB_MIG_ARMED. 4. Let us say, Port1 on Active node fails 5. IB_EVENT_PORT_ERR event is generated on Active node; and remote port error event is generated on Passive node. 6. In those event handler call ib_qp_modify() to set the path_mig_state to IB_MIG_MIGRATED. This will let the HCA's firmware know to switch to the alternate path. 7. After a while, Port1 is comes back again. 8. IB_EVENT_PORT_ACTIVE event is generated on Active node; and remote port active event is generated on Passive node. 9. On the Active node from IB_EVENT_PORT_ACTIVE event handler call the ib_send_cm_lap() to send the alternate path (through Port1) to the Passive node. 9.1 Passive node receives the LAP message 9.2 Calls ib_cm_init_rearm_attr() initialize the alternate path info 9.3 Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM 9.4 Send APR message back to the Active node. 10. Active node receives the APR message 11. Calls ib_cm_init_rearm_attr() initialize the alternate path info 12. Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM 13. Now when a first packet is passed between the Active and Passive node the ib_core changes the path_mig_state to the IB_MIG_ARMED. 14. Now it is all set for another failover. >One side calls ib_send_cm_lap() to propose a new alternate path. >Second side responds by calling ib_send_cm_apr(). >Both sides call ib_cm_init_qp_attr(), then ib_modify_qp() to load the new path. > >This is intended to work if failover has occurred, or if the user detects that >the alternate path is down and wants to replace it. > >There is an additional call, ib_cm_notify() which is used to let the CM know >that the primary path has failed, and the alternate path should be used when >sending future CM messages. In case of failover, this needs to be called before >calling ib_send_cm_lap() to ensure that the LAP message reaches the remote user. > > > >>The path migration from Primary to Alternate succeeded, then reloaded >>the alternate path. >> >> > >How did you reload the alternate path? > > Steps 9 through 12. > > >>failed with the IB_WC_RETRY_EXC_ERR. But I got the event IB_EVENT_PATH_MIG. >> >>With the ib_cm_init_rearm_attr() being called, failover/failback worked >>fine. >> >> > >Were you calling ib_send_cm_lap() to load a new alternate path, > Step 9 >or just assuming >that the old path would work after failover occurred? > > Before the failover occurring the QP's path_mig_state must be in IB_MIG_ARMED, otherwise failover doesn't work. If it is IB_MIG_ARMED, then alternate path is already loaded, and just calling ib_qp_modify() to update path_mig_state to IB_MIG_MIGRATED, will toss the primary path and change the alternate path to primary path. >- Sean > > From sashak at voltaire.com Thu Nov 2 11:18:14 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 21:18:14 +0200 Subject: [openib-general] opensm crash with topspin HCA In-Reply-To: <4df28be40611021033t3d77262fwea56de0377d11c31@mail.gmail.com> References: <4df28be40610311710i4beb19fdp3ecd13cd95db7d17@mail.gmail.com> <4df28be40611021033t3d77262fwea56de0377d11c31@mail.gmail.com> Message-ID: <20061102191814.GH17244@sashak.voltaire.com> On 10:33 Thu 02 Nov , Viswanath Krishnamurthy wrote: > When we run opensm (OFED) release and if a Topspin HCA is in the IB network, > opensm crashes in umad_receiver with NULL pointer exception. Do you have any logs, gdb backtrace or any other details? Sasha > The > transaction ID is zero is the MAD'S from topspin HCA on windows. The crashes > seems to random in umad_receiver. > > > HCA found: > > hca_id=InfiniHost0 > > vendor_id=0x02C9 > > vendor_part_id=0x5A44 > > hw_ver=0xA0 > > fw_ver=0x400060000 > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Nov 2 11:13:43 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 11:13:43 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <20061102183404.GD13591@mellanox.co.il> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> <4549D3B7.1050208@voltaire.com> <454A3476.6090402@ichips.intel.com> <20061102183404.GD13591@mellanox.co.il> Message-ID: <454A4367.5080409@ichips.intel.com> > We had an option to increase the RQ size for QP1 and QP0. > This might help you too: try increasing IB_MAD_QP_RECV_SIZE. Actually, dropping the requests actually helps the scalability. If nothing gets dropped, the backlog of queued requests grows to hundreds of thousands, most of which will have timed out before the SA can get around to processing them. One option is having the SA (or ib_umad?) return a busy status in response to a MAD, but we'd still have to be able to send this response as quickly as requests are being received. We could then limit the number of requests that would be queued in the kernel for a user. Unfortunately, when we are able to run on the cluster, modifying the kernel modules isn't available to use... - Sean From sashak at voltaire.com Thu Nov 2 11:20:39 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Nov 2006 21:20:39 +0200 Subject: [openib-general] [PATCH v2] opensm: strict osm_log arguments/format check In-Reply-To: <20061102105348.GA16559@sashak.voltaire.com> References: <20061102105348.GA16559@sashak.voltaire.com> Message-ID: <20061102192039.GI17244@sashak.voltaire.com> This adds gcc attribute to osm_log() which causes the compiler to check argument types against a format string. And also there are related fixes in osm_log() usage. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_log.h | 8 +++++++- osm/libvendor/osm_vendor_ibumad_sa.c | 2 +- osm/opensm/main.c | 3 ++- osm/opensm/osm_pkey_mgr.c | 1 + osm/opensm/osm_port_info_rcv.c | 5 +++-- osm/opensm/osm_sa_informinfo.c | 4 ++-- osm/opensm/osm_sa_link_record.c | 8 ++++---- osm/opensm/osm_sa_mad_ctrl.c | 3 ++- osm/opensm/osm_sa_response.c | 2 +- osm/opensm/osm_sm_state_mgr.c | 3 ++- osm/opensm/osm_sminfo_rcv.c | 9 +++++---- osm/opensm/osm_state_mgr.c | 8 ++++---- osm/osmtest/osmt_multicast.c | 12 +++++++----- osm/osmtest/osmt_service.c | 6 +++--- osm/osmtest/osmtest.c | 8 ++++---- 15 files changed, 48 insertions(+), 34 deletions(-) diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h index 6a1a93f..f51a1c8 100644 --- a/osm/include/opensm/osm_log.h +++ b/osm/include/opensm/osm_log.h @@ -60,6 +60,12 @@ #include #include #include +#ifdef __GNUC__ +#define STRICT_OSM_LOG_FORMAT __attribute__((format(printf, 3, 4))) +#else +#define STRICT_OSM_LOG_FORMAT +#endif + #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { # define END_C_DECLS } @@ -377,7 +383,7 @@ void osm_log( IN osm_log_t* const p_log, IN const osm_log_level_t verbosity, - IN const char *p_str, ... ); + IN const char *p_str, ... ) STRICT_OSM_LOG_FORMAT; void osm_log_raw( diff --git a/osm/libvendor/osm_vendor_ibumad_sa.c b/osm/libvendor/osm_vendor_ibumad_sa.c index 7fd0655..7c4a2f7 100644 --- a/osm/libvendor/osm_vendor_ibumad_sa.c +++ b/osm/libvendor/osm_vendor_ibumad_sa.c @@ -853,7 +853,7 @@ #ifdef DUAL_SIDED_RMPP if ( p_mpr_req->sgid_count + p_mpr_req->dgid_count > IB_MULTIPATH_MAX_GIDS ) { osm_log( p_log, OSM_LOG_ERROR, - "osmv_query_sa DBG:001 MULTIPATH_REC ", + "osmv_query_sa DBG:001 MULTIPATH_REC " "SGID count %d DGID count %d max count %d\n", p_mpr_req->sgid_count, p_mpr_req->dgid_count, IB_MULTIPATH_MAX_GIDS ); diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 729702a..752b546 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -460,7 +460,8 @@ parse_ignore_guids_file(IN char *guids_f { osm_log( &p_osm->log, OSM_LOG_ERROR, "parse_ignore_guids_file: ERR 0601: " - "Unable to open ignore guids file (%s)\n" ); + "Unable to open ignore guids file (%s)\n", + guids_file_name ); status = IB_ERROR; goto Exit; } diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c index f2cb221..735dc14 100644 --- a/osm/opensm/osm_pkey_mgr.c +++ b/osm/opensm/osm_pkey_mgr.c @@ -139,6 +139,7 @@ pkey_mgr_process_physical_port( "pkey_mgr_process_physical_port: ERR 0503: " "Failed to obtain P_Key 0x%04x block and index for node " "0x%016" PRIx64 " port %u\n", + ib_pkey_get_base( pkey ), cl_ntoh64( osm_node_get_node_guid( p_node ) ), osm_physp_get_port_num( p_physp ) ); return; diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c index 95112dc..f6d3595 100644 --- a/osm/opensm/osm_port_info_rcv.c +++ b/osm/opensm/osm_port_info_rcv.c @@ -724,8 +724,9 @@ osm_pi_rcv_process( { osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "osm_pi_rcv_process: " - "Got light sweep response from remote port of parent node GUID = 0x%" PRIx64 - " port = %u, Commencing heavy sweep\n", + "Got light sweep response from remote port of parent node " + "GUID = 0x%" PRIx64 " port = 0x%016" PRIx64 + ", Commencing heavy sweep\n", cl_ntoh64( node_guid ), cl_ntoh64( port_guid ) ); osm_state_mgr_process( p_rcv->p_state_mgr, diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c index 69dca1d..0cec307 100644 --- a/osm/opensm/osm_sa_informinfo.c +++ b/osm/opensm/osm_sa_informinfo.c @@ -163,8 +163,8 @@ __validate_ports_access_rights( { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__validate_ports_access_rights: ERR 4301: " - "Invalid port guid: 0x%016\n", - portguid ); + "Invalid port guid: 0x%016" PRIx64 "\n", + cl_ntoh64(portguid) ); valid = FALSE; goto Exit; } diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 751023f..0ca9092 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -145,10 +145,10 @@ __osm_lr_rcv_build_physp_link( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_lr_rcv_build_physp_link: ERR 1801: " "Unable to acquire link record\n" - "\t\t\t\tFrom port 0x%\n" - "\t\t\t\tTo port 0x%\n" - "\t\t\t\tFrom lid 0x%\n" - "\t\t\t\tTo lid 0x%\n", + "\t\t\t\tFrom port 0x%u\n" + "\t\t\t\tTo port 0x%u\n" + "\t\t\t\tFrom lid 0x%u\n" + "\t\t\t\tTo lid 0x%u\n", from_port, to_port, cl_ntoh16(from_lid), cl_ntoh16(to_lid) ); diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c index cd896b6..208f0d2 100644 --- a/osm/opensm/osm_sa_mad_ctrl.c +++ b/osm/opensm/osm_sa_mad_ctrl.c @@ -132,7 +132,8 @@ __osm_sa_mad_ctrl_process( "__osm_sa_mad_ctrl_process: " /* "Responding BUSY status since the dispatcher is already"*/ "Dropping MAD since the dispatcher is already" - " overloaded with %u messages and queue time of:%u[msec]\n", + " overloaded with %u messages and queue time of:" + "%" PRIu64 "[msec]\n", num_messages, last_dispatched_msg_queue_time_msec ); /* send a busy response */ diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c index db36ea2..27f4e9d 100644 --- a/osm/opensm/osm_sa_response.c +++ b/osm/opensm/osm_sa_response.c @@ -117,7 +117,7 @@ osm_sa_send_error( if (osm_exit_flag) { osm_log( p_resp->p_log, OSM_LOG_DEBUG, - "osm_sa_send_error: ", + "osm_sa_send_error: " "Ignoring requested send after exit\n" ); goto Exit; } diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c index aadc43a..7489c28 100644 --- a/osm/opensm/osm_sm_state_mgr.c +++ b/osm/opensm/osm_sm_state_mgr.c @@ -247,7 +247,8 @@ __osm_sm_state_mgr_send_master_sm_info_r { osm_log( p_sm_mgr->p_log, OSM_LOG_ERROR, "__osm_sm_state_mgr_send_master_sm_info_req: ERR 3203: " - "No port object for GUID 0x%X\n", p_sm_mgr->master_guid ); + "No port object for GUID 0x%016" PRIx64 "\n", + cl_ntoh64(p_sm_mgr->master_guid) ); goto Exit; } diff --git a/osm/opensm/osm_sminfo_rcv.c b/osm/opensm/osm_sminfo_rcv.c index 825b18b..2fcd2d4 100644 --- a/osm/opensm/osm_sminfo_rcv.c +++ b/osm/opensm/osm_sminfo_rcv.c @@ -402,8 +402,8 @@ __osm_sminfo_rcv_process_set_request( osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "__osm_sminfo_rcv_process_set_request: " "Received a STANDBY signal. Updating " - "sm_state_mgr master_guid: 0x%X\n", - p_rcv_smi->guid ); + "sm_state_mgr master_guid: 0x%016" PRIx64 "\n", + cl_ntoh64(p_rcv_smi->guid) ); p_rcv->p_sm_state_mgr->master_guid = p_rcv_smi->guid; } @@ -482,8 +482,9 @@ __osm_sminfo_rcv_process_get_sm( /* we will poll it - as long as it lives - we should be in Standby. */ osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "__osm_sminfo_rcv_process_get_sm: " - "Found higher SM. Updating sm_state_mgr master_guid: 0x%X\n", - p_sm->p_port->guid ); + "Found higher SM. Updating sm_state_mgr master_guid:" + " 0x%016" PRIx64 "\n", + cl_ntoh64(p_sm->p_port->guid) ); p_rcv->p_sm_state_mgr->master_guid = p_sm->p_port->guid; } break; diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 66da6fa..70af836 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -481,7 +481,7 @@ __osm_state_mgr_signal_warning( { osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, "__osm_state_mgr_signal_warning: " - "Invalid signal %s(%d) in state %s\n", + "Invalid signal %s(%lu) in state %s\n", osm_get_sm_signal_str( signal ), signal, osm_get_sm_state_str( p_mgr->state ) ); } @@ -500,7 +500,7 @@ __osm_state_mgr_signal_error( else osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_state_mgr_signal_error: ERR 3303: " - "Invalid signal %s(%d) in state %s\n", + "Invalid signal %s(%lu) in state %s\n", osm_get_sm_signal_str( signal ), signal, osm_get_sm_state_str( p_mgr->state ) ); } @@ -1480,8 +1480,8 @@ __osm_state_mgr_exists_other_master_sm( { osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, "__osm_state_mgr_exists_other_master_sm: " - "Found remote master SM with guid:0x%X\n", - p_sm->smi.guid ); + "Found remote master SM with guid:0x%016" PRIx64 "\n", + cl_ntoh64(p_sm->smi.guid) ); p_sm_res = p_sm; goto Exit; } diff --git a/osm/osmtest/osmt_multicast.c b/osm/osmtest/osmt_multicast.c index 33a4f47..19f9d37 100644 --- a/osm/osmtest/osmt_multicast.c +++ b/osm/osmtest/osmt_multicast.c @@ -1885,8 +1885,9 @@ #endif { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_mcast_flow: ERR 0209: " - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", - p_mc_res->mgid + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) ); status = IB_ERROR; goto Exit; @@ -2044,8 +2045,9 @@ #endif { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_mcast_flow: ERR 0212: " - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", - p_mc_res->mgid + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) ); status = IB_ERROR; goto Exit; @@ -3345,7 +3347,7 @@ #endif /* Delete all MCG that are not of IPoIB */ osm_log( &p_osmt->log, OSM_LOG_INFO, "osmt_run_mcast_flow : " - "Cleanup all MCG that are not IPoIB...\n", cnt ); + "Cleanup all MCG that are not IPoIB...\n" ); p_mgrp_mlid_tbl = &p_osmt->exp_subn.mgrp_mlid_tbl; p_mgrp = (osmtest_mgrp_t*)cl_qmap_head( p_mgrp_mlid_tbl ); diff --git a/osm/osmtest/osmt_service.c b/osm/osmtest/osmt_service.c index ec9a39e..ab95fec 100644 --- a/osm/osmtest/osmt_service.c +++ b/osm/osmtest/osmt_service.c @@ -1559,7 +1559,7 @@ #endif { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_service_records_flow: ERR 4A20: " - "Found service: id: 0x%016 " PRIx64 + "Found service: id: 0x%016" PRIx64 " " "that is invalid\n", id[7] ); status = IB_ERROR; @@ -1573,7 +1573,7 @@ #endif { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_service_records_flow: ERR 4A21: " - "Fail to find service: id: 0x%016 " PRIx64 + "Fail to find service: id: 0x%016" PRIx64 " " "name: %s\n", id[0], (char*)service_name[0] ); @@ -1588,7 +1588,7 @@ #endif { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmt_run_service_records_flow: ERR 4A22: " - "Fail to find service: id: 0x%016 " PRIx64 + "Fail to find service: id: 0x%016" PRIx64 " " "name: %s\n", id[5], (char*)service_name[6] ); diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 92a4190..a35e0c5 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -2787,7 +2787,8 @@ osmtest_create_inventory_file( IN osmtes { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_create_inventory_file: ERR 0079: " - "Unable to open inventory file (%s)\n" ); + "Unable to open inventory file (%s)\n", + p_osmt->opt.file_name ); status = IB_ERROR; goto Exit; } @@ -3356,7 +3357,7 @@ osmtest_validate_path_data( IN osmtest_t osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_validate_path_data: ERR 0012: " "PKEY mismatch on path SLID 0x%X to DLID 0x%X\n" - "\t\t\t\tExpected 0x%X, received 0x%X\n", + "\t\t\t\tExpected 0x%" PRIx64 ", received 0x%" PRIx64 "\n", cl_ntoh16( p_path->rec.slid ), cl_ntoh16( p_path->rec.dlid ), cl_ntoh64( p_path->rec.pkey ), cl_ntoh64( p_rec->pkey ) ); @@ -7165,8 +7166,7 @@ osmtest_bind( IN osmtest_t * p_osmt, { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_bind: ERR 0135: " - "No local ports. Unable to proceed\n", - ib_get_err_str( status ) ); + "No local ports. Unable to proceed\n" ); goto Exit; } guid = attr_array[port_index].port_guid; -- 1.4.3.2.g4bf7 From rdreier at cisco.com Thu Nov 2 10:20:43 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Nov 2006 10:20:43 -0800 Subject: [openib-general] [PATCH] use mmiowb after doorbell ring In-Reply-To: <20061102131913.GB8885@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 2 Nov 2006 15:19:13 +0200") References: <20061102131913.GB8885@mellanox.co.il> Message-ID: > By the way, what's up with this project? > It's still planned for libibverbs 1.1, isn't it? I working on it along with other things. Where are your patches for using multiple EQs for CQ events? :) - R. From mshefty at ichips.intel.com Thu Nov 2 11:18:59 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 11:18:59 -0800 Subject: [openib-general] [PATCH] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <20061102182207.GB13591@mellanox.co.il> References: <454A24E4.8060001@ichips.intel.com> <20061102182207.GB13591@mellanox.co.il> Message-ID: <454A44A3.6060001@ichips.intel.com> Michael S. Tsirkin wrote: > Adding registration at module start/stop seems simple enough and overhead is > minimal. We already have this for other modules (e.g. ib_sa). I don't really > unerstand why is there such a resistance to this simple fix for unload race? There's no resistance if someone is using it. There's just an easier solution if no one is or was going to use it... The change just isn't quite as trivial with the ib_cm or rdma_cm as it was with the ib_sa or ib_addr. - Sean From sean.hefty at intel.com Thu Nov 2 11:32:00 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 2 Nov 2006 11:32:00 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <454A490F.6050103@3leafnetworks.com> Message-ID: <000001c6feb5$919bcb80$7ffc070a@amr.corp.intel.com> > Let me make the steps clear - This helps - thanks. > 1. On Passive node register for remote port UP/DOWN event by >registering with ib_sa_serv_notice_hdlr() FYI - patches for this are being worked separately. > 2. On Passive node start the listener by calling ib_cm_listen(). > 3. On Active node create the RC QP and establish the connection by >calling ib_send_cm_req(). In struct ib_cm_req_param specify both primary >path (say, through Port1) and alternate path (say, through Port2). >NOTE:-Assume Port1 of Active node is connected to Port1 of Passive node; >and Port2 of Active node is connected to Port2 of Passive node. >NOTE:- After this step QP's path_mig_state will be IB_MIG_ARMED. > 4. Let us say, Port1 on Active node fails > 5. IB_EVENT_PORT_ERR event is generated on Active node; and remote >port error event is generated on Passive node. > 6. In those event handler call ib_qp_modify() to set the >path_mig_state to IB_MIG_MIGRATED. This will let the HCA's firmware know >to switch to the alternate path. At least the active side in your scenario should call ib_cm_notify() after this step. Otherwise, the LAP will go out the primary path, which is down. This isn't a big deal in your test case, since you wait for the primary path to return (step 7) before calling ib_send_cm_lap(). > 7. After a while, Port1 is comes back again. > 8. IB_EVENT_PORT_ACTIVE event is generated on Active node; and remote >port active event is generated on Passive node. > 9. On the Active node from IB_EVENT_PORT_ACTIVE event handler call >the ib_send_cm_lap() to send the alternate path (through Port1) to the >Passive node. > 9.1 Passive node receives the LAP message The proposed patch will record the alternate path when the LAP is sent or received. (Again, these patches are untested, so there can be some bugs here. I'm still working on writing a test program to use these interfaces.) > 9.2 Calls ib_cm_init_rearm_attr() initialize the alternate path info This should now call ib_cm_init_qp_attr(). > 9.3 Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM > 9.4 Send APR message back to the Active node. > 10. Active node receives the APR message > 11. Calls ib_cm_init_rearm_attr() initialize the alternate path info This should now call ib_cm_init_qp_attr(). > 12. Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM > 13. Now when a first packet is passed between the Active and Passive >node the ib_core changes the path_mig_state to the IB_MIG_ARMED. > 14. Now it is all set for another failover. Using the proposed patches, where did you see a failure? - Sean From ardavis at ichips.intel.com Thu Nov 2 11:50:35 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 02 Nov 2006 11:50:35 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <454A4367.5080409@ichips.intel.com> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> <4549D3B7.1050208@voltaire.com> <454A3476.6090402@ichips.intel.com> <20061102183404.GD13591@mellanox.co.il> <454A4367.5080409@ichips.intel.com> Message-ID: <454A4C0B.1080609@ichips.intel.com> Sean Hefty wrote: >One option is having the SA (or ib_umad?) return a busy status in response to a >MAD, but we'd still have to be able to send this response as quickly as requests >are being received. We could then limit the number of requests that would be >queued in the kernel for a user. > > Another great option would be to have path record caching. Unfortunately OFED 1.1 did not include ib_local_sa in the release. From halr at voltaire.com Thu Nov 2 12:11:06 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 15:11:06 -0500 Subject: [openib-general] opensm crash with topspin HCA In-Reply-To: <4df28be40611021033t3d77262fwea56de0377d11c31@mail.gmail.com> References: <4df28be40610311710i4beb19fdp3ecd13cd95db7d17@mail.gmail.com> <4df28be40611021033t3d77262fwea56de0377d11c31@mail.gmail.com> Message-ID: <1162498256.15232.12264.camel@hal.voltaire.com> On Thu, 2006-11-02 at 13:33, Viswanath Krishnamurthy wrote: > > When we run opensm (OFED) release and if a Topspin HCA is in the IB > network, opensm crashes in umad_receiver with NULL pointer exception. > The transaction ID is zero is the MAD'S from topspin HCA on windows. > The crashes seems to random in umad_receiver. What OpenSM version ? There was a problem like this fixed back at the end of August: r8920 | halr | 2006-08-14 09:09:28 -0400 (Mon, 14 Aug 2006) | 11 lines OpenSM/osm_vendor_ibumad.c: In get_madw, check for TID 0 (resolves NULL ptr crash with Cisco stack) This change fixes an OSM crash when working with Cisco's stack. Cisco's stack doesn't follow the same TID convention when generating transaction id which in some bad flow revealed this bug in the get_madw lookup. The bug was in get_madw which does not detect lookup of its reserved "free" entr y of key==0. Signed-off-by: Yevgeny Kliteynik Signed-off-by: Hal Rosenstock -- Hal > > > > HCA found: > > hca_id=InfiniHost0 > > vendor_id=0x02C9 > > vendor_part_id=0x5A44 > > hw_ver=0xA0 > > fw_ver=0x400060000 > > > > > ______________________________________________________________________ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From venkatesh.babu at 3leafnetworks.com Thu Nov 2 13:15:13 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 02 Nov 2006 13:15:13 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <000001c6feb5$919bcb80$7ffc070a@amr.corp.intel.com> References: <000001c6feb5$919bcb80$7ffc070a@amr.corp.intel.com> Message-ID: <454A5FE1.4090809@3leafnetworks.com> I have the changes to the steps 6, 9.2 and 11. In step 9.2 ib_cm_init_qp_attr() failed with -22 and then RCQP failed with IB_WC_RETRY_EXC_ERR. VBabu Sean Hefty wrote: >>Let me make the steps clear - >> >> > >This helps - thanks. > > > >> 1. On Passive node register for remote port UP/DOWN event by >>registering with ib_sa_serv_notice_hdlr() >> >> > >FYI - patches for this are being worked separately. > > > >> 2. On Passive node start the listener by calling ib_cm_listen(). >> 3. On Active node create the RC QP and establish the connection by >>calling ib_send_cm_req(). In struct ib_cm_req_param specify both primary >>path (say, through Port1) and alternate path (say, through Port2). >>NOTE:-Assume Port1 of Active node is connected to Port1 of Passive node; >>and Port2 of Active node is connected to Port2 of Passive node. >>NOTE:- After this step QP's path_mig_state will be IB_MIG_ARMED. >> 4. Let us say, Port1 on Active node fails >> 5. IB_EVENT_PORT_ERR event is generated on Active node; and remote >>port error event is generated on Passive node. >> 6. In those event handler call ib_qp_modify() to set the >>path_mig_state to IB_MIG_MIGRATED. This will let the HCA's firmware know >>to switch to the alternate path. >> >> > >At least the active side in your scenario should call ib_cm_notify() after this >step. Otherwise, the LAP will go out the primary path, which is down. This >isn't a big deal in your test case, since you wait for the primary path to >return (step 7) before calling ib_send_cm_lap(). > > > >> 7. After a while, Port1 is comes back again. >> 8. IB_EVENT_PORT_ACTIVE event is generated on Active node; and remote >>port active event is generated on Passive node. >> 9. On the Active node from IB_EVENT_PORT_ACTIVE event handler call >>the ib_send_cm_lap() to send the alternate path (through Port1) to the >>Passive node. >> 9.1 Passive node receives the LAP message >> >> > >The proposed patch will record the alternate path when the LAP is sent or >received. (Again, these patches are untested, so there can be some bugs here. >I'm still working on writing a test program to use these interfaces.) > > > >> 9.2 Calls ib_cm_init_rearm_attr() initialize the alternate path info >> >> > >This should now call ib_cm_init_qp_attr(). > > > >> 9.3 Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM >> 9.4 Send APR message back to the Active node. >>10. Active node receives the APR message >>11. Calls ib_cm_init_rearm_attr() initialize the alternate path info >> >> > >This should now call ib_cm_init_qp_attr(). > > > >>12. Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM >>13. Now when a first packet is passed between the Active and Passive >>node the ib_core changes the path_mig_state to the IB_MIG_ARMED. >> 14. Now it is all set for another failover. >> >> > >Using the proposed patches, where did you see a failure? > >- Sean > > From mshefty at ichips.intel.com Thu Nov 2 12:57:19 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 12:57:19 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <454A5FE1.4090809@3leafnetworks.com> References: <000001c6feb5$919bcb80$7ffc070a@amr.corp.intel.com> <454A5FE1.4090809@3leafnetworks.com> Message-ID: <454A5BAF.9020309@ichips.intel.com> Venkatesh Babu wrote: > I have the changes to the steps 6, 9.2 and 11. In step 9.2 > ib_cm_init_qp_attr() failed with -22 and then RCQP failed with > IB_WC_RETRY_EXC_ERR. Did you set qp_attr.qp_state = IB_QPS_RTS before calling ib_cm_init_qp_attr()? If not, can you try this? - Sean From venkatesh.babu at 3leafnetworks.com Thu Nov 2 13:46:23 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 02 Nov 2006 13:46:23 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <454A5BAF.9020309@ichips.intel.com> References: <000001c6feb5$919bcb80$7ffc070a@amr.corp.intel.com> <454A5FE1.4090809@3leafnetworks.com> <454A5BAF.9020309@ichips.intel.com> Message-ID: <454A672F.5010309@3leafnetworks.com> I made the change you suggested. On Active node I got the event IB_EVENT_PATH_MIG and then send failed with IB_WC_RETRY_EXC_ERR. On Passive node I got 100 IB_EVENT_PATH_MIG_ERR events. VBabu Sean Hefty wrote: > Venkatesh Babu wrote: > >> I have the changes to the steps 6, 9.2 and 11. In step 9.2 >> ib_cm_init_qp_attr() failed with -22 and then RCQP failed with >> IB_WC_RETRY_EXC_ERR. > > > Did you set qp_attr.qp_state = IB_QPS_RTS before calling > ib_cm_init_qp_attr()? If not, can you try this? > > - Sean From halr at voltaire.com Thu Nov 2 14:12:55 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 17:12:55 -0500 Subject: [openib-general] [PATCH TRIVIAL] diags: strip trailing whitespaces In-Reply-To: <20061102130432.GD16867@sashak.voltaire.com> References: <20061102130432.GD16867@sashak.voltaire.com> Message-ID: <1162505573.15232.17519.camel@hal.voltaire.com> On Thu, 2006-11-02 at 08:04, Sasha Khapyorsky wrote: > Strip trailing whitespaces in diags. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From somenath at veritas.com Wed Nov 1 14:17:33 2006 From: somenath at veritas.com (somenath) Date: Wed, 01 Nov 2006 14:17:33 -0800 Subject: [openib-general] local QP error notification Message-ID: <45491CFD.6090207@veritas.com> 0. after disconnecting a cable (we just used a switch interface to disable a port) for a connected qpair, we get errors when try to send a packet on that port (this is expected). but 1. don't get errors when try to post receive a packet. is this expected? 2. don't get any of the IB_EVENT_CQ_ERR, IB_EVENT_QP_FATAL, (ib_event_type) errors.. a. when is a good time for ULPs to issue disconnect and destroy qpairs in such cases? should it do it in step 0? should it expect step 1 and 2 to happen? b. what is expected when we connect the cable back? I assume for RC QP we must destroy the connection/QP in case of such error, it will never come back to ok state. is that right? thanks, som. From halr at voltaire.com Thu Nov 2 14:21:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 17:21:20 -0500 Subject: [openib-general] [PATCH TRIVIAL] management/libib*: strip trailing whitespaces In-Reply-To: <20061102130144.GC16867@sashak.voltaire.com> References: <20061102130144.GC16867@sashak.voltaire.com> Message-ID: <1162506075.15232.17893.camel@hal.voltaire.com> On Thu, 2006-11-02 at 08:01, Sasha Khapyorsky wrote: > Strip trailing whitespaces for libibcommon, libibumad, libibmad. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From rdreier at cisco.com Thu Nov 2 14:28:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Nov 2006 14:28:12 -0800 Subject: [openib-general] [PATCH v2] for 2-6-19 rdma/addr: use client registration to fix module unload race In-Reply-To: <000001c6fd20$950c7440$ff0da8c0@amr.corp.intel.com> (Sean Hefty's message of "Tue, 31 Oct 2006 11:12:59 -0800") References: <000001c6fd20$950c7440$ff0da8c0@amr.corp.intel.com> Message-ID: Thanks, applied. From ralph.campbell at qlogic.com Thu Nov 2 14:29:30 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 14:29:30 -0800 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose Message-ID: <1162506570.29948.567.camel@brick.pathscale.com> IB/core - Add DMA mapping functions to allow device drivers to interpose The QLogic InfiniPath HCAs use programmed I/O instead of HW DMA. This patch allows a verbs device driver to interpose on DMA mapping function calls in order to avoid relying on bus_to_virt() and phys_to_virt() to undo the mappings created by dma_map_single(), dma_map_sg(), etc. From: Ralph Campbell diff -r f37bd0e41fec include/rdma/ib_verbs.h --- a/include/rdma/ib_verbs.h Thu Oct 26 21:44:41 2006 +0700 +++ b/include/rdma/ib_verbs.h Thu Oct 26 16:10:04 2006 -0800 @@ -43,6 +43,8 @@ #include #include +#include +#include #include #include @@ -846,6 +848,42 @@ struct ib_cache { struct ib_pkey_cache **pkey_cache; struct ib_gid_cache **gid_cache; u8 *lmc_cache; +}; + +struct ib_dma_mapping_ops { + int (*mapping_error)(struct ib_device *dev, + dma_addr_t dma_addr); + dma_addr_t (*map_single)(struct ib_device *dev, + void *ptr, size_t size, + enum dma_data_direction direction); + void (*unmap_single)(struct ib_device *dev, + dma_addr_t addr, size_t size, + enum dma_data_direction direction); + dma_addr_t (*map_page)(struct ib_device *dev, + struct page *page, unsigned long offset, + size_t size, + enum dma_data_direction direction); + void (*unmap_page)(struct ib_device *dev, + dma_addr_t addr, size_t size, + enum dma_data_direction direction); + int (*map_sg)(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction); + void (*unmap_sg)(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction); + dma_addr_t (*dma_address)(struct ib_device *dev, + struct scatterlist *sg); + unsigned int (*dma_len)(struct ib_device *dev, + struct scatterlist *sg); + void (*sync_single_for_cpu)(struct ib_device *dev, + dma_addr_t dma_handle, + size_t size, + enum dma_data_direction dir); + void (*sync_single_for_device)(struct ib_device *dev, + dma_addr_t dma_handle, + size_t size, + enum dma_data_direction dir); }; struct iw_cm_verbs; @@ -992,6 +1030,8 @@ struct ib_device { struct ib_mad *in_mad, struct ib_mad *out_mad); + struct ib_dma_mapping_ops *dma_ops; + struct module *owner; struct class_device class_dev; struct kobject ports_parent; @@ -1395,8 +1435,182 @@ static inline int ib_req_ncomp_notif(str * usable for DMA. * @pd: The protection domain associated with the memory region. * @mr_access_flags: Specifies the memory access rights. + * + * Note that the ib_dma_*() functions defined below must be used + * to create/destroy addresses used with the Lkey or Rkey returned + * by ib_get_dma_mr(). */ struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags); + +/** + * ib_dma_mapping_error - check a dma_addr_t for error + * @device: The device for which the dma_addr was created + * @dma_addr: The DMA address to check + */ +static inline int ib_dma_mapping_error(struct ib_device *dev, + dma_addr_t dma_addr) +{ + return dev->dma_ops ? + dev->dma_ops->mapping_error(dev, dma_addr) : + dma_mapping_error(dma_addr); +} + +/** + * ib_dma_map_single - Map a kernel virtual address to DMA address + * @device: The device for which the dma_addr is to be created + * @cpu_addr: The kernel virtual address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline dma_addr_t ib_dma_map_single(struct ib_device *dev, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return dev->dma_ops ? + dev->dma_ops->map_single(dev, cpu_addr, size, direction) : + dma_map_single(dev->dma_device, cpu_addr, size, direction); +} + +/** + * ib_dma_unmap_single - Destroy a mapping created by ib_dma_map_single() + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline void ib_dma_unmap_single(struct ib_device *dev, + dma_addr_t addr, size_t size, + enum dma_data_direction direction) +{ + dev->dma_ops ? + dev->dma_ops->unmap_single(dev, addr, size, direction) : + dma_unmap_single(dev->dma_device, addr, size, direction); +} + +/** + * ib_dma_map_page - Map a physical page to DMA address + * @device: The device for which the dma_addr is to be created + * @page: The page to be mapped + * @offset: The offset within the page + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline dma_addr_t ib_dma_map_page(struct ib_device *dev, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction direction) +{ + return dev->dma_ops ? + dev->dma_ops->map_page(dev, page, offset, size, direction) : + dma_map_page(dev->dma_device, page, offset, size, direction); +} + +/** + * ib_dma_unmap_page - Destroy a mapping created by ib_dma_map_page() + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static inline void ib_dma_unmap_page(struct ib_device *dev, + dma_addr_t addr, size_t size, + enum dma_data_direction direction) +{ + dev->dma_ops ? + dev->dma_ops->unmap_page(dev, addr, size, direction) : + dma_unmap_page(dev->dma_device, addr, size, direction); +} + +/** + * ib_dma_map_sg - Map a scatter/gather list to DMA addresses + * @device: The device for which the DMA addresses are to be created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +static inline int ib_dma_map_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + return dev->dma_ops ? + dev->dma_ops->map_sg(dev, sg, nents, direction) : + dma_map_sg(dev->dma_device, sg, nents, direction); +} + +/** + * ib_dma_unmap_sg - Unmap a scatter/gather list of DMA addresses + * @device: The device for which the DMA addresses were created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +static inline void ib_dma_unmap_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + dev->dma_ops ? + dev->dma_ops->unmap_sg(dev, sg, nents, direction) : + dma_unmap_sg(dev->dma_device, sg, nents, direction); +} + +/** + * ib_sg_dma_address - Return the DMA address from a scatter/gather entry + * @device: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static inline dma_addr_t ib_sg_dma_address(struct ib_device *dev, + struct scatterlist *sg) +{ + return dev->dma_ops ? + dev->dma_ops->dma_address(dev, sg) : sg_dma_address(sg); +} + +/** + * ib_sg_dma_len - Return the DMA length from a scatter/gather entry + * @device: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static inline unsigned int ib_sg_dma_len(struct ib_device *dev, + struct scatterlist *sg) +{ + return dev->dma_ops ? + dev->dma_ops->dma_len(dev, sg) : sg_dma_len(sg); +} + +/** + * ib_dma_sync_single_for_cpu - Prepare DMA region to be accessed by CPU + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static inline void ib_dma_sync_single_for_cpu(struct ib_device *dev, + dma_addr_t addr, + size_t size, + enum dma_data_direction dir) +{ + dev->dma_ops ? + dev->dma_ops->sync_single_for_cpu(dev, addr, size, dir) : + dma_sync_single_for_cpu(dev->dma_device, addr, size, dir); +} + +/** + * ib_dma_sync_single_for_device - Prepare DMA region to be accessed by device + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static inline void ib_dma_sync_single_for_device(struct ib_device *dev, + dma_addr_t addr, + size_t size, + enum dma_data_direction dir) +{ + dev->dma_ops ? + dev->dma_ops->sync_single_for_device(dev, addr, size, dir) : + dma_sync_single_for_device(dev->dma_device, addr, size, dir); +} /** * ib_reg_phys_mr - Prepares a virtually addressed memory region for use diff -r f37bd0e41fec drivers/infiniband/core/mad.c --- a/drivers/infiniband/core/mad.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/core/mad.c Thu Oct 26 11:14:51 2006 -0800 @@ -999,16 +999,16 @@ int ib_send_mad(struct ib_mad_send_wr_pr mad_agent = mad_send_wr->send_buf.mad_agent; sge = mad_send_wr->sg_list; - sge[0].addr = dma_map_single(mad_agent->device->dma_device, - mad_send_wr->send_buf.mad, - sge[0].length, - DMA_TO_DEVICE); + sge[0].addr = ib_dma_map_single(mad_agent->device, + mad_send_wr->send_buf.mad, + sge[0].length, + DMA_TO_DEVICE); pci_unmap_addr_set(mad_send_wr, header_mapping, sge[0].addr); - sge[1].addr = dma_map_single(mad_agent->device->dma_device, - ib_get_payload(mad_send_wr), - sge[1].length, - DMA_TO_DEVICE); + sge[1].addr = ib_dma_map_single(mad_agent->device, + ib_get_payload(mad_send_wr), + sge[1].length, + DMA_TO_DEVICE); pci_unmap_addr_set(mad_send_wr, payload_mapping, sge[1].addr); spin_lock_irqsave(&qp_info->send_queue.lock, flags); @@ -1027,12 +1027,14 @@ int ib_send_mad(struct ib_mad_send_wr_pr } spin_unlock_irqrestore(&qp_info->send_queue.lock, flags); if (ret) { - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, header_mapping), - sge[0].length, DMA_TO_DEVICE); - dma_unmap_single(mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, payload_mapping), - sge[1].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_agent->device, + pci_unmap_addr(mad_send_wr, + header_mapping), + sge[0].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_agent->device, + pci_unmap_addr(mad_send_wr, + payload_mapping), + sge[1].length, DMA_TO_DEVICE); } return ret; } @@ -1851,11 +1853,11 @@ static void ib_mad_recv_done_handler(str mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header, mad_list); recv = container_of(mad_priv_hdr, struct ib_mad_private, header); - dma_unmap_single(port_priv->device->dma_device, - pci_unmap_addr(&recv->header, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), - DMA_FROM_DEVICE); + ib_dma_unmap_single(port_priv->device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); /* Setup MAD receive work completion from "normal" work completion */ recv->header.wc = *wc; @@ -2081,12 +2083,12 @@ static void ib_mad_send_done_handler(str qp_info = send_queue->qp_info; retry: - dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, header_mapping), - mad_send_wr->sg_list[0].length, DMA_TO_DEVICE); - dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device, - pci_unmap_addr(mad_send_wr, payload_mapping), - mad_send_wr->sg_list[1].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_send_wr->send_buf.mad_agent->device, + pci_unmap_addr(mad_send_wr, header_mapping), + mad_send_wr->sg_list[0].length, DMA_TO_DEVICE); + ib_dma_unmap_single(mad_send_wr->send_buf.mad_agent->device, + pci_unmap_addr(mad_send_wr, payload_mapping), + mad_send_wr->sg_list[1].length, DMA_TO_DEVICE); queued_send_wr = NULL; spin_lock_irqsave(&send_queue->lock, flags); list_del(&mad_list->list); @@ -2527,12 +2529,11 @@ static int ib_mad_post_receive_mads(stru break; } } - sg_list.addr = dma_map_single(qp_info->port_priv-> - device->dma_device, - &mad_priv->grh, - sizeof *mad_priv - - sizeof mad_priv->header, - DMA_FROM_DEVICE); + sg_list.addr = ib_dma_map_single(qp_info->port_priv->device, + &mad_priv->grh, + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr); recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list; mad_priv->header.mad_list.mad_queue = recv_queue; @@ -2548,12 +2549,12 @@ static int ib_mad_post_receive_mads(stru list_del(&mad_priv->header.mad_list.list); recv_queue->count--; spin_unlock_irqrestore(&recv_queue->lock, flags); - dma_unmap_single(qp_info->port_priv->device->dma_device, - pci_unmap_addr(&mad_priv->header, - mapping), - sizeof *mad_priv - - sizeof mad_priv->header, - DMA_FROM_DEVICE); + ib_dma_unmap_single(qp_info->port_priv->device, + pci_unmap_addr(&mad_priv->header, + mapping), + sizeof *mad_priv - + sizeof mad_priv->header, + DMA_FROM_DEVICE); kmem_cache_free(ib_mad_cache, mad_priv); printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret); break; @@ -2585,11 +2586,11 @@ static void cleanup_recv_queue(struct ib /* Remove from posted receive MAD list */ list_del(&mad_list->list); - dma_unmap_single(qp_info->port_priv->device->dma_device, - pci_unmap_addr(&recv->header, mapping), - sizeof(struct ib_mad_private) - - sizeof(struct ib_mad_private_header), - DMA_FROM_DEVICE); + ib_dma_unmap_single(qp_info->port_priv->device, + pci_unmap_addr(&recv->header, mapping), + sizeof(struct ib_mad_private) - + sizeof(struct ib_mad_private_header), + DMA_FROM_DEVICE); kmem_cache_free(ib_mad_cache, recv); } diff -r f37bd0e41fec drivers/infiniband/core/uverbs_mem.c --- a/drivers/infiniband/core/uverbs_mem.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/core/uverbs_mem.c Thu Oct 26 11:15:36 2006 -0800 @@ -52,8 +52,8 @@ static void __ib_umem_release(struct ib_ int i; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { - dma_unmap_sg(dev->dma_device, chunk->page_list, - chunk->nents, DMA_BIDIRECTIONAL); + ib_dma_unmap_sg(dev, chunk->page_list, + chunk->nents, DMA_BIDIRECTIONAL); for (i = 0; i < chunk->nents; ++i) { if (umem->writable && dirty) set_page_dirty_lock(chunk->page_list[i].page); @@ -136,10 +136,10 @@ int ib_umem_get(struct ib_device *dev, s chunk->page_list[i].length = PAGE_SIZE; } - chunk->nmap = dma_map_sg(dev->dma_device, - &chunk->page_list[0], - chunk->nents, - DMA_BIDIRECTIONAL); + chunk->nmap = ib_dma_map_sg(dev, + &chunk->page_list[0], + chunk->nents, + DMA_BIDIRECTIONAL); if (chunk->nmap <= 0) { for (i = 0; i < chunk->nents; ++i) put_page(chunk->page_list[i].page); From ralph.campbell at qlogic.com Thu Nov 2 14:30:26 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 14:30:26 -0800 Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions Message-ID: <1162506626.29948.568.camel@brick.pathscale.com> IB/ipath - Implement new verbs DMA mapping functions This patch implements the interposing DMA mapping functions to allow support for IOMMUs and remove the dependence on phys_to_virt(). From: Ralph Campbell diff -r f37bd0e41fec drivers/infiniband/hw/ipath/Makefile --- a/drivers/infiniband/hw/ipath/Makefile Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/hw/ipath/Makefile Thu Oct 26 11:16:16 2006 -0800 @@ -6,6 +6,7 @@ ib_ipath-y := \ ib_ipath-y := \ ipath_cq.o \ ipath_diag.o \ + ipath_dma.o \ ipath_driver.o \ ipath_eeprom.o \ ipath_file_ops.o \ diff -r f37bd0e41fec drivers/infiniband/hw/ipath/ipath_keys.c --- a/drivers/infiniband/hw/ipath/ipath_keys.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/hw/ipath/ipath_keys.c Fri Oct 27 16:22:43 2006 -0800 @@ -134,7 +134,7 @@ int ipath_lkey_ok(struct ipath_qp *qp, s */ if (sge->lkey == 0) { isge->mr = NULL; - isge->vaddr = bus_to_virt(sge->addr); + isge->vaddr = (void *) sge->addr; isge->length = sge->length; isge->sge_length = sge->length; ret = 1; @@ -202,12 +202,12 @@ int ipath_rkey_ok(struct ipath_qp *qp, s int ret; /* - * We use RKEY == zero for physical addresses - * (see ipath_get_dma_mr). + * We use RKEY == zero for kernel virtual addresses + * (see ipath_get_dma_mr and ipath_dma.c). */ if (rkey == 0) { sge->mr = NULL; - sge->vaddr = phys_to_virt(vaddr); + sge->vaddr = (void *) vaddr; sge->length = len; sge->sge_length = len; ss->sg_list = NULL; diff -r f37bd0e41fec drivers/infiniband/hw/ipath/ipath_mr.c --- a/drivers/infiniband/hw/ipath/ipath_mr.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/hw/ipath/ipath_mr.c Thu Oct 26 13:35:12 2006 -0800 @@ -54,6 +54,8 @@ static inline struct ipath_fmr *to_ifmr( * @acc: access flags * * Returns the memory region on success, otherwise returns an errno. + * Note that all DMA addresses should be created via the + * struct ib_dma_mapping_ops functions (see ipath_dma.c). */ struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc) { @@ -149,8 +151,7 @@ struct ib_mr *ipath_reg_phys_mr(struct i m = 0; n = 0; for (i = 0; i < num_phys_buf; i++) { - mr->mr.map[m]->segs[n].vaddr = - phys_to_virt(buffer_list[i].addr); + mr->mr.map[m]->segs[n].vaddr = (void *) buffer_list[i].addr; mr->mr.map[m]->segs[n].length = buffer_list[i].size; mr->mr.length += buffer_list[i].size; n++; @@ -347,7 +348,7 @@ int ipath_map_phys_fmr(struct ib_fmr *ib n = 0; ps = 1 << fmr->page_shift; for (i = 0; i < list_len; i++) { - fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]); + fmr->mr.map[m]->segs[n].vaddr = (void *) page_list[i]; fmr->mr.map[m]->segs[n].length = ps; if (++n == IPATH_SEGSZ) { m++; diff -r f37bd0e41fec drivers/infiniband/hw/ipath/ipath_verbs.c --- a/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c Thu Oct 26 11:17:23 2006 -0800 @@ -1599,6 +1599,7 @@ int ipath_register_ib_device(struct ipat dev->detach_mcast = ipath_multicast_detach; dev->process_mad = ipath_process_mad; dev->mmap = ipath_mmap; + dev->dma_ops = &ipath_dma_mapping_ops; snprintf(dev->node_desc, sizeof(dev->node_desc), IPATH_IDSTR " %s", init_utsname()->nodename); diff -r f37bd0e41fec drivers/infiniband/hw/ipath/ipath_verbs.h --- a/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h Thu Oct 26 11:17:38 2006 -0800 @@ -812,4 +812,6 @@ extern unsigned int ib_ipath_max_srq_wrs extern const u32 ib_ipath_rnr_table[]; +extern struct ib_dma_mapping_ops ipath_dma_mapping_ops; + #endif /* IPATH_VERBS_H */ diff -r f37bd0e41fec drivers/infiniband/hw/ipath/ipath_dma.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Fri Oct 27 10:40:03 2006 -0800 @@ -0,0 +1,229 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include + +#include "ipath_verbs.h" + +#define BAD_DMA_ADDRESS ((dma_addr_t) 0) + +/** + * ipath_dma_mapping_error - check a dma_addr_t for error + * @device: The device for which the dma_addr was created + * @dma_addr: The DMA address to check + */ +static int ipath_mapping_error(struct ib_device *dev, dma_addr_t dma_addr) +{ + return dma_addr == BAD_DMA_ADDRESS; +} + +/** + * ipath_dma_map_single - Map a kernel virtual address to DMA address + * @device: The device for which the dma_addr is to be created + * @cpu_addr: The kernel virtual address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static dma_addr_t ipath_dma_map_single(struct ib_device *dev, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(direction == DMA_NONE); + return (dma_addr_t) cpu_addr; +} + +/** + * ipath_dma_unmap_single - Destroy a mapping created by ipath_dma_map_single() + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static void ipath_dma_unmap_single(struct ib_device *dev, + dma_addr_t addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(direction == DMA_NONE); +} + +/** + * ipath_dma_map_page - Map a physical page to DMA address + * @device: The device for which the dma_addr is to be created + * @page: The page to be mapped + * @offset: The offset within the page + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static dma_addr_t ipath_dma_map_page(struct ib_device *dev, + struct page *page, + unsigned long offset, + size_t size, + enum dma_data_direction direction) +{ + dma_addr_t addr; + + BUG_ON(direction == DMA_NONE); + + if (offset + size > PAGE_SIZE) { + addr = BAD_DMA_ADDRESS; + goto done; + } + + addr = (dma_addr_t) page_address(page); + if (addr) + addr += offset; + /* TODO: handle highmem pages */ + +done: + return addr; +} + +/** + * ipath_dma_unmap_page - Destroy a mapping created by ipath_dma_map_page() + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @direction: The direction of the DMA + */ +static void ipath_dma_unmap_page(struct ib_device *dev, + dma_addr_t addr, size_t size, + enum dma_data_direction direction) +{ + BUG_ON(direction == DMA_NONE); +} + +/** + * ipath_map_sg - Map a scatter/gather list to DMA addresses + * @device: The device for which the DMA addresses are to be created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +int ipath_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + dma_addr_t addr; + int i; + int ret = nents; + + BUG_ON(direction == DMA_NONE); + + for (i = 0; i < nents; i++) { + addr = (dma_addr_t) page_address(sg[i].page); + /* TODO: handle highmem pages */ + if (!addr) { + ret = 0; + break; + } + } + return ret; +} + +/** + * ipath_unmap_sg - Unmap a scatter/gather list of DMA addresses + * @device: The device for which the DMA addresses were created + * @sg: The array of scatter/gather entries + * @nents: The number of scatter/gather entries + * @direction: The direction of the DMA + */ +static void ipath_unmap_sg(struct ib_device *dev, + struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + BUG_ON(direction == DMA_NONE); +} + +/** + * ipath_sg_dma_address - Return the DMA address from a scatter/gather entry + * @device: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static dma_addr_t ipath_sg_dma_address(struct ib_device *dev, + struct scatterlist *sg) +{ + return (dma_addr_t) page_address(sg->page); +} + +/** + * ipath_sg_dma_len - Return the DMA length from a scatter/gather entry + * @device: The device for which the DMA addresses were created + * @sg: The scatter/gather entry + */ +static unsigned int ipath_sg_dma_len(struct ib_device *dev, + struct scatterlist *sg) +{ + return sg->length; +} + +/** + * ipath_sync_single_for_cpu - Prepare DMA region to be accessed by CPU + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static void ipath_sync_single_for_cpu(struct ib_device *dev, + dma_addr_t addr, + size_t size, + enum dma_data_direction dir) +{ + dma_sync_single_for_cpu(dev->dma_device, addr, size, dir); +} + +/** + * ipath_sync_single_for_device - Prepare DMA region to be accessed by device + * @device: The device for which the DMA address was created + * @addr: The DMA address + * @size: The size of the region in bytes + * @dir: The direction of the DMA + */ +static void ipath_sync_single_for_device(struct ib_device *dev, + dma_addr_t addr, + size_t size, + enum dma_data_direction dir) +{ + dma_sync_single_for_device(dev->dma_device, addr, size, dir); +} + +struct ib_dma_mapping_ops ipath_dma_mapping_ops = { + ipath_mapping_error, + ipath_dma_map_single, + ipath_dma_unmap_single, + ipath_dma_map_page, + ipath_dma_unmap_page, + ipath_map_sg, + ipath_unmap_sg, + ipath_sg_dma_address, + ipath_sg_dma_len, + ipath_sync_single_for_cpu, + ipath_sync_single_for_device +}; From rdreier at cisco.com Thu Nov 2 14:28:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Nov 2006 14:28:04 -0800 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This includes various fixes for 2.6.19-rc5: Erez Zilber (1): IB/iser: Start connection after enabling iSER Jack Morgenstein (1): IB/uverbs: Return sq_draining value in query_qp response Krishna Kumar (1): RDMA/cma: rdma_bind_addr() leaks a cma_dev reference count Michael S. Tsirkin (1): IB/mthca: Fix MAD extended header format for MAD_IFC firmware command Paul Mackerras (1): IB/ehca: Fix eHCA driver compilation for uniprocessor Sean Hefty (1): RDMA/addr: Use client registration to fix module unload race Steve Wise (2): IB/amso1100: Use dma_alloc_coherent() instead of kmalloc/dma_map_single IB/amso1100: Fix incorrect pr_debug() drivers/infiniband/core/addr.c | 28 ++++++++++++++- drivers/infiniband/core/cma.c | 31 +++++++++++----- drivers/infiniband/core/uverbs_cmd.c | 2 +- drivers/infiniband/hw/amso1100/c2_alloc.c | 13 +++---- drivers/infiniband/hw/amso1100/c2_cq.c | 18 +++------ drivers/infiniband/hw/amso1100/c2_rnic.c | 56 ++++++++++++----------------- drivers/infiniband/hw/ehca/ehca_tools.h | 1 + drivers/infiniband/hw/mthca/mthca_cmd.c | 14 ++++---- drivers/infiniband/ulp/iser/iscsi_iser.c | 4 +- include/rdma/ib_addr.h | 20 ++++++++++- include/rdma/ib_user_verbs.h | 2 +- 11 files changed, 114 insertions(+), 75 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 60d3fbd..e11187e 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -47,6 +47,7 @@ struct addr_req { struct sockaddr src_addr; struct sockaddr dst_addr; struct rdma_dev_addr *addr; + struct rdma_addr_client *client; void *context; void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context); @@ -61,6 +62,26 @@ static LIST_HEAD(req_list); static DECLARE_WORK(work, process_req, NULL); static struct workqueue_struct *addr_wq; +void rdma_addr_register_client(struct rdma_addr_client *client) +{ + atomic_set(&client->refcount, 1); + init_completion(&client->comp); +} +EXPORT_SYMBOL(rdma_addr_register_client); + +static inline void put_client(struct rdma_addr_client *client) +{ + if (atomic_dec_and_test(&client->refcount)) + complete(&client->comp); +} + +void rdma_addr_unregister_client(struct rdma_addr_client *client) +{ + put_client(client); + wait_for_completion(&client->comp); +} +EXPORT_SYMBOL(rdma_addr_unregister_client); + int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, const unsigned char *dst_dev_addr) { @@ -229,6 +250,7 @@ static void process_req(void *data) list_del(&req->list); req->callback(req->status, &req->src_addr, req->addr, req->context); + put_client(req->client); kfree(req); } } @@ -264,7 +286,8 @@ static int addr_resolve_local(struct soc return ret; } -int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, +int rdma_resolve_ip(struct rdma_addr_client *client, + struct sockaddr *src_addr, struct sockaddr *dst_addr, struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context), @@ -285,6 +308,8 @@ int rdma_resolve_ip(struct sockaddr *src req->addr = addr; req->callback = callback; req->context = context; + req->client = client; + atomic_inc(&client->refcount); src_in = (struct sockaddr_in *) &req->src_addr; dst_in = (struct sockaddr_in *) &req->dst_addr; @@ -305,6 +330,7 @@ int rdma_resolve_ip(struct sockaddr *src break; default: ret = req->status; + atomic_dec(&client->refcount); kfree(req); break; } diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 9ae4f3a..845090b 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -63,6 +63,7 @@ static struct ib_client cma_client = { }; static struct ib_sa_client sa_client; +static struct rdma_addr_client addr_client; static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); @@ -1625,8 +1626,8 @@ int rdma_resolve_addr(struct rdma_cm_id if (cma_any_addr(dst_addr)) ret = cma_resolve_loopback(id_priv); else - ret = rdma_resolve_ip(&id->route.addr.src_addr, dst_addr, - &id->route.addr.dev_addr, + ret = rdma_resolve_ip(&addr_client, &id->route.addr.src_addr, + dst_addr, &id->route.addr.dev_addr, timeout_ms, addr_handler, id_priv); if (ret) goto err; @@ -1762,22 +1763,29 @@ int rdma_bind_addr(struct rdma_cm_id *id if (!cma_any_addr(addr)) { ret = rdma_translate_ip(addr, &id->route.addr.dev_addr); - if (!ret) { - mutex_lock(&lock); - ret = cma_acquire_dev(id_priv); - mutex_unlock(&lock); - } if (ret) - goto err; + goto err1; + + mutex_lock(&lock); + ret = cma_acquire_dev(id_priv); + mutex_unlock(&lock); + if (ret) + goto err1; } memcpy(&id->route.addr.src_addr, addr, ip_addr_size(addr)); ret = cma_get_port(id_priv); if (ret) - goto err; + goto err2; return 0; -err: +err2: + if (!cma_any_addr(addr)) { + mutex_lock(&lock); + cma_detach_from_dev(id_priv); + mutex_unlock(&lock); + } +err1: cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_IDLE); return ret; } @@ -2210,6 +2218,7 @@ static int cma_init(void) return -ENOMEM; ib_sa_register_client(&sa_client); + rdma_addr_register_client(&addr_client); ret = ib_register_client(&cma_client); if (ret) @@ -2217,6 +2226,7 @@ static int cma_init(void) return 0; err: + rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); return ret; @@ -2225,6 +2235,7 @@ err: static void cma_cleanup(void) { ib_unregister_client(&cma_client); + rdma_addr_unregister_client(&addr_client); ib_sa_unregister_client(&sa_client); destroy_workqueue(cma_wq); idr_destroy(&sdp_ps); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index b72c7f6..743247e 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1214,7 +1214,7 @@ ssize_t ib_uverbs_query_qp(struct ib_uve resp.qp_access_flags = attr->qp_access_flags; resp.pkey_index = attr->pkey_index; resp.alt_pkey_index = attr->alt_pkey_index; - resp.en_sqd_async_notify = attr->en_sqd_async_notify; + resp.sq_draining = attr->sq_draining; resp.max_rd_atomic = attr->max_rd_atomic; resp.max_dest_rd_atomic = attr->max_dest_rd_atomic; resp.min_rnr_timer = attr->min_rnr_timer; diff --git a/drivers/infiniband/hw/amso1100/c2_alloc.c b/drivers/infiniband/hw/amso1100/c2_alloc.c index 028a60b..0315f99 100644 --- a/drivers/infiniband/hw/amso1100/c2_alloc.c +++ b/drivers/infiniband/hw/amso1100/c2_alloc.c @@ -42,13 +42,14 @@ static int c2_alloc_mqsp_chunk(struct c2 { int i; struct sp_chunk *new_head; + dma_addr_t dma_addr; - new_head = (struct sp_chunk *) __get_free_page(gfp_mask); + new_head = dma_alloc_coherent(&c2dev->pcidev->dev, PAGE_SIZE, + &dma_addr, gfp_mask); if (new_head == NULL) return -ENOMEM; - new_head->dma_addr = dma_map_single(c2dev->ibdev.dma_device, new_head, - PAGE_SIZE, DMA_FROM_DEVICE); + new_head->dma_addr = dma_addr; pci_unmap_addr_set(new_head, mapping, new_head->dma_addr); new_head->next = NULL; @@ -80,10 +81,8 @@ void c2_free_mqsp_pool(struct c2_dev *c2 while (root) { next = root->next; - dma_unmap_single(c2dev->ibdev.dma_device, - pci_unmap_addr(root, mapping), PAGE_SIZE, - DMA_FROM_DEVICE); - __free_page((struct page *) root); + dma_free_coherent(&c2dev->pcidev->dev, PAGE_SIZE, root, + pci_unmap_addr(root, mapping)); root = next; } } diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 9d7bcc5..05c9154 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -246,20 +246,17 @@ int c2_arm_cq(struct ib_cq *ibcq, enum i static void c2_free_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq) { - - dma_unmap_single(c2dev->ibdev.dma_device, pci_unmap_addr(mq, mapping), - mq->q_size * mq->msg_size, DMA_FROM_DEVICE); - free_pages((unsigned long) mq->msg_pool.host, - get_order(mq->q_size * mq->msg_size)); + dma_free_coherent(&c2dev->pcidev->dev, mq->q_size * mq->msg_size, + mq->msg_pool.host, pci_unmap_addr(mq, mapping)); } static int c2_alloc_cq_buf(struct c2_dev *c2dev, struct c2_mq *mq, int q_size, int msg_size) { - unsigned long pool_start; + u8 *pool_start; - pool_start = __get_free_pages(GFP_KERNEL, - get_order(q_size * msg_size)); + pool_start = dma_alloc_coherent(&c2dev->pcidev->dev, q_size * msg_size, + &mq->host_dma, GFP_KERNEL); if (!pool_start) return -ENOMEM; @@ -267,13 +264,10 @@ static int c2_alloc_cq_buf(struct c2_dev 0, /* index (currently unknown) */ q_size, msg_size, - (u8 *) pool_start, + pool_start, NULL, /* peer (currently unknown) */ C2_MQ_HOST_TARGET); - mq->host_dma = dma_map_single(c2dev->ibdev.dma_device, - (void *)pool_start, - q_size * msg_size, DMA_FROM_DEVICE); pci_unmap_addr_set(mq, mapping, mq->host_dma); return 0; diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c index 30409e1..21d9612 100644 --- a/drivers/infiniband/hw/amso1100/c2_rnic.c +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c @@ -517,14 +517,12 @@ int c2_rnic_init(struct c2_dev *c2dev) /* Initialize the Verbs Reply Queue */ qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_QSIZE)); msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q1_MSGSIZE)); - q1_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + q1_pages = dma_alloc_coherent(&c2dev->pcidev->dev, qsize * msgsize, + &c2dev->rep_vq.host_dma, GFP_KERNEL); if (!q1_pages) { err = -ENOMEM; goto bail1; } - c2dev->rep_vq.host_dma = dma_map_single(c2dev->ibdev.dma_device, - (void *)q1_pages, qsize * msgsize, - DMA_FROM_DEVICE); pci_unmap_addr_set(&c2dev->rep_vq, mapping, c2dev->rep_vq.host_dma); pr_debug("%s rep_vq va %p dma %llx\n", __FUNCTION__, q1_pages, (unsigned long long) c2dev->rep_vq.host_dma); @@ -540,17 +538,15 @@ int c2_rnic_init(struct c2_dev *c2dev) /* Initialize the Asynchronus Event Queue */ qsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_QSIZE)); msgsize = be32_to_cpu(readl(mmio_regs + C2_REGS_Q2_MSGSIZE)); - q2_pages = kmalloc(qsize * msgsize, GFP_KERNEL); + q2_pages = dma_alloc_coherent(&c2dev->pcidev->dev, qsize * msgsize, + &c2dev->aeq.host_dma, GFP_KERNEL); if (!q2_pages) { err = -ENOMEM; goto bail2; } - c2dev->aeq.host_dma = dma_map_single(c2dev->ibdev.dma_device, - (void *)q2_pages, qsize * msgsize, - DMA_FROM_DEVICE); pci_unmap_addr_set(&c2dev->aeq, mapping, c2dev->aeq.host_dma); - pr_debug("%s aeq va %p dma %llx\n", __FUNCTION__, q1_pages, - (unsigned long long) c2dev->rep_vq.host_dma); + pr_debug("%s aeq va %p dma %llx\n", __FUNCTION__, q2_pages, + (unsigned long long) c2dev->aeq.host_dma); c2_mq_rep_init(&c2dev->aeq, 2, qsize, @@ -597,17 +593,13 @@ int c2_rnic_init(struct c2_dev *c2dev) bail4: vq_term(c2dev); bail3: - dma_unmap_single(c2dev->ibdev.dma_device, - pci_unmap_addr(&c2dev->aeq, mapping), - c2dev->aeq.q_size * c2dev->aeq.msg_size, - DMA_FROM_DEVICE); - kfree(q2_pages); + dma_free_coherent(&c2dev->pcidev->dev, + c2dev->aeq.q_size * c2dev->aeq.msg_size, + q2_pages, pci_unmap_addr(&c2dev->aeq, mapping)); bail2: - dma_unmap_single(c2dev->ibdev.dma_device, - pci_unmap_addr(&c2dev->rep_vq, mapping), - c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, - DMA_FROM_DEVICE); - kfree(q1_pages); + dma_free_coherent(&c2dev->pcidev->dev, + c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, + q1_pages, pci_unmap_addr(&c2dev->rep_vq, mapping)); bail1: c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool); bail0: @@ -640,19 +632,17 @@ void c2_rnic_term(struct c2_dev *c2dev) /* Free the verbs request allocator */ vq_term(c2dev); - /* Unmap and free the asynchronus event queue */ - dma_unmap_single(c2dev->ibdev.dma_device, - pci_unmap_addr(&c2dev->aeq, mapping), - c2dev->aeq.q_size * c2dev->aeq.msg_size, - DMA_FROM_DEVICE); - kfree(c2dev->aeq.msg_pool.host); - - /* Unmap and free the verbs reply queue */ - dma_unmap_single(c2dev->ibdev.dma_device, - pci_unmap_addr(&c2dev->rep_vq, mapping), - c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, - DMA_FROM_DEVICE); - kfree(c2dev->rep_vq.msg_pool.host); + /* Free the asynchronus event queue */ + dma_free_coherent(&c2dev->pcidev->dev, + c2dev->aeq.q_size * c2dev->aeq.msg_size, + c2dev->aeq.msg_pool.host, + pci_unmap_addr(&c2dev->aeq, mapping)); + + /* Free the verbs reply queue */ + dma_free_coherent(&c2dev->pcidev->dev, + c2dev->rep_vq.q_size * c2dev->rep_vq.msg_size, + c2dev->rep_vq.msg_pool.host, + pci_unmap_addr(&c2dev->rep_vq, mapping)); /* Free the MQ shared pointer pool */ c2_free_mqsp_pool(c2dev, c2dev->kern_mqsp_pool); diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index 809da3e..973c4b5 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -63,6 +63,7 @@ #include #include #include #include +#include extern int ehca_debug_level; diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 99a94d7..768df72 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1820,11 +1820,11 @@ int mthca_MAD_IFC(struct mthca_dev *dev, #define MAD_IFC_BOX_SIZE 0x400 #define MAD_IFC_MY_QPN_OFFSET 0x100 -#define MAD_IFC_RQPN_OFFSET 0x104 -#define MAD_IFC_SL_OFFSET 0x108 -#define MAD_IFC_G_PATH_OFFSET 0x109 -#define MAD_IFC_RLID_OFFSET 0x10a -#define MAD_IFC_PKEY_OFFSET 0x10e +#define MAD_IFC_RQPN_OFFSET 0x108 +#define MAD_IFC_SL_OFFSET 0x10c +#define MAD_IFC_G_PATH_OFFSET 0x10d +#define MAD_IFC_RLID_OFFSET 0x10e +#define MAD_IFC_PKEY_OFFSET 0x112 #define MAD_IFC_GRH_OFFSET 0x140 inmailbox = mthca_alloc_mailbox(dev, GFP_KERNEL); @@ -1862,7 +1862,7 @@ #define MAD_IFC_GRH_OFFSET 0x140 val = in_wc->dlid_path_bits | (in_wc->wc_flags & IB_WC_GRH ? 0x80 : 0); - MTHCA_PUT(inbox, val, MAD_IFC_GRH_OFFSET); + MTHCA_PUT(inbox, val, MAD_IFC_G_PATH_OFFSET); MTHCA_PUT(inbox, in_wc->slid, MAD_IFC_RLID_OFFSET); MTHCA_PUT(inbox, in_wc->pkey_index, MAD_IFC_PKEY_OFFSET); @@ -1870,7 +1870,7 @@ #define MAD_IFC_GRH_OFFSET 0x140 if (in_grh) memcpy(inbox + MAD_IFC_GRH_OFFSET, in_grh, 40); - op_modifier |= 0x10; + op_modifier |= 0x4; in_modifier |= in_wc->slid << 16; } diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index eb6f98d..9b2041e 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -363,11 +363,11 @@ iscsi_iser_conn_start(struct iscsi_cls_c struct iscsi_conn *conn = cls_conn->dd_data; int err; - err = iscsi_conn_start(cls_conn); + err = iser_conn_set_full_featured_mode(conn); if (err) return err; - return iser_conn_set_full_featured_mode(conn); + return iscsi_conn_start(cls_conn); } static struct iscsi_transport iscsi_iser_transport; diff --git a/include/rdma/ib_addr.h b/include/rdma/ib_addr.h index 81b6230..c094e50 100644 --- a/include/rdma/ib_addr.h +++ b/include/rdma/ib_addr.h @@ -36,6 +36,22 @@ #include #include #include +struct rdma_addr_client { + atomic_t refcount; + struct completion comp; +}; + +/** + * rdma_addr_register_client - Register an address client. + */ +void rdma_addr_register_client(struct rdma_addr_client *client); + +/** + * rdma_addr_unregister_client - Deregister an address client. + * @client: Client object to deregister. + */ +void rdma_addr_unregister_client(struct rdma_addr_client *client); + struct rdma_dev_addr { unsigned char src_dev_addr[MAX_ADDR_LEN]; unsigned char dst_dev_addr[MAX_ADDR_LEN]; @@ -52,6 +68,7 @@ int rdma_translate_ip(struct sockaddr *a /** * rdma_resolve_ip - Resolve source and destination IP addresses to * RDMA hardware addresses. + * @client: Address client associated with request. * @src_addr: An optional source address to use in the resolution. If a * source address is not provided, a usable address will be returned via * the callback. @@ -64,7 +81,8 @@ int rdma_translate_ip(struct sockaddr *a * or been canceled. A status of 0 indicates success. * @context: User-specified context associated with the call. */ -int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr, +int rdma_resolve_ip(struct rdma_addr_client *client, + struct sockaddr *src_addr, struct sockaddr *dst_addr, struct rdma_dev_addr *addr, int timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context), diff --git a/include/rdma/ib_user_verbs.h b/include/rdma/ib_user_verbs.h index db1b814..64a721f 100644 --- a/include/rdma/ib_user_verbs.h +++ b/include/rdma/ib_user_verbs.h @@ -458,7 +458,7 @@ struct ib_uverbs_query_qp_resp { __u8 cur_qp_state; __u8 path_mtu; __u8 path_mig_state; - __u8 en_sqd_async_notify; + __u8 sq_draining; __u8 max_rd_atomic; __u8 max_dest_rd_atomic; __u8 min_rnr_timer; From ralph.campbell at qlogic.com Thu Nov 2 14:32:56 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 14:32:56 -0800 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions Message-ID: <1162506776.29948.572.camel@brick.pathscale.com> IB/ipoib - Use the new verbs DMA mapping functions This patch converts IPoIB to use the new DMA mapping functions for kernel verbs consumers. From: Ralph Campbell diff -r f37bd0e41fec drivers/infiniband/ulp/ipoib/ipoib_ib.c --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 12:37:09 2006 -0800 @@ -109,9 +109,8 @@ static int ipoib_ib_post_receive(struct ret = ib_post_recv(priv->qp, ¶m, &bad_wr); if (unlikely(ret)) { ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); - dma_unmap_single(priv->ca->dma_device, - priv->rx_ring[id].mapping, - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + ib_dma_unmap_single(priv->ca, priv->rx_ring[id].mapping, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); dev_kfree_skb_any(priv->rx_ring[id].skb); priv->rx_ring[id].skb = NULL; } @@ -136,10 +135,9 @@ static int ipoib_alloc_rx_skb(struct net */ skb_reserve(skb, 4); - addr = dma_map_single(priv->ca->dma_device, - skb->data, IPOIB_BUF_SIZE, - DMA_FROM_DEVICE); - if (unlikely(dma_mapping_error(addr))) { + addr = ib_dma_map_single(priv->ca, skb->data, IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { dev_kfree_skb_any(skb); return -EIO; } @@ -193,8 +191,8 @@ static void ipoib_ib_handle_rx_wc(struct ipoib_warn(priv, "failed recv event " "(status=%d, wrid=%d vend_err %x)\n", wc->status, wr_id, wc->vendor_err); - dma_unmap_single(priv->ca->dma_device, addr, - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + ib_dma_unmap_single(priv->ca, addr, + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); dev_kfree_skb_any(skb); priv->rx_ring[wr_id].skb = NULL; return; @@ -212,8 +210,7 @@ static void ipoib_ib_handle_rx_wc(struct ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", wc->byte_len, wc->slid); - dma_unmap_single(priv->ca->dma_device, addr, - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); + ib_dma_unmap_single(priv->ca, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); skb_put(skb, wc->byte_len); skb_pull(skb, IB_GRH_BYTES); @@ -261,10 +258,8 @@ static void ipoib_ib_handle_tx_wc(struct tx_req = &priv->tx_ring[wr_id]; - dma_unmap_single(priv->ca->dma_device, - pci_unmap_addr(tx_req, mapping), - tx_req->skb->len, - DMA_TO_DEVICE); + ib_dma_unmap_single(priv->ca, pci_unmap_addr(tx_req, mapping), + tx_req->skb->len, DMA_TO_DEVICE); ++priv->stats.tx_packets; priv->stats.tx_bytes += tx_req->skb->len; @@ -353,9 +348,9 @@ void ipoib_send(struct net_device *dev, */ tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; tx_req->skb = skb; - addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, - DMA_TO_DEVICE); - if (unlikely(dma_mapping_error(addr))) { + addr = ib_dma_map_single(priv->ca, skb->data, skb->len, + DMA_TO_DEVICE); + if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { ++priv->stats.tx_errors; dev_kfree_skb_any(skb); return; @@ -366,8 +361,7 @@ void ipoib_send(struct net_device *dev, address->ah, qpn, addr, skb->len))) { ipoib_warn(priv, "post_send failed\n"); ++priv->stats.tx_errors; - dma_unmap_single(priv->ca->dma_device, addr, skb->len, - DMA_TO_DEVICE); + ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; @@ -537,24 +531,28 @@ int ipoib_ib_dev_stop(struct net_device while ((int) priv->tx_tail - (int) priv->tx_head < 0) { tx_req = &priv->tx_ring[priv->tx_tail & (ipoib_sendq_size - 1)]; - dma_unmap_single(priv->ca->dma_device, - pci_unmap_addr(tx_req, mapping), - tx_req->skb->len, - DMA_TO_DEVICE); + ib_dma_unmap_single(priv->ca, + pci_unmap_addr(tx_req, + mapping), + tx_req->skb->len, + DMA_TO_DEVICE); dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; } - for (i = 0; i < ipoib_recvq_size; ++i) - if (priv->rx_ring[i].skb) { - dma_unmap_single(priv->ca->dma_device, - pci_unmap_addr(&priv->rx_ring[i], - mapping), - IPOIB_BUF_SIZE, - DMA_FROM_DEVICE); - dev_kfree_skb_any(priv->rx_ring[i].skb); - priv->rx_ring[i].skb = NULL; - } + for (i = 0; i < ipoib_recvq_size; ++i) { + struct ipoib_rx_buf *rx_req; + + rx_req = &priv->rx_ring[i]; + if (!rx_req->skb) + continue; + ib_dma_unmap_single(priv->ca, + rx_req->mapping, + IPOIB_BUF_SIZE, + DMA_FROM_DEVICE); + dev_kfree_skb_any(rx_req->skb); + rx_req->skb = NULL; + } goto timeout; } From ralph.campbell at qlogic.com Thu Nov 2 14:33:49 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 14:33:49 -0800 Subject: [openib-general] [PATCH 4/7] IB/iser - Use the new verbs DMA mapping functions Message-ID: <1162506829.29948.574.camel@brick.pathscale.com> IB/iser - Use the new verbs DMA mapping functions This patch converts iser to use the new verbs DMA mapping functions for kernel verbs consumers. From: Ralph Campbell diff -r f37bd0e41fec drivers/infiniband/ulp/iser/iser_memory.c --- a/drivers/infiniband/ulp/iser/iser_memory.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/ulp/iser/iser_memory.c Thu Oct 26 13:16:33 2006 -0800 @@ -51,7 +51,7 @@ */ int iser_regd_buff_release(struct iser_regd_buf *regd_buf) { - struct device *dma_device; + struct ib_device *dev; if ((atomic_read(®d_buf->ref_count) == 0) || atomic_dec_and_test(®d_buf->ref_count)) { @@ -60,8 +60,8 @@ int iser_regd_buff_release(struct iser_r iser_unreg_mem(®d_buf->reg); if (regd_buf->dma_addr) { - dma_device = regd_buf->device->ib_device->dma_device; - dma_unmap_single(dma_device, + dev = regd_buf->device->ib_device; + ib_dma_unmap_single(dev, regd_buf->dma_addr, regd_buf->data_size, regd_buf->direction); @@ -85,10 +85,10 @@ void iser_reg_single(struct iser_device { dma_addr_t dma_addr; - dma_addr = dma_map_single(device->ib_device->dma_device, - regd_buf->virt_addr, - regd_buf->data_size, direction); - BUG_ON(dma_mapping_error(dma_addr)); + dma_addr = ib_dma_map_single(device->ib_device, + regd_buf->virt_addr, + regd_buf->data_size, direction); + BUG_ON(ib_dma_mapping_error(device->ib_device, dma_addr)); regd_buf->reg.lkey = device->mr->lkey; regd_buf->reg.len = regd_buf->data_size; @@ -106,7 +106,7 @@ int iser_start_rdma_unaligned_sg(struct enum iser_data_dir cmd_dir) { int dma_nents; - struct device *dma_device; + struct ib_device *dev; char *mem = NULL; struct iser_data_buf *data = &iser_ctask->data[cmd_dir]; unsigned long cmd_data_len = data->data_len; @@ -146,17 +146,12 @@ int iser_start_rdma_unaligned_sg(struct iser_ctask->data_copy[cmd_dir].copy_buf = mem; - dma_device = iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; - - if (cmd_dir == ISER_DIR_OUT) - dma_nents = dma_map_sg(dma_device, - &iser_ctask->data_copy[cmd_dir].sg_single, - 1, DMA_TO_DEVICE); - else - dma_nents = dma_map_sg(dma_device, - &iser_ctask->data_copy[cmd_dir].sg_single, - 1, DMA_FROM_DEVICE); - + dev = iser_ctask->iser_conn->ib_conn->device->ib_device; + dma_nents = ib_dma_map_sg(dev, + &iser_ctask->data_copy[cmd_dir].sg_single, + 1, + (cmd_dir == ISER_DIR_OUT) ? + DMA_TO_DEVICE : DMA_FROM_DEVICE); BUG_ON(dma_nents == 0); iser_ctask->data_copy[cmd_dir].dma_nents = dma_nents; @@ -169,19 +164,16 @@ void iser_finalize_rdma_unaligned_sg(str void iser_finalize_rdma_unaligned_sg(struct iscsi_iser_cmd_task *iser_ctask, enum iser_data_dir cmd_dir) { - struct device *dma_device; + struct ib_device *dev; struct iser_data_buf *mem_copy; unsigned long cmd_data_len; - dma_device = iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; - mem_copy = &iser_ctask->data_copy[cmd_dir]; - - if (cmd_dir == ISER_DIR_OUT) - dma_unmap_sg(dma_device, &mem_copy->sg_single, 1, - DMA_TO_DEVICE); - else - dma_unmap_sg(dma_device, &mem_copy->sg_single, 1, - DMA_FROM_DEVICE); + dev = iser_ctask->iser_conn->ib_conn->device->ib_device; + mem_copy = &iser_ctask->data_copy[cmd_dir]; + + ib_dma_unmap_sg(dev, &mem_copy->sg_single, 1, + (cmd_dir == ISER_DIR_OUT) ? + DMA_TO_DEVICE : DMA_FROM_DEVICE); if (cmd_dir == ISER_DIR_IN) { char *mem; @@ -230,7 +222,8 @@ void iser_finalize_rdma_unaligned_sg(str * consecutive elements. Also, it handles one entry SG. */ static int iser_sg_to_page_vec(struct iser_data_buf *data, - struct iser_page_vec *page_vec) + struct iser_page_vec *page_vec, + struct ib_device *ibdev) { struct scatterlist *sg = (struct scatterlist *)data->buf; dma_addr_t first_addr, last_addr, page; @@ -243,10 +236,12 @@ static int iser_sg_to_page_vec(struct is page_vec->offset = (u64) sg[0].offset & ~MASK_4K; for (i = 0; i < data->dma_nents; i++) { - total_sz += sg_dma_len(&sg[i]); - - first_addr = sg_dma_address(&sg[i]); - last_addr = first_addr + sg_dma_len(&sg[i]); + unsigned int dma_len = ib_sg_dma_len(ibdev, &sg[i]); + + total_sz += dma_len; + + first_addr = ib_sg_dma_address(ibdev, &sg[i]); + last_addr = first_addr + dma_len; start_aligned = !(first_addr & ~MASK_4K); end_aligned = !(last_addr & ~MASK_4K); @@ -254,8 +249,9 @@ static int iser_sg_to_page_vec(struct is /* continue to collect page fragments till aligned or SG ends */ while (!end_aligned && (i + 1 < data->dma_nents)) { i++; - total_sz += sg_dma_len(&sg[i]); - last_addr = sg_dma_address(&sg[i]) + sg_dma_len(&sg[i]); + dma_len = ib_sg_dma_len(ibdev, &sg[i]); + total_sz += dma_len; + last_addr = ib_sg_dma_address(ibdev, &sg[i]) + dma_len; end_aligned = !(last_addr & ~MASK_4K); } @@ -287,7 +283,8 @@ static int iser_sg_to_page_vec(struct is * the number of entries which are aligned correctly. Supports the case where * consecutive SG elements are actually fragments of the same physcial page. */ -static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data) +static unsigned int iser_data_buf_aligned_len(struct iser_data_buf *data, + struct ib_device *ibdev) { struct scatterlist *sg; dma_addr_t end_addr, next_addr; @@ -302,12 +299,12 @@ static unsigned int iser_data_buf_aligne (unsigned long)page_to_phys(sg[i].page), (unsigned long)sg[i].offset, (unsigned long)sg[i].length); */ - end_addr = sg_dma_address(&sg[i]) + - sg_dma_len(&sg[i]); + end_addr = ib_sg_dma_address(ibdev, &sg[i]) + + ib_sg_dma_len(ibdev, &sg[i]); /* iser_dbg("Checking sg iobuf end address " "0x%08lX\n", end_addr); */ if (i + 1 < data->dma_nents) { - next_addr = sg_dma_address(&sg[i+1]); + next_addr = ib_sg_dma_address(ibdev, &sg[i+1]); /* are i, i+1 fragments of the same page? */ if (end_addr == next_addr) continue; @@ -324,7 +321,8 @@ static unsigned int iser_data_buf_aligne return ret_len; } -static void iser_data_buf_dump(struct iser_data_buf *data) +static void iser_data_buf_dump(struct iser_data_buf *data, + struct ib_device *ibdev) { struct scatterlist *sg = (struct scatterlist *)data->buf; int i; @@ -332,9 +330,9 @@ static void iser_data_buf_dump(struct is for (i = 0; i < data->dma_nents; i++) iser_err("sg[%d] dma_addr:0x%lX page:0x%p " "off:0x%x sz:0x%x dma_len:0x%x\n", - i, (unsigned long)sg_dma_address(&sg[i]), + i, (unsigned long)ib_sg_dma_address(ibdev, &sg[i]), sg[i].page, sg[i].offset, - sg[i].length,sg_dma_len(&sg[i])); + sg[i].length, ib_sg_dma_len(ibdev, &sg[i])); } static void iser_dump_page_vec(struct iser_page_vec *page_vec) @@ -348,7 +346,8 @@ static void iser_dump_page_vec(struct is } static void iser_page_vec_build(struct iser_data_buf *data, - struct iser_page_vec *page_vec) + struct iser_page_vec *page_vec, + struct ib_device *ibdev) { int page_vec_len = 0; @@ -356,14 +355,14 @@ static void iser_page_vec_build(struct i page_vec->offset = 0; iser_dbg("Translating sg sz: %d\n", data->dma_nents); - page_vec_len = iser_sg_to_page_vec(data,page_vec); + page_vec_len = iser_sg_to_page_vec(data, page_vec, ibdev); iser_dbg("sg len %d page_vec_len %d\n", data->dma_nents,page_vec_len); page_vec->length = page_vec_len; if (page_vec_len * SIZE_4K < page_vec->data_size) { iser_err("page_vec too short to hold this SG\n"); - iser_data_buf_dump(data); + iser_data_buf_dump(data, ibdev); iser_dump_page_vec(page_vec); BUG(); } @@ -374,13 +373,12 @@ int iser_dma_map_task_data(struct iscsi_ enum iser_data_dir iser_dir, enum dma_data_direction dma_dir) { - struct device *dma_device; + struct ib_device *dev; iser_ctask->dir[iser_dir] = 1; - dma_device = - iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; - - data->dma_nents = dma_map_sg(dma_device, data->buf, data->size, dma_dir); + dev = iser_ctask->iser_conn->ib_conn->device->ib_device; + + data->dma_nents = ib_dma_map_sg(dev, data->buf, data->size, dma_dir); if (data->dma_nents == 0) { iser_err("dma_map_sg failed!!!\n"); return -EINVAL; @@ -390,20 +388,19 @@ int iser_dma_map_task_data(struct iscsi_ void iser_dma_unmap_task_data(struct iscsi_iser_cmd_task *iser_ctask) { - struct device *dma_device; + struct ib_device *dev; struct iser_data_buf *data; - dma_device = - iser_ctask->iser_conn->ib_conn->device->ib_device->dma_device; + dev = iser_ctask->iser_conn->ib_conn->device->ib_device; if (iser_ctask->dir[ISER_DIR_IN]) { data = &iser_ctask->data[ISER_DIR_IN]; - dma_unmap_sg(dma_device, data->buf, data->size, DMA_FROM_DEVICE); + ib_dma_unmap_sg(dev, data->buf, data->size, DMA_FROM_DEVICE); } if (iser_ctask->dir[ISER_DIR_OUT]) { data = &iser_ctask->data[ISER_DIR_OUT]; - dma_unmap_sg(dma_device, data->buf, data->size, DMA_TO_DEVICE); + ib_dma_unmap_sg(dev, data->buf, data->size, DMA_TO_DEVICE); } } @@ -418,6 +415,7 @@ int iser_reg_rdma_mem(struct iscsi_iser_ { struct iser_conn *ib_conn = iser_ctask->iser_conn->ib_conn; struct iser_device *device = ib_conn->device; + struct ib_device *ibdev = device->ib_device; struct iser_data_buf *mem = &iser_ctask->data[cmd_dir]; struct iser_regd_buf *regd_buf; int aligned_len; @@ -427,11 +425,11 @@ int iser_reg_rdma_mem(struct iscsi_iser_ regd_buf = &iser_ctask->rdma_regd[cmd_dir]; - aligned_len = iser_data_buf_aligned_len(mem); + aligned_len = iser_data_buf_aligned_len(mem, ibdev); if (aligned_len != mem->dma_nents) { iser_err("rdma alignment violation %d/%d aligned\n", aligned_len, mem->size); - iser_data_buf_dump(mem); + iser_data_buf_dump(mem, ibdev); /* unmap the command data before accessing it */ iser_dma_unmap_task_data(iser_ctask); @@ -449,8 +447,8 @@ int iser_reg_rdma_mem(struct iscsi_iser_ regd_buf->reg.lkey = device->mr->lkey; regd_buf->reg.rkey = device->mr->rkey; - regd_buf->reg.len = sg_dma_len(&sg[0]); - regd_buf->reg.va = sg_dma_address(&sg[0]); + regd_buf->reg.len = ib_sg_dma_len(ibdev, &sg[0]); + regd_buf->reg.va = ib_sg_dma_address(ibdev, &sg[0]); regd_buf->reg.is_fmr = 0; iser_dbg("PHYSICAL Mem.register: lkey: 0x%08X rkey: 0x%08X " @@ -460,10 +458,10 @@ int iser_reg_rdma_mem(struct iscsi_iser_ (unsigned long)regd_buf->reg.va, (unsigned long)regd_buf->reg.len); } else { /* use FMR for multiple dma entries */ - iser_page_vec_build(mem, ib_conn->page_vec); + iser_page_vec_build(mem, ib_conn->page_vec, ibdev); err = iser_reg_page_vec(ib_conn, ib_conn->page_vec, ®d_buf->reg); if (err) { - iser_data_buf_dump(mem); + iser_data_buf_dump(mem, ibdev); iser_err("mem->dma_nents = %d (dlength = 0x%x)\n", mem->dma_nents, ntoh24(iser_ctask->desc.iscsi_header.dlength)); iser_err("page_vec: data_size = 0x%x, length = %d, offset = 0x%x\n", From ralph.campbell at qlogic.com Thu Nov 2 14:35:03 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 14:35:03 -0800 Subject: [openib-general] [PATCH 5/7] IB/rds - Use the new verbs DMA mapping functions Message-ID: <1162506903.29948.577.camel@brick.pathscale.com> IB/rds - Use the new verbs DMA mapping functions This patch converts RDS to use the new DMA mapping functions for kernel verbs consumers. From: Ralph Campbell Index: src/linux-kernel/infiniband/ulp/rds/rds_buf.c =================================================================== --- src/linux-kernel/infiniband/ulp/rds/rds_buf.c (revision 9441) +++ src/linux-kernel/infiniband/ulp/rds/rds_buf.c (working copy) @@ -67,10 +67,10 @@ struct rds_buf* rds_alloc_send_buffer(st buf->loopback = FALSE; buf->optype = OP_SEND; buf->sge.length = ep->buffer_size; - buf->sge.addr = dma_map_single(ep->cma_id->device->dma_device, - buf->data, - buf->sge.length, - DMA_TO_DEVICE); + buf->sge.addr = ib_dma_map_single(ep->cma_id->device, + buf->data, + buf->sge.length, + DMA_TO_DEVICE); pci_unmap_addr_set(buf, mapping, buf->sge.addr); @@ -101,7 +101,7 @@ struct rds_buf* rds_alloc_recv_buffer(st buf->loopback = FALSE; buf->optype = OP_RECV; buf->sge.length = ep->buffer_size; - buf->sge.addr = dma_map_single(ep->cma_id->device->dma_device, + buf->sge.addr = ib_dma_map_single(ep->cma_id->device, buf->data, buf->sge.length, DMA_FROM_DEVICE); @@ -126,8 +126,8 @@ void rds_free_buffer(struct rds_buf *buf printk("rds: free buffer, bad ep or ep->kmem_cache!!\n"); return; } - dma_unmap_single( - ((struct rds_ep*)buf->parent_ep)->cma_id->device->dma_device, + ib_dma_unmap_single( + ((struct rds_ep*)buf->parent_ep)->cma_id->device, pci_unmap_addr(buf,mapping), buf->sge.length, DMA_TO_DEVICE); From ralph.campbell at qlogic.com Thu Nov 2 14:36:01 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 14:36:01 -0800 Subject: [openib-general] [PATCH 6/7] IB/sdp - Use the new verbs DMA mapping functions Message-ID: <1162506962.29948.578.camel@brick.pathscale.com> IB/sdp - Use the new verbs DMA mapping functions This patch converts SDP to use the new DMA mapping functions for kernel verbs consumers. From: Ralph Campbell Index: src/linux-kernel/infiniband/ulp/sdp/sdp_bcopy.c =================================================================== --- src/linux-kernel/infiniband/ulp/sdp/sdp_bcopy.c (revision 9441) +++ src/linux-kernel/infiniband/ulp/sdp/sdp_bcopy.c (working copy) @@ -67,7 +67,7 @@ void sdp_post_send(struct sdp_sock *ssk, unsigned mseq = ssk->tx_head; int i, rc, frags; dma_addr_t addr; - struct device *hwdev; + struct ib_device *dev; struct ib_sge *sge; struct ib_send_wr *bad_wr; @@ -80,15 +80,14 @@ void sdp_post_send(struct sdp_sock *ssk, tx_req = &ssk->tx_ring[mseq & (SDP_TX_SIZE - 1)]; tx_req->skb = skb; - hwdev = ssk->dma_device; + dev = ssk->mr->device; sge = ssk->ibsge; - addr = dma_map_single(hwdev, - skb->data, skb->len - skb->data_len, - DMA_TO_DEVICE); + addr = ib_dma_map_single(dev, skb->data, skb->len - skb->data_len, + DMA_TO_DEVICE); tx_req->mapping[0] = addr; /* TODO: proper error handling */ - BUG_ON(dma_mapping_error(addr)); + BUG_ON(ib_dma_mapping_error(dev, addr)); sge->addr = (u64)addr; sge->length = skb->len - skb->data_len; @@ -96,11 +95,11 @@ void sdp_post_send(struct sdp_sock *ssk, frags = skb_shinfo(skb)->nr_frags; for (i = 0; i < frags; ++i) { ++sge; - addr = dma_map_page(hwdev, skb_shinfo(skb)->frags[i].page, - skb_shinfo(skb)->frags[i].page_offset, - skb_shinfo(skb)->frags[i].size, - DMA_TO_DEVICE); - BUG_ON(dma_mapping_error(addr)); + addr = ib_dma_map_page(dev, skb_shinfo(skb)->frags[i].page, + skb_shinfo(skb)->frags[i].page_offset, + skb_shinfo(skb)->frags[i].size, + DMA_TO_DEVICE); + BUG_ON(ib_dma_mapping_error(dev, addr)); tx_req->mapping[i + 1] = addr; sge->addr = addr; sge->length = skb_shinfo(skb)->frags[i].size; @@ -124,7 +123,7 @@ void sdp_post_send(struct sdp_sock *ssk, struct sk_buff *sdp_send_completion(struct sdp_sock *ssk, int mseq) { - struct device *hwdev; + struct ib_device *dev; struct sdp_buf *tx_req; struct sk_buff *skb; int i, frags; @@ -135,16 +134,16 @@ struct sk_buff *sdp_send_completion(stru return NULL; } - hwdev = ssk->dma_device; + dev = ssk->mr->device; tx_req = &ssk->tx_ring[mseq & (SDP_TX_SIZE - 1)]; skb = tx_req->skb; - dma_unmap_single(hwdev, tx_req->mapping[0], skb->len - skb->data_len, - DMA_TO_DEVICE); + ib_dma_unmap_single(dev, tx_req->mapping[0], skb->len - skb->data_len, + DMA_TO_DEVICE); frags = skb_shinfo(skb)->nr_frags; for (i = 0; i < frags; ++i) { - dma_unmap_page(hwdev, tx_req->mapping[i + 1], - skb_shinfo(skb)->frags[i].size, - DMA_TO_DEVICE); + ib_dma_unmap_page(dev, tx_req->mapping[i + 1], + skb_shinfo(skb)->frags[i].size, + DMA_TO_DEVICE); } ++ssk->tx_tail; @@ -157,7 +156,7 @@ static void sdp_post_recv(struct sdp_soc struct sdp_buf *rx_req; int i, rc, frags; dma_addr_t addr; - struct device *hwdev; + struct ib_device *dev; struct ib_sge *sge; struct ib_recv_wr *bad_wr; struct sk_buff *skb; @@ -188,11 +187,10 @@ static void sdp_post_recv(struct sdp_soc rx_req = ssk->rx_ring + (id & (SDP_RX_SIZE - 1)); rx_req->skb = skb; - hwdev = ssk->dma_device; + dev = ssk->mr->device; sge = ssk->ibsge; - addr = dma_map_single(hwdev, h, skb_headlen(skb), - DMA_FROM_DEVICE); - BUG_ON(dma_mapping_error(addr)); + addr = ib_dma_map_single(dev, h, skb_headlen(skb), DMA_FROM_DEVICE); + BUG_ON(ib_dma_mapping_error(dev, addr)); rx_req->mapping[0] = addr; @@ -203,11 +201,11 @@ static void sdp_post_recv(struct sdp_soc frags = skb_shinfo(skb)->nr_frags; for (i = 0; i < frags; ++i) { ++sge; - addr = dma_map_page(hwdev, skb_shinfo(skb)->frags[i].page, - skb_shinfo(skb)->frags[i].page_offset, - skb_shinfo(skb)->frags[i].size, - DMA_FROM_DEVICE); - BUG_ON(dma_mapping_error(addr)); + addr = ib_dma_map_page(dev, skb_shinfo(skb)->frags[i].page, + skb_shinfo(skb)->frags[i].page_offset, + skb_shinfo(skb)->frags[i].size, + DMA_FROM_DEVICE); + BUG_ON(ib_dma_mapping_error(dev, addr)); rx_req->mapping[i + 1] = addr; sge->addr = addr; sge->length = skb_shinfo(skb)->frags[i].size; @@ -242,7 +240,7 @@ void sdp_post_recvs(struct sdp_sock *ssk struct sk_buff *sdp_recv_completion(struct sdp_sock *ssk, int id) { struct sdp_buf *rx_req; - struct device *hwdev; + struct ib_device *dev; struct sk_buff *skb; int i, frags; @@ -252,16 +250,16 @@ struct sk_buff *sdp_recv_completion(stru return NULL; } - hwdev = ssk->dma_device; + dev = ssk->mr->device; rx_req = &ssk->rx_ring[id & (SDP_RX_SIZE - 1)]; skb = rx_req->skb; - dma_unmap_single(hwdev, rx_req->mapping[0], skb_headlen(skb), - DMA_FROM_DEVICE); + ib_dma_unmap_single(dev, rx_req->mapping[0], skb_headlen(skb), + DMA_FROM_DEVICE); frags = skb_shinfo(skb)->nr_frags; for (i = 0; i < frags; ++i) - dma_unmap_page(hwdev, rx_req->mapping[i + 1], - skb_shinfo(skb)->frags[i].size, - DMA_TO_DEVICE); + ib_dma_unmap_page(dev, rx_req->mapping[i + 1], + skb_shinfo(skb)->frags[i].size, + DMA_TO_DEVICE); ++ssk->rx_tail; --ssk->remote_credits; return skb; Index: src/linux-kernel/infiniband/ulp/sdp/sdp_cma.c =================================================================== --- src/linux-kernel/infiniband/ulp/sdp/sdp_cma.c (revision 9441) +++ src/linux-kernel/infiniband/ulp/sdp/sdp_cma.c (working copy) @@ -159,7 +159,6 @@ int sdp_init_qp(struct sock *sk, struct } sdp_sk(sk)->cq = cq; sdp_sk(sk)->qp = id->qp; - sdp_sk(sk)->dma_device = device->dma_device; init_waitqueue_head(&sdp_sk(sk)->wq); Index: src/linux-kernel/infiniband/ulp/sdp/sdp.h =================================================================== --- src/linux-kernel/infiniband/ulp/sdp/sdp.h (revision 9441) +++ src/linux-kernel/infiniband/ulp/sdp/sdp.h (working copy) @@ -79,7 +79,6 @@ struct sdp_sock { struct ib_qp *qp; struct ib_cq *cq; struct ib_mr *mr; - struct device *dma_device; /* Like tcp_sock */ __u16 urg_data; int offset; /* like seq in tcp */ From ralph.campbell at qlogic.com Thu Nov 2 14:37:00 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 14:37:00 -0800 Subject: [openib-general] [PATCH 7/7] IB/srp - Use new verbs IB DMA mapping functions Message-ID: <1162507020.29948.580.camel@brick.pathscale.com> IB/srp - Use new verbs IB DMA mapping functions This patch converts SRP to use the new verbs DMA mapping functions for kernel verbs consumers. From: Ralph Campbell diff -r f37bd0e41fec drivers/infiniband/ulp/srp/ib_srp.c --- a/drivers/infiniband/ulp/srp/ib_srp.c Thu Oct 26 21:44:41 2006 +0700 +++ b/drivers/infiniband/ulp/srp/ib_srp.c Thu Oct 26 12:33:28 2006 -0800 @@ -122,9 +122,8 @@ static struct srp_iu *srp_alloc_iu(struc if (!iu->buf) goto out_free_iu; - iu->dma = dma_map_single(host->dev->dev->dma_device, - iu->buf, size, direction); - if (dma_mapping_error(iu->dma)) + iu->dma = ib_dma_map_single(host->dev->dev, iu->buf, size, direction); + if (ib_dma_mapping_error(host->dev->dev, iu->dma)) goto out_free_buf; iu->size = size; @@ -145,8 +144,7 @@ static void srp_free_iu(struct srp_host if (!iu) return; - dma_unmap_single(host->dev->dev->dma_device, - iu->dma, iu->size, iu->direction); + ib_dma_unmap_single(host->dev->dev, iu->dma, iu->size, iu->direction); kfree(iu->buf); kfree(iu); } @@ -481,8 +479,8 @@ static void srp_unmap_data(struct scsi_c scat = &req->fake_sg; } - dma_unmap_sg(target->srp_host->dev->dev->dma_device, scat, nents, - scmnd->sc_data_direction); + ib_dma_unmap_sg(target->srp_host->dev->dev, scat, nents, + scmnd->sc_data_direction); } static void srp_remove_req(struct srp_target_port *target, struct srp_request *req) @@ -594,23 +592,26 @@ static int srp_map_fmr(struct srp_target int i, j; int ret; struct srp_device *dev = target->srp_host->dev; + struct ib_device *ibdev = dev->dev; if (!dev->fmr_pool) return -ENODEV; - if ((sg_dma_address(&scat[0]) & ~dev->fmr_page_mask) && + if ((ib_sg_dma_address(ibdev, &scat[0]) & ~dev->fmr_page_mask) && mellanox_workarounds && !memcmp(&target->ioc_guid, mellanox_oui, 3)) return -EINVAL; len = page_cnt = 0; for (i = 0; i < sg_cnt; ++i) { - if (sg_dma_address(&scat[i]) & ~dev->fmr_page_mask) { + unsigned int dma_len = ib_sg_dma_len(ibdev, &scat[i]); + + if (ib_sg_dma_address(ibdev, &scat[i]) & ~dev->fmr_page_mask) { if (i > 0) return -EINVAL; else ++page_cnt; } - if ((sg_dma_address(&scat[i]) + sg_dma_len(&scat[i])) & + if ((ib_sg_dma_address(ibdev, &scat[i]) + dma_len) & ~dev->fmr_page_mask) { if (i < sg_cnt - 1) return -EINVAL; @@ -618,7 +619,7 @@ static int srp_map_fmr(struct srp_target ++page_cnt; } - len += sg_dma_len(&scat[i]); + len += dma_len; } page_cnt += len >> dev->fmr_page_shift; @@ -630,10 +631,14 @@ static int srp_map_fmr(struct srp_target return -ENOMEM; page_cnt = 0; - for (i = 0; i < sg_cnt; ++i) - for (j = 0; j < sg_dma_len(&scat[i]); j += dev->fmr_page_size) + for (i = 0; i < sg_cnt; ++i) { + unsigned int dma_len = ib_sg_dma_len(ibdev, &scat[i]); + + for (j = 0; j < dma_len; j += dev->fmr_page_size) dma_pages[page_cnt++] = - (sg_dma_address(&scat[i]) & dev->fmr_page_mask) + j; + (ib_sg_dma_address(ibdev, &scat[i]) & + dev->fmr_page_mask) + j; + } req->fmr = ib_fmr_pool_map_phys(dev->fmr_pool, dma_pages, page_cnt, io_addr); @@ -643,7 +648,8 @@ static int srp_map_fmr(struct srp_target goto out; } - buf->va = cpu_to_be64(sg_dma_address(&scat[0]) & ~dev->fmr_page_mask); + buf->va = cpu_to_be64(ib_sg_dma_address(ibdev, &scat[0]) & + ~dev->fmr_page_mask); buf->key = cpu_to_be32(req->fmr->fmr->rkey); buf->len = cpu_to_be32(len); @@ -662,6 +668,8 @@ static int srp_map_data(struct scsi_cmnd struct srp_cmd *cmd = req->cmd->buf; int len, nents, count; u8 fmt = SRP_DATA_DESC_DIRECT; + struct srp_device *dev; + struct ib_device *ibdev; if (!scmnd->request_buffer || scmnd->sc_data_direction == DMA_NONE) return sizeof (struct srp_cmd); @@ -686,8 +694,10 @@ static int srp_map_data(struct scsi_cmnd sg_init_one(scat, scmnd->request_buffer, scmnd->request_bufflen); } - count = dma_map_sg(target->srp_host->dev->dev->dma_device, - scat, nents, scmnd->sc_data_direction); + dev = target->srp_host->dev; + ibdev = dev->dev; + + count = ib_dma_map_sg(ibdev, scat, nents, scmnd->sc_data_direction); fmt = SRP_DATA_DESC_DIRECT; len = sizeof (struct srp_cmd) + sizeof (struct srp_direct_buf); @@ -701,9 +711,9 @@ static int srp_map_data(struct scsi_cmnd */ struct srp_direct_buf *buf = (void *) cmd->add_data; - buf->va = cpu_to_be64(sg_dma_address(scat)); - buf->key = cpu_to_be32(target->srp_host->dev->mr->rkey); - buf->len = cpu_to_be32(sg_dma_len(scat)); + buf->va = cpu_to_be64(ib_sg_dma_address(ibdev, scat)); + buf->key = cpu_to_be32(dev->mr->rkey); + buf->len = cpu_to_be32(ib_sg_dma_len(ibdev, scat)); } else if (srp_map_fmr(target, scat, count, req, (void *) cmd->add_data)) { /* @@ -721,13 +731,14 @@ static int srp_map_data(struct scsi_cmnd count * sizeof (struct srp_direct_buf); for (i = 0; i < count; ++i) { + unsigned int dma_len = ib_sg_dma_len(ibdev, &scat[i]); + buf->desc_list[i].va = - cpu_to_be64(sg_dma_address(&scat[i])); + cpu_to_be64(ib_sg_dma_address(ibdev, &scat[i])); buf->desc_list[i].key = - cpu_to_be32(target->srp_host->dev->mr->rkey); - buf->desc_list[i].len = - cpu_to_be32(sg_dma_len(&scat[i])); - datalen += sg_dma_len(&scat[i]); + cpu_to_be32(dev->mr->rkey); + buf->desc_list[i].len = cpu_to_be32(dma_len); + datalen += dma_len; } if (scmnd->sc_data_direction == DMA_TO_DEVICE) @@ -807,13 +818,15 @@ static void srp_process_rsp(struct srp_t static void srp_handle_recv(struct srp_target_port *target, struct ib_wc *wc) { + struct ib_device *dev; struct srp_iu *iu; u8 opcode; iu = target->rx_ring[wc->wr_id & ~SRP_OP_RECV]; - dma_sync_single_for_cpu(target->srp_host->dev->dev->dma_device, iu->dma, - target->max_ti_iu_len, DMA_FROM_DEVICE); + dev = target->srp_host->dev->dev; + ib_dma_sync_single_for_cpu(dev, iu->dma, target->max_ti_iu_len, + DMA_FROM_DEVICE); opcode = *(u8 *) iu->buf; @@ -849,8 +862,8 @@ static void srp_handle_recv(struct srp_t break; } - dma_sync_single_for_device(target->srp_host->dev->dev->dma_device, iu->dma, - target->max_ti_iu_len, DMA_FROM_DEVICE); + ib_dma_sync_single_for_device(dev, iu->dma, target->max_ti_iu_len, + DMA_FROM_DEVICE); } static void srp_completion(struct ib_cq *cq, void *target_ptr) @@ -968,6 +981,7 @@ static int srp_queuecommand(struct scsi_ struct srp_request *req; struct srp_iu *iu; struct srp_cmd *cmd; + struct ib_device *dev; int len; if (target->state == SRP_TARGET_CONNECTING) @@ -984,8 +998,9 @@ static int srp_queuecommand(struct scsi_ if (!iu) goto err; - dma_sync_single_for_cpu(target->srp_host->dev->dev->dma_device, iu->dma, - srp_max_iu_len, DMA_TO_DEVICE); + dev = target->srp_host->dev->dev; + ib_dma_sync_single_for_cpu(dev, iu->dma, srp_max_iu_len, + DMA_TO_DEVICE); req = list_entry(target->free_reqs.next, struct srp_request, list); @@ -1017,8 +1032,8 @@ static int srp_queuecommand(struct scsi_ goto err_unmap; } - dma_sync_single_for_device(target->srp_host->dev->dev->dma_device, iu->dma, - srp_max_iu_len, DMA_TO_DEVICE); + ib_dma_sync_single_for_device(dev, iu->dma, srp_max_iu_len, + DMA_TO_DEVICE); if (__srp_post_send(target, iu, len)) { printk(KERN_ERR PFX "Send failed\n"); From mst at mellanox.co.il Thu Nov 2 14:52:22 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 3 Nov 2006 00:52:22 +0200 Subject: [openib-general] [PATCH] use mmiowb after doorbell ring In-Reply-To: References: Message-ID: <20061102225222.GA15403@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] use mmiowb after doorbell ring > > > By the way, what's up with this project? > > It's still planned for libibverbs 1.1, isn't it? > > I working on it along with other things. > > Where are your patches for using multiple EQs for CQ events? :) Sorry, that was not an atempt to pressure you. I was just organising my plans for the next month or so and wanted to check whether this needs my attention. I actually forgot about the multiple EQ idea - thanks for the reminder, I need to look into how the API would look. -- MST From mst at mellanox.co.il Thu Nov 2 14:54:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 3 Nov 2006 00:54:32 +0200 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <454A4C0B.1080609@ichips.intel.com> References: <454A4C0B.1080609@ichips.intel.com> Message-ID: <20061102225432.GB15403@mellanox.co.il> Quoting r. Arlin Davis : > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq > > Sean Hefty wrote: > > >One option is having the SA (or ib_umad?) return a busy status in response to a > >MAD, but we'd still have to be able to send this response as quickly as requests > >are being received. We could then limit the number of requests that would be > >queued in the kernel for a user. > > > > > > Another great option would be to have path record caching. Unfortunately > OFED 1.1 did not include ib_local_sa in the release. > This won't help you much. With 256 nodes all to all already gives you 65000 requests which is the same order of magnitude as the reported 130000. -- MST From kliteyn at dev.mellanox.co.il Thu Nov 2 15:00:45 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 03 Nov 2006 01:00:45 +0200 Subject: [openib-general] [PATCH v2] opensm: strict osm_log arguments/format check In-Reply-To: <20061102192039.GI17244@sashak.voltaire.com> References: <20061102105348.GA16559@sashak.voltaire.com> <20061102192039.GI17244@sashak.voltaire.com> Message-ID: <454A789D.3000004@dev.mellanox.co.il> Looks ok, great. -- Yevgeny Sasha Khapyorsky wrote: > This adds gcc attribute to osm_log() which causes the compiler to check > argument types against a format string. And also there are related fixes > in osm_log() usage. > > Signed-off-by: Sasha Khapyorsky > --- > osm/include/opensm/osm_log.h | 8 +++++++- > osm/libvendor/osm_vendor_ibumad_sa.c | 2 +- > osm/opensm/main.c | 3 ++- > osm/opensm/osm_pkey_mgr.c | 1 + > osm/opensm/osm_port_info_rcv.c | 5 +++-- > osm/opensm/osm_sa_informinfo.c | 4 ++-- > osm/opensm/osm_sa_link_record.c | 8 ++++---- > osm/opensm/osm_sa_mad_ctrl.c | 3 ++- > osm/opensm/osm_sa_response.c | 2 +- > osm/opensm/osm_sm_state_mgr.c | 3 ++- > osm/opensm/osm_sminfo_rcv.c | 9 +++++---- > osm/opensm/osm_state_mgr.c | 8 ++++---- > osm/osmtest/osmt_multicast.c | 12 +++++++----- > osm/osmtest/osmt_service.c | 6 +++--- > osm/osmtest/osmtest.c | 8 ++++---- > 15 files changed, 48 insertions(+), 34 deletions(-) > > diff --git a/osm/include/opensm/osm_log.h b/osm/include/opensm/osm_log.h > index 6a1a93f..f51a1c8 100644 > --- a/osm/include/opensm/osm_log.h > +++ b/osm/include/opensm/osm_log.h > @@ -60,6 +60,12 @@ #include > #include > #include > > +#ifdef __GNUC__ > +#define STRICT_OSM_LOG_FORMAT __attribute__((format(printf, 3, 4))) > +#else > +#define STRICT_OSM_LOG_FORMAT > +#endif > + > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { > # define END_C_DECLS } > @@ -377,7 +383,7 @@ void > osm_log( > IN osm_log_t* const p_log, > IN const osm_log_level_t verbosity, > - IN const char *p_str, ... ); > + IN const char *p_str, ... ) STRICT_OSM_LOG_FORMAT; > > void > osm_log_raw( > diff --git a/osm/libvendor/osm_vendor_ibumad_sa.c b/osm/libvendor/osm_vendor_ibumad_sa.c > index 7fd0655..7c4a2f7 100644 > --- a/osm/libvendor/osm_vendor_ibumad_sa.c > +++ b/osm/libvendor/osm_vendor_ibumad_sa.c > @@ -853,7 +853,7 @@ #ifdef DUAL_SIDED_RMPP > if ( p_mpr_req->sgid_count + p_mpr_req->dgid_count > IB_MULTIPATH_MAX_GIDS ) > { > osm_log( p_log, OSM_LOG_ERROR, > - "osmv_query_sa DBG:001 MULTIPATH_REC ", > + "osmv_query_sa DBG:001 MULTIPATH_REC " > "SGID count %d DGID count %d max count %d\n", > p_mpr_req->sgid_count, p_mpr_req->dgid_count, > IB_MULTIPATH_MAX_GIDS ); > diff --git a/osm/opensm/main.c b/osm/opensm/main.c > index 729702a..752b546 100644 > --- a/osm/opensm/main.c > +++ b/osm/opensm/main.c > @@ -460,7 +460,8 @@ parse_ignore_guids_file(IN char *guids_f > { > osm_log( &p_osm->log, OSM_LOG_ERROR, > "parse_ignore_guids_file: ERR 0601: " > - "Unable to open ignore guids file (%s)\n" ); > + "Unable to open ignore guids file (%s)\n", > + guids_file_name ); > status = IB_ERROR; > goto Exit; > } > diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c > index f2cb221..735dc14 100644 > --- a/osm/opensm/osm_pkey_mgr.c > +++ b/osm/opensm/osm_pkey_mgr.c > @@ -139,6 +139,7 @@ pkey_mgr_process_physical_port( > "pkey_mgr_process_physical_port: ERR 0503: " > "Failed to obtain P_Key 0x%04x block and index for node " > "0x%016" PRIx64 " port %u\n", > + ib_pkey_get_base( pkey ), > cl_ntoh64( osm_node_get_node_guid( p_node ) ), > osm_physp_get_port_num( p_physp ) ); > return; > diff --git a/osm/opensm/osm_port_info_rcv.c b/osm/opensm/osm_port_info_rcv.c > index 95112dc..f6d3595 100644 > --- a/osm/opensm/osm_port_info_rcv.c > +++ b/osm/opensm/osm_port_info_rcv.c > @@ -724,8 +724,9 @@ osm_pi_rcv_process( > { > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "osm_pi_rcv_process: " > - "Got light sweep response from remote port of parent node GUID = 0x%" PRIx64 > - " port = %u, Commencing heavy sweep\n", > + "Got light sweep response from remote port of parent node " > + "GUID = 0x%" PRIx64 " port = 0x%016" PRIx64 > + ", Commencing heavy sweep\n", > cl_ntoh64( node_guid ), > cl_ntoh64( port_guid ) ); > osm_state_mgr_process( p_rcv->p_state_mgr, > diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c > index 69dca1d..0cec307 100644 > --- a/osm/opensm/osm_sa_informinfo.c > +++ b/osm/opensm/osm_sa_informinfo.c > @@ -163,8 +163,8 @@ __validate_ports_access_rights( > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__validate_ports_access_rights: ERR 4301: " > - "Invalid port guid: 0x%016\n", > - portguid ); > + "Invalid port guid: 0x%016" PRIx64 "\n", > + cl_ntoh64(portguid) ); > valid = FALSE; > goto Exit; > } > diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c > index 751023f..0ca9092 100644 > --- a/osm/opensm/osm_sa_link_record.c > +++ b/osm/opensm/osm_sa_link_record.c > @@ -145,10 +145,10 @@ __osm_lr_rcv_build_physp_link( > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_lr_rcv_build_physp_link: ERR 1801: " > "Unable to acquire link record\n" > - "\t\t\t\tFrom port 0x%\n" > - "\t\t\t\tTo port 0x%\n" > - "\t\t\t\tFrom lid 0x%\n" > - "\t\t\t\tTo lid 0x%\n", > + "\t\t\t\tFrom port 0x%u\n" > + "\t\t\t\tTo port 0x%u\n" > + "\t\t\t\tFrom lid 0x%u\n" > + "\t\t\t\tTo lid 0x%u\n", > from_port, to_port, > cl_ntoh16(from_lid), > cl_ntoh16(to_lid) ); > diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c > index cd896b6..208f0d2 100644 > --- a/osm/opensm/osm_sa_mad_ctrl.c > +++ b/osm/opensm/osm_sa_mad_ctrl.c > @@ -132,7 +132,8 @@ __osm_sa_mad_ctrl_process( > "__osm_sa_mad_ctrl_process: " > /* "Responding BUSY status since the dispatcher is already"*/ > "Dropping MAD since the dispatcher is already" > - " overloaded with %u messages and queue time of:%u[msec]\n", > + " overloaded with %u messages and queue time of:" > + "%" PRIu64 "[msec]\n", > num_messages, last_dispatched_msg_queue_time_msec ); > > /* send a busy response */ > diff --git a/osm/opensm/osm_sa_response.c b/osm/opensm/osm_sa_response.c > index db36ea2..27f4e9d 100644 > --- a/osm/opensm/osm_sa_response.c > +++ b/osm/opensm/osm_sa_response.c > @@ -117,7 +117,7 @@ osm_sa_send_error( > if (osm_exit_flag) > { > osm_log( p_resp->p_log, OSM_LOG_DEBUG, > - "osm_sa_send_error: ", > + "osm_sa_send_error: " > "Ignoring requested send after exit\n" ); > goto Exit; > } > diff --git a/osm/opensm/osm_sm_state_mgr.c b/osm/opensm/osm_sm_state_mgr.c > index aadc43a..7489c28 100644 > --- a/osm/opensm/osm_sm_state_mgr.c > +++ b/osm/opensm/osm_sm_state_mgr.c > @@ -247,7 +247,8 @@ __osm_sm_state_mgr_send_master_sm_info_r > { > osm_log( p_sm_mgr->p_log, OSM_LOG_ERROR, > "__osm_sm_state_mgr_send_master_sm_info_req: ERR 3203: " > - "No port object for GUID 0x%X\n", p_sm_mgr->master_guid ); > + "No port object for GUID 0x%016" PRIx64 "\n", > + cl_ntoh64(p_sm_mgr->master_guid) ); > goto Exit; > } > > diff --git a/osm/opensm/osm_sminfo_rcv.c b/osm/opensm/osm_sminfo_rcv.c > index 825b18b..2fcd2d4 100644 > --- a/osm/opensm/osm_sminfo_rcv.c > +++ b/osm/opensm/osm_sminfo_rcv.c > @@ -402,8 +402,8 @@ __osm_sminfo_rcv_process_set_request( > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "__osm_sminfo_rcv_process_set_request: " > "Received a STANDBY signal. Updating " > - "sm_state_mgr master_guid: 0x%X\n", > - p_rcv_smi->guid ); > + "sm_state_mgr master_guid: 0x%016" PRIx64 "\n", > + cl_ntoh64(p_rcv_smi->guid) ); > p_rcv->p_sm_state_mgr->master_guid = p_rcv_smi->guid; > } > > @@ -482,8 +482,9 @@ __osm_sminfo_rcv_process_get_sm( > /* we will poll it - as long as it lives - we should be in Standby. */ > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > "__osm_sminfo_rcv_process_get_sm: " > - "Found higher SM. Updating sm_state_mgr master_guid: 0x%X\n", > - p_sm->p_port->guid ); > + "Found higher SM. Updating sm_state_mgr master_guid:" > + " 0x%016" PRIx64 "\n", > + cl_ntoh64(p_sm->p_port->guid) ); > p_rcv->p_sm_state_mgr->master_guid = p_sm->p_port->guid; > } > break; > diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c > index 66da6fa..70af836 100644 > --- a/osm/opensm/osm_state_mgr.c > +++ b/osm/opensm/osm_state_mgr.c > @@ -481,7 +481,7 @@ __osm_state_mgr_signal_warning( > { > osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > "__osm_state_mgr_signal_warning: " > - "Invalid signal %s(%d) in state %s\n", > + "Invalid signal %s(%lu) in state %s\n", > osm_get_sm_signal_str( signal ), > signal, osm_get_sm_state_str( p_mgr->state ) ); > } > @@ -500,7 +500,7 @@ __osm_state_mgr_signal_error( > else > osm_log( p_mgr->p_log, OSM_LOG_ERROR, > "__osm_state_mgr_signal_error: ERR 3303: " > - "Invalid signal %s(%d) in state %s\n", > + "Invalid signal %s(%lu) in state %s\n", > osm_get_sm_signal_str( signal ), > signal, osm_get_sm_state_str( p_mgr->state ) ); > } > @@ -1480,8 +1480,8 @@ __osm_state_mgr_exists_other_master_sm( > { > osm_log( p_mgr->p_log, OSM_LOG_VERBOSE, > "__osm_state_mgr_exists_other_master_sm: " > - "Found remote master SM with guid:0x%X\n", > - p_sm->smi.guid ); > + "Found remote master SM with guid:0x%016" PRIx64 "\n", > + cl_ntoh64(p_sm->smi.guid) ); > p_sm_res = p_sm; > goto Exit; > } > diff --git a/osm/osmtest/osmt_multicast.c b/osm/osmtest/osmt_multicast.c > index 33a4f47..19f9d37 100644 > --- a/osm/osmtest/osmt_multicast.c > +++ b/osm/osmtest/osmt_multicast.c > @@ -1885,8 +1885,9 @@ #endif > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_mcast_flow: ERR 0209: " > - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", > - p_mc_res->mgid > + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", > + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), > + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) > ); > status = IB_ERROR; > goto Exit; > @@ -2044,8 +2045,9 @@ #endif > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_mcast_flow: ERR 0212: " > - "Validating MGID failed. MGID:0x%016" PRIx64 "\n", > - p_mc_res->mgid > + "Validating MGID failed. MGID:0x%016" PRIx64 ":%016" PRIx64 "\n", > + cl_ntoh64( p_mc_res->mgid.unicast.prefix ), > + cl_ntoh64( p_mc_res->mgid.unicast.interface_id ) > ); > status = IB_ERROR; > goto Exit; > @@ -3345,7 +3347,7 @@ #endif > /* Delete all MCG that are not of IPoIB */ > osm_log( &p_osmt->log, OSM_LOG_INFO, > "osmt_run_mcast_flow : " > - "Cleanup all MCG that are not IPoIB...\n", cnt ); > + "Cleanup all MCG that are not IPoIB...\n" ); > > p_mgrp_mlid_tbl = &p_osmt->exp_subn.mgrp_mlid_tbl; > p_mgrp = (osmtest_mgrp_t*)cl_qmap_head( p_mgrp_mlid_tbl ); > diff --git a/osm/osmtest/osmt_service.c b/osm/osmtest/osmt_service.c > index ec9a39e..ab95fec 100644 > --- a/osm/osmtest/osmt_service.c > +++ b/osm/osmtest/osmt_service.c > @@ -1559,7 +1559,7 @@ #endif > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_service_records_flow: ERR 4A20: " > - "Found service: id: 0x%016 " PRIx64 > + "Found service: id: 0x%016" PRIx64 " " > "that is invalid\n", > id[7] ); > status = IB_ERROR; > @@ -1573,7 +1573,7 @@ #endif > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_service_records_flow: ERR 4A21: " > - "Fail to find service: id: 0x%016 " PRIx64 > + "Fail to find service: id: 0x%016" PRIx64 " " > "name: %s\n", > id[0], > (char*)service_name[0] ); > @@ -1588,7 +1588,7 @@ #endif > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmt_run_service_records_flow: ERR 4A22: " > - "Fail to find service: id: 0x%016 " PRIx64 > + "Fail to find service: id: 0x%016" PRIx64 " " > "name: %s\n", > id[5], > (char*)service_name[6] ); > diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c > index 92a4190..a35e0c5 100644 > --- a/osm/osmtest/osmtest.c > +++ b/osm/osmtest/osmtest.c > @@ -2787,7 +2787,8 @@ osmtest_create_inventory_file( IN osmtes > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmtest_create_inventory_file: ERR 0079: " > - "Unable to open inventory file (%s)\n" ); > + "Unable to open inventory file (%s)\n", > + p_osmt->opt.file_name ); > status = IB_ERROR; > goto Exit; > } > @@ -3356,7 +3357,7 @@ osmtest_validate_path_data( IN osmtest_t > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmtest_validate_path_data: ERR 0012: " > "PKEY mismatch on path SLID 0x%X to DLID 0x%X\n" > - "\t\t\t\tExpected 0x%X, received 0x%X\n", > + "\t\t\t\tExpected 0x%" PRIx64 ", received 0x%" PRIx64 "\n", > cl_ntoh16( p_path->rec.slid ), > cl_ntoh16( p_path->rec.dlid ), > cl_ntoh64( p_path->rec.pkey ), cl_ntoh64( p_rec->pkey ) ); > @@ -7165,8 +7166,7 @@ osmtest_bind( IN osmtest_t * p_osmt, > { > osm_log( &p_osmt->log, OSM_LOG_ERROR, > "osmtest_bind: ERR 0135: " > - "No local ports. Unable to proceed\n", > - ib_get_err_str( status ) ); > + "No local ports. Unable to proceed\n" ); > goto Exit; > } > guid = attr_array[port_index].port_guid; From halr at voltaire.com Thu Nov 2 15:02:14 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 02 Nov 2006 18:02:14 -0500 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <20061102225432.GB15403@mellanox.co.il> References: <454A4C0B.1080609@ichips.intel.com> <20061102225432.GB15403@mellanox.co.il> Message-ID: <1162508515.15232.19608.camel@hal.voltaire.com> On Thu, 2006-11-02 at 17:54, Michael S. Tsirkin wrote: > Quoting r. Arlin Davis : > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq > > > > Sean Hefty wrote: > > > > >One option is having the SA (or ib_umad?) return a busy status in response to a > > >MAD, but we'd still have to be able to send this response as quickly as requests > > >are being received. We could then limit the number of requests that would be > > >queued in the kernel for a user. > > > > > > > > > > Another great option would be to have path record caching. Unfortunately > > OFED 1.1 did not include ib_local_sa in the release. > > > > This won't help you much. > With 256 nodes all to all already gives you 65000 requests > which is the same order of magnitude as the reported 130000. The requests might occur at a different time so they could be spread out rather than synchronized. -- Hal From mst at mellanox.co.il Thu Nov 2 15:14:04 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 3 Nov 2006 01:14:04 +0200 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1162506570.29948.567.camel@brick.pathscale.com> References: <1162506570.29948.567.camel@brick.pathscale.com> Message-ID: <20061102231404.GC15403@mellanox.co.il> Quoting r. Ralph Campbell : > Subject: [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose > > IB/core - Add DMA mapping functions to allow device drivers to interpose > > The QLogic InfiniPath HCAs use programmed I/O instead of HW DMA. > This patch allows a verbs device driver to interpose on DMA mapping > function calls in order to avoid relying on bus_to_virt() and > phys_to_virt() to undo the mappings created by dma_map_single(), > dma_map_sg(), etc. > > From: Ralph Campbell > > diff -r f37bd0e41fec include/rdma/ib_verbs.h > --- a/include/rdma/ib_verbs.h Thu Oct 26 21:44:41 2006 +0700 > +++ b/include/rdma/ib_verbs.h Thu Oct 26 16:10:04 2006 -0800 > @@ -43,6 +43,8 @@ > > #include > #include > +#include > +#include > > #include > #include > @@ -846,6 +848,42 @@ struct ib_cache { > struct ib_pkey_cache **pkey_cache; > struct ib_gid_cache **gid_cache; > u8 *lmc_cache; > +}; > + > +struct ib_dma_mapping_ops { > + int (*mapping_error)(struct ib_device *dev, > + dma_addr_t dma_addr); > + dma_addr_t (*map_single)(struct ib_device *dev, > + void *ptr, size_t size, > + enum dma_data_direction direction); > + void (*unmap_single)(struct ib_device *dev, > + dma_addr_t addr, size_t size, > + enum dma_data_direction direction); > + dma_addr_t (*map_page)(struct ib_device *dev, > + struct page *page, unsigned long offset, > + size_t size, > + enum dma_data_direction direction); > + void (*unmap_page)(struct ib_device *dev, > + dma_addr_t addr, size_t size, > + enum dma_data_direction direction); > + int (*map_sg)(struct ib_device *dev, > + struct scatterlist *sg, int nents, > + enum dma_data_direction direction); > + void (*unmap_sg)(struct ib_device *dev, > + struct scatterlist *sg, int nents, > + enum dma_data_direction direction); > + dma_addr_t (*dma_address)(struct ib_device *dev, > + struct scatterlist *sg); > + unsigned int (*dma_len)(struct ib_device *dev, > + struct scatterlist *sg); > + void (*sync_single_for_cpu)(struct ib_device *dev, > + dma_addr_t dma_handle, > + size_t size, > + enum dma_data_direction dir); > + void (*sync_single_for_device)(struct ib_device *dev, > + dma_addr_t dma_handle, > + size_t size, > + enum dma_data_direction dir); > }; Maybe we should make the API a bit more generic than just matching what ipath needs. Specifically mellanox HCAs (and I expect others) can support *both* dma (in use today) and pushing data "inline" directly into HCA. And this actually might make more sense than DMA for small messages. However, this means that the API must give the HCA the choice of what to keep inside the mapping. This could mean, for example, returning a structure that can include dma_addr_t, void*, or both, and a flag to distinguish between the two. Does this make sense? -- MST From mst at mellanox.co.il Thu Nov 2 15:14:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 3 Nov 2006 01:14:54 +0200 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <1162508515.15232.19608.camel@hal.voltaire.com> References: <454A4C0B.1080609@ichips.intel.com> <20061102225432.GB15403@mellanox.co.il> <1162508515.15232.19608.camel@hal.voltaire.com> Message-ID: <20061102231454.GD15403@mellanox.co.il> Quoting r. Hal Rosenstock : > Subject: Re: scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq > > On Thu, 2006-11-02 at 17:54, Michael S. Tsirkin wrote: > > Quoting r. Arlin Davis : > > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq > > > > > > Sean Hefty wrote: > > > > > > >One option is having the SA (or ib_umad?) return a busy status in response to a > > > >MAD, but we'd still have to be able to send this response as quickly as requests > > > >are being received. We could then limit the number of requests that would be > > > >queued in the kernel for a user. > > > > > > > > > > > > > > Another great option would be to have path record caching. Unfortunately > > > OFED 1.1 did not include ib_local_sa in the release. > > > > > > > This won't help you much. > > With 256 nodes all to all already gives you 65000 requests > > which is the same order of magnitude as the reported 130000. > > The requests might occur at a different time so they could be spread out > rather than synchronized. I don't see how caching does this. -- MST From ardavis at ichips.intel.com Thu Nov 2 15:17:15 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Thu, 02 Nov 2006 15:17:15 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <20061102225432.GB15403@mellanox.co.il> References: <454A4C0B.1080609@ichips.intel.com> <20061102225432.GB15403@mellanox.co.il> Message-ID: <454A7C7B.5080904@ichips.intel.com> Michael S. Tsirkin wrote: >>Another great option would be to have path record caching. Unfortunately >>OFED 1.1 did not include ib_local_sa in the release. >> >> >> > >This won't help you much. >With 256 nodes all to all already gives you 65000 requests >which is the same order of magnitude as the reported 130000. > > > Am I missing something here? 65,000 requests every 15 minutes (current default) for the entire cluster versus 100-130000 every time I start an application is a big help. Especially on a very large cluster that is batching up smaller independent jobs sharing a single SA and fabric. We either need caching or SA capabilities that can scale up with large clusters. A single service running at 6000 requests/second will not succeed. -arlin. From rdreier at cisco.com Thu Nov 2 15:19:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Nov 2006 15:19:24 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <20061102231454.GD15403@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 3 Nov 2006 01:14:54 +0200") References: <454A4C0B.1080609@ichips.intel.com> <20061102225432.GB15403@mellanox.co.il> <1162508515.15232.19608.camel@hal.voltaire.com> <20061102231454.GD15403@mellanox.co.il> Message-ID: > > With 256 nodes all to all already gives you 65000 requests > > which is the same order of magnitude as the reported 130000. I think the only advantage of caching is if you start your app twice. But maybe you can fill the cache more efficiently by doing a single get table to find all the paths at once, rather than having to let the cma query each path after arping for the GID... - R. From mst at mellanox.co.il Thu Nov 2 15:26:17 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 3 Nov 2006 01:26:17 +0200 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: References: Message-ID: <20061102232617.GE15403@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq > > > > With 256 nodes all to all already gives you 65000 requests > > > which is the same order of magnitude as the reported 130000. > > I think the only advantage of caching is if you start your app twice. > > But maybe you can fill the cache more efficiently by doing a single > get table to find all the paths at once, rather than having to let the > cma query each path after arping for the GID... Actually, this sounds like an excellent idea - this immediately makes the number of queries linear with cluster size. -- MST From rdreier at cisco.com Thu Nov 2 15:25:22 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Nov 2006 15:25:22 -0800 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <20061102231404.GC15403@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 3 Nov 2006 01:14:04 +0200") References: <1162506570.29948.567.camel@brick.pathscale.com> <20061102231404.GC15403@mellanox.co.il> Message-ID: > However, this means that the API must give the HCA the choice of > what to keep inside the mapping. This could mean, for example, returning > a structure that can include dma_addr_t, void*, or both, and a flag to > distinguish between the two. It's an interesting idea. However I think it may be more trouble than it's worth, for at least two reasons. First, the wrapper for dma_map_sg() will probably become really ugly, although maybe there's a clever idea. Second, the consumer right now only gets to pass a 64-bit address into the work request posting functions. I don't think we really want to change that interface, so the driver would have to encode the flag in the address somehow anyway. Also handling highmem is a problem. ipath just depends on 64BIT so it avoids the problem. I guess mthca could only return a kernel virtual address if one exists, and always use DMA for highmem pages. So that isn't really a serious objection. - R. From rdreier at cisco.com Thu Nov 2 15:26:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Nov 2006 15:26:48 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <454A7C7B.5080904@ichips.intel.com> (Arlin Davis's message of "Thu, 02 Nov 2006 15:17:15 -0800") References: <454A4C0B.1080609@ichips.intel.com> <20061102225432.GB15403@mellanox.co.il> <454A7C7B.5080904@ichips.intel.com> Message-ID: > Am I missing something here? 65,000 requests every 15 minutes (current > default) for the entire cluster versus 100-130000 every time I start an > application is a big help. Depends on how long your app takes to run -- if your app only starts once every day or something then the cache refreshing is worse. - R. From rdreier at cisco.com Thu Nov 2 15:27:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Nov 2006 15:27:45 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <20061102232617.GE15403@mellanox.co.il> (Michael S. Tsirkin's message of "Fri, 3 Nov 2006 01:26:17 +0200") References: <20061102232617.GE15403@mellanox.co.il> Message-ID: > Actually, this sounds like an excellent idea - this immediately makes the number > of queries linear with cluster size. Maybe -- I wonder how scalable our RMPP implementation is though. What will happen on the SM node with 256 RMPP requests returning 256 paths each? How about 1024 * 1024? - R. From mst at mellanox.co.il Thu Nov 2 15:44:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Fri, 3 Nov 2006 01:44:43 +0200 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: Message-ID: <20061102234442.GF15403@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose > > > However, this means that the API must give the HCA the choice of > > what to keep inside the mapping. This could mean, for example, returning > > a structure that can include dma_addr_t, void*, or both, and a flag to > > distinguish between the two. > > It's an interesting idea. However I think it may be more trouble than it's > worth, for at least two reasons. First, the wrapper for dma_map_sg() will > probably become really ugly, although maybe there's a clever idea. Oh, my guess is s/g is usually for long messages so we can just always do dma in that case. > Second, > the consumer right now only gets to pass a 64-bit address into the work > request posting functions. I don't think we really want to change that > interface, so the driver would have to encode the flag in the address somehow > anyway. But how? Wait, work request posting functions actually get a virtual address and a key, not a dma address. Maybe something can be done with this? Say, we have get_dma_mr at the moment - maybe we could have a special mr, and let the dma functions also select which mr to use? > Also handling highmem is a problem. ipath just depends on 64BIT so it > avoids the problem. I guess mthca could only return a kernel virtual > address if one exists, and always use DMA for highmem pages. So that > isn't really a serious objection. Right. -- MST From ralph.campbell at qlogic.com Thu Nov 2 15:50:29 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 02 Nov 2006 15:50:29 -0800 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <20061102234442.GF15403@mellanox.co.il> References: <20061102234442.GF15403@mellanox.co.il> Message-ID: <1162511429.29948.583.camel@brick.pathscale.com> On Fri, 2006-11-03 at 01:44 +0200, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: Re: [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose > > > > > However, this means that the API must give the HCA the choice of > > > what to keep inside the mapping. This could mean, for example, returning > > > a structure that can include dma_addr_t, void*, or both, and a flag to > > > distinguish between the two. > > > > It's an interesting idea. However I think it may be more trouble than it's > > worth, for at least two reasons. First, the wrapper for dma_map_sg() will > > probably become really ugly, although maybe there's a clever idea. > > Oh, my guess is s/g is usually for long messages so we can just always do dma in > that case. > > > Second, > > the consumer right now only gets to pass a 64-bit address into the work > > request posting functions. I don't think we really want to change that > > interface, so the driver would have to encode the flag in the address somehow > > anyway. > > But how? > Wait, work request posting functions actually get a virtual > address and a key, not a dma address. Maybe something can be done with this? > Say, we have get_dma_mr at the moment - maybe we could have a special > mr, and let the dma functions also select which mr to use? > > > > Also handling highmem is a problem. ipath just depends on 64BIT so it > > avoids the problem. I guess mthca could only return a kernel virtual > > address if one exists, and always use DMA for highmem pages. So that > > isn't really a serious objection. > > Right. I'm open to suggestions if you have a proposal for making the interface more usable. From mshefty at ichips.intel.com Thu Nov 2 15:50:25 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 15:50:25 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <454A672F.5010309@3leafnetworks.com> References: <000001c6feb5$919bcb80$7ffc070a@amr.corp.intel.com> <454A5FE1.4090809@3leafnetworks.com> <454A5BAF.9020309@ichips.intel.com> <454A672F.5010309@3leafnetworks.com> Message-ID: <454A8441.70606@ichips.intel.com> Venkatesh Babu wrote: > I made the change you suggested. > On Active node I got the event IB_EVENT_PATH_MIG and then send failed > with IB_WC_RETRY_EXC_ERR. Ok - I will continue debugging this once I complete a test program. Thanks for the assistance. > On Passive node I got 100 IB_EVENT_PATH_MIG_ERR events. This sounds like a bug in the stack. - Sean From todd.rimmer at qlogic.com Thu Nov 2 16:11:30 2006 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Thu, 2 Nov 2006 18:11:30 -0600 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061190BD8B2@EPEXCH2.qlogic.org> > From: Michael S. Tsirkin > Sent: Thursday, November 02, 2006 5:55 PM > To: Arlin Davis > Cc: Or Gerlitz; openib-general; Arlin Davis > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add support > for address and route retries, call disconnect when recving dreq > > Quoting r. Arlin Davis : > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add > support for address and route retries, call disconnect when recving dreq > > > > Sean Hefty wrote: > > > > >One option is having the SA (or ib_umad?) return a busy status in > response to a > > >MAD, but we'd still have to be able to send this response as quickly as > requests > > >are being received. We could then limit the number of requests that > would be > > >queued in the kernel for a user. > > > > > > > > > > Another great option would be to have path record caching. Unfortunately > > OFED 1.1 did not include ib_local_sa in the release. > > > > This won't help you much. > With 256 nodes all to all already gives you 65000 requests > which is the same order of magnitude as the reported 130000. We have SA caching working quite well with very large clusters. Here are some techniques which make it much more efficient: 1. A given node only cares about path records relevant to it. So only ask for path records where it is the source. 2. Use SA notices for GID in/out of service to trigger cache updates, and only then for the specific GID which has changed - as background, refresh all cache entrys slowly and infrequently just in case the notice was lost, however IBTA does allow retries and Acks of notices so this will be infrequent 3. limit number of outstanding SA queries from a given node, this avoids 1 node blasting the SM There a little more to it, but that should be the main points relevant to this discussion. Todd Rimmer From todd.rimmer at qlogic.com Thu Nov 2 16:15:28 2006 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Thu, 2 Nov 2006 18:15:28 -0600 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061190BD8B3@EPEXCH2.qlogic.org> > From: Michael S. Tsirkin > Sent: Thursday, November 02, 2006 6:15 PM > To: Hal Rosenstock > Cc: Or Gerlitz; openib-general; Arlin R Davis > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add support > for address and route retries, call disconnect when recving dreq > > Quoting r. Hal Rosenstock : > > Subject: Re: scaling issues, was: uDAPL cma: add support for address and > route retries, call disconnect when recving dreq > > > > On Thu, 2006-11-02 at 17:54, Michael S. Tsirkin wrote: > > > Quoting r. Arlin Davis : > > > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add > support for address and route retries, call disconnect when recving dreq > > > > > > > > Sean Hefty wrote: > > > > > > > > >One option is having the SA (or ib_umad?) return a busy status in > response to a > > > > >MAD, but we'd still have to be able to send this response as > quickly as requests > > > > >are being received. We could then limit the number of requests > that would be > > > > >queued in the kernel for a user. > > > > > > > > > > > > > > > > > > Another great option would be to have path record caching. > Unfortunately > > > > OFED 1.1 did not include ib_local_sa in the release. > > > > > > > > > > This won't help you much. > > > With 256 nodes all to all already gives you 65000 requests > > > which is the same order of magnitude as the reported 130000. > > > > The requests might occur at a different time so they could be spread out > > rather than synchronized. > > I don't see how caching does this. > If all the queries are made at app startup, there will be one huge batch of queries to the SA, especially for a many process MPI job. In contrast if SA caching is building its own replica of the relevant subset of the SA, the pace can be more controlled. It can even be purposely randomized by the SA cache code itself (eg. don't just do it every 10 minutes, do it every 10 minutes +/- a random number, etc). This way if all nodes powered on at similar time you won't have a pattern of everyone asking SM at the same time. Todd Rimmer From mshefty at ichips.intel.com Thu Nov 2 16:22:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 16:22:38 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <20061102225432.GB15403@mellanox.co.il> References: <454A4C0B.1080609@ichips.intel.com> <20061102225432.GB15403@mellanox.co.il> Message-ID: <454A8BCE.2060508@ichips.intel.com> Michael S. Tsirkin wrote: > This won't help you much. > With 256 nodes all to all already gives you 65000 requests > which is the same order of magnitude as the reported 130000. A cache for 256 nodes only generates 256 requests. Each request is a get table from a given sgid. The all to all connection model generates n^2 requests because each request is a get for a given sgid/dgid pair. Additionally, cached requests can be done when the application isn't running, with a fairly long or infinite update time. Arlin and I have discussed some caching options, including having multiple cache service daemons running on the subnet. If more than service is running, a nodes can select a particular service to communicate with. Communication can be done using RC to reduce the MAD overhead. - Sean From mshefty at ichips.intel.com Thu Nov 2 16:51:38 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 02 Nov 2006 16:51:38 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: References: <20061102232617.GE15403@mellanox.co.il> Message-ID: <454A929A.7000406@ichips.intel.com> > Maybe -- I wonder how scalable our RMPP implementation is though. > What will happen on the SM node with 256 RMPP requests returning 256 > paths each? How about 1024 * 1024? The RMPP implementation doesn't expect to receive lots of simultaneous RMPP transactions, and expects them to be reasonably behaved. It uses linear lookups, so there may be room for improvement here. However, I would expect most scalability issues to be on the end receiving an RMPP message, rather than on the send side. - Sean From krkumar2 at in.ibm.com Fri Nov 3 00:29:45 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 3 Nov 2006 13:59:45 +0530 Subject: [openib-general] Question on ucma Message-ID: Hi, I installed the 2.6.19-rc3 bits, and when I try to run perftest/rdma_bw (with '-c' option), I get the error : "librdmacm: Couldnt open rdma_cm ABI version". I found that this is due to ucma not being present in mainline kernel bits (which creates /sys/class/misc/rdma_cm). So how can I resolve this and run these tests ? Thanks, - KK From halr at voltaire.com Fri Nov 3 03:22:29 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2006 06:22:29 -0500 Subject: [openib-general] [PATCH] opensm: reuse PKey values for "dynamic" partitions. In-Reply-To: <20061101203936.GC9985@sashak.voltaire.com> References: <20061101203936.GC9985@sashak.voltaire.com> Message-ID: <1162552935.15232.49418.camel@hal.voltaire.com> On Wed, 2006-11-01 at 15:39, Sasha Khapyorsky wrote: > When partition is specified in partition configuration file without > desired PKey value OpenSM will generate one dynamically. > > The problem is that when the list of such "dynamic" partitions is > edited (some partitions are removed and/or some added), PKey values > will be regenerated again and reassigned. > > This patch fixes this undesired behavior. Now OpenSM will try to reuse > PKey values for such "dynamic" partitions. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Fri Nov 3 03:55:52 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2006 06:55:52 -0500 Subject: [openib-general] [PATCH v2] opensm: remove obsolete p_report_buf In-Reply-To: <20061101184559.GC22655@sashak.voltaire.com> References: <20061101184559.GC22655@sashak.voltaire.com> Message-ID: <1162554925.15232.50688.camel@hal.voltaire.com> On Wed, 2006-11-01 at 13:45, Sasha Khapyorsky wrote: > This removes obsolete now shared sm->p_report_buf buffer and cleans > up related code. And also introduces new log function osm_log_printf() > which currently trivially sends formatted output to stdout. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From johnt1johnt2 at gmail.com Fri Nov 3 04:28:45 2006 From: johnt1johnt2 at gmail.com (john t) Date: Fri, 3 Nov 2006 17:58:45 +0530 Subject: [openib-general] broadcast Message-ID: Hi, I havent tried this but want to hear from IB experts whether IB broadcast (where switch does the real broadcasting task and where all the hosts are connected to a single switch) is as fast as point-to-point IB data transfer and secondly if data transfer over UD QP is faster then over RC QP ?? just like how UDP is faster then TCP ?? Regards, John T. -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Fri Nov 3 06:22:56 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 03 Nov 2006 08:22:56 -0600 Subject: [openib-general] Question on ucma In-Reply-To: References: Message-ID: <1162563776.10969.2.camel@stevo-desktop> Sean posted 7 patches that include the ucma support. You'll need those + the one librdmacm patch he posted. Steve. On Fri, 2006-11-03 at 13:59 +0530, Krishna Kumar2 wrote: > Hi, > > I installed the 2.6.19-rc3 bits, and when I try to run > perftest/rdma_bw (with '-c' option), I get the error : > "librdmacm: Couldnt open rdma_cm ABI version". > > I found that this is due to ucma not being present in > mainline kernel bits (which creates /sys/class/misc/rdma_cm). > So how can I resolve this and run these tests ? > > Thanks, > > - KK > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Fri Nov 3 07:59:33 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 3 Nov 2006 17:59:33 +0200 Subject: [openib-general] [PATCH] opensm: x86_64 related osm_log() format fixes Message-ID: <20061103155933.GA32197@sashak.voltaire.com> x86_64 related osm_log() format fixes. Found with strict format checking attribute. Signed-off-by: Sasha Khapyorsky --- osm/libvendor/osm_vendor_ibumad_sa.c | 2 +- osm/opensm/osm_mcast_mgr.c | 2 +- osm/opensm/osm_sa_link_record.c | 4 ++-- osm/opensm/osm_sa_multipath_record.c | 8 ++++---- osm/opensm/osm_sa_path_record.c | 4 ++-- osm/opensm/osm_state_mgr.c | 6 +++--- osm/opensm/osm_ucast_updn.c | 2 +- osm/osmtest/osmtest.c | 18 +++++++++--------- 8 files changed, 23 insertions(+), 23 deletions(-) diff --git a/osm/libvendor/osm_vendor_ibumad_sa.c b/osm/libvendor/osm_vendor_ibumad_sa.c index 7c4a2f7..7ef77de 100644 --- a/osm/libvendor/osm_vendor_ibumad_sa.c +++ b/osm/libvendor/osm_vendor_ibumad_sa.c @@ -159,7 +159,7 @@ #else ( ( p_madw->mad_size - IB_SA_MAD_HDR_SIZE ) / ib_get_attr_size( p_sa_mad->attr_offset ) ); osm_log( p_bind->p_log, OSM_LOG_DEBUG, - "__osmv_sa_mad_rcv_cb: Count = %u = %u / %u (%u)\n", + "__osmv_sa_mad_rcv_cb: Count = %u = %zu / %u (%zu)\n", query_res.result_cnt, p_madw->mad_size - IB_SA_MAD_HDR_SIZE, ib_get_attr_size( p_sa_mad->attr_offset ), ( p_madw->mad_size - IB_SA_MAD_HDR_SIZE ) % diff --git a/osm/opensm/osm_mcast_mgr.c b/osm/opensm/osm_mcast_mgr.c index 82ef7c3..5cb97be 100644 --- a/osm/opensm/osm_mcast_mgr.c +++ b/osm/opensm/osm_mcast_mgr.c @@ -812,7 +812,7 @@ __osm_mcast_mgr_branch( { osm_log( p_mgr->p_log, OSM_LOG_DEBUG, "__osm_mcast_mgr_branch: " - "Routing %u destination(s) via switch port 0x%X\n", + "Routing %zu destination(s) via switch port 0x%X\n", count, i ); } diff --git a/osm/opensm/osm_sa_link_record.c b/osm/opensm/osm_sa_link_record.c index 2bc16a3..b4e4aef 100644 --- a/osm/opensm/osm_sa_link_record.c +++ b/osm/opensm/osm_sa_link_record.c @@ -568,7 +568,7 @@ #endif (num_rec > 1)) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_lr_rcv_respond: ERR 1806: " - "Got more than one record for SubnAdmGet (%u)\n", + "Got more than one record for SubnAdmGet (%zu)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); @@ -600,7 +600,7 @@ #endif { osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_lr_rcv_respond: " - "Generating response with %u records", num_rec ); + "Generating response with %zu records", num_rec ); } /* diff --git a/osm/opensm/osm_sa_multipath_record.c b/osm/opensm/osm_sa_multipath_record.c index 5f70369..0151374 100644 --- a/osm/opensm/osm_sa_multipath_record.c +++ b/osm/opensm/osm_sa_multipath_record.c @@ -295,7 +295,7 @@ __osm_mpr_rcv_get_path_parms( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_mpr_rcv_get_path_parms: ERR 4518: " "Ports do not share specified PKey 0x%04x\n" - "\t\tsrc %Lx dst %Lx\n", + "\t\tsrc %" PRIx64 " dst %" PRIx64 "\n", cl_ntoh16( required_pkey ), cl_ntoh64( osm_physp_get_port_guid( p_physp ) ), cl_ntoh64( osm_physp_get_port_guid( p_dest_physp ) ) ); @@ -308,7 +308,7 @@ __osm_mpr_rcv_get_path_parms( osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_mpr_rcv_get_path_parms: ERR 4519: " "Ports do not have any shared PKeys\n" - "\t\tsrc %Lx dst %Lx\n", + "\t\tsrc %" PRIx64 " dst %" PRIx64 "\n", cl_ntoh64( osm_physp_get_port_guid( p_physp ) ), cl_ntoh64( osm_physp_get_port_guid( p_dest_physp ) ) ); status = IB_NOT_FOUND; @@ -529,7 +529,7 @@ __osm_mpr_rcv_get_path_parms( if (vl == IB_DROP_VL) { /* discard packet */ osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, "__osm_mpr_rcv_get_path_parms: Path not found for SL %d\n" - "\t\tin_port_num %d port_guid %Lx\n", + "\t\tin_port_num %d port_guid %" PRIx64 "\n", required_sl, in_port_num, cl_ntoh64( osm_physp_get_port_guid( p_physp ) ) ); status = IB_NOT_FOUND; @@ -1467,7 +1467,7 @@ __osm_mpr_rcv_respond( osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_mpr_rcv_respond: " - "Generating response with %u records\n", num_rec ); + "Generating response with %zu records\n", num_rec ); mad_size = IB_SA_MAD_HDR_SIZE + num_rec * sizeof(ib_path_rec_t); diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index a93ce10..dca3603 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -1661,7 +1661,7 @@ #endif { osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_pr_rcv_respond: ERR 1F13: " - "Got more than one record for SubnAdmGet (%u)\n", + "Got more than one record for SubnAdmGet (%zu)\n", num_rec ); osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_TOO_MANY_RECORDS ); @@ -1691,7 +1691,7 @@ #endif osm_log( p_rcv->p_log, OSM_LOG_DEBUG, "__osm_pr_rcv_respond: " - "Generating response with %u records\n", num_rec ); + "Generating response with %zu records\n", num_rec ); if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0)) { diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 70af836..993b7eb 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -1801,7 +1801,7 @@ __osm_state_mgr_check_tbl_consistency( * new lid we wanted to give it in our port_lid_tbl. */ osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_state_mgr_check_tbl_consistency: ERR 3322: " - "lid 0x%X is wrongly assigned to port 0x%016" PRIx64 + "lid 0x%zX is wrongly assigned to port 0x%016" PRIx64 " in port_lid_tbl\n", lid, cl_ntoh64( osm_port_get_guid( p_port_stored ) ) ); } @@ -1815,7 +1815,7 @@ __osm_state_mgr_check_tbl_consistency( osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_state_mgr_check_tbl_consistency: ERR 3323: " "port 0x%016" PRIx64 " exists in new port_lid_tbl under " - "lid 0x%X, but missing in subnet port_lid_tbl db\n", + "lid 0x%zX, but missing in subnet port_lid_tbl db\n", cl_ntoh64( osm_port_get_guid( p_port_ref ) ), lid ); } else @@ -1826,7 +1826,7 @@ __osm_state_mgr_check_tbl_consistency( * and p_port_ref also didn't get the lid update. */ osm_log( p_mgr->p_log, OSM_LOG_ERROR, "__osm_state_mgr_check_tbl_consistency: ERR 3324: " - "lid 0x%X has port 0x%016" PRIx64 + "lid 0x%zX has port 0x%016" PRIx64 " in new port_lid_tbl db, " "and port 0x%016" PRIx64 " in subnet port_lid_tbl db\n", lid, cl_ntoh64( osm_port_get_guid( p_port_ref ) ), diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c index b2159d3..8abad21 100644 --- a/osm/opensm/osm_ucast_updn.c +++ b/osm/opensm/osm_ucast_updn.c @@ -360,7 +360,7 @@ __updn_bfs_by_node( { osm_log(&(osm.log), OSM_LOG_DEBUG, "__updn_bfs_by_node:" - "Starting a new iteration with %d elements in current list\n", + "Starting a new iteration with %zu elements in current list\n", cl_list_count(p_currList)); /* Init the switch directed list */ p_nextList = (cl_list_t*)malloc(sizeof(cl_list_t)); diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c index 345a5c3..7dcc6e9 100644 --- a/osm/osmtest/osmtest.c +++ b/osm/osmtest/osmtest.c @@ -2117,7 +2117,7 @@ osmtest_write_all_link_recs( IN osmtest_ { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_write_all_link_recs: " - "Received %u records\n", num_recs ); + "Received %zu records\n", num_recs ); } result = fprintf( fh, "#\n" "# Link Records\n" "#\n" ); @@ -2264,7 +2264,7 @@ osmtest_write_all_node_recs( IN osmtest_ { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_write_all_node_recs: " - "Received %u records\n", num_recs ); + "Received %zu records\n", num_recs ); } result = fprintf( fh, "#\n" "# Node Records\n" "#\n" ); @@ -2339,7 +2339,7 @@ osmtest_write_all_port_recs( IN osmtest_ { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_write_all_port_recs: " - "Received %u records\n", num_recs ); + "Received %zu records\n", num_recs ); } result = fprintf( fh, "#\n" "# PortInfo Records\n" "#\n" ); @@ -2414,7 +2414,7 @@ osmtest_write_all_path_recs( { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_write_all_path_recs: " - "Received %u records\n", num_recs ); + "Received %zu records\n", num_recs ); } result = fprintf( fh, "#\n" "# Path Records\n" "#\n" ); @@ -4205,7 +4205,7 @@ osmtest_validate_all_node_recs( IN osmte { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_validate_all_node_recs: " - "Received %u records\n", num_recs ); + "Received %zu records\n", num_recs ); } /* @@ -4290,7 +4290,7 @@ osmtest_validate_all_guidinfo_recs( IN o { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_validate_all_guidinfo_recs: " - "Received %u records\n", num_recs ); + "Received %zu records\n", num_recs ); } /* No validation as yet */ @@ -4345,7 +4345,7 @@ osmtest_validate_all_path_recs( IN osmte { osm_log( &p_osmt->log, OSM_LOG_VERBOSE, "osmtest_validate_all_path_recs: " - "Received %u records\n", num_recs ); + "Received %zu records\n", num_recs ); } /* @@ -4849,7 +4849,7 @@ osmtest_validate_single_path_rec_lid_pai { osm_log( &p_osmt->log, OSM_LOG_ERROR, "osmtest_validate_single_path_rec_lid_pair: ERR 0103: " - "Too many records. Expected 1, received %u\n", num_recs ); + "Too many records. Expected 1, received %zu\n", num_recs ); status = IB_ERROR; } @@ -5118,7 +5118,7 @@ osmtest_validate_single_path_rec_guid_pa num_recs = context.result.result_cnt; osm_log( &p_osmt->log, OSM_LOG_VERBOSE, - "osmtest_validate_single_path_rec_guid_pair: %u records\n", + "osmtest_validate_single_path_rec_guid_pair: %zu records\n", num_recs); for( i = 0; i < num_recs; i++ ) -- 1.4.3.2.g4bf7 From halr at voltaire.com Fri Nov 3 08:35:46 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2006 11:35:46 -0500 Subject: [openib-general] [PATCH] opensm: x86_64 related osm_log() format fixes In-Reply-To: <20061103155933.GA32197@sashak.voltaire.com> References: <20061103155933.GA32197@sashak.voltaire.com> Message-ID: <1162571651.15232.61401.camel@hal.voltaire.com> On Fri, 2006-11-03 at 10:59, Sasha Khapyorsky wrote: > x86_64 related osm_log() format fixes. Found with strict format checking > attribute. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From halr at voltaire.com Fri Nov 3 08:45:41 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2006 11:45:41 -0500 Subject: [openib-general] [PATCH] opensm: strict osm_log arguments/format check In-Reply-To: <20061102105348.GA16559@sashak.voltaire.com> References: <20061102105348.GA16559@sashak.voltaire.com> Message-ID: <1162572314.15232.61741.camel@hal.voltaire.com> On Thu, 2006-11-02 at 05:53, Sasha Khapyorsky wrote: > This adds gcc attribute to osm_log() which causes the compiler to check > argument types against a format string. And also there are related fixes > in osm_log() usage in opensm and osmtest. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied (in two parts). The strict checking is now enabled. -- Hal From jlentini at netapp.com Fri Nov 3 09:48:20 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 3 Nov 2006 12:48:20 -0500 (EST) Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <20061102234442.GF15403@mellanox.co.il> References: <20061102234442.GF15403@mellanox.co.il> Message-ID: On Fri, 3 Nov 2006, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: Re: [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose > > > > > However, this means that the API must give the HCA the choice of > > > what to keep inside the mapping. This could mean, for example, returning > > > a structure that can include dma_addr_t, void*, or both, and a flag to > > > distinguish between the two. > > > > It's an interesting idea. However I think it may be more trouble than it's > > worth, for at least two reasons. First, the wrapper for dma_map_sg() will > > probably become really ugly, although maybe there's a clever idea. > > Oh, my guess is s/g is usually for long messages so we can just always do dma in > that case. > > > Second, > > the consumer right now only gets to pass a 64-bit address into the work > > request posting functions. I don't think we really want to change that > > interface, so the driver would have to encode the flag in the address somehow > > anyway. > > But how? > Wait, work request posting functions actually get a virtual > address and a key, not a dma address. Work requests posted using ib_post_send/recv are specified using a dma address obtained using the appropriate Linux DMA mapping API function. They are not virtual addresses. > Maybe something can be done with this? Say, we have get_dma_mr at > the moment - maybe we could have a special mr, and let the dma > functions also select which mr to use? > > > > Also handling highmem is a problem. ipath just depends on 64BIT so it > > avoids the problem. I guess mthca could only return a kernel virtual > > address if one exists, and always use DMA for highmem pages. So that > > isn't really a serious objection. > > Right. > > -- > MST > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Fri Nov 3 09:54:15 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 3 Nov 2006 12:54:15 -0500 (EST) Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1162506626.29948.568.camel@brick.pathscale.com> References: <1162506626.29948.568.camel@brick.pathscale.com> Message-ID: Ralph, Could you add documentation to Documentation/infiniband/ explainin when a ULP needs to use the new verbs DMA mapping functions (what structures/function parameters must be specified using this API)? Did you consider creating a new type, ib_dma_addr_t, and using this type for the verb structure members/function parameters that require it? james From swise at opengridcomputing.com Fri Nov 3 10:08:21 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 03 Nov 2006 12:08:21 -0600 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <20061102234442.GF15403@mellanox.co.il> Message-ID: <1162577301.3328.4.camel@stevo-desktop> > > > > But how? > > Wait, work request posting functions actually get a virtual > > address and a key, not a dma address. > > Work requests posted using ib_post_send/recv are specified using a dma > address obtained using the appropriate Linux DMA mapping API function. > They are not virtual addresses. > This isn't necessarily true. The addr field in the WR is a dma addr when using a MR allocated via ib_get_dma_mr(). For MRs allocated via ib_reg_phys_mem(), then the addr field is something relative to the iova_start u64 passed into the ib_reg_phys_mem(). Right? From ralph.campbell at qlogic.com Fri Nov 3 10:09:22 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 03 Nov 2006 10:09:22 -0800 Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: References: <1162506626.29948.568.camel@brick.pathscale.com> Message-ID: <1162577362.29948.600.camel@brick.pathscale.com> On Fri, 2006-11-03 at 12:54 -0500, James Lentini wrote: > Ralph, > > Could you add documentation to Documentation/infiniband/ explainin > when a ULP needs to use the new verbs DMA mapping functions (what > structures/function parameters must be specified using this API)? Good idea. I will. > Did you consider creating a new type, ib_dma_addr_t, and using this > type for the verb structure members/function parameters that require > it? I didn't really think about it since I was trying to make identical replacements for dma_map_single(), etc. which return dma_addr_t. That would pretty much force ib_dma_addr_t to be the same as dma_addr_t. From ralph.campbell at qlogic.com Fri Nov 3 10:20:03 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 03 Nov 2006 10:20:03 -0800 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1162577301.3328.4.camel@stevo-desktop> References: <20061102234442.GF15403@mellanox.co.il> <1162577301.3328.4.camel@stevo-desktop> Message-ID: <1162578003.29948.613.camel@brick.pathscale.com> On Fri, 2006-11-03 at 12:08 -0600, Steve Wise wrote: > > > > > > But how? > > > Wait, work request posting functions actually get a virtual > > > address and a key, not a dma address. > > > > Work requests posted using ib_post_send/recv are specified using a dma > > address obtained using the appropriate Linux DMA mapping API function. > > They are not virtual addresses. > > > > This isn't necessarily true. The addr field in the WR is a dma addr > when using a MR allocated via ib_get_dma_mr(). For MRs allocated via > ib_reg_phys_mem(), then the addr field is something relative to the > iova_start u64 passed into the ib_reg_phys_mem(). Right? The addr field in the send and receive work requests are actually offsets within the memory region specified by the Lkey. The open-fabrics verbs layer has chosen to define the Lkey returned from ib_get_dma_mr() as memory region representing all of physical memory and the offset as a device specific bus address which should be created with the ib_dma_*() functions I just posted. Similar rules apply to ib_map_phys_fmr() and ib_phys_buf. I will write this up as James has suggested in a Documentation/infiniband/memory_regions.txt file. From jlentini at netapp.com Fri Nov 3 10:52:18 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 3 Nov 2006 13:52:18 -0500 (EST) Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1162577301.3328.4.camel@stevo-desktop> References: <20061102234442.GF15403@mellanox.co.il> <1162577301.3328.4.camel@stevo-desktop> Message-ID: On Fri, 3 Nov 2006, Steve Wise wrote: > > > > > > But how? > > > Wait, work request posting functions actually get a virtual > > > address and a key, not a dma address. > > > > Work requests posted using ib_post_send/recv are specified using a dma > > address obtained using the appropriate Linux DMA mapping API function. > > They are not virtual addresses. > > > > This isn't necessarily true. The addr field in the WR is a dma addr > when using a MR allocated via ib_get_dma_mr(). For MRs allocated via > ib_reg_phys_mem(), then the addr field is something relative to the > iova_start u64 passed into the ib_reg_phys_mem(). Right? The only documentation I've ever seen on this topic is here: http://www.linuxjournal.com/article/8009 It says: "The address in the gather list is a DMA address obtained from dma_map_single() rather than a virtual address" NFS-RDMA is the only ULP I know of that uses ib_reg_phys_mr(). NFS-RDMA uses the DMA mapping API to initialize the iova_start address passed into ib_reg_phys_mr(). That is what the Mellanox driver expected (at the time, that was the best documentation I had). Subsequent work request addresses are calculated based on the iova_start address (as you point out), but the iova_start was generated using the DMA mapping API. From jlentini at netapp.com Fri Nov 3 10:57:02 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 3 Nov 2006 13:57:02 -0500 (EST) Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1162577362.29948.600.camel@brick.pathscale.com> References: <1162506626.29948.568.camel@brick.pathscale.com> <1162577362.29948.600.camel@brick.pathscale.com> Message-ID: On Fri, 3 Nov 2006, Ralph Campbell wrote: > On Fri, 2006-11-03 at 12:54 -0500, James Lentini wrote: > > Ralph, > > > > Could you add documentation to Documentation/infiniband/ explainin > > when a ULP needs to use the new verbs DMA mapping functions (what > > structures/function parameters must be specified using this API)? > > Good idea. I will. Thank you Ralph. This will be a big help to present and future ULP authors. > > Did you consider creating a new type, ib_dma_addr_t, and using this > > type for the verb structure members/function parameters that require > > it? > > I didn't really think about it since I was trying to make identical > replacements for dma_map_single(), etc. which return dma_addr_t. > That would pretty much force ib_dma_addr_t to be the same as > dma_addr_t. I suggested it as a way to self document the verbs API. It would make clear that for certain variables a call to the IB DMA mapping API was necessary. Of course, that benefit has to be weighed against the obfuscation created by an new typedef. From mshefty at ichips.intel.com Fri Nov 3 11:41:50 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 03 Nov 2006 11:41:50 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <454A3476.6090402@ichips.intel.com> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> <4549D3B7.1050208@voltaire.com> <454A3476.6090402@ichips.intel.com> Message-ID: <454B9B7E.5000403@ichips.intel.com> We were able to get some more test time on the cluster. Our latest findings are below. > The main issue that we saw was that the SA simply doesn't scale. From what we could see, it didn't appear that _any_ path record queries were ever lost, even when scaling up to 500,000+ requests. As long as the query timeouts were large enough (dependent on process count), our tests would finish within a reasonable time, and without retrying queries. If the timeout values were too small, the SA would form a backlog of timed out requests. With 1024 processes trying to establish all to all connections, it would take about 30 seconds for all nodes to complete path record queries. The SA was able to sustain about 17,000 queries per second. >>Was the issue with address resolution being ARP request or reply >>messages getting lost? We only just started looking into this when we were bumped off the cluster. In our initial peek at this, it looked like either the ARP requests or replies were being discarded on transmit. Simply increasing the ARP cache timeout fixed most of the problems for us. > The disconnect delay occurred because of remote nodes being slow to respond to > disconnect requests. We're still investigating this issue. This was a DAPL issue. - Sean From fwang2 at ornl.gov Fri Nov 3 12:12:39 2006 From: fwang2 at ornl.gov (Feiyi Wang) Date: Fri, 3 Nov 2006 15:12:39 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <1162482544.15232.585.camel@hal.voltaire.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <1162475595.29957.109003.camel@hal.voltaire.com> <23e627a30611020720i4a268098h3bf1549621e16f0@mail.gmail.com> <1162482544.15232.585.camel@hal.voltaire.com> Message-ID: <9cce1c9a0611031212i395e825ane09e95b816261e06@mail.gmail.com> In our test at the ORNL - it appears you can "turn off" the traffic by giving every VL weight 0. As soon as you assign non-zero VL weight, the traffic starts to flow, however, VL with more weight doesn't have expected preference treatment. In other words, traffic shaping didn't take place. smpquery vlarb verified the mapping table was there. I believe the scenario described below 'should' be able to generate congestion point ... but it would be helpful if someone can elaborate a way to "look into" how/if scheduling/arbitration take place. Best, Feiyi On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock wrote: > Hi Oliver, > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > Hi, Hal - > > > > > How is this being observed/measured ? > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > A single process of ibv_read_bw gives about 1415MB /s average > > bandwidth. Two concurrent process report 714.45 MB/s each, dead even. > > Now if I bump up one process with a different SL, then I expect to see > > shaping to take place. Please let me if the scenario makes sense. > > It makes sense. However, if the higher priority traffic does not fill > the scheduling, the low priority can take up the slack so I'm not sure > if this is what you are seeing or something else. > > It might be interesting to try the same thing at SDR speeds. > > -- Hal > > > > Yes, 8 VLs should be supported in your subnet. You can verify this with > > > smpquery portinfo on the HCA port and examine OperVLs assuming the port > > > is ACTIVE. > > > > yes, I verified the data VL support, it is 8. I will poke for more > > info with suggested commands by Sasha. > > > > > > A related question is, if I modify qos setting in SM, do I need to > > > > restart SA on each hosts for it to see the changes? (I am hoping not, > > > > as I tried in the test, it doesn't seem to make a difference) > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > > > mean SA client ? The client hosts don't need restarting but did you > > > restart OpenSM with your QoS configuration ? > > > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > > > BTW, which OpenSM are you running ? > > > > OFED 1.1 based. > > > > thanks > > > > - Oliver > > From halr at voltaire.com Fri Nov 3 12:27:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2006 15:27:28 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <9cce1c9a0611031212i395e825ane09e95b816261e06@mail.gmail.com> References: <23e627a30611011352v68ac518eje9175fc3f4213839@mail.gmail.com> <1162475595.29957.109003.camel@hal.voltaire.com> <23e627a30611020720i4a268098h3bf1549621e16f0@mail.gmail.com> <1162482544.15232.585.camel@hal.voltaire.com> <9cce1c9a0611031212i395e825ane09e95b816261e06@mail.gmail.com> Message-ID: <1162585647.15232.70306.camel@hal.voltaire.com> On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > In our test at the ORNL - it appears you can "turn off" the traffic by > giving every VL weight 0. A weight of 0 indicates to skip that entry. > As soon as you assign non-zero VL weight, > the traffic starts to flow, however, VL with more weight doesn't have > expected preference treatment. In other words, traffic shaping didn't > take place. smpquery vlarb verified the mapping table was there. correctly ? Is it high or low priority or both ? What about SL2VLMapping table ? Is it setup correctly ? What's your topology for this ? Can you send your SL2VLMapping and VLarbitration configuration ? > I believe the scenario described below 'should' be able to generate > congestion point ... but it would be helpful if someone can elaborate > a way to "look into" how/if scheduling/arbitration take place. The only ways I know would be to look at either the packets on the wire or what you are doing with multiple streams which seems valid to me. Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to understand how to configure this ? -- Hal > Best, > > Feiyi > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock wrote: > > Hi Oliver, > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > Hi, Hal - > > > > > > > How is this being observed/measured ? > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > A single process of ibv_read_bw gives about 1415MB /s average > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead even. > > > Now if I bump up one process with a different SL, then I expect to see > > > shaping to take place. Please let me if the scenario makes sense. > > > > It makes sense. However, if the higher priority traffic does not fill > > the scheduling, the low priority can take up the slack so I'm not sure > > if this is what you are seeing or something else. > > > > It might be interesting to try the same thing at SDR speeds. > > > > -- Hal > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify this with > > > > smpquery portinfo on the HCA port and examine OperVLs assuming the port > > > > is ACTIVE. > > > > > > yes, I verified the data VL support, it is 8. I will poke for more > > > info with suggested commands by Sasha. > > > > > > > > A related question is, if I modify qos setting in SM, do I need to > > > > > restart SA on each hosts for it to see the changes? (I am hoping not, > > > > > as I tried in the test, it doesn't seem to make a difference) > > > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > > > > mean SA client ? The client hosts don't need restarting but did you > > > > restart OpenSM with your QoS configuration ? > > > > > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > > > > > BTW, which OpenSM are you running ? > > > > > > OFED 1.1 based. > > > > > > thanks > > > > > > - Oliver > > > > From fwang2 at ornl.gov Fri Nov 3 12:43:14 2006 From: fwang2 at ornl.gov (Wang, Feiyi) Date: Fri, 03 Nov 2006 15:43:14 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <1162585647.15232.70306.camel@hal.voltaire.com> Message-ID: <537C6C0940C6C143AA46A88946B8541705109AE0@ORNLEXCHANGE.ornl.gov> The test is done on two hosts, say A and B. A has 4x SDR (run ib_rdam_bw as server), B has 4x DDR (run more than one thread of ib_rdma_bw as clients). The sl2vl table read as: smpquery sl2vl 7 # SL2VL table: Lid 7 # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7| smpquery vlarb 7 # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 # Low priority VL Arbitration Table: VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | # High priority VL Arbitration Table: VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | Low priority table entries are all zero to skip. High priority table give VL 0 and VL 2 different weight. The SL is specified on command line, one thread with SL 0, the other thread with SL 2. Thanks for looking into this, and let me know if more info is needed. Feiyi -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Friday, November 03, 2006 3:27 PM To: Wang, Feiyi Cc: openib-general at openib.org Subject: Re: [openib-general] question on QoS support On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > In our test at the ORNL - it appears you can "turn off" the traffic by > giving every VL weight 0. A weight of 0 indicates to skip that entry. > As soon as you assign non-zero VL weight, > the traffic starts to flow, however, VL with more weight doesn't have > expected preference treatment. In other words, traffic shaping didn't > take place. smpquery vlarb verified the mapping table was there. correctly ? Is it high or low priority or both ? What about SL2VLMapping table ? Is it setup correctly ? What's your topology for this ? Can you send your SL2VLMapping and VLarbitration configuration ? > I believe the scenario described below 'should' be able to generate > congestion point ... but it would be helpful if someone can elaborate > a way to "look into" how/if scheduling/arbitration take place. The only ways I know would be to look at either the packets on the wire or what you are doing with multiple streams which seems valid to me. Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to understand how to configure this ? -- Hal > Best, > > Feiyi > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock wrote: > > Hi Oliver, > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > Hi, Hal - > > > > > > > How is this being observed/measured ? > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > A single process of ibv_read_bw gives about 1415MB /s average > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead even. > > > Now if I bump up one process with a different SL, then I expect to see > > > shaping to take place. Please let me if the scenario makes sense. > > > > It makes sense. However, if the higher priority traffic does not fill > > the scheduling, the low priority can take up the slack so I'm not sure > > if this is what you are seeing or something else. > > > > It might be interesting to try the same thing at SDR speeds. > > > > -- Hal > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify this with > > > > smpquery portinfo on the HCA port and examine OperVLs assuming the port > > > > is ACTIVE. > > > > > > yes, I verified the data VL support, it is 8. I will poke for more > > > info with suggested commands by Sasha. > > > > > > > > A related question is, if I modify qos setting in SM, do I need to > > > > > restart SA on each hosts for it to see the changes? (I am hoping not, > > > > > as I tried in the test, it doesn't seem to make a difference) > > > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. Do you > > > > mean SA client ? The client hosts don't need restarting but did you > > > > restart OpenSM with your QoS configuration ? > > > > > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > > > > > BTW, which OpenSM are you running ? > > > > > > OFED 1.1 based. > > > > > > thanks > > > > > > - Oliver > > > > From halr at voltaire.com Fri Nov 3 12:50:57 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2006 15:50:57 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <537C6C0940C6C143AA46A88946B8541705109AE0@ORNLEXCHANGE.ornl.gov> References: <537C6C0940C6C143AA46A88946B8541705109AE0@ORNLEXCHANGE.ornl.gov> Message-ID: <1162587056.15232.71190.camel@hal.voltaire.com> On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > The test is done on two hosts, say A and B. A has 4x SDR (run ib_rdam_bw > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > clients). The sl2vl table read as: > > smpquery sl2vl 7 > # SL2VL table: Lid 7 > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7| > > smpquery vlarb 7 > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > # Low priority VL Arbitration Table: > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > # High priority VL Arbitration Table: > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > Low priority table entries are all zero to skip. > High priority table give VL 0 and VL 2 different weight. > > The SL is specified on command line, one thread with SL 0, the other > thread with SL 2. > > Thanks for looking into this, and let me know if more info is needed. What's the limit of high priority ? -- Hal > Feiyi > > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, November 03, 2006 3:27 PM > To: Wang, Feiyi > Cc: openib-general at openib.org > Subject: Re: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > > In our test at the ORNL - it appears you can "turn off" the traffic by > > giving every VL weight 0. > > A weight of 0 indicates to skip that entry. > > > As soon as you assign non-zero VL weight, > > the traffic starts to flow, however, VL with more weight doesn't have > > expected preference treatment. In other words, traffic shaping didn't > > take place. smpquery vlarb verified the mapping table was there. > > correctly ? > > Is it high or low priority or both ? > > What about SL2VLMapping table ? Is it setup correctly ? > > What's your topology for this ? > > Can you send your SL2VLMapping and VLarbitration configuration ? > > > I believe the scenario described below 'should' be able to generate > > congestion point ... but it would be helpful if someone can elaborate > > a way to "look into" how/if scheduling/arbitration take place. > > The only ways I know would be to look at either the packets on the wire > or what you are doing with multiple streams which seems valid to me. > > Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to > understand how to configure this ? > > -- Hal > > > Best, > > > > Feiyi > > > > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock > wrote: > > > Hi Oliver, > > > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > > Hi, Hal - > > > > > > > > > How is this being observed/measured ? > > > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > > A single process of ibv_read_bw gives about 1415MB /s average > > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead > even. > > > > Now if I bump up one process with a different SL, then I expect to > see > > > > shaping to take place. Please let me if the scenario makes sense. > > > > > > It makes sense. However, if the higher priority traffic does not > fill > > > the scheduling, the low priority can take up the slack so I'm not > sure > > > if this is what you are seeing or something else. > > > > > > It might be interesting to try the same thing at SDR speeds. > > > > > > -- Hal > > > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify > this with > > > > > smpquery portinfo on the HCA port and examine OperVLs assuming > the port > > > > > is ACTIVE. > > > > > > > > yes, I verified the data VL support, it is 8. I will poke for more > > > > info with suggested commands by Sasha. > > > > > > > > > > A related question is, if I modify qos setting in SM, do I > need to > > > > > > restart SA on each hosts for it to see the changes? (I am > hoping not, > > > > > > as I tried in the test, it doesn't seem to make a difference) > > > > > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. > Do you > > > > > mean SA client ? The client hosts don't need restarting but did > you > > > > > restart OpenSM with your QoS configuration ? > > > > > > > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > > > > > > > BTW, which OpenSM are you running ? > > > > > > > > OFED 1.1 based. > > > > > > > > thanks > > > > > > > > - Oliver > > > > > > > From fwang2 at ornl.gov Fri Nov 3 12:56:08 2006 From: fwang2 at ornl.gov (Wang, Feiyi) Date: Fri, 03 Nov 2006 15:56:08 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <1162587056.15232.71190.camel@hal.voltaire.com> Message-ID: <537C6C0940C6C143AA46A88946B8541705109B08@ORNLEXCHANGE.ornl.gov> 255 I think I tested with default 0 before, that is send at most one packet before give low priority table the chance according to IBA. It doesn't seem to make a difference though. Feiyi -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Friday, November 03, 2006 3:51 PM To: Wang, Feiyi Cc: openib-general at openib.org Subject: RE: [openib-general] question on QoS support On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > The test is done on two hosts, say A and B. A has 4x SDR (run ib_rdam_bw > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > clients). The sl2vl table read as: > > smpquery sl2vl 7 > # SL2VL table: Lid 7 > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7| > > smpquery vlarb 7 > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > # Low priority VL Arbitration Table: > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > # High priority VL Arbitration Table: > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > Low priority table entries are all zero to skip. > High priority table give VL 0 and VL 2 different weight. > > The SL is specified on command line, one thread with SL 0, the other > thread with SL 2. > > Thanks for looking into this, and let me know if more info is needed. What's the limit of high priority ? -- Hal > Feiyi > > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, November 03, 2006 3:27 PM > To: Wang, Feiyi > Cc: openib-general at openib.org > Subject: Re: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > > In our test at the ORNL - it appears you can "turn off" the traffic by > > giving every VL weight 0. > > A weight of 0 indicates to skip that entry. > > > As soon as you assign non-zero VL weight, > > the traffic starts to flow, however, VL with more weight doesn't have > > expected preference treatment. In other words, traffic shaping didn't > > take place. smpquery vlarb verified the mapping table was there. > > correctly ? > > Is it high or low priority or both ? > > What about SL2VLMapping table ? Is it setup correctly ? > > What's your topology for this ? > > Can you send your SL2VLMapping and VLarbitration configuration ? > > > I believe the scenario described below 'should' be able to generate > > congestion point ... but it would be helpful if someone can elaborate > > a way to "look into" how/if scheduling/arbitration take place. > > The only ways I know would be to look at either the packets on the wire > or what you are doing with multiple streams which seems valid to me. > > Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to > understand how to configure this ? > > -- Hal > > > Best, > > > > Feiyi > > > > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock > wrote: > > > Hi Oliver, > > > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > > Hi, Hal - > > > > > > > > > How is this being observed/measured ? > > > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > > A single process of ibv_read_bw gives about 1415MB /s average > > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead > even. > > > > Now if I bump up one process with a different SL, then I expect to > see > > > > shaping to take place. Please let me if the scenario makes sense. > > > > > > It makes sense. However, if the higher priority traffic does not > fill > > > the scheduling, the low priority can take up the slack so I'm not > sure > > > if this is what you are seeing or something else. > > > > > > It might be interesting to try the same thing at SDR speeds. > > > > > > -- Hal > > > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify > this with > > > > > smpquery portinfo on the HCA port and examine OperVLs assuming > the port > > > > > is ACTIVE. > > > > > > > > yes, I verified the data VL support, it is 8. I will poke for more > > > > info with suggested commands by Sasha. > > > > > > > > > > A related question is, if I modify qos setting in SM, do I > need to > > > > > > restart SA on each hosts for it to see the changes? (I am > hoping not, > > > > > > as I tried in the test, it doesn't seem to make a difference) > > > > > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. > Do you > > > > > mean SA client ? The client hosts don't need restarting but did > you > > > > > restart OpenSM with your QoS configuration ? > > > > > > > > I mean client SA. yes, I understand OpenSM needs to be restarted. > > > > > > > > > BTW, which OpenSM are you running ? > > > > > > > > OFED 1.1 based. > > > > > > > > thanks > > > > > > > > - Oliver > > > > > > > From halr at voltaire.com Fri Nov 3 12:57:43 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 03 Nov 2006 15:57:43 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <537C6C0940C6C143AA46A88946B8541705109B08@ORNLEXCHANGE.ornl.gov> References: <537C6C0940C6C143AA46A88946B8541705109B08@ORNLEXCHANGE.ornl.gov> Message-ID: <1162587457.15232.71447.camel@hal.voltaire.com> On Fri, 2006-11-03 at 15:56, Wang, Feiyi wrote: > 255 > > I think I tested with default 0 before, that is send at most one packet > before give low priority table the chance according to IBA. It doesn't > seem to make a difference though. I was hoping you would say 0 as that means 1 packet before looking at low priority. 255 means unbounded packets on high priority. Can you send me the results of smpquery portinfo on that port to ensure that it is being set properly ? -- Hal > Feiyi > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, November 03, 2006 3:51 PM > To: Wang, Feiyi > Cc: openib-general at openib.org > Subject: RE: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > > The test is done on two hosts, say A and B. A has 4x SDR (run > ib_rdam_bw > > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > > clients). The sl2vl table read as: > > > > smpquery sl2vl 7 > > # SL2VL table: Lid 7 > > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15| > > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| > 7| > > > > smpquery vlarb 7 > > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > > # Low priority VL Arbitration Table: > > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > > # High priority VL Arbitration Table: > > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > > > Low priority table entries are all zero to skip. > > High priority table give VL 0 and VL 2 different weight. > > > > The SL is specified on command line, one thread with SL 0, the other > > thread with SL 2. > > > > Thanks for looking into this, and let me know if more info is needed. > > What's the limit of high priority ? > > -- Hal > > > Feiyi > > > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Friday, November 03, 2006 3:27 PM > > To: Wang, Feiyi > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] question on QoS support > > > > On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > > > In our test at the ORNL - it appears you can "turn off" the traffic > by > > > giving every VL weight 0. > > > > A weight of 0 indicates to skip that entry. > > > > > As soon as you assign non-zero VL weight, > > > the traffic starts to flow, however, VL with more weight doesn't > have > > > expected preference treatment. In other words, traffic shaping > didn't > > > take place. smpquery vlarb verified the mapping table was there. > > > > correctly ? > > > > Is it high or low priority or both ? > > > > What about SL2VLMapping table ? Is it setup correctly ? > > > > What's your topology for this ? > > > > Can you send your SL2VLMapping and VLarbitration configuration ? > > > > > I believe the scenario described below 'should' be able to generate > > > congestion point ... but it would be helpful if someone can > elaborate > > > a way to "look into" how/if scheduling/arbitration take place. > > > > The only ways I know would be to look at either the packets on the > wire > > or what you are doing with multiple streams which seems valid to me. > > > > Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to > > understand how to configure this ? > > > > -- Hal > > > > > Best, > > > > > > Feiyi > > > > > > > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock > > wrote: > > > > Hi Oliver, > > > > > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > > > Hi, Hal - > > > > > > > > > > > How is this being observed/measured ? > > > > > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > > > A single process of ibv_read_bw gives about 1415MB /s average > > > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead > > even. > > > > > Now if I bump up one process with a different SL, then I expect > to > > see > > > > > shaping to take place. Please let me if the scenario makes > sense. > > > > > > > > It makes sense. However, if the higher priority traffic does not > > fill > > > > the scheduling, the low priority can take up the slack so I'm not > > sure > > > > if this is what you are seeing or something else. > > > > > > > > It might be interesting to try the same thing at SDR speeds. > > > > > > > > -- Hal > > > > > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify > > this with > > > > > > smpquery portinfo on the HCA port and examine OperVLs assuming > > the port > > > > > > is ACTIVE. > > > > > > > > > > yes, I verified the data VL support, it is 8. I will poke for > more > > > > > info with suggested commands by Sasha. > > > > > > > > > > > > A related question is, if I modify qos setting in SM, do I > > need to > > > > > > > restart SA on each hosts for it to see the changes? (I am > > hoping not, > > > > > > > as I tried in the test, it doesn't seem to make a > difference) > > > > > > > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. > > Do you > > > > > > mean SA client ? The client hosts don't need restarting but > did > > you > > > > > > restart OpenSM with your QoS configuration ? > > > > > > > > > > I mean client SA. yes, I understand OpenSM needs to be > restarted. > > > > > > > > > > > BTW, which OpenSM are you running ? > > > > > > > > > > OFED 1.1 based. > > > > > > > > > > thanks > > > > > > > > > > - Oliver > > > > > > > > > > > From akepner at sgi.com Fri Nov 3 15:37:36 2006 From: akepner at sgi.com (akepner at sgi.com) Date: Fri, 3 Nov 2006 15:37:36 -0800 (PST) Subject: [openib-general] static ARP entries for IPoIB? Message-ID: I'd like to create static ARP entries for some IPoIB devices. The arp (8) command that I'm using doesn't know about ARPHRD_INFINIBAND (this is arp version 1.88 from net-tools-1.60-583.4.src.rpm, in SLES10.) Is there a version of arp (8) that works with IB? Or some other utility or means to make static ARP entries for IPoIB? -- Arthur From rdreier at cisco.com Fri Nov 3 16:13:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Nov 2006 16:13:56 -0800 Subject: [openib-general] static ARP entries for IPoIB? In-Reply-To: ( akepner@sgi.com's message of "Fri, 3 Nov 2006 15:37:36 -0800 (PST)") References: Message-ID: > I'd like to create static ARP entries for some IPoIB > devices. The arp (8) command that I'm using doesn't > know about ARPHRD_INFINIBAND (this is arp version 1.88 > from net-tools-1.60-583.4.src.rpm, in SLES10.) Is there > a version of arp (8) that works with IB? Or some other > utility or means to make static ARP entries for IPoIB? "ip neigh add ..." - R. From bhartner at us.ibm.com Sat Nov 4 06:57:26 2006 From: bhartner at us.ibm.com (Bill Hartner) Date: Sat, 4 Nov 2006 08:57:26 -0600 Subject: [openib-general] cma_tavor_quirk.patch Message-ID: I downloaded OFED 1.1 and tried using tavor_quirk to improve uDAPL performance. I thought tavor_quirk would set the MTU to 1024, but the patch looks like this: # cat kernel_patches/fixes/cma_tavor_quirk.patch [snip] --- linux-2.6.18-rc2-devel.orig/drivers/infiniband/core/cma.c +++ linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c [snip] +       if (tavor_quirk) { +               path_rec.mtu_selector = IB_SA_LT; +               path_rec.mtu = IB_MTU_2048; +       } + [snip] Is this correct, or should MTU be set to 1024 when tavor_quirk is used? http://openib.org/pipermail/openib-general/2006-September/026097.html -Bill From rdreier at cisco.com Sat Nov 4 07:15:54 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sat, 04 Nov 2006 07:15:54 -0800 Subject: [openib-general] cma_tavor_quirk.patch In-Reply-To: (Bill Hartner's message of "Sat, 4 Nov 2006 08:57:26 -0600") References: Message-ID: > +       if (tavor_quirk) { > +               path_rec.mtu_selector = IB_SA_LT; > +               path_rec.mtu = IB_MTU_2048; > +       } > + > [snip] > > Is this correct, or should MTU be set to 1024 when tavor_quirk is used? The quirk _is_ basically setting the MTU to 1024 (it's asking for a path with an MTU less than 2048). It is just being careful to work in a network where the MTU is already less than 1024. - R. From bhartner at us.ibm.com Sat Nov 4 08:13:48 2006 From: bhartner at us.ibm.com (Bill Hartner) Date: Sat, 4 Nov 2006 10:13:48 -0600 Subject: [openib-general] cma_tavor_quirk.patch In-Reply-To: Message-ID: Roland Dreier wrote on 11/04/2006 09:15:54 AM: > > +       if (tavor_quirk) { > > +               path_rec.mtu_selector = IB_SA_LT; > > +               path_rec.mtu = IB_MTU_2048; > > +       } > > + > > [snip] > > > > Is this correct, or should MTU be set to 1024 when tavor_quirk is used? > > The quirk _is_ basically setting the MTU to 1024 (it's asking for a > path with an MTU less than 2048).  It is just being careful to work in > a network where the MTU is already less than 1024. > - R. Thanks, I overlooked IB_SA_LT.  I will need to check the config again since I was not able to get the expected performance improvement. -Bill From eitan at mellanox.co.il Sat Nov 4 22:27:28 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 05 Nov 2006 08:27:28 +0200 Subject: [openib-general] git: cloning partial tree Message-ID: <454D8450.7040508@mellanox.co.il> A newbie question: What is the easiest/simplest way for me to clone only the infiniband part of the kernel tree? Thanks Eitan From mst at mellanox.co.il Sat Nov 4 22:43:41 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Nov 2006 08:43:41 +0200 Subject: [openib-general] git: cloning partial tree In-Reply-To: <454D8450.7040508@mellanox.co.il> References: <454D8450.7040508@mellanox.co.il> Message-ID: <20061105064341.GA12241@mellanox.co.il> Quoting r. Eitan Zahavi : > Subject: git: cloning partial tree > > A newbie question: > What is the easiest/simplest way for me to clone only the infiniband > part of the kernel tree? > > Thanks > > Eitan You can't do that with git. But you can check out just the infiniband part of the tree: git checkout master `git-ls-tree -r --name-only master \ include/rdma include/scsi/srp.h drivers/infiniband \ Documentation/infiniband ofed_scripts kernel_patches` echo 'ref: refs/heads/master' > .git/HEAD -- MST From eitan at mellanox.co.il Sat Nov 4 23:42:09 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 05 Nov 2006 09:42:09 +0200 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: <1162506776.29948.572.camel@brick.pathscale.com> References: <1162506776.29948.572.camel@brick.pathscale.com> Message-ID: <454D95D1.9050207@mellanox.co.il> Hi Ralph, Is there any performance penalty for using the IB version of the DMA mapping functions? Thanks Eitan Ralph Campbell wrote: > IB/ipoib - Use the new verbs DMA mapping functions > > This patch converts IPoIB to use the new DMA mapping functions > for kernel verbs consumers. > > From: Ralph Campbell > > diff -r f37bd0e41fec drivers/infiniband/ulp/ipoib/ipoib_ib.c > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 21:44:41 2006 +0700 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 12:37:09 2006 -0800 > @@ -109,9 +109,8 @@ static int ipoib_ib_post_receive(struct > ret = ib_post_recv(priv->qp, ¶m, &bad_wr); > if (unlikely(ret)) { > ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); > - dma_unmap_single(priv->ca->dma_device, > - priv->rx_ring[id].mapping, > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > + ib_dma_unmap_single(priv->ca, priv->rx_ring[id].mapping, > + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > dev_kfree_skb_any(priv->rx_ring[id].skb); > priv->rx_ring[id].skb = NULL; > } > @@ -136,10 +135,9 @@ static int ipoib_alloc_rx_skb(struct net > */ > skb_reserve(skb, 4); > > - addr = dma_map_single(priv->ca->dma_device, > - skb->data, IPOIB_BUF_SIZE, > - DMA_FROM_DEVICE); > - if (unlikely(dma_mapping_error(addr))) { > + addr = ib_dma_map_single(priv->ca, skb->data, IPOIB_BUF_SIZE, > + DMA_FROM_DEVICE); > + if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { > dev_kfree_skb_any(skb); > return -EIO; > } > @@ -193,8 +191,8 @@ static void ipoib_ib_handle_rx_wc(struct > ipoib_warn(priv, "failed recv event " > "(status=%d, wrid=%d vend_err %x)\n", > wc->status, wr_id, wc->vendor_err); > - dma_unmap_single(priv->ca->dma_device, addr, > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > + ib_dma_unmap_single(priv->ca, addr, > + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > dev_kfree_skb_any(skb); > priv->rx_ring[wr_id].skb = NULL; > return; > @@ -212,8 +210,7 @@ static void ipoib_ib_handle_rx_wc(struct > ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > wc->byte_len, wc->slid); > > - dma_unmap_single(priv->ca->dma_device, addr, > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > + ib_dma_unmap_single(priv->ca, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > skb_put(skb, wc->byte_len); > skb_pull(skb, IB_GRH_BYTES); > @@ -261,10 +258,8 @@ static void ipoib_ib_handle_tx_wc(struct > > tx_req = &priv->tx_ring[wr_id]; > > - dma_unmap_single(priv->ca->dma_device, > - pci_unmap_addr(tx_req, mapping), > - tx_req->skb->len, > - DMA_TO_DEVICE); > + ib_dma_unmap_single(priv->ca, pci_unmap_addr(tx_req, mapping), > + tx_req->skb->len, DMA_TO_DEVICE); > > ++priv->stats.tx_packets; > priv->stats.tx_bytes += tx_req->skb->len; > @@ -353,9 +348,9 @@ void ipoib_send(struct net_device *dev, > */ > tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; > tx_req->skb = skb; > - addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, > - DMA_TO_DEVICE); > - if (unlikely(dma_mapping_error(addr))) { > + addr = ib_dma_map_single(priv->ca, skb->data, skb->len, > + DMA_TO_DEVICE); > + if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { > ++priv->stats.tx_errors; > dev_kfree_skb_any(skb); > return; > @@ -366,8 +361,7 @@ void ipoib_send(struct net_device *dev, > address->ah, qpn, addr, skb->len))) { > ipoib_warn(priv, "post_send failed\n"); > ++priv->stats.tx_errors; > - dma_unmap_single(priv->ca->dma_device, addr, skb->len, > - DMA_TO_DEVICE); > + ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); > dev_kfree_skb_any(skb); > } else { > dev->trans_start = jiffies; > @@ -537,24 +531,28 @@ int ipoib_ib_dev_stop(struct net_device > while ((int) priv->tx_tail - (int) priv->tx_head < 0) { > tx_req = &priv->tx_ring[priv->tx_tail & > (ipoib_sendq_size - 1)]; > - dma_unmap_single(priv->ca->dma_device, > - pci_unmap_addr(tx_req, mapping), > - tx_req->skb->len, > - DMA_TO_DEVICE); > + ib_dma_unmap_single(priv->ca, > + pci_unmap_addr(tx_req, > + mapping), > + tx_req->skb->len, > + DMA_TO_DEVICE); > dev_kfree_skb_any(tx_req->skb); > ++priv->tx_tail; > } > > - for (i = 0; i < ipoib_recvq_size; ++i) > - if (priv->rx_ring[i].skb) { > - dma_unmap_single(priv->ca->dma_device, > - pci_unmap_addr(&priv->rx_ring[i], > - mapping), > - IPOIB_BUF_SIZE, > - DMA_FROM_DEVICE); > - dev_kfree_skb_any(priv->rx_ring[i].skb); > - priv->rx_ring[i].skb = NULL; > - } > + for (i = 0; i < ipoib_recvq_size; ++i) { > + struct ipoib_rx_buf *rx_req; > + > + rx_req = &priv->rx_ring[i]; > + if (!rx_req->skb) > + continue; > + ib_dma_unmap_single(priv->ca, > + rx_req->mapping, > + IPOIB_BUF_SIZE, > + DMA_FROM_DEVICE); > + dev_kfree_skb_any(rx_req->skb); > + rx_req->skb = NULL; > + } > > goto timeout; > } > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Sat Nov 4 23:46:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sun, 05 Nov 2006 09:46:26 +0200 Subject: [openib-general] git: cloning partial tree In-Reply-To: <20061105064341.GA12241@mellanox.co.il> References: <454D8450.7040508@mellanox.co.il> <20061105064341.GA12241@mellanox.co.il> Message-ID: <454D96D2.8090100@mellanox.co.il> Michael S. Tsirkin wrote: > Quoting r. Eitan Zahavi : > >> Subject: git: cloning partial tree >> >> A newbie question: >> What is the easiest/simplest way for me to clone only the infiniband >> part of the kernel tree? >> >> Thanks >> >> Eitan >> > > You can't do that with git. > Are there other SVN features not available under git ? > But you can check out just the infiniband part of the tree: > git checkout master `git-ls-tree -r --name-only master \ > include/rdma include/scsi/srp.h drivers/infiniband \ > Documentation/infiniband ofed_scripts kernel_patches` > echo 'ref: refs/heads/master' > .git/HEAD > > > Thanks, but I have tried this can got back: fatal: Not a git repository fatal: Not a git repository From mst at mellanox.co.il Sun Nov 5 00:08:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Nov 2006 10:08:43 +0200 Subject: [openib-general] git: cloning partial tree In-Reply-To: <454D96D2.8090100@mellanox.co.il> References: <454D96D2.8090100@mellanox.co.il> Message-ID: <20061105080843.GA12706@mellanox.co.il> Quoting r. Eitan Zahavi : > > But you can check out just the infiniband part of the tree: > > git checkout master `git-ls-tree -r --name-only master \ > > include/rdma include/scsi/srp.h drivers/infiniband \ > > Documentation/infiniband ofed_scripts kernel_patches` > > echo 'ref: refs/heads/master' > .git/HEAD > > > > > > > Thanks, but I have tried this can got back: > fatal: Not a git repository > fatal: Not a git repository You should clone with -n (don't checkout) flag first. On a shared filesystem, you can also use -s to save disk and make clone go faster. -- MST From ogerlitz at voltaire.com Sun Nov 5 05:05:07 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 05 Nov 2006 15:05:07 +0200 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1162506570.29948.567.camel@brick.pathscale.com> <20061102231404.GC15403@mellanox.co.il> Message-ID: <454DE183.80405@voltaire.com> Roland Dreier wrote: > Also handling highmem is a problem. ipath just depends on 64BIT so it > avoids the problem. I guess mthca could only return a kernel virtual > address if one exists, and always use DMA for highmem pages. So that > isn't really a serious objection. Roland, I have mentioned this to Ralph in the past, just want to get ack/nak on that from you: also on 64bit arch a block driver (eg SCSI LLD eg SRP/iSER/etc) might get from higher level an SG whose pages are **not** mapped into the kernel virtual address space. For example this can happen with Direct I/O. Or. From tziporet at dev.mellanox.co.il Sun Nov 5 05:20:42 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 05 Nov 2006 15:20:42 +0200 Subject: [openib-general] Kernel.org kernel In-Reply-To: References: Message-ID: <454DE52A.1060507@dev.mellanox.co.il> Roland Dreier wrote: > If you are able to build your own kernel and like to use up-to-date > kernels, I would recommend just using the drivers that are in the > mainline kernel and not worrying about OFED. > > However if you wish to use SDP you need to work with OFED. Tziporet From vlad at dev.mellanox.co.il Sun Nov 5 05:26:34 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 05 Nov 2006 15:26:34 +0200 Subject: [openib-general] Error inserting ib_umad In-Reply-To: <1162418001.5609.17.camel@julia.et.endace.com> References: <1162418001.5609.17.camel@julia.et.endace.com> Message-ID: <454DE68A.4010605@dev.mellanox.co.il> Hi Vishal, Try to run '/etc/init.d/openibd restart' or reboot this server . I guess, that you have IB modules which came with SUSE 10.1 loaded. Regards, Vladimir vishal wrote: > Hi, > > I have installed OFED-1.1 rc7. Got the following error on trying > 'modprobe ib_umad':- > > FATAL: Error inserting ib_umad > (/lib/modules/2.6.16.13-4-smp/kernel/drivers/infiniband/core/ib_umad.ko): Unknown symbol in module, or unknown parameter (see dmesg) > > dmesg output:- > > ib_umad: module not supported by Novell, setting U taint flag. > ib_umad: disagrees about version of symbol ib_unregister_client > ib_umad: Unknown symbol ib_unregister_client > ib_umad: Unknown symbol ib_get_mad_data_offset > ib_umad: disagrees about version of symbol ib_modify_port > ib_umad: Unknown symbol ib_modify_port > ib_umad: disagrees about version of symbol ib_create_ah > ib_umad: Unknown symbol ib_create_ah > ib_umad: disagrees about version of symbol ib_register_client > ib_umad: Unknown symbol ib_register_client > ib_umad: disagrees about version of symbol ib_unregister_mad_agent > ib_umad: Unknown symbol ib_unregister_mad_agent > ib_umad: Unknown symbol ib_response_mad > ib_umad: disagrees about version of symbol ib_post_send_mad > ib_umad: Unknown symbol ib_post_send_mad > ib_umad: disagrees about version of symbol ib_create_send_mad > ib_umad: Unknown symbol ib_create_send_mad > ib_umad: disagrees about version of symbol ib_set_client_data > ib_umad: Unknown symbol ib_set_client_data > ib_umad: disagrees about version of symbol ib_get_client_data > ib_umad: Unknown symbol ib_get_client_data > ib_umad: Unknown symbol ib_is_mad_class_rmpp > ib_umad: disagrees about version of symbol ib_free_send_mad > ib_umad: Unknown symbol ib_free_send_mad > ib_umad: disagrees about version of symbol ib_destroy_ah > ib_umad: Unknown symbol ib_destroy_ah > ib_umad: Unknown symbol ib_get_rmpp_segment > ib_umad: disagrees about version of symbol ib_register_mad_agent > ib_umad: Unknown symbol ib_register_mad_agent > > > I am using SUSE 10.1 Enterprise x86_64. > > Thanks! > > Vishal > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From michael.arndt at informatik.tu-chemnitz.de Sun Nov 5 07:12:26 2006 From: michael.arndt at informatik.tu-chemnitz.de (Michael Arndt) Date: Sun, 5 Nov 2006 16:12:26 +0100 Subject: [openib-general] osm_opensm_t Struct Message-ID: <000301c700ec$cd6ccd10$21606d86@one7> Hi, the osm_opensm_t struct is carrying a pointer named "p_updn_ucast_routing". What's the use of this pointer? Thanks Michael From mst at mellanox.co.il Sun Nov 5 07:22:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Nov 2006 17:22:44 +0200 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: <1162506776.29948.572.camel@brick.pathscale.com> References: <1162506776.29948.572.camel@brick.pathscale.com> Message-ID: <20061105152244.GC14245@mellanox.co.il> Quoting r. Ralph Campbell : > diff -r f37bd0e41fec drivers/infiniband/ulp/ipoib/ipoib_ib.c > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 21:44:41 2006 +0700 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 12:37:09 2006 -0800 > @@ -109,9 +109,8 @@ static int ipoib_ib_post_receive(struct > ret = ib_post_recv(priv->qp, ¶m, &bad_wr); > if (unlikely(ret)) { > ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); > - dma_unmap_single(priv->ca->dma_device, > - priv->rx_ring[id].mapping, > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > + ib_dma_unmap_single(priv->ca, priv->rx_ring[id].mapping, > + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > dev_kfree_skb_any(priv->rx_ring[id].skb); > priv->rx_ring[id].skb = NULL; > } Hmm, since ib_dma_unmap_single calls a function through a pointer, this seems to introduce overhead on data path operations in ipoib. For apps like ipoib always working with low memory, I think it is important to avoid this overhead of extra indirect function calls at least on systems without IO MMU - where e.g. dma_unmap_single is empty. This probably means you need some of architecture-dependent code, but should be possible - look at how dma API is implemented for an example. And this applies to all ULPs on systems without high memory. -- MST From akepner at sgi.com Sun Nov 5 09:22:25 2006 From: akepner at sgi.com (akepner at sgi.com) Date: Sun, 5 Nov 2006 09:22:25 -0800 (PST) Subject: [openib-general] static ARP entries for IPoIB? In-Reply-To: References: Message-ID: On Fri, 3 Nov 2006, Roland Dreier wrote: > I wrote: > > > I'd like to create static ARP entries for some IPoIB > > devices. The arp (8) command that I'm using doesn't > > know about ARPHRD_INFINIBAND (this is arp version 1.88 > > from net-tools-1.60-583.4.src.rpm, in SLES10.) Is there > > a version of arp (8) that works with IB? Or some other > > utility or means to make static ARP entries for IPoIB? > > "ip neigh add ..." > Thanks. I had to patch it with: http://openib.org/pipermail/openib-general/2006-March/018270.html but now it does the trick! -- Arthur From sashak at voltaire.com Sun Nov 5 10:44:46 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 5 Nov 2006 20:44:46 +0200 Subject: [openib-general] osm_opensm_t Struct In-Reply-To: <000301c700ec$cd6ccd10$21606d86@one7> References: <000301c700ec$cd6ccd10$21606d86@one7> Message-ID: <20061105184446.GB26615@sashak.voltaire.com> On 16:12 Sun 05 Nov , Michael Arndt wrote: > Hi, > > the osm_opensm_t struct is carrying a pointer named "p_updn_ucast_routing". > What's the use of this pointer? There is no such pointer in recent versions. Once it was used to keep address of the up/down routing engine, but was replaced by more "generic" routing_engine struct. Sasha From bunk at stusta.de Sat Nov 4 22:48:01 2006 From: bunk at stusta.de (Adrian Bunk) Date: Sun, 5 Nov 2006 07:48:01 +0100 Subject: [openib-general] 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: References: Message-ID: <20061105064801.GV13381@stusta.de> This email lists some known regressions in 2.6.19-rc4 compared to 2.6.18 that are not yet fixed in Linus' tree. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject : ipath driver MCEs system on load when HT chip present References : http://bugzilla.kernel.org/show_bug.cgi?id=7455 Submitter : Bryan O'Sullivan Caused-By : Eric W. Biederman Status : unknown Subject : i386: more DWARFs and strange messages References : http://lkml.org/lkml/2006/10/29/127 Submitter : Martin Lorenz Status : unknown Subject : x86_64: oprofile doesn't work References : http://lkml.org/lkml/2006/10/27/3 Submitter : Prakash Punnoor Status : unknown Subject : sata-via doesn't detect anymore disks attached to VIA vt6421 References : http://bugzilla.kernel.org/show_bug.cgi?id=7255 Submitter : Thierry Vignaud Status : unknown Subject : unable to rip cd References : http://lkml.org/lkml/2006/10/13/100 Submitter : Alex Romosan Status : unknown Subject : SMP kernel can not generate ISA irq properly References : http://lkml.org/lkml/2006/10/22/15 Submitter : Komuro Status : unknown Subject : nfs-kernel-server does not start References : http://bugzilla.kernel.org/show_bug.cgi?id=7457 Submitter : Elimar Riesebieter Status : problem is being debugged Subject : ThinkPad T60/X60: lose ACPI events after suspend/resume References : http://lkml.org/lkml/2006/10/10/39 http://lkml.org/lkml/2006/10/4/425 http://lkml.org/lkml/2006/10/16/262 http://bugzilla.kernel.org/show_bug.cgi?id=7408 http://lkml.org/lkml/2006/10/30/251 http://lkml.org/lkml/2006/11/3/244 Submitter : Martin Lorenz "Michael S. Tsirkin" Status : problem is being debugged Subject : ThinkPad R50p: boot fail with (lapic && on_battery) References : http://lkml.org/lkml/2006/10/31/333 Submitter : Ernst Herzberg Status : problem is being debugged Subject : x86_64: NR_IRQ increase causes 11.5% slowdown in lmbench's fork benchmark References : http://lkml.org/lkml/2006/11/2/192 Submitter : Tim Chen Caused-By : Eric W. Biederman commit 550f2299ac8ffaba943cf211380d3a8d3fa75301 Handled-By : Eric W. Biederman Andi Kleen Status : problem is being debugged Subject : cpufreq not working on AMD K8 References : http://lkml.org/lkml/2006/10/10/114 http://lkml.org/lkml/2006/11/3/26 Submitter : Christian Handled-By : Dave Jones Status : Dave is investigating Subject : PCI: MMCONFIG breakage References : http://lkml.org/lkml/2006/10/27/251 Submitter : Jeff Chua Status : unknown, both BIOS and Direct work Greg said it should already be fixed From jeff.chua.linux at gmail.com Sun Nov 5 01:30:27 2006 From: jeff.chua.linux at gmail.com (Jeff Chua) Date: Sun, 5 Nov 2006 17:30:27 +0800 Subject: [openib-general] 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <20061105064801.GV13381@stusta.de> References: <20061105064801.GV13381@stusta.de> Message-ID: On 11/5/06, Adrian Bunk wrote: > Subject : PCI: MMCONFIG breakage > References : http://lkml.org/lkml/2006/10/27/251 > Submitter : Jeff Chua > Status : unknown, both BIOS and Direct work > Greg said it should already be fixed Here's results with vanilla 2.6.19-rc4 (gcc version 3.4.5) ... 1) PCI access mode (Any) ... FAILED 2) PCI access mode (MMConfig) ... FAILED 3) PCI access mode (Direct) ... PASSED 4) PCI access mode (BIOS) ... PASSED Looks like it's still having problem with Dell Optiplex GX620. Thanks, Jeff. From mst at mellanox.co.il Sun Nov 5 12:19:32 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Nov 2006 22:19:32 +0200 Subject: [openib-general] static ARP entries for IPoIB? In-Reply-To: References: Message-ID: <20061105201932.GA24900@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: static ARP entries for IPoIB? > > > I'd like to create static ARP entries for some IPoIB > > devices. The arp (8) command that I'm using doesn't > > know about ARPHRD_INFINIBAND (this is arp version 1.88 > > from net-tools-1.60-583.4.src.rpm, in SLES10.) Is there > > a version of arp (8) that works with IB? Or some other > > utility or means to make static ARP entries for IPoIB? > > "ip neigh add ..." BTW, can't/shouldn't we fix arp? -- MST From mst at mellanox.co.il Sun Nov 5 12:28:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 5 Nov 2006 22:28:30 +0200 Subject: [openib-general] Fwd: [ANNOUNCE] GIT 1.4.3.4 Message-ID: <20061105202830.GA25632@mellanox.co.il> FYI. I don't see anything significant here. ----- Forwarded message from Junio C Hamano ----- Subject: [ANNOUNCE] GIT 1.4.3.4 Date: Sun, 5 Nov 2006 11:01:45 +0200 From: "Junio C Hamano" The latest maintenance release GIT 1.4.3.4 is available at the usual places: http://www.kernel.org/pub/software/scm/git/ git-1.4.3.4.tar.{gz,bz2} (tarball) git-htmldocs-1.4.3.4.tar.{gz,bz2} (preformatted docs) git-manpages-1.4.3.4.tar.{gz,bz2} (preformatted docs) RPMS/$arch/git-*-1.4.3.4-1.$arch.rpm (RPM) Among many minor fixes and documentation updates, this contains these fixes: - revision traversal now treats --unpacked as commit filter, not traversal limiter. If you have unpacked commits that are parents of packed ones which are in turn parents of commits that are unpacked, running rev-list starting at the latest unpacked commits used to _stop_ at the first packed commit and older unpacked commits were not shown. With this update, the traversal does not stop at packed commits, and shows the older unpacked commits. The updated semantics is easier to use with git-repack --unpacked. - In a repository configured for shared access, if the permission bits of existing directories are misconfigured (e.g. running repository commands as root by mistake), a codepath to create a new object failed with incorrect error message. Fixed. - An earlier fix to cope with traditional-style patches that were generated with --unified=0 broke handling of creation and deletion diffs in git-apply. Fixed. ---------------------------------------------------------------- Andy Parkins (2): Minor grammar fixes for git-diff-index.txt git-clone documentation didn't mention --origin as equivalent of -o Christian Couder (3): Remove --syslog in git-daemon inetd documentation examples. Documentation: add upload-archive service to git-daemon. Documentation: add git in /etc/services. Edgar Toernig (1): Use memmove instead of memcpy for overlapping areas J. Bruce Fields (1): Documentation: updates to "Everyday GIT" Jakub Narebski (3): diff-format.txt: Combined diff format documentation supplement diff-format.txt: Correct information about pathnames quoting in patch format gitweb: Check git base URLs before generating URL from it Jan Harkes (1): Continue traversal when rev-list --unpacked finds a packed commit. Johannes Schindelin (1): link_temp_to_file: call adjust_shared_perm() only when we created the directory Junio C Hamano (9): Documentation: clarify refname disambiguation rules. combine-diff: a few more finishing touches. combine-diff: fix hunk_comment_line logic. combine-diff: honour --no-commit-id Surround "#define DEBUG 0" with "#ifndef DEBUG..#endif" quote.c: ensure the same quoting across platforms. revision traversal: --unpacked does not limit commit list anymore. link_temp_to_file: don't leave the path truncated on adjust_shared_perm failure apply: handle "traditional" creation/deletion diff correctly. Nicolas Pitre (1): pack-objects doesn't create random pack names Rene Scharfe (1): git-cherry: document limit and add diagram Shawn O. Pearce (3): Use ULONG_MAX rather than implicit cast of -1. Remove SIMPLE_PROGRAMS and make git-daemon a normal program. Remove unsupported C99 style struct initializers in git-archive. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQBFTaP6wMbZpPMRm5oRAvmYAJ9a58U9N7PaM7l7jyzw4MS4YiwjZACghgAO LnuuiDIqaGGKJbkPJlS0Fto= =9LbZ -----END PGP SIGNATURE----- - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo at vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ----- End forwarded message ----- -- MST From hnguyen at de.ibm.com Sun Nov 5 12:40:38 2006 From: hnguyen at de.ibm.com (Hoang-Nam Nguyen) Date: Sun, 5 Nov 2006 21:40:38 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode Message-ID: <200611052140.38445.hnguyen@de.ibm.com> Hello Roland! This is a patch of ehca that assures 4k alignment for firmware control block in 64k page mode, because kzalloc()'s result address might not be 4k aligned if kernel's 64k page is enabled. Thus, we introduced wrappers called ehca_alloc/free_fw_ctrlblock(), which use a slab cache for objects with 4k length and 4k alignment in order to alloc/free firmware control blocks in 64k page mode. In 4k page mode those wrappers just are defines of get_zeroed_page() and free_page(). Thanks! Nam Signed-off-by: Hoang-Nam Nguyen --- ehca_hca.c | 17 ++++++++------- ehca_irq.c | 17 +++++++-------- ehca_iverbs.h | 8 +++++++ ehca_main.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++--------- ehca_mrmw.c | 8 +++---- ehca_qp.c | 12 ++++++----- 6 files changed, 89 insertions(+), 35 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 5eae6ac..f77e626 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -40,6 +40,7 @@ */ #include "ehca_tools.h" +#include "ehca_iverbs.h" #include "hcp_if.h" int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) @@ -49,7 +50,7 @@ int ehca_query_device(struct ib_device * ib_device); struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -96,7 +97,7 @@ int ehca_query_device(struct ib_device * = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); query_device1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -109,7 +110,7 @@ int ehca_query_port(struct ib_device *ib ib_device); struct hipz_query_port *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -162,7 +163,7 @@ int ehca_query_port(struct ib_device *ib props->active_speed = 0x1; query_port1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -178,7 +179,7 @@ int ehca_query_pkey(struct ib_device *ib return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -193,7 +194,7 @@ int ehca_query_pkey(struct ib_device *ib memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); query_pkey1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -211,7 +212,7 @@ int ehca_query_gid(struct ib_device *ibd return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -227,7 +228,7 @@ int ehca_query_gid(struct ib_device *ibd memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); query_gid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 048cc44..01b66d7 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -45,6 +45,7 @@ #include "ehca_iverbs.h" #include "ehca_tools.h" #include "hcp_if.h" #include "hipz_fns.h" +#include "ipz_pt_fn.h" #define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) #define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) @@ -137,38 +138,36 @@ int ehca_error_data(struct ehca_shca *sh u64 *rblock; unsigned long block_count; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (u64*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Cannot allocate rblock memory."); ret = -ENOMEM; goto error_data1; } + /* rblock must be 4K aligned and should be 4K large */ ret = hipz_h_error_data(shca->ipz_hca_handle, resource, rblock, &block_count); - if (ret == H_R_STATE) { + if (ret == H_R_STATE) ehca_err(&shca->ib_device, "No error data is available: %lx.", resource); - } else if (ret == H_SUCCESS) { int length; length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); - if (length > PAGE_SIZE) - length = PAGE_SIZE; + if (length > EHCA_PAGESIZE) + length = EHCA_PAGESIZE; print_error_data(shca, data, rblock, length); - } - else { + } else ehca_err(&shca->ib_device, "Error data could not be fetched: %lx", resource); - } - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); error_data1: return ret; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 319c39d..73e5f3f 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -179,4 +179,12 @@ int ehca_mmap_register(u64 physical,void int ehca_munmap(unsigned long addr, size_t len); +#ifdef CONFIG_PPC_64K_PAGES +void *ehca_alloc_fw_ctrlblock(void); +void ehca_free_fw_ctrlblock(void *ptr); +#else +#define ehca_alloc_fw_ctrlblock() get_zeroed_page(GFP_KERNEL) +#define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) +#endif + #endif diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 024d511..d66d66b 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -40,6 +40,9 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#ifdef CONFIG_PPC_64K_PAGES +#include +#endif #include "ehca_classes.h" #include "ehca_iverbs.h" #include "ehca_mrmw.h" @@ -49,7 +52,7 @@ #include "hcp_if.h" MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0017"); +MODULE_VERSION("SVNEHCA_0018"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -94,11 +97,37 @@ spinlock_t ehca_cq_idr_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); + static struct list_head shca_list; /* list of all registered ehcas */ static spinlock_t shca_list_lock; static struct timer_list poll_eqs_timer; +#ifdef CONFIG_PPC_64K_PAGES +static struct kmem_cache *ctblk_cache = NULL; + +/* constructor ctblk_cache */ +void ehca_ctblk_ctor(void *ptr, kmem_cache_t *cache, unsigned long flags) +{ + memset(ptr, 0, EHCA_PAGESIZE); +} + +void *ehca_alloc_fw_ctrlblock(void) +{ + void *ret = kmem_cache_alloc(ctblk_cache, SLAB_KERNEL); + if (!ret) + ehca_gen_err("Out of memory for ctblk"); + return ret; +} + +void ehca_free_fw_ctrlblock(void *ptr) +{ + if (ptr) + kmem_cache_free(ctblk_cache, ptr); + +} +#endif + static int ehca_create_slab_caches(void) { int ret; @@ -133,6 +162,17 @@ static int ehca_create_slab_caches(void) goto create_slab_caches5; } +#ifdef CONFIG_PPC_64K_PAGES + ctblk_cache = kmem_cache_create("ehca_cache_ctblk", + EHCA_PAGESIZE, H_CB_ALIGNMENT, + SLAB_HWCACHE_ALIGN, + ehca_ctblk_ctor, NULL); + if (!ctblk_cache) { + ehca_gen_err("Cannot create ctblk SLAB cache."); + ehca_cleanup_mrmw_cache(); + goto create_slab_caches5; + } +#endif return 0; create_slab_caches5: @@ -157,6 +197,10 @@ static void ehca_destroy_slab_caches(voi ehca_cleanup_qp_cache(); ehca_cleanup_cq_cache(); ehca_cleanup_pd_cache(); +#ifdef CONFIG_PPC_64K_PAGES + if (ctblk_cache) + kmem_cache_destroy(ctblk_cache); +#endif } #define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) @@ -168,7 +212,7 @@ int ehca_sense_attributes(struct ehca_sh u64 h_ret; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_gen_err("Cannot allocate rblock memory."); return -ENOMEM; @@ -211,7 +255,7 @@ int ehca_sense_attributes(struct ehca_sh shca->sport[1].rate = IB_RATE_30_GBPS; num_ports1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -220,7 +264,7 @@ static int init_node_guid(struct ehca_sh int ret = 0; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -235,7 +279,7 @@ static int init_node_guid(struct ehca_sh memcpy(&shca->ib_device.node_guid, &rblock->node_guid, sizeof(u64)); init_node_guid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -431,7 +475,7 @@ static ssize_t ehca_show_##name(struct \ shca = dev->driver_data; \ \ - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); \ + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); \ if (!rblock) { \ dev_err(dev, "Can't allocate rblock memory."); \ return 0; \ @@ -439,12 +483,12 @@ static ssize_t ehca_show_##name(struct \ if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { \ dev_err(dev, "Can't query device properties"); \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ return 0; \ } \ \ data = rblock->name; \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ \ if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ return snprintf(buf, 256, "1\n"); \ @@ -752,7 +796,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0017)\n"); + "(Rel.: SVNEHCA_0018)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 5ca6544..1b77ac7 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -1013,7 +1013,7 @@ int ehca_reg_mr_rpages(struct ehca_shca u32 i; u64 *kpage; - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = (u64*)ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1092,7 +1092,7 @@ int ehca_reg_mr_rpages(struct ehca_shca ehca_reg_mr_rpages_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p " @@ -1124,7 +1124,7 @@ inline int ehca_rereg_mr_rereg1(struct e ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = (u64*)ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1181,7 +1181,7 @@ inline int ehca_rereg_mr_rereg1(struct e } ehca_rereg_mr_rereg1_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x " diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 4394123..3768d8d 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -811,8 +811,9 @@ static int internal_modify_qp(struct ib_ unsigned long spl_flags = 0; /* do query_qp to obtain current attr values */ - mqpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); - if (mqpcb == NULL) { + mqpcb = (struct hcp_modify_qp_control_block*) + ehca_alloc_fw_ctrlblock(); + if (!mqpcb) { ehca_err(ibqp->device, "Could not get zeroed page for mqpcb " "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); return -ENOMEM; @@ -1225,7 +1226,7 @@ modify_qp_exit2: } modify_qp_exit1: - kfree(mqpcb); + ehca_free_fw_ctrlblock(mqpcb); return ret; } @@ -1277,7 +1278,8 @@ int ehca_query_qp(struct ib_qp *qp, return -EINVAL; } - qpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL ); + qpcb = (struct hcp_modify_qp_control_block*) + ehca_alloc_fw_ctrlblock(); if (!qpcb) { ehca_err(qp->device,"Out of memory for qpcb " "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); @@ -1401,7 +1403,7 @@ int ehca_query_qp(struct ib_qp *qp, ehca_dmp(qpcb, 4*70, "qp_num=%x", qp->qp_num); query_qp_exit1: - kfree(qpcb); + ehca_free_fw_ctrlblock(qpcb); return ret; } -------------- next part -------------- diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 5eae6ac..f77e626 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -40,6 +40,7 @@ */ #include "ehca_tools.h" +#include "ehca_iverbs.h" #include "hcp_if.h" int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) @@ -49,7 +50,7 @@ int ehca_query_device(struct ib_device * ib_device); struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -96,7 +97,7 @@ int ehca_query_device(struct ib_device * = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); query_device1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -109,7 +110,7 @@ int ehca_query_port(struct ib_device *ib ib_device); struct hipz_query_port *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -162,7 +163,7 @@ int ehca_query_port(struct ib_device *ib props->active_speed = 0x1; query_port1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -178,7 +179,7 @@ int ehca_query_pkey(struct ib_device *ib return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -193,7 +194,7 @@ int ehca_query_pkey(struct ib_device *ib memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); query_pkey1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -211,7 +212,7 @@ int ehca_query_gid(struct ib_device *ibd return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -227,7 +228,7 @@ int ehca_query_gid(struct ib_device *ibd memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); query_gid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 048cc44..01b66d7 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -45,6 +45,7 @@ #include "ehca_iverbs.h" #include "ehca_tools.h" #include "hcp_if.h" #include "hipz_fns.h" +#include "ipz_pt_fn.h" #define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) #define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) @@ -137,38 +138,36 @@ int ehca_error_data(struct ehca_shca *sh u64 *rblock; unsigned long block_count; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (u64*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Cannot allocate rblock memory."); ret = -ENOMEM; goto error_data1; } + /* rblock must be 4K aligned and should be 4K large */ ret = hipz_h_error_data(shca->ipz_hca_handle, resource, rblock, &block_count); - if (ret == H_R_STATE) { + if (ret == H_R_STATE) ehca_err(&shca->ib_device, "No error data is available: %lx.", resource); - } else if (ret == H_SUCCESS) { int length; length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); - if (length > PAGE_SIZE) - length = PAGE_SIZE; + if (length > EHCA_PAGESIZE) + length = EHCA_PAGESIZE; print_error_data(shca, data, rblock, length); - } - else { + } else ehca_err(&shca->ib_device, "Error data could not be fetched: %lx", resource); - } - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); error_data1: return ret; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 319c39d..73e5f3f 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -179,4 +179,12 @@ int ehca_mmap_register(u64 physical,void int ehca_munmap(unsigned long addr, size_t len); +#ifdef CONFIG_PPC_64K_PAGES +void *ehca_alloc_fw_ctrlblock(void); +void ehca_free_fw_ctrlblock(void *ptr); +#else +#define ehca_alloc_fw_ctrlblock() get_zeroed_page(GFP_KERNEL) +#define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) +#endif + #endif diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 024d511..d66d66b 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -40,6 +40,9 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#ifdef CONFIG_PPC_64K_PAGES +#include +#endif #include "ehca_classes.h" #include "ehca_iverbs.h" #include "ehca_mrmw.h" @@ -49,7 +52,7 @@ #include "hcp_if.h" MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0017"); +MODULE_VERSION("SVNEHCA_0018"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -94,11 +97,37 @@ spinlock_t ehca_cq_idr_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); + static struct list_head shca_list; /* list of all registered ehcas */ static spinlock_t shca_list_lock; static struct timer_list poll_eqs_timer; +#ifdef CONFIG_PPC_64K_PAGES +static struct kmem_cache *ctblk_cache = NULL; + +/* constructor ctblk_cache */ +void ehca_ctblk_ctor(void *ptr, kmem_cache_t *cache, unsigned long flags) +{ + memset(ptr, 0, EHCA_PAGESIZE); +} + +void *ehca_alloc_fw_ctrlblock(void) +{ + void *ret = kmem_cache_alloc(ctblk_cache, SLAB_KERNEL); + if (!ret) + ehca_gen_err("Out of memory for ctblk"); + return ret; +} + +void ehca_free_fw_ctrlblock(void *ptr) +{ + if (ptr) + kmem_cache_free(ctblk_cache, ptr); + +} +#endif + static int ehca_create_slab_caches(void) { int ret; @@ -133,6 +162,17 @@ static int ehca_create_slab_caches(void) goto create_slab_caches5; } +#ifdef CONFIG_PPC_64K_PAGES + ctblk_cache = kmem_cache_create("ehca_cache_ctblk", + EHCA_PAGESIZE, H_CB_ALIGNMENT, + SLAB_HWCACHE_ALIGN, + ehca_ctblk_ctor, NULL); + if (!ctblk_cache) { + ehca_gen_err("Cannot create ctblk SLAB cache."); + ehca_cleanup_mrmw_cache(); + goto create_slab_caches5; + } +#endif return 0; create_slab_caches5: @@ -157,6 +197,10 @@ static void ehca_destroy_slab_caches(voi ehca_cleanup_qp_cache(); ehca_cleanup_cq_cache(); ehca_cleanup_pd_cache(); +#ifdef CONFIG_PPC_64K_PAGES + if (ctblk_cache) + kmem_cache_destroy(ctblk_cache); +#endif } #define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) @@ -168,7 +212,7 @@ int ehca_sense_attributes(struct ehca_sh u64 h_ret; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_gen_err("Cannot allocate rblock memory."); return -ENOMEM; @@ -211,7 +255,7 @@ int ehca_sense_attributes(struct ehca_sh shca->sport[1].rate = IB_RATE_30_GBPS; num_ports1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -220,7 +264,7 @@ static int init_node_guid(struct ehca_sh int ret = 0; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -235,7 +279,7 @@ static int init_node_guid(struct ehca_sh memcpy(&shca->ib_device.node_guid, &rblock->node_guid, sizeof(u64)); init_node_guid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -431,7 +475,7 @@ static ssize_t ehca_show_##name(struct \ shca = dev->driver_data; \ \ - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); \ + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); \ if (!rblock) { \ dev_err(dev, "Can't allocate rblock memory."); \ return 0; \ @@ -439,12 +483,12 @@ static ssize_t ehca_show_##name(struct \ if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { \ dev_err(dev, "Can't query device properties"); \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ return 0; \ } \ \ data = rblock->name; \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ \ if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ return snprintf(buf, 256, "1\n"); \ @@ -752,7 +796,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0017)\n"); + "(Rel.: SVNEHCA_0018)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 5ca6544..1b77ac7 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -1013,7 +1013,7 @@ int ehca_reg_mr_rpages(struct ehca_shca u32 i; u64 *kpage; - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = (u64*)ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1092,7 +1092,7 @@ int ehca_reg_mr_rpages(struct ehca_shca ehca_reg_mr_rpages_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p " @@ -1124,7 +1124,7 @@ inline int ehca_rereg_mr_rereg1(struct e ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = (u64*)ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1181,7 +1181,7 @@ inline int ehca_rereg_mr_rereg1(struct e } ehca_rereg_mr_rereg1_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x " diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 4394123..3768d8d 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -811,8 +811,9 @@ static int internal_modify_qp(struct ib_ unsigned long spl_flags = 0; /* do query_qp to obtain current attr values */ - mqpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); - if (mqpcb == NULL) { + mqpcb = (struct hcp_modify_qp_control_block*) + ehca_alloc_fw_ctrlblock(); + if (!mqpcb) { ehca_err(ibqp->device, "Could not get zeroed page for mqpcb " "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); return -ENOMEM; @@ -1225,7 +1226,7 @@ modify_qp_exit2: } modify_qp_exit1: - kfree(mqpcb); + ehca_free_fw_ctrlblock(mqpcb); return ret; } @@ -1277,7 +1278,8 @@ int ehca_query_qp(struct ib_qp *qp, return -EINVAL; } - qpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL ); + qpcb = (struct hcp_modify_qp_control_block*) + ehca_alloc_fw_ctrlblock(); if (!qpcb) { ehca_err(qp->device,"Out of memory for qpcb " "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); @@ -1401,7 +1403,7 @@ int ehca_query_qp(struct ib_qp *qp, ehca_dmp(qpcb, 4*70, "qp_num=%x", qp->qp_num); query_qp_exit1: - kfree(qpcb); + ehca_free_fw_ctrlblock(qpcb); return ret; } From hnguyen at de.ibm.com Sun Nov 5 12:41:28 2006 From: hnguyen at de.ibm.com (Hoang-Nam Nguyen) Date: Sun, 5 Nov 2006 21:41:28 +0100 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode Message-ID: <200611052141.29030.hnguyen@de.ibm.com> Hello Roland! This is another patch of ehca for 64k page support. It fixes a bug that maps 4k aligned addresses in 64k page mode in a wrong way. Thanks! Nam Signed-off-by: Hoang-Nam Nguyen --- hcp_phyp.c | 5 +++--0018_64kpage_ioremap.patch 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.c b/drivers/infiniband/hw/ehca/hcp_phyp.c index 0b1a477..6237252 100644 --- a/drivers/infiniband/hw/ehca/hcp_phyp.c +++ b/drivers/infiniband/hw/ehca/hcp_phyp.c @@ -44,13 +44,14 @@ #include "hipz_hw.h" int hcall_map_page(u64 physaddr, u64 *mapaddr) { - *mapaddr = (u64)(ioremap(physaddr, EHCA_PAGESIZE)); + *mapaddr = (u64)ioremap((physaddr & PAGE_MASK), PAGE_SIZE) + + (physaddr & (~PAGE_MASK)); return 0; } int hcall_unmap_page(u64 mapaddr) { - iounmap((volatile void __iomem*)mapaddr); + iounmap((void __iomem*)(mapaddr & PAGE_MASK)); return 0; } -------------- next part -------------- diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.c b/drivers/infiniband/hw/ehca/hcp_phyp.c index 0b1a477..6237252 100644 --- a/drivers/infiniband/hw/ehca/hcp_phyp.c +++ b/drivers/infiniband/hw/ehca/hcp_phyp.c @@ -44,13 +44,14 @@ #include "hipz_hw.h" int hcall_map_page(u64 physaddr, u64 *mapaddr) { - *mapaddr = (u64)(ioremap(physaddr, EHCA_PAGESIZE)); + *mapaddr = (u64)ioremap((physaddr & PAGE_MASK), PAGE_SIZE) + + (physaddr & (~PAGE_MASK)); return 0; } int hcall_unmap_page(u64 mapaddr) { - iounmap((volatile void __iomem*)mapaddr); + iounmap((void __iomem*)(mapaddr & PAGE_MASK)); return 0; } From hnguyen at de.ibm.com Sun Nov 5 12:42:20 2006 From: hnguyen at de.ibm.com (Hoang-Nam Nguyen) Date: Sun, 5 Nov 2006 21:42:20 +0100 Subject: [openib-general] [PATCH 2.6.19 3/4] ehca: Kconfig: activate scaling code as default to prevent drop packets (UD) Message-ID: <200611052142.20577.hnguyen@de.ibm.com> Hello Roland! Here is a small patch of ehca's Kconfig that activates scaling code as default. After several measurements we saw that this feature prevents drop packets (UD) in stress situation. Thus, enabling it helps to improve ehca's bandwidth through ipoib. Thanks! Nam Signed-off-by: Hoang-Nam Nguyen --- Kconfig | 1 + 1 files changed, 1 insertion(+) diff --git a/drivers/infiniband/hw/ehca/Kconfig b/drivers/infiniband/hw/ehca/Kconfig index 922389b..727b10d 100644 --- a/drivers/infiniband/hw/ehca/Kconfig +++ b/drivers/infiniband/hw/ehca/Kconfig @@ -10,6 +10,7 @@ config INFINIBAND_EHCA config INFINIBAND_EHCA_SCALING bool "Scaling support (EXPERIMENTAL)" depends on IBMEBUS && INFINIBAND_EHCA && HOTPLUG_CPU && EXPERIMENTAL + default y ---help--- eHCA scaling support schedules the CQ callbacks to different CPUs. -------------- next part -------------- diff --git a/drivers/infiniband/hw/ehca/Kconfig b/drivers/infiniband/hw/ehca/Kconfig index 922389b..727b10d 100644 --- a/drivers/infiniband/hw/ehca/Kconfig +++ b/drivers/infiniband/hw/ehca/Kconfig @@ -10,6 +10,7 @@ config INFINIBAND_EHCA config INFINIBAND_EHCA_SCALING bool "Scaling support (EXPERIMENTAL)" depends on IBMEBUS && INFINIBAND_EHCA && HOTPLUG_CPU && EXPERIMENTAL + default y ---help--- eHCA scaling support schedules the CQ callbacks to different CPUs. From hnguyen at de.ibm.com Sun Nov 5 12:42:56 2006 From: hnguyen at de.ibm.com (Hoang-Nam Nguyen) Date: Sun, 5 Nov 2006 21:42:56 +0100 Subject: [openib-general] [PATCH 2.6.19 4/4] ehca: ehca_av.c use constant for max mtu Message-ID: <200611052142.56722.hnguyen@de.ibm.com> Hello Roland! This is a patch for ehca, mainly a code change to adhere to kernel coding style. It defines and uses a constant EHCA_MAX_MTU instead hardcoded value. Thanks! Nam Signed-off-by: Hoang-Nam Nguyen --- ehca_av.c | 5 ++--- hipz_hw.h | 2 ++ 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c index 3bac197..214e2fd 100644 --- a/drivers/infiniband/hw/ehca/ehca_av.c +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -118,8 +118,7 @@ struct ib_ah *ehca_create_ah(struct ib_p } memcpy(&av->av.grh.word_1, &gid, sizeof(gid)); } - /* for the time being we use a hard coded PMTU of 2048 Bytes */ - av->av.pmtu = 4; + av->av.pmtu = EHCA_MAX_MTU; /* dgid comes in grh.word_3 */ memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid, @@ -193,7 +192,7 @@ int ehca_modify_ah(struct ib_ah *ah, str memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid)); } - new_ehca_av.pmtu = 4; /* see also comment in create_ah() */ + new_ehca_av.pmtu = EHCA_MAX_MTU; memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid, sizeof(ah_attr->grh.dgid)); diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h index 3fc92b0..fad9136 100644 --- a/drivers/infiniband/hw/ehca/hipz_hw.h +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -45,6 +45,8 @@ #define __HIPZ_HW_H__ #include "ehca_tools.h" +#define EHCA_MAX_MTU 4 + /* QP Table Entry Memory Map */ struct hipz_qptemm { u64 qpx_hcr; -------------- next part -------------- diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c index 3bac197..214e2fd 100644 --- a/drivers/infiniband/hw/ehca/ehca_av.c +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -118,8 +118,7 @@ struct ib_ah *ehca_create_ah(struct ib_p } memcpy(&av->av.grh.word_1, &gid, sizeof(gid)); } - /* for the time being we use a hard coded PMTU of 2048 Bytes */ - av->av.pmtu = 4; + av->av.pmtu = EHCA_MAX_MTU; /* dgid comes in grh.word_3 */ memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid, @@ -193,7 +192,7 @@ int ehca_modify_ah(struct ib_ah *ah, str memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid)); } - new_ehca_av.pmtu = 4; /* see also comment in create_ah() */ + new_ehca_av.pmtu = EHCA_MAX_MTU; memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid, sizeof(ah_attr->grh.dgid)); diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h index 3fc92b0..fad9136 100644 --- a/drivers/infiniband/hw/ehca/hipz_hw.h +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -45,6 +45,8 @@ #define __HIPZ_HW_H__ #include "ehca_tools.h" +#define EHCA_MAX_MTU 4 + /* QP Table Entry Memory Map */ struct hipz_qptemm { u64 qpx_hcr; From rdreier at cisco.com Sun Nov 5 13:19:55 2006 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 05 Nov 2006 13:19:55 -0800 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose References: <1162506570.29948.567.camel@brick.pathscale.com> <20061102231404.GC15403@mellanox.co.il> <454DE183.80405@voltaire.com> Message-ID: > I have mentioned this to Ralph in the past, just want to get ack/nak > on that from you: also on 64bit arch a block driver (eg SCSI LLD eg > SRP/iSER/etc) might get from higher level an SG whose pages are > **not** mapped into the kernel virtual address space. For example this > can happen with Direct I/O. No, I don't see how that could happen. Aren't all pages always mapped by the the kernel direct mapping on 64-bit architectures? - R. From or.gerlitz at gmail.com Sun Nov 5 13:25:27 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sun, 5 Nov 2006 23:25:27 +0200 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: References: <1162506570.29948.567.camel@brick.pathscale.com> <20061102231404.GC15403@mellanox.co.il> <454DE183.80405@voltaire.com> Message-ID: <15ddcffd0611051325p7546ef75qe5a6af74fe97b56f@mail.gmail.com> On 11/5/06, Roland Dreier wrote: > > I have mentioned this to Ralph in the past, just want to get ack/nak > > on that from you: also on 64bit arch a block driver (eg SCSI LLD eg > > SRP/iSER/etc) might get from higher level an SG whose pages are > > **not** mapped into the kernel virtual address space. For example this > > can happen with Direct I/O. > > No, I don't see how that could happen. Aren't all pages always mapped > by the the kernel direct mapping on 64-bit architectures? I don't know exactly how this happens, but one of the comments i've got from Christoph on the iser code, is that one can't assume page_address(sg[i].page) will not be NULL for SG passed to a SCSI LLD, i think Direct I/O is one flow where this might happen. Or. From or.gerlitz at gmail.com Sun Nov 5 13:34:35 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Sun, 5 Nov 2006 23:34:35 +0200 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <454A2C38.1040805@ichips.intel.com> References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com> <4540CA0E.9020807@voltaire.com> <4540DE9B.7070900@ichips.intel.com> <454A0CD6.3090703@voltaire.com> <454A2C38.1040805@ichips.intel.com> Message-ID: <15ddcffd0611051334w7ca0ea1ax6bf73fe25098aa61@mail.gmail.com> On 11/2/06, Sean Hefty wrote: > > Have you looked on that? from the compilation failure against > > libibverbs-1.0 the gap seem pretty small. If indeed this is the case, > > since libibverbs-1.1 is in development lets check with Roland if it > > makes sense for him to support these small-gap-features in > > libibverbs-1.0.X, i guess what matters here is ABI versions... > > I have not had time to look into this yet. I see, this is of great importance, but in second priority i think to the kernel push (we are now in post -rc4 time frame, which means 2.6.20 is going to get open within few weeks...). However, if u/k ABI issues are involved here, we might get into troubles... anyway, whenever you have some time... > > I think we do want it. The rdma cm provide the means to offload ip > > multicast to ib multicast though registration (join/leave etc) with the > > ib_sa module. IP Multicast does use the send-only feature and hence IP > > Multicast offloading apps need it as well. The rdma cm framework fits > > very well for such apps and the ib_usa (which does not exist now, and i > > am not sure needs to exist... it was a project of a summer student with > > open-mpi that required that...) not. > Are you wanting the rdma cm to join the same multicast groups that ipoib does? > (This is simple to change, but it does not join the same groups today.) Actually, yes, we strogly prefer that you will remove the rdma cm signature byte from MGIDs generated by the cma. > I will likely need to spin these patches again to incorporate the changes for > path failover, so adding in join options wouldn't be difficult. Are you just > wanting to see them added the rdma_join_multicast directly? Yes, i'd like to have the rdma cm consumer being able to specify if it want to be joined as full member vs join as send only. This would also have an impact on the rdma cm calling the qp attach to mgid verb for the first case, only. Or. From benh at kernel.crashing.org Sun Nov 5 14:48:40 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 06 Nov 2006 09:48:40 +1100 Subject: [openib-general] [PATCH 2.6.19 4/4] ehca: ehca_av.c use constant for max mtu In-Reply-To: <200611052142.56722.hnguyen@de.ibm.com> References: <200611052142.56722.hnguyen@de.ibm.com> Message-ID: <1162766921.28571.251.camel@localhost.localdomain> Can you fix your patch sending technique ? A mangled patch inline with a non-mangled one in attachment... that's a bit gross. Just get a proper one inline and be done with it. If your mailer can't be coerced into not damaging patches, then use another one for sending them. Cheers, Ben From arnd at arndb.de Sun Nov 5 15:45:58 2006 From: arnd at arndb.de (Arnd Bergmann) Date: Mon, 6 Nov 2006 00:45:58 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <200611052140.38445.hnguyen@de.ibm.com> References: <200611052140.38445.hnguyen@de.ibm.com> Message-ID: <200611060045.59074.arnd@arndb.de> On Sunday 05 November 2006 21:40, Hoang-Nam Nguyen wrote: > +/* constructor ctblk_cache */ > +void ehca_ctblk_ctor(void *ptr, kmem_cache_t *cache, unsigned long flags) > +{ > + memset(ptr, 0, EHCA_PAGESIZE); > +} > + > +void *ehca_alloc_fw_ctrlblock(void) > +{ > + void *ret = kmem_cache_alloc(ctblk_cache, SLAB_KERNEL); > + if (!ret) > +  ehca_gen_err("Out of memory for ctblk"); > + return ret; > +} > + > +void ehca_free_fw_ctrlblock(void *ptr) > +{ > + if (ptr) > +  kmem_cache_free(ctblk_cache, ptr); > + This seems broken. You have a constructor for newly allocated objects, but there is no destructor and it seems that objects passed to ehca_free_fw_ctrlblock are not guaranteed to be initialized either. I'd simply move the memset into the alloc function and get rid of the constructor here. Arnd <>< From kliteyn at dev.mellanox.co.il Sun Nov 5 23:00:39 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Nov 2006 09:00:39 +0200 Subject: [openib-general] [PATCH] osm: comparing InformInfo records Message-ID: <454EDD97.5060000@dev.mellanox.co.il> Hi Hal [From Vu Pham] 1. sending InformInfo set subscribe for trap 64,65,144 - this works; however, osm.log outputs wrong value for "subscribe" field 2. sending InformInfo set *unsubscribe* for trap 64,65,144 - I'm using/formating the same mad as (1) except the "subscribe" field is zero; however, opensm response with status 0x200 [/From Vu Pham] 1. The received InformInfo struct was modified before dumping it. This was fixed as part of the second issue. 2. The function that compares InformInfo structures was just comparing the whole memory allocated for it, including reserved fields. Fixed to compare more selectively. Yevgeny Signed-off-by: Yevgeny Kliteynik Index: opensm/osm_sa_informinfo.c =================================================================== --- opensm/osm_sa_informinfo.c (revision 10064) +++ opensm/osm_sa_informinfo.c (working copy) @@ -345,7 +345,6 @@ osm_infr_rcv_process_set_method( ib_inform_info_t *p_recvd_inform_info; osm_infr_t inform_info_rec; /* actual inform record to be stored for reports */ osm_infr_t *p_infr; - uint8_t subscribe; ib_net32_t qpn; uint8_t resp_time_val; ib_api_status_t res; @@ -403,19 +402,11 @@ osm_infr_rcv_process_set_method( * * QPN: * internally we keep the QPN field of the InformInfo updated - * so we can simply compare the entire record - when finding such. - * IBA spec only requires the QPN field to be filled when an unsubscribe - * Set(InformInfo) is done. See table 119 p 740 QPN field - * - * SUBSCRIBE: - * For similar reasons we change the subscribe to 0 on the - * inserted/searched data + * so we can simply compare it in the record - when finding such. */ - subscribe = p_recvd_inform_info->subscribe; - if (subscribe) + if (p_recvd_inform_info->subscribe) { - inform_info_rec.inform_record.inform_info.subscribe = 0; ib_inform_info_set_qpn( &inform_info_rec.inform_record.inform_info, inform_info_rec.report_addr.addr_type.gsi.remote_qp ); @@ -443,7 +434,7 @@ osm_infr_rcv_process_set_method( p_infr = osm_infr_get_by_rec( p_rcv->p_subn, p_rcv->p_log, &inform_info_rec ); /* check to see if the request was for subscribe = 1 */ - if (subscribe) + if (p_recvd_inform_info->subscribe) { /* validate the request for a new or update InformInfo */ if (__validate_infr( p_rcv, &inform_info_rec ) != TRUE) @@ -480,6 +471,8 @@ osm_infr_rcv_process_set_method( goto Exit; } + /* set the subscribe bit to 0 before adding the record */ + p_infr->inform_record.inform_info.subscribe = 0; /* Add this new osm_infr_t object to subnet object */ osm_infr_insert_to_db( p_rcv->p_subn, p_rcv->p_log, p_infr ); @@ -488,6 +481,8 @@ osm_infr_rcv_process_set_method( { /* Update the old instance of the osm_infr_t object */ p_infr->inform_record = inform_info_rec.inform_record; + /* set the subscribe bit to 0 after updating the record */ + p_infr->inform_record.inform_info.subscribe = 0; } } else Index: opensm/osm_inform.c =================================================================== --- opensm/osm_inform.c (revision 10064) +++ opensm/osm_inform.c (working copy) @@ -206,30 +206,133 @@ __match_inf_rec( osm_infr_t* p_infr_rec = (osm_infr_t *)context; osm_infr_t* p_infr = (osm_infr_t*)p_list_item; osm_log_t *p_log = p_infr_rec->p_infr_rcv->p_log; - cl_status_t status; - int32_t count1, count2; + cl_status_t status = CL_NOT_FOUND; + ib_gid_t all_zero_gid; + OSM_LOG_ENTER( p_log, __match_inf_rec); - count1 = memcmp(&p_infr->report_addr, &p_infr_rec->report_addr, - sizeof(p_infr_rec->report_addr)); - if (count1) - osm_log( p_log, OSM_LOG_DEBUG, - "__match_inf_rec: " - "Differ by Address\n" ); - count2 = memcmp( - &p_infr->inform_record.inform_info, - &p_infr_rec->inform_record.inform_info, - sizeof(p_infr->inform_record.inform_info) ); - if (count2) - osm_log( p_log, OSM_LOG_DEBUG, - "__match_inf_rec: " - "Differ by InformInfo\n" ); - if ((count1 == 0) && (count2 == 0)) - status = CL_SUCCESS; + if ( !memcmp(&p_infr->report_addr, + &p_infr_rec->report_addr, + sizeof(p_infr_rec->report_addr)) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by Address\n" ); + goto Exit; + } + + memset(&all_zero_gid, 0, sizeof(ib_gid_t)); + + /* if inform_info.gid is not zero, ignoring lid range */ + if ( !memcmp(&p_infr_rec->inform_record.inform_info.gid, + &all_zero_gid, + sizeof(p_infr_rec->inform_record.inform_info.gid)) ) + { + if ( !memcmp(&p_infr->inform_record.inform_info.gid, + &p_infr_rec->inform_record.inform_info.gid, + sizeof(p_infr->inform_record.inform_info.gid)) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.gid\n" ); + goto Exit; + } + } else - status = CL_NOT_FOUND; + { + if ( (p_infr->inform_record.inform_info.lid_range_begin != + p_infr_rec->inform_record.inform_info.lid_range_begin) || + (p_infr->inform_record.inform_info.lid_range_end != + p_infr_rec->inform_record.inform_info.lid_range_end) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.LIDRange\n" ); + goto Exit; + } + } + + if ( p_infr->inform_record.inform_info.is_generic != + p_infr_rec->inform_record.inform_info.is_generic ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.IsGeneric\n" ); + goto Exit; + } + if ( p_infr->inform_record.inform_info.trap_type != + p_infr_rec->inform_record.inform_info.trap_type ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.TrapType\n" ); + goto Exit; + } + + if ( p_infr->inform_record.inform_info.is_generic != + p_infr_rec->inform_record.inform_info.is_generic ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.IsGeneric\n" ); + } + else if (p_infr->inform_record.inform_info.is_generic) + { + if ( p_infr->inform_record.inform_info.g_or_v.generic.trap_num != + p_infr_rec->inform_record.inform_info.g_or_v.generic.trap_num ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.TrapNumber\n" ); + goto Exit; + } + else if ( p_infr->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val != + p_infr_rec->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.QPNRespTimeVal\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_msb != + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_msb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.NodeTypeMSB\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_lsb != + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_lsb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.NodeTypeLSB\n" ); + else + status = CL_SUCCESS; + } + else + { + if ( p_infr->inform_record.inform_info.g_or_v.vend.dev_id != + p_infr_rec->inform_record.inform_info.g_or_v.vend.dev_id ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.DeviceID\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val != + p_infr_rec->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.QPNRespTimeVal\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_msb != + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_msb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.VendorIdMSB\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_lsb != + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_lsb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.VendorIdLSB\n" ); + else + status = CL_SUCCESS; + } + + Exit: OSM_LOG_EXIT( p_log ); return status; } From michael at ellerman.id.au Sun Nov 5 23:33:18 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 06 Nov 2006 18:33:18 +1100 Subject: [openib-general] Can't build drivers/infiniband/hw/ipath/ipath_keys.c on arch/powerpc Message-ID: <1162798399.8175.24.camel@localhost.localdomain> Hi, Just a heads-up, drivers/infiniband/hw/ipath/ipath_keys.c doesn't build on powerpc because we have don't have bus_to_virt(). (Actually you can't select CONFIG_INFINIBAND_IPATH on mainline because powerpc doesn't enable PCI_MSI, but we will real soon now, and when we do this will start breaking people's all modconfig builds). cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rjwalsh at pathscale.com Sun Nov 5 23:42:48 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Sun, 05 Nov 2006 23:42:48 -0800 Subject: [openib-general] Can't build drivers/infiniband/hw/ipath/ipath_keys.c on arch/powerpc In-Reply-To: <1162798399.8175.24.camel@localhost.localdomain> References: <1162798399.8175.24.camel@localhost.localdomain> Message-ID: <454EE778.2030109@pathscale.com> Michael Ellerman wrote: > Hi, > > Just a heads-up, drivers/infiniband/hw/ipath/ipath_keys.c doesn't build > on powerpc because we have don't have bus_to_virt(). > > (Actually you can't select CONFIG_INFINIBAND_IPATH on mainline because > powerpc doesn't enable PCI_MSI, but we will real soon now, and when we > do this will start breaking people's all modconfig builds). Thanks for the heads-up. We're working to remove this requirement as we speak. Regards, Robert. From sean.hefty at intel.com Mon Nov 6 00:02:37 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 6 Nov 2006 00:02:37 -0800 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <15ddcffd0611051334w7ca0ea1ax6bf73fe25098aa61@mail.gmail.com> Message-ID: <000001c70179$ecf9c150$9fd4180a@amr.corp.intel.com> >> Are you wanting the rdma cm to join the same multicast groups that ipoib >does? >> (This is simple to change, but it does not join the same groups today.) > >Actually, yes, we strogly prefer that you will remove the rdma cm >signature byte from MGIDs generated by the cma. Would there be any issues that result from allowing this that requires more restrictive access to joining the groups? - Sean From benh at kernel.crashing.org Mon Nov 6 00:06:48 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 06 Nov 2006 19:06:48 +1100 Subject: [openib-general] Can't build drivers/infiniband/hw/ipath/ipath_keys.c on arch/powerpc In-Reply-To: <1162798399.8175.24.camel@localhost.localdomain> References: <1162798399.8175.24.camel@localhost.localdomain> Message-ID: <1162800409.28571.298.camel@localhost.localdomain> On Mon, 2006-11-06 at 18:33 +1100, Michael Ellerman wrote: > Hi, > > Just a heads-up, drivers/infiniband/hw/ipath/ipath_keys.c doesn't build > on powerpc because we have don't have bus_to_virt(). > > (Actually you can't select CONFIG_INFINIBAND_IPATH on mainline because > powerpc doesn't enable PCI_MSI, but we will real soon now, and when we > do this will start breaking people's all modconfig builds). I'm surprised that something as recent as infiniband requires a long-deprecated function bus_to_virt(). What is it trying to do there that needs that call ? Ben. From michael at ellerman.id.au Mon Nov 6 00:31:22 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 06 Nov 2006 19:31:22 +1100 Subject: [openib-general] Can't build drivers/infiniband/hw/ipath/ipath_keys.c on arch/powerpc In-Reply-To: <454EE778.2030109@pathscale.com> References: <1162798399.8175.24.camel@localhost.localdomain> <454EE778.2030109@pathscale.com> Message-ID: <1162801882.8175.42.camel@localhost.localdomain> On Sun, 2006-11-05 at 23:42 -0800, Robert Walsh wrote: > Michael Ellerman wrote: > > Hi, > > > > Just a heads-up, drivers/infiniband/hw/ipath/ipath_keys.c doesn't build > > on powerpc because we have don't have bus_to_virt(). > > > > (Actually you can't select CONFIG_INFINIBAND_IPATH on mainline because > > powerpc doesn't enable PCI_MSI, but we will real soon now, and when we > > do this will start breaking people's all modconfig builds). > > Thanks for the heads-up. We're working to remove this requirement as we > speak. Sweet. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From HNGUYEN at de.ibm.com Mon Nov 6 02:00:00 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 6 Nov 2006 11:00:00 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <200611060045.59074.arnd@arndb.de> Message-ID: Hi Arnd, > This seems broken. You have a constructor for newly allocated objects, but > there is no destructor and it seems that objects passed to > ehca_free_fw_ctrlblock are not guaranteed to be initialized either. > I'd simply move the memset into the alloc function and get rid of the > constructor here. Yep, I was not aware that ctor is not called for every kmem_cache_alloc(). Thx for pointing this out. Nam From ogerlitz at voltaire.com Mon Nov 6 02:23:20 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 06 Nov 2006 12:23:20 +0200 Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work with proposed 2.6.20 kernel CMA In-Reply-To: <000001c70179$ecf9c150$9fd4180a@amr.corp.intel.com> References: <000001c70179$ecf9c150$9fd4180a@amr.corp.intel.com> Message-ID: <454F0D18.601@voltaire.com> Sean Hefty wrote: >>> Are you wanting the rdma cm to join the same multicast groups that ipoib >> does? >>> (This is simple to change, but it does not join the same groups today.) >> Actually, yes, we strogly prefer that you will remove the rdma cm >> signature byte from MGIDs generated by the cma. > > Would there be any issues that result from allowing this that requires more > restrictive access to joining the groups? I don't think so, generally speaking, using multicast has its security drawbacks eg as was mentioned by Roland earlier on this thread, however I don't see how the rdma cm MGID signature byte provides meaningful help against them. Or. From jackm at dev.mellanox.co.il Mon Nov 6 02:25:18 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 6 Nov 2006 12:25:18 +0200 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> Message-ID: <200611061225.19077.jackm@dev.mellanox.co.il> On Wednesday 25 October 2006 00:25, Sean Hefty wrote: > The following set of patches expand the rdma_cm support to include > UD and multicast, and expose the rdma_cm to userspace. I would like to > target the 2.6.20 kernel, but at least getting them into one or more > branches would be helpful for other developers to test against these > changes. > I have incorporated your rdma patches for 2.6.20 (1-7 v2) into our driver, and am experiencing problems with multicast.c. When I unload the Infiniband driver I am getting a kernel Oops (consistently at the same location, with the same stack output). I am doing the driver unload immediately after reboot (the boot process loads the infiniband driver). I am not running opensm. Looks to me like a reference counting problem. Below is the relevant data. Jack ================================================ Console output: # /etc/init.d/openibd stop Shutting down interface ib0: [ OK ] Shutting down interface ib1: [ OK ] Message from syslogd at Mon Nov 6 12:00:06 2006 ... kernel: BUG: spinlock bad magic on CPU#1, ib_mad2/1570 Message from syslogd at Mon Nov 6 12:00:06 2006 ... kernel: general protection fault: 0000 [1] SMP ================================================ lsmod output (infiniband modules only): ib_mthca 123972 0 ib_umad 18736 0 ib_sa 25920 0 ib_mad 39864 3 ib_mthca,ib_umad,ib_sa ib_core 56448 4 ib_mthca,ib_umad,ib_sa,ib_mad ================================= ps -ef shows that the following command has hung: /sbin/modprobe -r ib_ipoib =========================================== /var/log/messages: BUG: spinlock bad magic on CPU#1, ib_mad2/1570 general protection fault: 0000 [1] SMP CPU 1 Modules linked in: nfsd exportfs ipv6 parport_pc lp parport autofs4 nfs lockd nfs_acl sunrpc vfat fat dm_mirr or dm_mod button battery ac ohci_hcd ehci_hcd i2c_nforce2 i2c_core ib_mthca ib_umad ib_sa ib_mad ib_core tg3 ext3 jbd sata_nv libata mptsas scsi_transport_sas sd_mod Pid: 1570, comm: ib_mad2 Not tainted 2.6.17.7 #3 RIP: 0010:[] {spin_bug+116} RSP: 0018:ffff81013bb95ca8 EFLAGS: 00010002 RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffffffff8044c057 RDX: ffffffff804a7f18 RSI: 0000000000000046 RDI: ffffffff804a7f00 RBP: ffff81013ba36668 R08: 00000000ffffffff R09: 0000000000000003 R10: 0000000100000000 R11: 0000000000000000 R12: ffff81013ba36668 R13: 0000000000000283 R14: 0000000000000000 R15: ffffffff8808f2cf FS: 00002b86142cdb00(0000) GS:ffff81013fc616d0(0000) knlGS:00000000f7f038e0 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000003b5a131eb0 CR3: 000000013aa5d000 CR4: 00000000000006e0 Process ib_mad2 (pid: 1570, threadinfo ffff81013bb94000, task ffff81013f2768b0) Stack: 0000000000000003 ffff81013ba36668 ffff81013ba36660 ffffffff802ddc8d ffff81013c88e8b8 ffff81013ba36660 ffff81013ba36668 ffffffff80428b2b 0000000000000246 ffffffff88097efe Call Trace: {_raw_spin_lock+28} {_spin_lock_irqsave+11} {:ib_sa:release_group+26} {:ib_sa:mcast_work_handler+1280} {_spin_unlock_irq+7} {:ib_mad:timeout_sends+0} {:ib_sa:ib_sa_mcmember_rec_callback+64} {_spin_unlock_irq+7} {:ib_sa:send_handler+74} {:ib_mad:timeout_sends+397} {run_workqueue+161} {worker_thread+0} {keventd_create_kthread+0} {worker_thread+261} {default_wake_function+0} {keventd_create_kthread+0} {default_wake_function+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} Code: 44 8b 83 04 01 00 00 48 8d 8b a0 02 00 00 8b 55 04 41 89 c1 From mst at mellanox.co.il Mon Nov 6 02:35:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Nov 2006 12:35:44 +0200 Subject: [openib-general] [PATCH 1/7 v2] for 2.6.20 ib/ib_sa: add tracking of multicast join / leave requests In-Reply-To: <000101c6f7bc$c0a46770$a6d4180a@amr.corp.intel.com> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <000101c6f7bc$c0a46770$a6d4180a@amr.corp.intel.com> Message-ID: <20061106103544.GC29344@mellanox.co.il> Quoting Sean Hefty : > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > index 3faa182..d90f804 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c > @@ -60,14 +60,11 @@ static DEFINE_MUTEX(mcast_mutex); > /* Used for all multicast joins (broadcast, IPv4 mcast and IPv6 mcast) */ > struct ipoib_mcast { > struct ib_sa_mcmember_rec mcmember; > + struct ib_sa_multicast *mc; > struct ipoib_ah *ah; > > struct rb_node rb_node; > struct list_head list; > - struct completion done; > - > - int query_id; > - struct ib_sa_query *query; > > unsigned long created; > unsigned long backoff; > @@ -299,18 +296,22 @@ static int ipoib_mcast_join_finish(struc > return 0; > } > > -static void > +static int > ipoib_mcast_sendonly_join_complete(int status, > - struct ib_sa_mcmember_rec *mcmember, > - void *mcast_ptr) > + struct ib_sa_multicast *multicast) > { > - struct ipoib_mcast *mcast = mcast_ptr; > + struct ipoib_mcast *mcast = multicast->context; > struct net_device *dev = mcast->dev; > struct ipoib_dev_priv *priv = netdev_priv(dev); > > + /* We trap for port events ourselves. */ > + if (status == -ENETRESET) > + return 0; > + > if (!status) > - ipoib_mcast_join_finish(mcast, mcmember); > - else { > + status = ipoib_mcast_join_finish(mcast, &multicast->rec); > + > + if (status) { > if (mcast->logcount++ < 20) > ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for " > IPOIB_GID_FMT ", status %d\n", > @@ -325,11 +326,10 @@ ipoib_mcast_sendonly_join_complete(int s > spin_unlock_irq(&priv->tx_lock); > > /* Clear the busy flag so we try again */ > - clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > - mcast->query = NULL; > + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, > + &mcast->flags); > } > - > - complete(&mcast->done); > + return status; > } > > static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) > @@ -359,35 +359,33 @@ #endif > rec.port_gid = priv->local_gid; > rec.pkey = cpu_to_be16(priv->pkey); > > - init_completion(&mcast->done); > - > - ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, &rec, > - IB_SA_MCMEMBER_REC_MGID | > - IB_SA_MCMEMBER_REC_PORT_GID | > - IB_SA_MCMEMBER_REC_PKEY | > - IB_SA_MCMEMBER_REC_JOIN_STATE, > - 1000, GFP_ATOMIC, > - ipoib_mcast_sendonly_join_complete, > - mcast, &mcast->query); > - if (ret < 0) { > - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed (ret = %d)\n", > + mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, > + priv->port, &rec, > + IB_SA_MCMEMBER_REC_MGID | > + IB_SA_MCMEMBER_REC_PORT_GID | > + IB_SA_MCMEMBER_REC_PKEY | > + IB_SA_MCMEMBER_REC_JOIN_STATE, > + GFP_ATOMIC, > + ipoib_mcast_sendonly_join_complete, > + mcast); > + if (IS_ERR(mcast->mc)) { > + ret = PTR_ERR(mcast->mc); > + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > + ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n", > ret); > } else { > ipoib_dbg_mcast(priv, "no multicast record for " IPOIB_GID_FMT > ", starting join\n", > IPOIB_GID_ARG(mcast->mcmember.mgid)); > - > - mcast->query_id = ret; > } > > return ret; > } > > -static void ipoib_mcast_join_complete(int status, > - struct ib_sa_mcmember_rec *mcmember, > - void *mcast_ptr) > +static int ipoib_mcast_join_complete(int status, > + struct ib_sa_multicast *multicast) > { > - struct ipoib_mcast *mcast = mcast_ptr; > + struct ipoib_mcast *mcast = multicast->context; > struct net_device *dev = mcast->dev; > struct ipoib_dev_priv *priv = netdev_priv(dev); > > @@ -395,23 +393,24 @@ static void ipoib_mcast_join_complete(in > " (status %d)\n", > IPOIB_GID_ARG(mcast->mcmember.mgid), status); > > - if (!status && !ipoib_mcast_join_finish(mcast, mcmember)) { > + /* We trap for port events ourselves. */ > + if (status == -ENETRESET) > + return 0; > + > + if (!status) > + status = ipoib_mcast_join_finish(mcast, &multicast->rec); > + > + if (!status) { > mcast->backoff = 1; > mutex_lock(&mcast_mutex); > if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > queue_work(ipoib_workqueue, &priv->mcast_task); > mutex_unlock(&mcast_mutex); > - complete(&mcast->done); > - return; > - } > - > - if (status == -EINTR) { > - complete(&mcast->done); > - return; > + return 0; > } > > - if (status && mcast->logcount++ < 20) { > - if (status == -ETIMEDOUT || status == -EINTR) { > + if (mcast->logcount++ < 20) { > + if (status == -ETIMEDOUT) { > ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT > ", status %d\n", > IPOIB_GID_ARG(mcast->mcmember.mgid), > @@ -428,23 +427,18 @@ static void ipoib_mcast_join_complete(in > if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) > mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS; > > - mutex_lock(&mcast_mutex); > + /* Clear the busy flag so we try again */ > + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > > + mutex_lock(&mcast_mutex); > spin_lock_irq(&priv->lock); > - mcast->query = NULL; > - > - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) { > - if (status == -ETIMEDOUT) > - queue_work(ipoib_workqueue, &priv->mcast_task); > - else > - queue_delayed_work(ipoib_workqueue, &priv->mcast_task, > - mcast->backoff * HZ); > - } else > - complete(&mcast->done); > + if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) > + queue_delayed_work(ipoib_workqueue, &priv->mcast_task, > + mcast->backoff * HZ); > spin_unlock_irq(&priv->lock); > mutex_unlock(&mcast_mutex); > > - return; > + return status; > } > > static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, > @@ -493,15 +487,14 @@ static void ipoib_mcast_join(struct net_ > rec.hop_limit = priv->broadcast->mcmember.hop_limit; > } > > - init_completion(&mcast->done); > - > - ret = ib_sa_mcmember_rec_set(&ipoib_sa_client, priv->ca, priv->port, > - &rec, comp_mask, mcast->backoff * 1000, > - GFP_ATOMIC, ipoib_mcast_join_complete, > - mcast, &mcast->query); > - > - if (ret < 0) { > - ipoib_warn(priv, "ib_sa_mcmember_rec_set failed, status %d\n", ret); > + set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > + mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, > + &rec, comp_mask, GFP_KERNEL, > + ipoib_mcast_join_complete, mcast); > + if (IS_ERR(mcast->mc)) { > + clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags); > + ret = PTR_ERR(mcast->mc); > + ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret); > > mcast->backoff *= 2; > if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS) > @@ -513,8 +506,7 @@ static void ipoib_mcast_join(struct net_ > &priv->mcast_task, > mcast->backoff * HZ); > mutex_unlock(&mcast_mutex); > - } else > - mcast->query_id = ret; > + } > } > > void ipoib_mcast_join_task(void *dev_ptr) > @@ -538,7 +530,7 @@ void ipoib_mcast_join_task(void *dev_ptr > priv->local_rate = attr.active_speed * > ib_width_enum_to_int(attr.active_width); > } else > - ipoib_warn(priv, "ib_query_port failed\n"); > + ipoib_warn(priv, "ib_query_port failed\n"); > } > > if (!priv->broadcast) { > @@ -565,7 +557,8 @@ void ipoib_mcast_join_task(void *dev_ptr > } > > if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > - ipoib_mcast_join(dev, priv->broadcast, 0); > + if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) > + ipoib_mcast_join(dev, priv->broadcast, 0); > return; > } > > @@ -620,26 +613,9 @@ int ipoib_mcast_start_thread(struct net_ > return 0; > } > > -static void wait_for_mcast_join(struct ipoib_dev_priv *priv, > - struct ipoib_mcast *mcast) > -{ > - spin_lock_irq(&priv->lock); > - if (mcast && mcast->query) { > - ib_sa_cancel_query(mcast->query_id, mcast->query); > - mcast->query = NULL; > - spin_unlock_irq(&priv->lock); > - ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n", > - IPOIB_GID_ARG(mcast->mcmember.mgid)); > - wait_for_completion(&mcast->done); > - } > - else > - spin_unlock_irq(&priv->lock); > -} > - > int ipoib_mcast_stop_thread(struct net_device *dev, int flush) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - struct ipoib_mcast *mcast; > > ipoib_dbg_mcast(priv, "stopping multicast thread\n"); > > @@ -655,52 +631,27 @@ int ipoib_mcast_stop_thread(struct net_d > if (flush) > flush_workqueue(ipoib_workqueue); > > - wait_for_mcast_join(priv, priv->broadcast); > - > - list_for_each_entry(mcast, &priv->multicast_list, list) > - wait_for_mcast_join(priv, mcast); > - > return 0; > } > > static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > - struct ib_sa_mcmember_rec rec = { > - .join_state = 1 > - }; > int ret = 0; > > - if (!test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) > - return 0; > - > - ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", > - IPOIB_GID_ARG(mcast->mcmember.mgid)); > - > - rec.mgid = mcast->mcmember.mgid; > - rec.port_gid = priv->local_gid; > - rec.pkey = cpu_to_be16(priv->pkey); > + if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { > + ipoib_dbg_mcast(priv, "leaving MGID " IPOIB_GID_FMT "\n", > + IPOIB_GID_ARG(mcast->mcmember.mgid)); > > - /* Remove ourselves from the multicast group */ > - ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), > - &mcast->mcmember.mgid); > - if (ret) > - ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); > + /* Remove ourselves from the multicast group */ > + ret = ipoib_mcast_detach(dev, be16_to_cpu(mcast->mcmember.mlid), > + &mcast->mcmember.mgid); > + if (ret) > + ipoib_warn(priv, "ipoib_mcast_detach failed (result = %d)\n", ret); > + } > > - /* > - * Just make one shot at leaving and don't wait for a reply; > - * if we fail, too bad. > - */ > - ret = ib_sa_mcmember_rec_delete(&ipoib_sa_client, priv->ca, priv->port, &rec, > - IB_SA_MCMEMBER_REC_MGID | > - IB_SA_MCMEMBER_REC_PORT_GID | > - IB_SA_MCMEMBER_REC_PKEY | > - IB_SA_MCMEMBER_REC_JOIN_STATE, > - 0, GFP_ATOMIC, NULL, > - mcast, &mcast->query); > - if (ret < 0) > - ipoib_warn(priv, "ib_sa_mcmember_rec_delete failed " > - "for leave (result = %d)\n", ret); > + if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > + ib_sa_free_multicast(mcast->mc); > > return 0; > } > @@ -753,7 +704,7 @@ void ipoib_mcast_send(struct net_device > dev_kfree_skb_any(skb); > } > > - if (mcast->query) > + if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) > ipoib_dbg_mcast(priv, "no address vector, " > "but multicast join already started\n"); > else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) > @@ -910,7 +861,6 @@ void ipoib_mcast_restart_task(void *dev_ > > /* We have to cancel outside of the spinlock */ > list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { > - wait_for_mcast_join(priv, mcast); > ipoib_mcast_leave(mcast->dev, mcast); > ipoib_mcast_free(mcast); > } > diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h > index 97715b0..3b957e5 100644 > --- a/include/rdma/ib_sa.h > +++ b/include/rdma/ib_sa.h > @@ -285,18 +285,6 @@ int ib_sa_path_rec_get(struct ib_sa_clie > void *context, > struct ib_sa_query **query); > > -int ib_sa_mcmember_rec_query(struct ib_sa_client *client, > - struct ib_device *device, u8 port_num, > - u8 method, > - struct ib_sa_mcmember_rec *rec, > - ib_sa_comp_mask comp_mask, > - int timeout_ms, gfp_t gfp_mask, > - void (*callback)(int status, > - struct ib_sa_mcmember_rec *resp, > - void *context), > - void *context, > - struct ib_sa_query **query); > - > int ib_sa_service_rec_query(struct ib_sa_client *client, > struct ib_device *device, u8 port_num, > u8 method, > @@ -309,93 +297,87 @@ int ib_sa_service_rec_query(struct ib_sa > void *context, > struct ib_sa_query **sa_query); > > +struct ib_sa_multicast { > + struct ib_sa_mcmember_rec rec; > + ib_sa_comp_mask comp_mask; > + int (*callback)(int status, > + struct ib_sa_multicast *multicast); > + void *context; > +}; > + > /** > - * ib_sa_mcmember_rec_set - Start an MCMember set query > - * @client:SA client > - * @device:device to send query on > - * @port_num: port number to send query on > - * @rec:MCMember Record to send in query > - * @comp_mask:component mask to send in query > - * @timeout_ms:time to wait for response > - * @gfp_mask:GFP mask to use for internal allocations > - * @callback:function called when query completes, times out or is > - * canceled > - * @context:opaque user context passed to callback > - * @sa_query:query context, used to cancel query > + * ib_sa_join_multicast - Initiates a join request to the specified multicast > + * group. > + * @client: SA client > + * @device: Device associated with the multicast group. > + * @port_num: Port on the specified device to associate with the multicast > + * group. > + * @rec: SA multicast member record specifying group attributes. > + * @comp_mask: Component mask indicating which group attributes of %rec are > + * valid. > + * @gfp_mask: GFP mask for memory allocations. > + * @callback: User callback invoked once the join operation completes. > + * @context: User specified context stored with the ib_sa_multicast structure. > * > - * Send an MCMember Set query to the SA (eg to join a multicast > - * group). The callback function will be called when the query > - * completes (or fails); status is 0 for a successful response, -EINTR > - * if the query is canceled, -ETIMEDOUT is the query timed out, or > - * -EIO if an error occurred sending the query. The resp parameter of > - * the callback is only valid if status is 0. > + * This call initiates a multicast join request with the SA for the specified > + * multicast group. If the join operation is started successfully, it returns > + * an ib_sa_multicast structure that is used to track the multicast operation. > + * Users must free this structure by calling ib_free_multicast, even if the > + * join operation later fails. (The callback status is non-zero.) > * > - * If the return value of ib_sa_mcmember_rec_set() is negative, it is > - * an error code. Otherwise it is a query ID that can be used to > - * cancel the query. > + * If the join operation fails; status will be non-zero, with the following > + * failures possible: > + * -ETIMEDOUT: The request timed out. > + * -EIO: An error occurred sending the query. > + * -EINVAL: The MCMemberRecord values differed from the existing group's. > + * -ENETRESET: Indicates that an fatal error has occurred on the multicast > + * group, and the user must rejoin the group to continue using it. > */ > -static inline int > -ib_sa_mcmember_rec_set(struct ib_sa_client *client, > - struct ib_device *device, u8 port_num, > - struct ib_sa_mcmember_rec *rec, > - ib_sa_comp_mask comp_mask, > - int timeout_ms, gfp_t gfp_mask, > - void (*callback)(int status, > - struct ib_sa_mcmember_rec *resp, > - void *context), > - void *context, > - struct ib_sa_query **query) > -{ > - return ib_sa_mcmember_rec_query(client, device, port_num, > - IB_MGMT_METHOD_SET, > - rec, comp_mask, > - timeout_ms, gfp_mask, callback, > - context, query); > -} > +struct ib_sa_multicast *ib_sa_join_multicast(struct ib_sa_client *client, > + struct ib_device *device, u8 port_num, > + struct ib_sa_mcmember_rec *rec, > + ib_sa_comp_mask comp_mask, gfp_t gfp_mask, > + int (*callback)(int status, > + struct ib_sa_multicast > + *multicast), > + void *context); > > /** > - * ib_sa_mcmember_rec_delete - Start an MCMember delete query > - * @client:SA client > - * @device:device to send query on > - * @port_num: port number to send query on > - * @rec:MCMember Record to send in query > - * @comp_mask:component mask to send in query > - * @timeout_ms:time to wait for response > - * @gfp_mask:GFP mask to use for internal allocations > - * @callback:function called when query completes, times out or is > - * canceled > - * @context:opaque user context passed to callback > - * @sa_query:query context, used to cancel query > + * ib_free_multicast - Frees the multicast tracking structure, and releases > + * any reference on the multicast group. > + * @multicast: Multicast tracking structure allocated by ib_join_multicast. > * > - * Send an MCMember Delete query to the SA (eg to leave a multicast > - * group). The callback function will be called when the query > - * completes (or fails); status is 0 for a successful response, -EINTR > - * if the query is canceled, -ETIMEDOUT is the query timed out, or > - * -EIO if an error occurred sending the query. The resp parameter of > - * the callback is only valid if status is 0. > + * This call blocks until the multicast identifier is destroyed. It may > + * not be called from within the multicast callback; however, returning a non- > + * zero value from the callback will result in destroying the multicast > + * tracking structure. > + */ > +void ib_sa_free_multicast(struct ib_sa_multicast *multicast); > + > +/** > + * ib_get_mcmember_rec - Looks up a multicast member record by its MGID and > + * returns it if found. > + * @device: Device associated with the multicast group. > + * @port_num: Port on the specified device to associate with the multicast > + * group. > + * @mgid: optional MGID of multicast group. > + * @rec: Location to copy SA multicast member record. > * > - * If the return value of ib_sa_mcmember_rec_delete() is negative, it > - * is an error code. Otherwise it is a query ID that can be used to > - * cancel the query. > + * If an MGID is specified, returns an existing multicast member record if > + * one is found for the local port. If no MGID is specified, or the specified > + * MGID is 0, returns a multicast member record filled in with default values > + * that may be used to create a new multicast group. > */ > -static inline int > -ib_sa_mcmember_rec_delete(struct ib_sa_client *client, > - struct ib_device *device, u8 port_num, > - struct ib_sa_mcmember_rec *rec, > - ib_sa_comp_mask comp_mask, > - int timeout_ms, gfp_t gfp_mask, > - void (*callback)(int status, > - struct ib_sa_mcmember_rec *resp, > - void *context), > - void *context, > - struct ib_sa_query **query) > -{ > - return ib_sa_mcmember_rec_query(client, device, port_num, > - IB_SA_METHOD_DELETE, > - rec, comp_mask, > - timeout_ms, gfp_mask, callback, > - context, query); > -} > +int ib_sa_get_mcmember_rec(struct ib_device *device, u8 port_num, > + union ib_gid *mgid, struct ib_sa_mcmember_rec *rec); > + > +/** > + * ib_init_ah_from_mcmember - Initialize address handle attributes based on > + * an SA multicast member record. > + */ > +int ib_init_ah_from_mcmember(struct ib_device *device, u8 port_num, > + struct ib_sa_mcmember_rec *rec, > + struct ib_ah_attr *ah_attr); > > /** > * ib_init_ah_from_path - Initialize address handle attributes based on an SA > OK, I went over this again. Some questions: So instead of ib_sa_mcmember_rec_set which returned the query by pointer we now have ib_sa_join_multicast which returns the pointer. This part looks OK I guess, but I still do not understand why does the patch tinker with logic (e.g. setting/clearing IPOIB_MCAST_FLAG_BUSY) in the IPoIB code. *All* the new API was supposed to do is add reference counting on top of join/leave queries. So why the need to rework the logic? Is it possible that what is missing is the analog of ib_sa_cancel_query - a non-blocking call which would guarantee that join callback is invoked soon? If yes, I think we should just add that to the new API. Sean, it seems like you are trying to push some unrelated re-factoring in the IPoIB code, which might be fine, but should be done separately from the update to the new API - could be before, or after the multicast change. -- MST From HNGUYEN at de.ibm.com Mon Nov 6 02:39:09 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 6 Nov 2006 11:39:09 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <200611060045.59074.arnd@arndb.de> Message-ID: Hi Roland! > Arnd wrote: > This seems broken. You have a constructor for newly allocated objects, but > there is no destructor and it seems that objects passed to > ehca_free_fw_ctrlblock are not guaranteed to be initialized either. > I'd simply move the memset into the alloc function and get rid of the > constructor here. As Arnd stated I need to fix this ctor issue. Do you prefer me to resend all patches in proper format (non-mangled inline) or just this one bug fix? Thanks! Nam From ogerlitz at voltaire.com Mon Nov 6 04:40:37 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 06 Nov 2006 14:40:37 +0200 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <200611061225.19077.jackm@dev.mellanox.co.il> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <200611061225.19077.jackm@dev.mellanox.co.il> Message-ID: <454F2D45.6060903@voltaire.com> Jack Morgenstein wrote: > On Wednesday 25 October 2006 00:25, Sean Hefty wrote: >> The following set of patches expand the rdma_cm support to include >> UD and multicast, and expose the rdma_cm to userspace. I would like to >> target the 2.6.20 kernel, but at least getting them into one or more >> branches would be helpful for other developers to test against these >> changes. > I have incorporated your rdma patches for 2.6.20 (1-7 v2) into > our driver, and am experiencing problems with multicast.c. By "our driver" do you mean the OFED 1.1 IB kernel drivers (eg ib_sa rdma_cm rdma_ucm etc)? I am using this patch series on top of Roland git tree of few weeks ago (eg more or less as 2.6.19-rc3) and have not got this crash. > When I unload the Infiniband driver I am getting a > kernel Oops (consistently at the same location, with the same > stack output). By "unload the driver" you mean modprobe -r to which module? If you know which modprobe -r causes this you might be able isolate the problem by doing rmmod to the effected modules one by one from a script, probably the offending one will crash also in this scheme. Or. From S.Linev at gsi.de Mon Nov 6 04:47:47 2006 From: S.Linev at gsi.de (Linev Sergei) Date: Mon, 6 Nov 2006 13:47:47 +0100 Subject: [openib-general] problem with libibverbs and ib_rdma_bw test Message-ID: <5AF41FB491229F4687F4DAA4456D0BD5588E87@W2K3MAILSV.gsi.de> Hi I was trying to install OFED 1.1 on SuSE 9.3 Linux (2.6.11.4-20a-smp). We are using Opterons with Mellanox MHES18-XT PCIe host adapters. Previousely we were using IB Gold 1.8.0 and mostly working with uDAPL. Now I trying uDAPL with OFED and find out, that it is not working for me. Actually, I see the same problem as it was reported here: http://openib.org/pipermail/openib-general/2006-October/028077.html I was trying to find out a place, where it reports a problem and was able to trace down to dapls_ib_mr_register() call, where ibv_reg_mr() returns zero handle (openib_cma/dapl_ib_mem.c, line 197) According to recomendation in following mail, http://openib.org/pipermail/openib-general/2006-October/028107.html I was trying to trace if ibverbs interface is working. And I find out that basic latency/bandwidth tests are working, but when I try to run and rdma-based tests, I immidiately see a problem. For instance, when I call on node01 ib_rdma_bw, and on node02 "ib_rdma_bw node01", I see on both nodes same error message: node01> ib_rdma_bw 29808: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 29808:pp_init_ctx: Couldn't allocate MR Probably, it is well known problem and I can solve it just by upgrading to the newest Linux? Any help is appreciated. Sergey Linev ########################################## Experiment Data Processing (EE) Gesellschaft für Schwerionenforschung (GSI) Planckstr. 1 D-64291 Darmstadt, Germany ########################################## From jsquyres at cisco.com Mon Nov 6 05:11:01 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 6 Nov 2006 08:11:01 -0500 Subject: [openib-general] This is the last time I'm asking... In-Reply-To: References: Message-ID: Having received no replies for 2 weeks as to why it is useful to have MVAPICH in the OpenFabrics SVN, I can only conclude that no one cares. If someone does care, please respond to my original questions included below ASAP (originally posted 23 Oct, 27 Oct, 1 Nov). I therefore make the motion to remove MVAPICH from the OpenFabrics SVN (all the source is still available via the OSU SVN and other distribution points). Specifically, I motion to do the following around COB tomorrow (7 Nov 2006): svn rm https://openib.org/svn/gen2/trunk/src/userspace/mpi Any objections? On Nov 1, 2006, at 10:53 AM, Jeff Squyres wrote: > Forwarding this to the mvapich-discuss list because it has gotten > zero replies on the openib-general list. If someone from OSU could > reply, it would be most helpful. Thanks. > > > Begin forwarded message: > >> From: Jeff Squyres >> Date: October 27, 2006 11:05:17 AM EDT >> To: openib >> Subject: Re: [mvapich] Announcing the release of MVAPICH2 0.9.6 >> with on-demand connection management, multi-core optimized shared >> memory communication and memory hook support >> >> Any response from the OSU crew? >> >> Can someone provide a reason why MVAPICH is still in OpenIB's >> Subversion repository? Please see my original mail, below, for >> more detailed questions. >> >> Thanks. >> >> >> On Oct 23, 2006, at 7:36 AM, Jeff Squyres wrote: >> >>> On Oct 22, 2006, at 11:53 PM, Dhabaleswar Panda wrote: >>> >>>> A stripped down version of this release is also available at the >>>> OpenIB SVN. >>> >>> I see this statement in every MVAPICH release notice and it >>> continues to puzzle me. >>> >>> I understand that there was a use for an alternate distribution >>> source before MVAPICH became open source. But now that the >>> MVAPICH code bases are freely available from OSU via multiple >>> mechanisms (anonymous SVN, tarball download, etc.), why is a >>> "stripped down version" maintained in the OpenIB SVN? >>> >>> 1. What, exactly, is the difference between the MVAPICH available >>> from OSU and the "stripped down version" in the OpenIB SVN? >>> >>> 2. Why would someone choose to download the "stripped down >>> version" from the OpenIB SVN? Have any real users/customers done >>> so? >>> >>> 3. What is the point of maintaining yet more flavors of MVAPICH >>> -- aren't there enough already (multiple versions from OSU, more >>> versions available from each IB vendor)? >>> >>> DK -- can you please explain? Thanks. >>> >>> -- >>> Jeff Squyres >>> Server Virtualization Business Unit >>> Cisco Systems >>> >>> >> >> >> -- >> Jeff Squyres >> Server Virtualization Business Unit >> Cisco Systems >> >> > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From jackm at dev.mellanox.co.il Mon Nov 6 06:16:29 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 6 Nov 2006 16:16:29 +0200 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <454F2D45.6060903@voltaire.com> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <200611061225.19077.jackm@dev.mellanox.co.il> <454F2D45.6060903@voltaire.com> Message-ID: <200611061616.30434.jackm@dev.mellanox.co.il> On Monday 06 November 2006 14:40, Or Gerlitz wrote: > By "unload the driver" you mean modprobe -r to which module? > As indicated in my mail, the active modprobe was to module ib_ipoib (as part of the "/etc/init.d/openibd stop" script). >By "our driver" do you mean the OFED 1.1 IB kernel drivers (eg ib_sa >rdma_cm rdma_ucm etc)? I am using this patch series on top of Roland git >tree of few weeks ago (eg more or less as 2.6.19-rc3) and have not got >this crash. The infiniband source are based upon 2.6.19-rc4 kernel sources, with Sean's patches added in. Userspace was taken from the current svn trunk userspace code. I also took Sean's librdmacm patch and Stephen Wise's perftest patch so userspace would work with Sean's modified cm event structure. All this was run on top of kernel 2.6.17.7 (with the help of some backport patches, attached). I tried the driver out on kernels 2.6.18 and SLES 10.0 (2.6.16.21) and the reported failure has not occurred so far. Will keep you posted. - Jack -------------- next part -------------- A non-text attachment was scrubbed... Name: 2_6_17_patches.tgz Type: application/x-tgz Size: 5848 bytes Desc: not available URL: From tziporet at dev.mellanox.co.il Mon Nov 6 06:42:40 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 06 Nov 2006 16:42:40 +0200 Subject: [openib-general] problem with libibverbs and ib_rdma_bw test In-Reply-To: <5AF41FB491229F4687F4DAA4456D0BD5588E87@W2K3MAILSV.gsi.de> References: <5AF41FB491229F4687F4DAA4456D0BD5588E87@W2K3MAILSV.gsi.de> Message-ID: <454F49E0.5040309@dev.mellanox.co.il> Linev Sergei wrote: > Hi > > I was trying to install OFED 1.1 on SuSE 9.3 Linux (2.6.11.4-20a-smp). > We are using Opterons with Mellanox MHES18-XT PCIe host adapters. > Previousely we were using IB Gold 1.8.0 and mostly working with uDAPL. > OFED 1.1 does not support SuSE 9.3. Best if you move to some later OS that was qualified (list is available in the release notes) Tziporet From erezz at voltaire.com Mon Nov 6 06:45:12 2006 From: erezz at voltaire.com (Erez Zilber) Date: Mon, 06 Nov 2006 16:45:12 +0200 Subject: [openib-general] [PATCH 4/7] IB/iser - Use the new verbs DMA mapping functions In-Reply-To: <1162506829.29948.574.camel@brick.pathscale.com> References: <1162506829.29948.574.camel@brick.pathscale.com> Message-ID: <454F4A78.4060606@voltaire.com> Ralph Campbell wrote: > IB/iser - Use the new verbs DMA mapping functions > > This patch converts iser to use the new verbs DMA mapping functions > for kernel verbs consumers. > > I tested it and ran some sanity checks and it looks ok. -- ____________________________________________________________ Erez Zilber | 972-9-971-7689 Software Engineer, Storage Team Voltaire – _The Grid Backbone_ __ www.voltaire.com From panda at cse.ohio-state.edu Mon Nov 6 06:53:09 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon, 6 Nov 2006 09:53:09 -0500 (EST) Subject: [openib-general] [mvapich-discuss] This is the last time I'm asking... In-Reply-To: from "Jeff Squyres" at Nov 06, 2006 08:11:01 AM Message-ID: <200611061453.kA6Er91h001869@xi.cse.ohio-state.edu> Jeff: May I know on with what `right' you are making this motion to remove the code. To have the code there was decided by the OpenIB community and the organizers. It needs to be decided by the community, not by an individual person. Let me suggest that we we discuss this at the Developers Summit at SC '06. If the Open Fabrics community no longer wants the code to be there and will prefer to download it from the OSU SVN site, we can proceed accordingly. Thanks, DK > Having received no replies for 2 weeks as to why it is useful to have > MVAPICH in the OpenFabrics SVN, I can only conclude that no one > cares. If someone does care, please respond to my original questions > included below ASAP (originally posted 23 Oct, 27 Oct, 1 Nov). > > I therefore make the motion to remove MVAPICH from the OpenFabrics > SVN (all the source is still available via the OSU SVN and other > distribution points). Specifically, I motion to do the following > around COB tomorrow (7 Nov 2006): > > svn rm https://openib.org/svn/gen2/trunk/src/userspace/mpi > > Any objections? > > > > On Nov 1, 2006, at 10:53 AM, Jeff Squyres wrote: > > > Forwarding this to the mvapich-discuss list because it has gotten > > zero replies on the openib-general list. If someone from OSU could > > reply, it would be most helpful. Thanks. > > > > > > Begin forwarded message: > > > >> From: Jeff Squyres > >> Date: October 27, 2006 11:05:17 AM EDT > >> To: openib > >> Subject: Re: [mvapich] Announcing the release of MVAPICH2 0.9.6 > >> with on-demand connection management, multi-core optimized shared > >> memory communication and memory hook support > >> > >> Any response from the OSU crew? > >> > >> Can someone provide a reason why MVAPICH is still in OpenIB's > >> Subversion repository? Please see my original mail, below, for > >> more detailed questions. > >> > >> Thanks. > >> > >> > >> On Oct 23, 2006, at 7:36 AM, Jeff Squyres wrote: > >> > >>> On Oct 22, 2006, at 11:53 PM, Dhabaleswar Panda wrote: > >>> > >>>> A stripped down version of this release is also available at the > >>>> OpenIB SVN. > >>> > >>> I see this statement in every MVAPICH release notice and it > >>> continues to puzzle me. > >>> > >>> I understand that there was a use for an alternate distribution > >>> source before MVAPICH became open source. But now that the > >>> MVAPICH code bases are freely available from OSU via multiple > >>> mechanisms (anonymous SVN, tarball download, etc.), why is a > >>> "stripped down version" maintained in the OpenIB SVN? > >>> > >>> 1. What, exactly, is the difference between the MVAPICH available > >>> from OSU and the "stripped down version" in the OpenIB SVN? > >>> > >>> 2. Why would someone choose to download the "stripped down > >>> version" from the OpenIB SVN? Have any real users/customers done > >>> so? > >>> > >>> 3. What is the point of maintaining yet more flavors of MVAPICH > >>> -- aren't there enough already (multiple versions from OSU, more > >>> versions available from each IB vendor)? > >>> > >>> DK -- can you please explain? Thanks. > >>> > >>> -- > >>> Jeff Squyres > >>> Server Virtualization Business Unit > >>> Cisco Systems > >>> > >>> > >> > >> > >> -- > >> Jeff Squyres > >> Server Virtualization Business Unit > >> Cisco Systems > >> > >> > > > > > > -- > > Jeff Squyres > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss at cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From heiko.carstens at de.ibm.com Mon Nov 6 06:58:39 2006 From: heiko.carstens at de.ibm.com (Heiko Carstens) Date: Mon, 6 Nov 2006 15:58:39 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <200611052140.38445.hnguyen@de.ibm.com> References: <200611052140.38445.hnguyen@de.ibm.com> Message-ID: <20061106145839.GA9387@osiris.boeblingen.de.ibm.com> > +#ifdef CONFIG_PPC_64K_PAGES > +void *ehca_alloc_fw_ctrlblock(void); > +void ehca_free_fw_ctrlblock(void *ptr); > +#else > +#define ehca_alloc_fw_ctrlblock() get_zeroed_page(GFP_KERNEL) > +#define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) > +#endif Maybe you want to make sure that ehca_alloc_fw_ctrlblock() always returns a void pointer, so you can avoid all the casts in your code? static inline void *ehca_alloc_fw_ctrlblock(void) { return (void *)get_zeroed_page(GFP_KERNEL); } From S.Linev at gsi.de Mon Nov 6 07:13:57 2006 From: S.Linev at gsi.de (Linev Sergei) Date: Mon, 6 Nov 2006 16:13:57 +0100 Subject: [openib-general] problem with libibverbs and ib_rdma_bw test Message-ID: <5AF41FB491229F4687F4DAA4456D0BD5588E89@W2K3MAILSV.gsi.de> Hi I found, that only SuSE Enterprise Linux 10 is supported. Can I try normal SuSE Linux 10.1 or it should be only Enterprise edition? And another question - why in Mellanox release notes for OFED 1.1 stated, that uDAPL is not supported - is it not working or just has some bugs/unattended features? Regards, Sergey > -----Original Message----- > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] > Sent: Montag, 6. November 2006 15:43 > To: Linev Sergei > Cc: openib-general at openib.org > Subject: Re: [openib-general] problem with libibverbs and > ib_rdma_bw test > > > Linev Sergei wrote: > > Hi > > > > I was trying to install OFED 1.1 on SuSE 9.3 Linux > (2.6.11.4-20a-smp). > > We are using Opterons with Mellanox MHES18-XT PCIe host adapters. > > Previousely we were using IB Gold 1.8.0 and mostly working > with uDAPL. > > > > OFED 1.1 does not support SuSE 9.3. > Best if you move to some later OS that was qualified (list is > available > in the release notes) > > Tziporet > > > From jlentini at netapp.com Mon Nov 6 07:23:03 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 6 Nov 2006 10:23:03 -0500 (EST) Subject: [openib-general] problem with libibverbs and ib_rdma_bw test In-Reply-To: <5AF41FB491229F4687F4DAA4456D0BD5588E87@W2K3MAILSV.gsi.de> References: <5AF41FB491229F4687F4DAA4456D0BD5588E87@W2K3MAILSV.gsi.de> Message-ID: Did you follow these directions from the libibverbs README? -- https://openfabrics.org/svn/gen2/trunk/src/userspace/libibverbs/README To use IB verbs from userspace, a process must also have permission to tell the kernel to lock sufficient memory for all of your registered memory regions as well as the memory used internally by IB resources such as queue pairs (QPs) and completion queues (CQs). To check your resource limits, use the command ulimit -l (or "limit memorylocked" for csh-like shells). If you see a small number such as 32 (the units are KB) then you will need to increase this limit. This is usually done for ordinary users via the file /etc/security/limits.conf. More configuration may be necessary if you are logging in via OpenSSH and your sshd is configured to use privilege separation. On Mon, 6 Nov 2006, Linev Sergei wrote: > Hi > > I was trying to install OFED 1.1 on SuSE 9.3 Linux (2.6.11.4-20a-smp). > We are using Opterons with Mellanox MHES18-XT PCIe host adapters. > Previousely we were using IB Gold 1.8.0 and mostly working with uDAPL. > > Now I trying uDAPL with OFED and find out, that it is not working for me. > Actually, I see the same problem as it was reported here: > > http://openib.org/pipermail/openib-general/2006-October/028077.html > > I was trying to find out a place, where it reports a problem and was able to trace down to > dapls_ib_mr_register() call, where ibv_reg_mr() returns zero handle (openib_cma/dapl_ib_mem.c, line 197) > > According to recomendation in following mail, > http://openib.org/pipermail/openib-general/2006-October/028107.html > > I was trying to trace if ibverbs interface is working. > And I find out that basic latency/bandwidth tests are working, but when I try to run > and rdma-based tests, I immidiately see a problem. For instance, when I call on node01 > ib_rdma_bw, and on node02 "ib_rdma_bw node01", I see on both nodes same error message: > > node01> ib_rdma_bw > 29808: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 > 29808:pp_init_ctx: Couldn't allocate MR > > Probably, it is well known problem and I can solve it just by upgrading to the newest Linux? > > Any help is appreciated. > > Sergey Linev > > > ########################################## > Experiment Data Processing (EE) > Gesellschaft für Schwerionenforschung (GSI) > Planckstr. 1 > D-64291 Darmstadt, Germany > ########################################## > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jlentini at netapp.com Mon Nov 6 07:52:19 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 6 Nov 2006 10:52:19 -0500 (EST) Subject: [openib-general] static ARP entries for IPoIB? In-Reply-To: <20061105201932.GA24900@mellanox.co.il> References: <20061105201932.GA24900@mellanox.co.il> Message-ID: On Sun, 5 Nov 2006, Michael S. Tsirkin wrote: > Quoting r. Roland Dreier : > > Subject: Re: static ARP entries for IPoIB? > > > > > I'd like to create static ARP entries for some IPoIB > > > devices. The arp (8) command that I'm using doesn't > > > know about ARPHRD_INFINIBAND (this is arp version 1.88 > > > from net-tools-1.60-583.4.src.rpm, in SLES10.) Is there > > > a version of arp (8) that works with IB? Or some other > > > utility or means to make static ARP entries for IPoIB? > > > > "ip neigh add ..." > > BTW, can't/shouldn't we fix arp? When I looked at the issue a while back, I concluded that it should not be changed: http://openib.org/pipermail/openib-general/2006-March/018270.html FYI: I sent the ip neighbor patch to the iproute2 maintainer. He wanted at patch to support IBoIB MAC addresses for all ip subcommands. You can see some of the suggestions here: http://openib.org/pipermail/openib-general/2006-March/018487.html From HNGUYEN at de.ibm.com Mon Nov 6 08:13:44 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 6 Nov 2006 17:13:44 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <20061106145839.GA9387@osiris.boeblingen.de.ibm.com> Message-ID: heicars2 at de.ltcfwd.linux.ibm.com wrote on 06.11.2006 15:58:39: > Maybe you want to make sure that ehca_alloc_fw_ctrlblock() always returns a > void pointer, so you can avoid all the casts in your code? > static inline void *ehca_alloc_fw_ctrlblock(void) > { > return (void *)get_zeroed_page(GFP_KERNEL); > } Yes, good point. That helps to avoid different warnings between 4k and 64k page modes. Thx Nam From jsquyres at cisco.com Mon Nov 6 08:27:26 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 6 Nov 2006 11:27:26 -0500 Subject: [openib-general] [mvapich-discuss] This is the last time I'm asking... In-Reply-To: <200611061453.kA6Er91h001869@xi.cse.ohio-state.edu> References: <200611061453.kA6Er91h001869@xi.cse.ohio-state.edu> Message-ID: <44892EBC-6E14-4AED-BEDF-4000BA51D569@cisco.com> As I explained in my mail, no one had replied to any of the posts containing my very directed and specific questions (not even you -- and you still haven't), so I figured that no one cared. That's not an unreasonable assumption given that I posted the same questions 3 times and got silence in return. I am unaware of any special "right" required to make a motion. Are there some protocols (perhaps a la Robert's Rules of Order) that are typically used for making a motion? I haven't seen any...? The agenda for the SC Developer's Summit is already over-full. This conversation is fine to begin in e-mail; a good start would be answering my original questions. Thanks! On Nov 6, 2006, at 9:53 AM, Dhabaleswar Panda wrote: > Jeff: > > May I know on with what `right' you are making this motion to remove > the code. > > To have the code there was decided by the OpenIB community and the > organizers. It needs to be decided by the community, not by an > individual person. > > Let me suggest that we we discuss this at the Developers Summit at SC > '06. If the Open Fabrics community no longer wants the code to be > there and will prefer to download it from the OSU SVN site, we can > proceed accordingly. > > Thanks, > > DK > > >> Having received no replies for 2 weeks as to why it is useful to have >> MVAPICH in the OpenFabrics SVN, I can only conclude that no one >> cares. If someone does care, please respond to my original questions >> included below ASAP (originally posted 23 Oct, 27 Oct, 1 Nov). >> >> I therefore make the motion to remove MVAPICH from the OpenFabrics >> SVN (all the source is still available via the OSU SVN and other >> distribution points). Specifically, I motion to do the following >> around COB tomorrow (7 Nov 2006): >> >> svn rm https://openib.org/svn/gen2/trunk/src/userspace/mpi >> >> Any objections? >> >> >> >> On Nov 1, 2006, at 10:53 AM, Jeff Squyres wrote: >> >>> Forwarding this to the mvapich-discuss list because it has gotten >>> zero replies on the openib-general list. If someone from OSU could >>> reply, it would be most helpful. Thanks. >>> >>> >>> Begin forwarded message: >>> >>>> From: Jeff Squyres >>>> Date: October 27, 2006 11:05:17 AM EDT >>>> To: openib >>>> Subject: Re: [mvapich] Announcing the release of MVAPICH2 0.9.6 >>>> with on-demand connection management, multi-core optimized shared >>>> memory communication and memory hook support >>>> >>>> Any response from the OSU crew? >>>> >>>> Can someone provide a reason why MVAPICH is still in OpenIB's >>>> Subversion repository? Please see my original mail, below, for >>>> more detailed questions. >>>> >>>> Thanks. >>>> >>>> >>>> On Oct 23, 2006, at 7:36 AM, Jeff Squyres wrote: >>>> >>>>> On Oct 22, 2006, at 11:53 PM, Dhabaleswar Panda wrote: >>>>> >>>>>> A stripped down version of this release is also available at the >>>>>> OpenIB SVN. >>>>> >>>>> I see this statement in every MVAPICH release notice and it >>>>> continues to puzzle me. >>>>> >>>>> I understand that there was a use for an alternate distribution >>>>> source before MVAPICH became open source. But now that the >>>>> MVAPICH code bases are freely available from OSU via multiple >>>>> mechanisms (anonymous SVN, tarball download, etc.), why is a >>>>> "stripped down version" maintained in the OpenIB SVN? >>>>> >>>>> 1. What, exactly, is the difference between the MVAPICH available >>>>> from OSU and the "stripped down version" in the OpenIB SVN? >>>>> >>>>> 2. Why would someone choose to download the "stripped down >>>>> version" from the OpenIB SVN? Have any real users/customers done >>>>> so? >>>>> >>>>> 3. What is the point of maintaining yet more flavors of MVAPICH >>>>> -- aren't there enough already (multiple versions from OSU, more >>>>> versions available from each IB vendor)? >>>>> >>>>> DK -- can you please explain? Thanks. >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> Server Virtualization Business Unit >>>>> Cisco Systems >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> Server Virtualization Business Unit >>>> Cisco Systems >>>> >>>> >>> >>> >>> -- >>> Jeff Squyres >>> Server Virtualization Business Unit >>> Cisco Systems >>> >>> >> >> >> -- >> Jeff Squyres >> Server Virtualization Business Unit >> Cisco Systems >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss at cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From mst at mellanox.co.il Mon Nov 6 08:32:58 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Nov 2006 18:32:58 +0200 Subject: [openib-general] ipoib mtu problem with UDP Message-ID: <20061106163258.GC31647@mellanox.co.il> I tried using ifconfig to limit the ipoib mtu. Once I do this on *either* both server and client, or only on the client side, UDP seems to stop working: #ifconfig ib0 mtu 512 #netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # MBytes/sec % SS us/KB 118784 65507 10.00 27582 0 172.2 26.33 inf 118784 10.00 0 0.0 23.40 inf Things work fine if the mtu on the client side is 2044: # ifconfig ib0 mtu 2044 # netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # MBytes/sec % SS us/KB 118784 65507 10.00 78488 0 490.1 25.31 2.310 118784 10.00 68534 428.0 24.55 2.241 Tested with kernel 2.6.19-rc4 and netperf 2.4.2. -- MST From tziporet at dev.mellanox.co.il Mon Nov 6 08:44:40 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 06 Nov 2006 18:44:40 +0200 Subject: [openib-general] problem with libibverbs and ib_rdma_bw test In-Reply-To: <5AF41FB491229F4687F4DAA4456D0BD5588E89@W2K3MAILSV.gsi.de> References: <5AF41FB491229F4687F4DAA4456D0BD5588E89@W2K3MAILSV.gsi.de> Message-ID: <454F6678.6030907@dev.mellanox.co.il> Linev Sergei wrote: > Hi > > I found, that only SuSE Enterprise Linux 10 is supported. > Can I try normal SuSE Linux 10.1 or it should be only Enterprise edition? > > We have run here testing on SuSE Linux 10, but not full QA. Another option is to install SuSE Linux 10 and then replace kernel to latest kernel.org (e.g 2.6.18) > And another question - why in Mellanox release notes for OFED 1.1 stated, that uDAPL is not supported - > is it not working or just has some bugs/unattended features? > uDAPL is supported by OFED, and was QAed by several companies. In Mellanox release notes there is a list of ULPs that Mellanox tested and can give support too, and uDAPL is not one of them. Tziporet From rdreier at cisco.com Mon Nov 6 08:50:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Nov 2006 08:50:44 -0800 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <200611060045.59074.arnd@arndb.de> (Arnd Bergmann's message of "Mon, 6 Nov 2006 00:45:58 +0100") References: <200611052140.38445.hnguyen@de.ibm.com> <200611060045.59074.arnd@arndb.de> Message-ID: > I'd simply move the memset into the alloc function and get rid of the > constructor here. Slightly better still would be to use kmem_cache_zalloc() (save a tiny bit of text by getting rid of the call to memset). - R. From rdreier at cisco.com Mon Nov 6 08:51:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Nov 2006 08:51:53 -0800 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: (Hoang-Nam Nguyen's message of "Mon, 6 Nov 2006 11:39:09 +0100") References: Message-ID: > As Arnd stated I need to fix this ctor issue. Do you prefer me to resend > all patches in proper format (non-mangled inline) or just this one bug fix? I have the rest of the patches, so you just need to resend a fixed version of this one. BTW see my previous response about kmem_cache_zalloc() -- I think that's the best way to fix this. In the future though if you can make a patch-sending script or something that lets you avoid the attachments that would be great. From rdreier at cisco.com Mon Nov 6 08:52:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Nov 2006 08:52:45 -0800 Subject: [openib-general] Can't build drivers/infiniband/hw/ipath/ipath_keys.c on arch/powerpc In-Reply-To: <1162800409.28571.298.camel@localhost.localdomain> ( Benjamin Herrenschmidt's message of "Mon, 06 Nov 2006 19:06:48 +1100") References: <1162798399.8175.24.camel@localhost.localdomain> <1162800409.28571.298.camel@localhost.localdomain> Message-ID: > I'm surprised that something as recent as infiniband requires a > long-deprecated function bus_to_virt(). > > What is it trying to do there that needs that call ? Don't ask -- just enjoy the fact that you don't know about this... From rdreier at cisco.com Mon Nov 6 09:07:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Nov 2006 09:07:33 -0800 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: <20061105152244.GC14245@mellanox.co.il> (Michael S. Tsirkin's message of "Sun, 5 Nov 2006 17:22:44 +0200") References: <1162506776.29948.572.camel@brick.pathscale.com> <20061105152244.GC14245@mellanox.co.il> Message-ID: > Hmm, since ib_dma_unmap_single calls a function through a pointer, > this seems to introduce overhead on data path operations in ipoib. > For apps like ipoib always working with low memory, I think it is important to avoid this > overhead of extra indirect function calls at least on systems without IO MMU - > where e.g. dma_unmap_single is empty. > This probably means you need some of architecture-dependent code, > but should be possible - look at how dma API is implemented for an example. > And this applies to all ULPs on systems without high memory. How is this possible? The IOMMU might be detected at runtime, and you can always have a system with multiple HCAs of different types, so I don't see how the conditional can be avoided. It is unfortunate but in this case I think we have to accept the cost of making the code general. It is sad that ipath is likely the only driver that will ever use this. Maybe something that the speed-freaks would like would be to add a hidden config option that turns all the ib_dma_xxx stuff into NOP macros unless ipath is being built. Of course that doesn't help all that much because all the distros etc will enable ipath. Anyway, I suspect the penalty is near-zero anyway, since the pointer being tested will likely be in cache and the branch predictor will learn which way the branch goes. (Except on a heterogeneous system I suppose) - R. From todd.rimmer at qlogic.com Mon Nov 6 09:10:54 2006 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Mon, 6 Nov 2006 11:10:54 -0600 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061190BD8ED@EPEXCH2.qlogic.org> > -----Original Message----- > From: Todd Rimmer > Sent: Thursday, November 02, 2006 7:12 PM > To: 'Michael S. Tsirkin'; Arlin Davis > Cc: Or Gerlitz; openib-general; Arlin Davis > Subject: RE: [openib-general] scaling issues, was: uDAPL cma: add support > for address and route retries, call disconnect when recving dreq > > > > From: Michael S. Tsirkin > > Sent: Thursday, November 02, 2006 5:55 PM > > To: Arlin Davis > > Cc: Or Gerlitz; openib-general; Arlin Davis > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add > support > > for address and route retries, call disconnect when recving dreq > > > > Quoting r. Arlin Davis : > > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add > > support for address and route retries, call disconnect when recving dreq > > > > > > Sean Hefty wrote: > > > > > > >One option is having the SA (or ib_umad?) return a busy status in > > response to a > > > >MAD, but we'd still have to be able to send this response as quickly > as > > requests > > > >are being received. We could then limit the number of requests that > > would be > > > >queued in the kernel for a user. > > > > > > > > > > > > > > Another great option would be to have path record caching. > Unfortunately > > > OFED 1.1 did not include ib_local_sa in the release. > > > > > > > This won't help you much. > > With 256 nodes all to all already gives you 65000 requests > > which is the same order of magnitude as the reported 130000. > > We have SA caching working quite well with very large clusters. Here are > some techniques which make it much more efficient: > > 1. A given node only cares about path records relevant to it. So only ask > for path records where it is the source. > 2. Use SA notices for GID in/out of service to trigger cache updates, and > only then for the specific GID which has changed > - as background, refresh all cache entrys slowly and infrequently > just in case the notice was lost, however IBTA does allow retries and Acks > of notices so this will be infrequent > 3. limit number of outstanding SA queries from a given node, this avoids 1 > node blasting the SM > There a little more to it, but that should be the main points relevant to > this discussion. > > Todd Rimmer resending, bounced due to email address change. From todd.rimmer at qlogic.com Mon Nov 6 09:11:21 2006 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Mon, 6 Nov 2006 11:11:21 -0600 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061190BD8EE@EPEXCH2.qlogic.org> > From: Todd Rimmer > Sent: Thursday, November 02, 2006 7:15 PM > To: 'Michael S. Tsirkin'; Hal Rosenstock > Cc: Or Gerlitz; openib-general; Arlin R Davis > Subject: RE: [openib-general] scaling issues, was: uDAPL cma: add support > for address and route retries, call disconnect when recving dreq > > > From: Michael S. Tsirkin > > Sent: Thursday, November 02, 2006 6:15 PM > > To: Hal Rosenstock > > Cc: Or Gerlitz; openib-general; Arlin R Davis > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add > support > > for address and route retries, call disconnect when recving dreq > > > > Quoting r. Hal Rosenstock : > > > Subject: Re: scaling issues, was: uDAPL cma: add support for address > and > > route retries, call disconnect when recving dreq > > > > > > On Thu, 2006-11-02 at 17:54, Michael S. Tsirkin wrote: > > > > Quoting r. Arlin Davis : > > > > > Subject: Re: [openib-general] scaling issues, was: uDAPL cma: add > > support for address and route retries, call disconnect when recving dreq > > > > > > > > > > Sean Hefty wrote: > > > > > > > > > > >One option is having the SA (or ib_umad?) return a busy status in > > response to a > > > > > >MAD, but we'd still have to be able to send this response as > > quickly as requests > > > > > >are being received. We could then limit the number of requests > > that would be > > > > > >queued in the kernel for a user. > > > > > > > > > > > > > > > > > > > > > > Another great option would be to have path record caching. > > Unfortunately > > > > > OFED 1.1 did not include ib_local_sa in the release. > > > > > > > > > > > > > This won't help you much. > > > > With 256 nodes all to all already gives you 65000 requests > > > > which is the same order of magnitude as the reported 130000. > > > > > > The requests might occur at a different time so they could be spread > out > > > rather than synchronized. > > > > I don't see how caching does this. > > > If all the queries are made at app startup, there will be one huge batch > of queries to the SA, especially for a many process MPI job. > > In contrast if SA caching is building its own replica of the relevant > subset of the SA, the pace can be more controlled. It can even be > purposely randomized by the SA cache code itself (eg. don't just do it > every 10 minutes, do it every 10 minutes +/- a random number, etc). This > way if all nodes powered on at similar time you won't have a pattern of > everyone asking SM at the same time. > > Todd Rimmer resending, bounced due to email address change. From S.Linev at gsi.de Mon Nov 6 09:17:26 2006 From: S.Linev at gsi.de (Linev Sergei) Date: Mon, 6 Nov 2006 18:17:26 +0100 Subject: [openib-general] problem with libibverbs and ib_rdma_bw test Message-ID: <5AF41FB491229F4687F4DAA4456D0BD5588E8A@W2K3MAILSV.gsi.de> Hi, > > Did you follow these directions from the libibverbs README? > > -- > https://openfabrics.org/svn/gen2/trunk/src/userspace/libibverbs/README > > > To use IB verbs from userspace, a process must also have permission to > tell the kernel to lock sufficient memory for all of your registered > memory regions as well as the memory used internally by IB resources > such as queue pairs (QPs) and completion queues (CQs). To check your > resource limits, use the command > > ulimit -l > > (or "limit memorylocked" for csh-like shells). > > If you see a small number such as 32 (the units are KB) then you will > need to increase this limit. This is usually done for ordinary users > via the file /etc/security/limits.conf. More configuration may be > necessary if you are logging in via OpenSSH and your sshd is > configured to use privilege separation. > > > On Mon, 6 Nov 2006, Linev Sergei wrote: > > > Hi > > > > I was trying to install OFED 1.1 on SuSE 9.3 Linux > (2.6.11.4-20a-smp). > > We are using Opterons with Mellanox MHES18-XT PCIe host adapters. > > Previousely we were using IB Gold 1.8.0 and mostly working > with uDAPL. > > > > Now I trying uDAPL with OFED and find out, that it is not > working for me. > > Actually, I see the same problem as it was reported here: > > > > http://openib.org/pipermail/openib-general/2006-October/028077.html > > > > I was trying to find out a place, where it reports a > problem and was able to trace down to > > dapls_ib_mr_register() call, where ibv_reg_mr() returns > zero handle (openib_cma/dapl_ib_mem.c, line 197) > > > > According to recomendation in following mail, > > http://openib.org/pipermail/openib-general/2006-October/028107.html > > > > I was trying to trace if ibverbs interface is working. > > And I find out that basic latency/bandwidth tests are > working, but when I try to run > > and rdma-based tests, I immidiately see a problem. For > instance, when I call on node01 > > ib_rdma_bw, and on node02 "ib_rdma_bw node01", I see on > both nodes same error message: > > > > node01> ib_rdma_bw > > 29808: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 > | iters=1000 | duplex=0 | cma=0 > > 29808:pp_init_ctx: Couldn't allocate MR > > > > Probably, it is well known problem and I can solve it just > by upgrading to the newest Linux? > > > > Any help is appreciated. > > > > Sergey Linev > > > > > > ########################################## > > Experiment Data Processing (EE) > > Gesellschaft für Schwerionenforschung (GSI) > > Planckstr. 1 > > D-64291 Darmstadt, Germany > > ########################################## > > > > It was a point! When I cahnge limits in limits.conf file, I can run most of my uDAPL code except disconnection of nodes (which not principal for me). I get error message: dapl/common/dapl_ep_free.c:114: dapl_ep_free: Assertion `ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED || ep_ptr->param.ep_state == DAT_EP_STATE_UNCONNECTED' failed. Thanks for the help. Only remark - how much memory I should specify to be on the safe side? For the moment I setup 4 MB, while 256 KB was not enough. But this is just for 4-nodes system with all-to-all connection. Do you have any suggestion, how I can calculate required memory space for 16- or 64-nodes cluster? Thanks again for the help! Sergey From jlentini at netapp.com Mon Nov 6 09:41:21 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 6 Nov 2006 12:41:21 -0500 (EST) Subject: [openib-general] problem with libibverbs and ib_rdma_bw test In-Reply-To: <5AF41FB491229F4687F4DAA4456D0BD5588E8A@W2K3MAILSV.gsi.de> References: <5AF41FB491229F4687F4DAA4456D0BD5588E8A@W2K3MAILSV.gsi.de> Message-ID: On Mon, 6 Nov 2006, Linev Sergei wrote: > When I cahnge limits in limits.conf file, I can run most of my uDAPL > code except disconnection of nodes (which not principal for me). I > get error message: > > dapl/common/dapl_ep_free.c:114: dapl_ep_free: Assertion `ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED || ep_ptr->param.ep_state == DAT_EP_STATE_UNCONNECTED' failed. Are you calling dat_ep_disconnect() before calling dat_ep_free()? > Thanks for the help. Only remark - how much memory I should specify > to be on the safe side? For the moment I setup 4 MB, while 256 KB > was not enough. But this is just for 4-nodes system with all-to-all > connection. Do you have any suggestion, how I can calculate required > memory space for 16- or 64-nodes cluster? I trust the RDMA applications I use to behave properly, so I set memorylocked to unlimited. From HNGUYEN at de.ibm.com Mon Nov 6 09:45:04 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 6 Nov 2006 18:45:04 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: Message-ID: Hi Roland! > > As Arnd stated I need to fix this ctor issue. Do you prefer me to resend > > all patches in proper format (non-mangled inline) or just this one bug fix? > I have the rest of the patches, so you just need to resend a fixed > version of this one. BTW see my previous response about > kmem_cache_zalloc() -- I think that's the best way to fix this. Yes. This makes sense to me. Will send this one patch soon. > In the future though if you can make a patch-sending script or > something that lets you avoid the attachments that would be great. Sure, will do. I just found out that I'm using an old version of kmail and its editor (all fancy editing is turned off, just plain text) just manngles leading and trailing tabs and spaces... Thanks Nam From mshefty at ichips.intel.com Mon Nov 6 10:07:00 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 06 Nov 2006 10:07:00 -0800 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <200611061225.19077.jackm@dev.mellanox.co.il> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <200611061225.19077.jackm@dev.mellanox.co.il> Message-ID: <454F79C4.6060205@ichips.intel.com> > When I unload the Infiniband driver I am getting a > kernel Oops (consistently at the same location, with the same > stack output). > > I am doing the driver unload immediately after reboot > (the boot process loads the infiniband driver). > I am not running opensm. > > Looks to me like a reference counting problem. This could be an issue resulting from integrating ib_multicast into the ib_sa module. I have not seen this issue, but I will see if I can reproduce the problem. - Sean From mst at mellanox.co.il Mon Nov 6 10:17:54 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Nov 2006 20:17:54 +0200 Subject: [openib-general] [PATCH 1/7 v2] for 2.6.20 ib/ib_sa: add tracking of multicast join / leave requests In-Reply-To: <454F78B5.6010704@ichips.intel.com> References: <454F78B5.6010704@ichips.intel.com> Message-ID: <20061106181754.GF31647@mellanox.co.il> Quoting r. Sean Hefty : > The difference in the APIs is that the > multicast API requires exactly 1 call to leave the group after the call to join. Hmm, I see. -- MST From ralph.campbell at qlogic.com Mon Nov 6 10:21:42 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 06 Nov 2006 10:21:42 -0800 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: <20061105152244.GC14245@mellanox.co.il> References: <1162506776.29948.572.camel@brick.pathscale.com> <20061105152244.GC14245@mellanox.co.il> Message-ID: <1162837302.29948.631.camel@brick.pathscale.com> On Sun, 2006-11-05 at 17:22 +0200, Michael S. Tsirkin wrote: > Quoting r. Ralph Campbell : > > diff -r f37bd0e41fec drivers/infiniband/ulp/ipoib/ipoib_ib.c > > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 21:44:41 2006 +0700 > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 12:37:09 2006 -0800 > > @@ -109,9 +109,8 @@ static int ipoib_ib_post_receive(struct > > ret = ib_post_recv(priv->qp, ¶m, &bad_wr); > > if (unlikely(ret)) { > > ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); > > - dma_unmap_single(priv->ca->dma_device, > > - priv->rx_ring[id].mapping, > > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > + ib_dma_unmap_single(priv->ca, priv->rx_ring[id].mapping, > > + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > dev_kfree_skb_any(priv->rx_ring[id].skb); > > priv->rx_ring[id].skb = NULL; > > } > > Hmm, since ib_dma_unmap_single calls a function through a pointer, > this seems to introduce overhead on data path operations in ipoib. > For apps like ipoib always working with low memory, I think it is important to avoid this > overhead of extra indirect function calls at least on systems without IO MMU - > where e.g. dma_unmap_single is empty. > This probably means you need some of architecture-dependent code, > but should be possible - look at how dma API is implemented for an example. > And this applies to all ULPs on systems without high memory. I did try asking the kernel folks if it would be acceptable to introduce an architecture dependent way for device drivers to interpose on the dma_*() functions but was rejected. The open fabrics middle layer (IPoIB, RDS, SDP, SRP, etc.) is supposed to be architecture neutral. The only way I could think of to meet both requirements is this sort of function indirection. Note that for mtcha and ehca, there isn't a function call indirection. The ib_dma_*() functions are inline and only test the ibdev->dma_ops for NULL before calling the dma_*() inline functions. From ralph.campbell at qlogic.com Mon Nov 6 10:13:07 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 06 Nov 2006 10:13:07 -0800 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: <454D95D1.9050207@mellanox.co.il> References: <1162506776.29948.572.camel@brick.pathscale.com> <454D95D1.9050207@mellanox.co.il> Message-ID: <1162836787.29948.623.camel@brick.pathscale.com> There is a very slight overhead since there is a test and branch. Given modern CPU architecture, this is in the noise. On Sun, 2006-11-05 at 09:42 +0200, Eitan Zahavi wrote: > Hi Ralph, > > Is there any performance penalty for using the IB version of the DMA > mapping functions? > > Thanks > > Eitan > > Ralph Campbell wrote: > > IB/ipoib - Use the new verbs DMA mapping functions > > > > This patch converts IPoIB to use the new DMA mapping functions > > for kernel verbs consumers. > > > > From: Ralph Campbell > > > > diff -r f37bd0e41fec drivers/infiniband/ulp/ipoib/ipoib_ib.c > > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 21:44:41 2006 +0700 > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c Thu Oct 26 12:37:09 2006 -0800 > > @@ -109,9 +109,8 @@ static int ipoib_ib_post_receive(struct > > ret = ib_post_recv(priv->qp, ¶m, &bad_wr); > > if (unlikely(ret)) { > > ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret); > > - dma_unmap_single(priv->ca->dma_device, > > - priv->rx_ring[id].mapping, > > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > + ib_dma_unmap_single(priv->ca, priv->rx_ring[id].mapping, > > + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > dev_kfree_skb_any(priv->rx_ring[id].skb); > > priv->rx_ring[id].skb = NULL; > > } > > @@ -136,10 +135,9 @@ static int ipoib_alloc_rx_skb(struct net > > */ > > skb_reserve(skb, 4); > > > > - addr = dma_map_single(priv->ca->dma_device, > > - skb->data, IPOIB_BUF_SIZE, > > - DMA_FROM_DEVICE); > > - if (unlikely(dma_mapping_error(addr))) { > > + addr = ib_dma_map_single(priv->ca, skb->data, IPOIB_BUF_SIZE, > > + DMA_FROM_DEVICE); > > + if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { > > dev_kfree_skb_any(skb); > > return -EIO; > > } > > @@ -193,8 +191,8 @@ static void ipoib_ib_handle_rx_wc(struct > > ipoib_warn(priv, "failed recv event " > > "(status=%d, wrid=%d vend_err %x)\n", > > wc->status, wr_id, wc->vendor_err); > > - dma_unmap_single(priv->ca->dma_device, addr, > > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > + ib_dma_unmap_single(priv->ca, addr, > > + IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > dev_kfree_skb_any(skb); > > priv->rx_ring[wr_id].skb = NULL; > > return; > > @@ -212,8 +210,7 @@ static void ipoib_ib_handle_rx_wc(struct > > ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n", > > wc->byte_len, wc->slid); > > > > - dma_unmap_single(priv->ca->dma_device, addr, > > - IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > + ib_dma_unmap_single(priv->ca, addr, IPOIB_BUF_SIZE, DMA_FROM_DEVICE); > > > > skb_put(skb, wc->byte_len); > > skb_pull(skb, IB_GRH_BYTES); > > @@ -261,10 +258,8 @@ static void ipoib_ib_handle_tx_wc(struct > > > > tx_req = &priv->tx_ring[wr_id]; > > > > - dma_unmap_single(priv->ca->dma_device, > > - pci_unmap_addr(tx_req, mapping), > > - tx_req->skb->len, > > - DMA_TO_DEVICE); > > + ib_dma_unmap_single(priv->ca, pci_unmap_addr(tx_req, mapping), > > + tx_req->skb->len, DMA_TO_DEVICE); > > > > ++priv->stats.tx_packets; > > priv->stats.tx_bytes += tx_req->skb->len; > > @@ -353,9 +348,9 @@ void ipoib_send(struct net_device *dev, > > */ > > tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)]; > > tx_req->skb = skb; > > - addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len, > > - DMA_TO_DEVICE); > > - if (unlikely(dma_mapping_error(addr))) { > > + addr = ib_dma_map_single(priv->ca, skb->data, skb->len, > > + DMA_TO_DEVICE); > > + if (unlikely(ib_dma_mapping_error(priv->ca, addr))) { > > ++priv->stats.tx_errors; > > dev_kfree_skb_any(skb); > > return; > > @@ -366,8 +361,7 @@ void ipoib_send(struct net_device *dev, > > address->ah, qpn, addr, skb->len))) { > > ipoib_warn(priv, "post_send failed\n"); > > ++priv->stats.tx_errors; > > - dma_unmap_single(priv->ca->dma_device, addr, skb->len, > > - DMA_TO_DEVICE); > > + ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE); > > dev_kfree_skb_any(skb); > > } else { > > dev->trans_start = jiffies; > > @@ -537,24 +531,28 @@ int ipoib_ib_dev_stop(struct net_device > > while ((int) priv->tx_tail - (int) priv->tx_head < 0) { > > tx_req = &priv->tx_ring[priv->tx_tail & > > (ipoib_sendq_size - 1)]; > > - dma_unmap_single(priv->ca->dma_device, > > - pci_unmap_addr(tx_req, mapping), > > - tx_req->skb->len, > > - DMA_TO_DEVICE); > > + ib_dma_unmap_single(priv->ca, > > + pci_unmap_addr(tx_req, > > + mapping), > > + tx_req->skb->len, > > + DMA_TO_DEVICE); > > dev_kfree_skb_any(tx_req->skb); > > ++priv->tx_tail; > > } > > > > - for (i = 0; i < ipoib_recvq_size; ++i) > > - if (priv->rx_ring[i].skb) { > > - dma_unmap_single(priv->ca->dma_device, > > - pci_unmap_addr(&priv->rx_ring[i], > > - mapping), > > - IPOIB_BUF_SIZE, > > - DMA_FROM_DEVICE); > > - dev_kfree_skb_any(priv->rx_ring[i].skb); > > - priv->rx_ring[i].skb = NULL; > > - } > > + for (i = 0; i < ipoib_recvq_size; ++i) { > > + struct ipoib_rx_buf *rx_req; > > + > > + rx_req = &priv->rx_ring[i]; > > + if (!rx_req->skb) > > + continue; > > + ib_dma_unmap_single(priv->ca, > > + rx_req->mapping, > > + IPOIB_BUF_SIZE, > > + DMA_FROM_DEVICE); > > + dev_kfree_skb_any(rx_req->skb); > > + rx_req->skb = NULL; > > + } > > > > goto timeout; > > } > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Mon Nov 6 10:23:51 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Nov 2006 10:23:51 -0800 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: <20061106180443.GD31647@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 6 Nov 2006 20:04:43 +0200") References: <20061106180443.GD31647@mellanox.co.il> Message-ID: > For example, ib_send_wr/ib_recv_wr could have an *optional* "void *data" field. > And we could have a rule that ULP must *either* pass in the void *data, and > do dma mappings through the usual dma API, or go through the ib_dma mappings (or both). This is a good direction to look at, because it makes "inline data" more useful. However I think it just shifts the cost elsewhere, because now low-level drivers need to have a conditional test for every gather/scatter entry to see which case is being used. So I'm not sure it's really a net win. - R. From mshefty at ichips.intel.com Mon Nov 6 10:02:29 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 06 Nov 2006 10:02:29 -0800 Subject: [openib-general] [PATCH 1/7 v2] for 2.6.20 ib/ib_sa: add tracking of multicast join / leave requests In-Reply-To: <20061106103544.GC29344@mellanox.co.il> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <000101c6f7bc$c0a46770$a6d4180a@amr.corp.intel.com> <20061106103544.GC29344@mellanox.co.il> Message-ID: <454F78B5.6010704@ichips.intel.com> > So instead of ib_sa_mcmember_rec_set which returned the query by pointer we now > have ib_sa_join_multicast which returns the pointer. This part looks OK I guess, but > I still do not understand why does the patch tinker with logic (e.g. > setting/clearing IPOIB_MCAST_FLAG_BUSY) in the IPoIB code. The use of the flag is still there. The difference in the APIs is that the multicast API requires exactly 1 call to leave the group after the call to join. The ib_multicast interface is simple. A user calls ib_join_multicast, followed by ib_free_multicast. What issues do you see with this interface? > *All* the new API was supposed to do is add reference counting on top of > join/leave queries. So why the need to rework the logic? Is it possible that > what is missing is the analog of ib_sa_cancel_query - a non-blocking call which > would guarantee that join callback is invoked soon? If yes, I think we should > just add that to the new API. Focus on the ib_multicast code first. Then look at the ipoib changes. The ib_multicast code does add reference counting on top of the existing ib_sa APIs. Join increments the reference count. Free decrements it. The "ib_sa_cancel_query" call that you're wanting is ib_free_multicast. > Sean, it seems like you are trying to push some unrelated re-factoring in the IPoIB > code, which might be fine, but should be done separately from the update to > the new API - could be before, or after the multicast change. There should not be any unrelated changes to the ipoib code. The changes are a result of using the new ib_multicast API and its restrictions. - Sean From mst at mellanox.co.il Mon Nov 6 10:04:43 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Nov 2006 20:04:43 +0200 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: References: Message-ID: <20061106180443.GD31647@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions > > > Hmm, since ib_dma_unmap_single calls a function through a pointer, > > this seems to introduce overhead on data path operations in ipoib. > > For apps like ipoib always working with low memory, I think it is important > > to avoid this overhead of extra indirect function calls at least on systems > > without IO MMU - where e.g. dma_unmap_single is empty. > > This probably means you need some of architecture-dependent code, > > but should be possible - look at how dma API is implemented for an example. > > And this applies to all ULPs on systems without high memory. > > How is this possible? > The IOMMU might be detected at runtime, E.g. on i386 dma_unmap_single seems to always be empty, I don't think it can be added at runtime. But I agree x86_64 is more important. > and you can always have a system with multiple HCAs of different types, so I > don't see how the conditional can be avoided. It is unfortunate but in this > case I think we have to accept the cost of making the code general. Well, in general case you are right, of course, and the problem is not solvable. But consider IPoIB, or any ULP that deals with low memory only, or mostly. It already has the data virtual address - so, why isn't it possible to pass that down to verbs somehow, and, in this case, avoid the extra overhead? For example, ib_send_wr/ib_recv_wr could have an *optional* "void *data" field. And we could have a rule that ULP must *either* pass in the void *data, and do dma mappings through the usual dma API, or go through the ib_dma mappings (or both). This would 1. avoid overhead for that ULP, as it would pass in real dma addresses and ipath can simply ignore them and use the data pointer instead. 2. allow optimisations such as inline data for HCAs that support both dma and copy modes > > It is sad that ipath is likely the only driver that will ever use > this. Maybe something that the speed-freaks would like would be to > add a hidden config option that turns all the ib_dma_xxx stuff into > NOP macros unless ipath is being built. Of course that doesn't help > all that much because all the distros etc will enable ipath. I think the most generic HCA is capable of both DMA and direct copy by driver. So, how about implementing something like the proposal above so that ipath is *not* the only driver to use this? > Anyway, I suspect the penalty is near-zero anyway, since the pointer > being tested will likely be in cache and the branch predictor will > learn which way the branch goes. (Except on a heterogeneous system I > suppose) Hmm. Maybe. Some numbers demonstrating this for e.g. ipoib might be useful. -- MST From mst at mellanox.co.il Mon Nov 6 11:02:16 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 6 Nov 2006 21:02:16 +0200 Subject: [openib-general] [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions In-Reply-To: References: Message-ID: <20061106190216.GB368@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH 3/7] IB/ipoib - Use the new verbs DMA mapping functions > > > For example, ib_send_wr/ib_recv_wr could have an *optional* "void *data" > > field. And we could have a rule that ULP must *either* pass in the void > > *data, and do dma mappings through the usual dma API, or go through the > > ib_dma mappings (or both). > > This is a good direction to look at, because it makes "inline data" > more useful. However I think it just shifts the cost elsewhere, > because now low-level drivers need to have a conditional test for > every gather/scatter entry to see which case is being used. I agree there's same cost to the conditional, but at least all drivers get some benefit in return as opposed to getting penalized for ipath limitations with nothing in return. It might be enough to make it per-wr, not per-s/g entry - ULPs that use s/g likely are not using low memory. > So I'm not sure it's really a net win. Well, this is not different from testing the "inline" flag. So for very short messages I have tests (in user level) to show this is a net win with respect to latency. -- MST From fwang2 at ornl.gov Mon Nov 6 10:13:14 2006 From: fwang2 at ornl.gov (Wang, Feiyi) Date: Mon, 06 Nov 2006 13:13:14 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <1162587457.15232.71447.camel@hal.voltaire.com> Message-ID: <537C6C0940C6C143AA46A88946B8541705109FE1@ORNLEXCHANGE.ornl.gov> Hal - Please see the output for active port 1 (although there are two ports on this HCA, the second one is disabled now). #smpquery portinfo 8 1 # Port info: Lid 8 port 1 Mkey:............................0x0000000000000000 GidPrefix:.......................0xfe80000000000000 Lid:.............................0x0008 SMLid:...........................0x0001 CapMask:.........................0x2510a68 IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsLedInfoSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported DiagCode:........................0x0000 MkeyLeasePeriod:.................0 LocalPort:.......................1 LinkWidthEnabled:................1X or 4X LinkWidthSupported:..............1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 or 5.0 Gbps LinkState:.......................Active PhysLinkState:...................LinkUp LinkDownDefState:................Polling ProtectBits:.....................0 LMC:.............................0 LinkSpeedActive:.................2.5 Gbps LinkSpeedEnabled:................2.5 or 5.0 Gbps NeighborMTU:.....................2048 SMSL:............................0 VLCap:...........................VL0-7 InitType:........................0x00 VLHighLimit:.....................255 VLArbHighCap:....................8 VLArbLowCap:.....................8 InitReply:.......................0x00 MtuCap:..........................2048 VLStallCount:....................7 HoqLife:.........................31 OperVLs:.........................VL0-7 PartEnforceInb:..................0 PartEnforceOutb:.................0 FilterRawInb:....................0 FilterRawOutb:...................0 MkeyViolations:..................0 PkeyViolations:..................0 QkeyViolations:..................0 GuidCap:.........................32 ClientReregister:................0 SubnetTimeout:...................18 RespTimeVal:.....................16 LocalPhysErr:....................8 OverrunErr:......................8 MaxCreditHint:...................0 RoundTrip:.......................0 Feiyi -----Original Message----- From: Hal Rosenstock [mailto:halr at voltaire.com] Sent: Friday, November 03, 2006 3:58 PM To: Wang, Feiyi Cc: openib-general at openib.org Subject: RE: [openib-general] question on QoS support On Fri, 2006-11-03 at 15:56, Wang, Feiyi wrote: > 255 > > I think I tested with default 0 before, that is send at most one packet > before give low priority table the chance according to IBA. It doesn't > seem to make a difference though. I was hoping you would say 0 as that means 1 packet before looking at low priority. 255 means unbounded packets on high priority. Can you send me the results of smpquery portinfo on that port to ensure that it is being set properly ? -- Hal > Feiyi > > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, November 03, 2006 3:51 PM > To: Wang, Feiyi > Cc: openib-general at openib.org > Subject: RE: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > > The test is done on two hosts, say A and B. A has 4x SDR (run > ib_rdam_bw > > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > > clients). The sl2vl table read as: > > > > smpquery sl2vl 7 > > # SL2VL table: Lid 7 > > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15| > > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| > 7| > > > > smpquery vlarb 7 > > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > > # Low priority VL Arbitration Table: > > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > > # High priority VL Arbitration Table: > > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > > > Low priority table entries are all zero to skip. > > High priority table give VL 0 and VL 2 different weight. > > > > The SL is specified on command line, one thread with SL 0, the other > > thread with SL 2. > > > > Thanks for looking into this, and let me know if more info is needed. > > What's the limit of high priority ? > > -- Hal > > > Feiyi > > > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Friday, November 03, 2006 3:27 PM > > To: Wang, Feiyi > > Cc: openib-general at openib.org > > Subject: Re: [openib-general] question on QoS support > > > > On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > > > In our test at the ORNL - it appears you can "turn off" the traffic > by > > > giving every VL weight 0. > > > > A weight of 0 indicates to skip that entry. > > > > > As soon as you assign non-zero VL weight, > > > the traffic starts to flow, however, VL with more weight doesn't > have > > > expected preference treatment. In other words, traffic shaping > didn't > > > take place. smpquery vlarb verified the mapping table was there. > > > > correctly ? > > > > Is it high or low priority or both ? > > > > What about SL2VLMapping table ? Is it setup correctly ? > > > > What's your topology for this ? > > > > Can you send your SL2VLMapping and VLarbitration configuration ? > > > > > I believe the scenario described below 'should' be able to generate > > > congestion point ... but it would be helpful if someone can > elaborate > > > a way to "look into" how/if scheduling/arbitration take place. > > > > The only ways I know would be to look at either the packets on the > wire > > or what you are doing with multiple streams which seems valid to me. > > > > Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to > > understand how to configure this ? > > > > -- Hal > > > > > Best, > > > > > > Feiyi > > > > > > > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock > > wrote: > > > > Hi Oliver, > > > > > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > > > Hi, Hal - > > > > > > > > > > > How is this being observed/measured ? > > > > > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > > > A single process of ibv_read_bw gives about 1415MB /s average > > > > > bandwidth. Two concurrent process report 714.45 MB/s each, dead > > even. > > > > > Now if I bump up one process with a different SL, then I expect > to > > see > > > > > shaping to take place. Please let me if the scenario makes > sense. > > > > > > > > It makes sense. However, if the higher priority traffic does not > > fill > > > > the scheduling, the low priority can take up the slack so I'm not > > sure > > > > if this is what you are seeing or something else. > > > > > > > > It might be interesting to try the same thing at SDR speeds. > > > > > > > > -- Hal > > > > > > > > > > Yes, 8 VLs should be supported in your subnet. You can verify > > this with > > > > > > smpquery portinfo on the HCA port and examine OperVLs assuming > > the port > > > > > > is ACTIVE. > > > > > > > > > > yes, I verified the data VL support, it is 8. I will poke for > more > > > > > info with suggested commands by Sasha. > > > > > > > > > > > > A related question is, if I modify qos setting in SM, do I > > need to > > > > > > > restart SA on each hosts for it to see the changes? (I am > > hoping not, > > > > > > > as I tried in the test, it doesn't seem to make a > difference) > > > > > > > > > > > > Not sure what you mean. SA is tightly coupled with the OpenSM. > > Do you > > > > > > mean SA client ? The client hosts don't need restarting but > did > > you > > > > > > restart OpenSM with your QoS configuration ? > > > > > > > > > > I mean client SA. yes, I understand OpenSM needs to be > restarted. > > > > > > > > > > > BTW, which OpenSM are you running ? > > > > > > > > > > OFED 1.1 based. > > > > > > > > > > thanks > > > > > > > > > > - Oliver > > > > > > > > > > > From benh at kernel.crashing.org Mon Nov 6 12:24:01 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 07 Nov 2006 07:24:01 +1100 Subject: [openib-general] Can't build drivers/infiniband/hw/ipath/ipath_keys.c on arch/powerpc In-Reply-To: References: <1162798399.8175.24.camel@localhost.localdomain> <1162800409.28571.298.camel@localhost.localdomain> Message-ID: <1162844641.28571.331.camel@localhost.localdomain> On Mon, 2006-11-06 at 08:52 -0800, Roland Dreier wrote: > > I'm surprised that something as recent as infiniband requires a > > long-deprecated function bus_to_virt(). > > > > What is it trying to do there that needs that call ? > > Don't ask -- just enjoy the fact that you don't know about this... Deal :) Ben. From hnguyen at linux.vnet.ibm.com Mon Nov 6 13:26:44 2006 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Ngyuen) Date: Mon, 6 Nov 2006 22:26:44 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode Message-ID: <200611062226.44939.hnguyen@linux.vnet.ibm.com> Hello Roland! Below is the patch with the "ctor" bug fix and Heiko's improvement to assure ehca_alloc_fw_ctrlblock() always returns (void*). Hope it is in proper format. Thanks! Nam This is a patch of ehca that assures 4k alignment for firmware control block in 64k page mode, because kzalloc()'s result address might not be 4k aligned if kernel's 64k page is enabled. Thus, we introduced wrappers called ehca_alloc/free_fw_ctrlblock(), which use a slab cache for objects with 4k length and 4k alignment in order to alloc/free firmware control blocks in 64k page mode. In 4k page mode those wrappers just are defines of get_zeroed_page() and free_page(). Signed-off-by: Hoang-Nam Nguyen --- ehca_hca.c | 17 +++++++++-------- ehca_irq.c | 17 ++++++++--------- ehca_iverbs.h | 8 ++++++++ ehca_main.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++--------- ehca_mrmw.c | 8 ++++---- ehca_qp.c | 12 +++++++----- 6 files changed, 83 insertions(+), 35 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 5eae6ac..f77e626 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -40,6 +40,7 @@ */ #include "ehca_tools.h" +#include "ehca_iverbs.h" #include "hcp_if.h" int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) @@ -49,7 +50,7 @@ int ehca_query_device(struct ib_device * ib_device); struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -96,7 +97,7 @@ int ehca_query_device(struct ib_device * = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); query_device1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -109,7 +110,7 @@ int ehca_query_port(struct ib_device *ib ib_device); struct hipz_query_port *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -162,7 +163,7 @@ int ehca_query_port(struct ib_device *ib props->active_speed = 0x1; query_port1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -178,7 +179,7 @@ int ehca_query_pkey(struct ib_device *ib return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -193,7 +194,7 @@ int ehca_query_pkey(struct ib_device *ib memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); query_pkey1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -211,7 +212,7 @@ int ehca_query_gid(struct ib_device *ibd return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -227,7 +228,7 @@ int ehca_query_gid(struct ib_device *ibd memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); query_gid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 048cc44..01b66d7 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -45,6 +45,7 @@ #include "ehca_iverbs.h" #include "ehca_tools.h" #include "hcp_if.h" #include "hipz_fns.h" +#include "ipz_pt_fn.h" #define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) #define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) @@ -137,38 +138,36 @@ int ehca_error_data(struct ehca_shca *sh u64 *rblock; unsigned long block_count; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (u64*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Cannot allocate rblock memory."); ret = -ENOMEM; goto error_data1; } + /* rblock must be 4K aligned and should be 4K large */ ret = hipz_h_error_data(shca->ipz_hca_handle, resource, rblock, &block_count); - if (ret == H_R_STATE) { + if (ret == H_R_STATE) ehca_err(&shca->ib_device, "No error data is available: %lx.", resource); - } else if (ret == H_SUCCESS) { int length; length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); - if (length > PAGE_SIZE) - length = PAGE_SIZE; + if (length > EHCA_PAGESIZE) + length = EHCA_PAGESIZE; print_error_data(shca, data, rblock, length); - } - else { + } else ehca_err(&shca->ib_device, "Error data could not be fetched: %lx", resource); - } - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); error_data1: return ret; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 319c39d..7011aba 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -179,4 +179,12 @@ int ehca_mmap_register(u64 physical,void int ehca_munmap(unsigned long addr, size_t len); +#ifdef CONFIG_PPC_64K_PAGES +void *ehca_alloc_fw_ctrlblock(void); +void ehca_free_fw_ctrlblock(void *ptr); +#else +#define ehca_alloc_fw_ctrlblock() ((void*)get_zeroed_page(GFP_KERNEL)) +#define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) +#endif + #endif diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 024d511..a40871f 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -40,6 +40,9 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#ifdef CONFIG_PPC_64K_PAGES +#include +#endif #include "ehca_classes.h" #include "ehca_iverbs.h" #include "ehca_mrmw.h" @@ -49,7 +52,7 @@ #include "hcp_if.h" MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0017"); +MODULE_VERSION("SVNEHCA_0018"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -94,11 +97,31 @@ spinlock_t ehca_cq_idr_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); + static struct list_head shca_list; /* list of all registered ehcas */ static spinlock_t shca_list_lock; static struct timer_list poll_eqs_timer; +#ifdef CONFIG_PPC_64K_PAGES +static struct kmem_cache *ctblk_cache = NULL; + +void *ehca_alloc_fw_ctrlblock(void) +{ + void *ret = kmem_cache_zalloc(ctblk_cache, SLAB_KERNEL); + if (!ret) + ehca_gen_err("Out of memory for ctblk"); + return ret; +} + +void ehca_free_fw_ctrlblock(void *ptr) +{ + if (ptr) + kmem_cache_free(ctblk_cache, ptr); + +} +#endif + static int ehca_create_slab_caches(void) { int ret; @@ -133,6 +156,17 @@ static int ehca_create_slab_caches(void) goto create_slab_caches5; } +#ifdef CONFIG_PPC_64K_PAGES + ctblk_cache = kmem_cache_create("ehca_cache_ctblk", + EHCA_PAGESIZE, H_CB_ALIGNMENT, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ctblk_cache) { + ehca_gen_err("Cannot create ctblk SLAB cache."); + ehca_cleanup_mrmw_cache(); + goto create_slab_caches5; + } +#endif return 0; create_slab_caches5: @@ -157,6 +191,10 @@ static void ehca_destroy_slab_caches(voi ehca_cleanup_qp_cache(); ehca_cleanup_cq_cache(); ehca_cleanup_pd_cache(); +#ifdef CONFIG_PPC_64K_PAGES + if (ctblk_cache) + kmem_cache_destroy(ctblk_cache); +#endif } #define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) @@ -168,7 +206,7 @@ int ehca_sense_attributes(struct ehca_sh u64 h_ret; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_gen_err("Cannot allocate rblock memory."); return -ENOMEM; @@ -211,7 +249,7 @@ int ehca_sense_attributes(struct ehca_sh shca->sport[1].rate = IB_RATE_30_GBPS; num_ports1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -220,7 +258,7 @@ static int init_node_guid(struct ehca_sh int ret = 0; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -235,7 +273,7 @@ static int init_node_guid(struct ehca_sh memcpy(&shca->ib_device.node_guid, &rblock->node_guid, sizeof(u64)); init_node_guid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -431,7 +469,7 @@ static ssize_t ehca_show_##name(struct \ shca = dev->driver_data; \ \ - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); \ + rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); \ if (!rblock) { \ dev_err(dev, "Can't allocate rblock memory."); \ return 0; \ @@ -439,12 +477,12 @@ static ssize_t ehca_show_##name(struct \ if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { \ dev_err(dev, "Can't query device properties"); \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ return 0; \ } \ \ data = rblock->name; \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ \ if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ return snprintf(buf, 256, "1\n"); \ @@ -752,7 +790,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0017)\n"); + "(Rel.: SVNEHCA_0018)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 5ca6544..1b77ac7 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -1013,7 +1013,7 @@ int ehca_reg_mr_rpages(struct ehca_shca u32 i; u64 *kpage; - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = (u64*)ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1092,7 +1092,7 @@ int ehca_reg_mr_rpages(struct ehca_shca ehca_reg_mr_rpages_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p " @@ -1124,7 +1124,7 @@ inline int ehca_rereg_mr_rereg1(struct e ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = (u64*)ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1181,7 +1181,7 @@ inline int ehca_rereg_mr_rereg1(struct e } ehca_rereg_mr_rereg1_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x " diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 4394123..3768d8d 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -811,8 +811,9 @@ static int internal_modify_qp(struct ib_ unsigned long spl_flags = 0; /* do query_qp to obtain current attr values */ - mqpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); - if (mqpcb == NULL) { + mqpcb = (struct hcp_modify_qp_control_block*) + ehca_alloc_fw_ctrlblock(); + if (!mqpcb) { ehca_err(ibqp->device, "Could not get zeroed page for mqpcb " "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); return -ENOMEM; @@ -1225,7 +1226,7 @@ modify_qp_exit2: } modify_qp_exit1: - kfree(mqpcb); + ehca_free_fw_ctrlblock(mqpcb); return ret; } @@ -1277,7 +1278,8 @@ int ehca_query_qp(struct ib_qp *qp, return -EINVAL; } - qpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL ); + qpcb = (struct hcp_modify_qp_control_block*) + ehca_alloc_fw_ctrlblock(); if (!qpcb) { ehca_err(qp->device,"Out of memory for qpcb " "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); @@ -1401,7 +1403,7 @@ int ehca_query_qp(struct ib_qp *qp, ehca_dmp(qpcb, 4*70, "qp_num=%x", qp->qp_num); query_qp_exit1: - kfree(qpcb); + ehca_free_fw_ctrlblock(qpcb); return ret; } From arnd at arndb.de Mon Nov 6 13:35:28 2006 From: arnd at arndb.de (Arnd Bergmann) Date: Mon, 6 Nov 2006 22:35:28 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <200611062226.44939.hnguyen@linux.vnet.ibm.com> References: <200611062226.44939.hnguyen@linux.vnet.ibm.com> Message-ID: <200611062235.28667.arnd@arndb.de> On Monday 06 November 2006 22:26, Hoang-Nam Ngyuen wrote: > -       rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); > +       rblock = (struct hipz_query_hca*)ehca_alloc_fw_ctrlblock(); >   > -       rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); > +       rblock = (struct hipz_query_port*)ehca_alloc_fw_ctrlblock(); The point Heiko made in his comment is that with ehca_alloc_fw_ctrlblock returning a void*, you can (and _should_) remove the casts to other pointer types. Arnd <>< From adit.262 at gmail.com Mon Nov 6 14:28:07 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Mon, 6 Nov 2006 17:28:07 -0500 Subject: [openib-general] Error starting IB HCA while Linux Boot Message-ID: Hi, I have a problem in getting the IB hardware during startup. I am running a Red Hat Linux machine with Pentium D processors. Here is the output of the dmesg command for the ib hardware: ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) ib_mthca: Initializing 0000:02:00.0 ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM ib_mthca 0000:02:00.0: SYS_EN returned status 0x07, aborting. ib_mthca: probe of 0000:02:00.0 failed with error -22 Here is a detail of the IB hardware using the lspci command: 02:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0) Any idea why the IB hardware fails to start and is there a solution for this? Thanks, Adit Ranadive Georgia Institute of Technology, Atlanta, GA From halr at voltaire.com Mon Nov 6 14:40:28 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 06 Nov 2006 17:40:28 -0500 Subject: [openib-general] question on QoS support In-Reply-To: <537C6C0940C6C143AA46A88946B8541705109FE1@ORNLEXCHANGE.ornl.gov> References: <537C6C0940C6C143AA46A88946B8541705109FE1@ORNLEXCHANGE.ornl.gov> Message-ID: <1162852827.25771.1681.camel@hal.voltaire.com> On Mon, 2006-11-06 at 13:13, Wang, Feiyi wrote: > Hal - > > Please see the output for active port 1 (although there are two ports on > this HCA, the second one is disabled now). > > #smpquery portinfo 8 1 > # Port info: Lid 8 port 1 > Mkey:............................0x0000000000000000 > GidPrefix:.......................0xfe80000000000000 > Lid:.............................0x0008 > SMLid:...........................0x0001 > CapMask:.........................0x2510a68 > IsTrapSupported > IsAutomaticMigrationSupported > IsSLMappingSupported > IsLedInfoSupported > IsSystemImageGUIDsupported > IsCommunicatonManagementSupported > IsVendorClassSupported > IsCapabilityMaskNoticeSupported > IsClientRegistrationSupported > DiagCode:........................0x0000 > MkeyLeasePeriod:.................0 > LocalPort:.......................1 > LinkWidthEnabled:................1X or 4X > LinkWidthSupported:..............1X or 4X > LinkWidthActive:.................4X > LinkSpeedSupported:..............2.5 or 5.0 Gbps > LinkState:.......................Active > PhysLinkState:...................LinkUp > LinkDownDefState:................Polling > ProtectBits:.....................0 > LMC:.............................0 > LinkSpeedActive:.................2.5 Gbps > LinkSpeedEnabled:................2.5 or 5.0 Gbps > NeighborMTU:.....................2048 > SMSL:............................0 > VLCap:...........................VL0-7 > InitType:........................0x00 > VLHighLimit:.....................255 OK; this is pretty conclusive. > VLArbHighCap:....................8 > VLArbLowCap:.....................8 > InitReply:.......................0x00 > MtuCap:..........................2048 > VLStallCount:....................7 > HoqLife:.........................31 > OperVLs:.........................VL0-7 > PartEnforceInb:..................0 > PartEnforceOutb:.................0 > FilterRawInb:....................0 > FilterRawOutb:...................0 > MkeyViolations:..................0 > PkeyViolations:..................0 > QkeyViolations:..................0 > GuidCap:.........................32 > ClientReregister:................0 > SubnetTimeout:...................18 > RespTimeVal:.....................16 > LocalPhysErr:....................8 > OverrunErr:......................8 > MaxCreditHint:...................0 > RoundTrip:.......................0 Do you have an IB analyzer ? -- Hal > Feiyi > > -----Original Message----- > From: Hal Rosenstock [mailto:halr at voltaire.com] > Sent: Friday, November 03, 2006 3:58 PM > To: Wang, Feiyi > Cc: openib-general at openib.org > Subject: RE: [openib-general] question on QoS support > > On Fri, 2006-11-03 at 15:56, Wang, Feiyi wrote: > > 255 > > > > I think I tested with default 0 before, that is send at most one > packet > > before give low priority table the chance according to IBA. It doesn't > > seem to make a difference though. > > I was hoping you would say 0 as that means 1 packet before looking at > low priority. > > 255 means unbounded packets on high priority. Can you send me the > results of smpquery portinfo on that port to ensure that it is being set > properly ? > > -- Hal > > > Feiyi > > > > > > -----Original Message----- > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > Sent: Friday, November 03, 2006 3:51 PM > > To: Wang, Feiyi > > Cc: openib-general at openib.org > > Subject: RE: [openib-general] question on QoS support > > > > On Fri, 2006-11-03 at 15:43, Wang, Feiyi wrote: > > > The test is done on two hosts, say A and B. A has 4x SDR (run > > ib_rdam_bw > > > as server), B has 4x DDR (run more than one thread of ib_rdma_bw as > > > clients). The sl2vl table read as: > > > > > > smpquery sl2vl 7 > > > # SL2VL table: Lid 7 > > > # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| > > 9|10|11|12|13|14|15| > > > ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| > > 7| > > > > > > smpquery vlarb 7 > > > # VLArbitration tables: Lid 7 port 0 LowCap 8 HighCap 8 > > > # Low priority VL Arbitration Table: > > > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > > WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | > > > # High priority VL Arbitration Table: > > > VL : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 | > > > WEIGHT: |0x1 |0x0 |0x8 |0x0 |0x0 |0x0 |0x0 |0x0 | > > > > > > Low priority table entries are all zero to skip. > > > High priority table give VL 0 and VL 2 different weight. > > > > > > The SL is specified on command line, one thread with SL 0, the other > > > thread with SL 2. > > > > > > Thanks for looking into this, and let me know if more info is > needed. > > > > What's the limit of high priority ? > > > > -- Hal > > > > > Feiyi > > > > > > > > > > > > -----Original Message----- > > > From: Hal Rosenstock [mailto:halr at voltaire.com] > > > Sent: Friday, November 03, 2006 3:27 PM > > > To: Wang, Feiyi > > > Cc: openib-general at openib.org > > > Subject: Re: [openib-general] question on QoS support > > > > > > On Fri, 2006-11-03 at 15:12, Feiyi Wang wrote: > > > > In our test at the ORNL - it appears you can "turn off" the > traffic > > by > > > > giving every VL weight 0. > > > > > > A weight of 0 indicates to skip that entry. > > > > > > > As soon as you assign non-zero VL weight, > > > > the traffic starts to flow, however, VL with more weight doesn't > > have > > > > expected preference treatment. In other words, traffic shaping > > didn't > > > > take place. smpquery vlarb verified the mapping table was there. > > > > > > correctly ? > > > > > > Is it high or low priority or both ? > > > > > > What about SL2VLMapping table ? Is it setup correctly ? > > > > > > What's your topology for this ? > > > > > > Can you send your SL2VLMapping and VLarbitration configuration ? > > > > > > > I believe the scenario described below 'should' be able to > generate > > > > congestion point ... but it would be helpful if someone can > > elaborate > > > > a way to "look into" how/if scheduling/arbitration take place. > > > > > > The only ways I know would be to look at either the packets on the > > wire > > > or what you are doing with multiple streams which seems valid to me. > > > > > > Have you read section 7.6.9.2 (p. 189-190) in IBA 1.2 volume 1 to > > > understand how to configure this ? > > > > > > -- Hal > > > > > > > Best, > > > > > > > > Feiyi > > > > > > > > > > > > On 02 Nov 2006 10:49:04 -0500, Hal Rosenstock > > > wrote: > > > > > Hi Oliver, > > > > > > > > > > On Thu, 2006-11-02 at 10:20, Oliver wrote: > > > > > > Hi, Hal - > > > > > > > > > > > > > How is this being observed/measured ? > > > > > > > > > > > > Host A, B, with 4x DDR both connected to Flextronic switch. > > > > > > A single process of ibv_read_bw gives about 1415MB /s average > > > > > > bandwidth. Two concurrent process report 714.45 MB/s each, > dead > > > even. > > > > > > Now if I bump up one process with a different SL, then I > expect > > to > > > see > > > > > > shaping to take place. Please let me if the scenario makes > > sense. > > > > > > > > > > It makes sense. However, if the higher priority traffic does not > > > fill > > > > > the scheduling, the low priority can take up the slack so I'm > not > > > sure > > > > > if this is what you are seeing or something else. > > > > > > > > > > It might be interesting to try the same thing at SDR speeds. > > > > > > > > > > -- Hal > > > > > > > > > > > > Yes, 8 VLs should be supported in your subnet. You can > verify > > > this with > > > > > > > smpquery portinfo on the HCA port and examine OperVLs > assuming > > > the port > > > > > > > is ACTIVE. > > > > > > > > > > > > yes, I verified the data VL support, it is 8. I will poke for > > more > > > > > > info with suggested commands by Sasha. > > > > > > > > > > > > > > A related question is, if I modify qos setting in SM, do I > > > need to > > > > > > > > restart SA on each hosts for it to see the changes? (I am > > > hoping not, > > > > > > > > as I tried in the test, it doesn't seem to make a > > difference) > > > > > > > > > > > > > > Not sure what you mean. SA is tightly coupled with the > OpenSM. > > > Do you > > > > > > > mean SA client ? The client hosts don't need restarting but > > did > > > you > > > > > > > restart OpenSM with your QoS configuration ? > > > > > > > > > > > > I mean client SA. yes, I understand OpenSM needs to be > > restarted. > > > > > > > > > > > > > BTW, which OpenSM are you running ? > > > > > > > > > > > > OFED 1.1 based. > > > > > > > > > > > > thanks > > > > > > > > > > > > - Oliver > > > > > > > > > > > > > > > > From arlin.r.davis at intel.com Mon Nov 6 14:44:29 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 6 Nov 2006 14:44:29 -0800 Subject: [openib-general] [PATCH 1/3] uDAPL cma: add support for new client register event Message-ID: <000001c701f5$1e822750$bb97070a@amr.corp.intel.com> New series of patches with "-x -up" Added support for new ib verbs client register event. No extra processing required at the uDAPL level. Shows up if opensm bounces. Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_util.c =================================================================== --- dapl/openib_cma/dapl_ib_util.c (revision 10032) +++ dapl/openib_cma/dapl_ib_util.c (working copy) @@ -744,9 +744,16 @@ void dapli_async_event_cb(struct _ib_hca hca->async_un_ctx); break; } + case IBV_EVENT_CLIENT_REREGISTER: + /* no need to report this event this time */ + dapl_dbg_log (DAPL_DBG_TYPE_WARN, + " async_event: IBV_EVENT_CLIENT_REREGISTER\n"); + break; + default: dapl_dbg_log (DAPL_DBG_TYPE_WARN, - " async_event: UNKNOWN\n"); + " async_event: %d UNKNOWN\n", + event.event_type); break; } From arlin.r.davis at intel.com Mon Nov 6 14:44:31 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 6 Nov 2006 14:44:31 -0800 Subject: [openib-general] [PATCH 2/3] uDAPL cma: fix issues with creating qp without rcv resources Message-ID: <000101c701f5$1fd24e00$bb97070a@amr.corp.intel.com> Fix some issues supporting create qp without recv cq handle or recv qp resources. IB verbs assume a recv_cq handle and uDAPL dapl_ep_create assumes there is always recv_sge resources specified. Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/common/dapl_ep_create.c =================================================================== --- dapl/common/dapl_ep_create.c (revision 10032) +++ dapl/common/dapl_ep_create.c (working copy) @@ -166,7 +166,7 @@ dapl_ep_create ( (recv_evd_handle != DAT_HANDLE_NULL && ep_attr->max_recv_dtos == 0) || (request_evd_handle == DAT_HANDLE_NULL && ep_attr->max_request_dtos != 0) || (request_evd_handle != DAT_HANDLE_NULL && ep_attr->max_request_dtos == 0) || - ep_attr->max_recv_iov == 0 || + (recv_evd_handle != DAT_HANDLE_NULL && ep_attr->max_recv_iov == 0) || ep_attr->max_request_iov == 0 || (DAT_SUCCESS != dapl_ep_check_recv_completion_flags ( ep_attr->recv_completion_flags)) )) Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 10032) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -143,13 +143,21 @@ DAT_RETURN dapls_ib_qp_alloc(IN DAPL_IA /* Setup attributes and create qp */ dapl_os_memzero((void*)&qp_create, sizeof(qp_create)); qp_create.cap.max_send_wr = attr->max_request_dtos; - qp_create.cap.max_recv_wr = attr->max_recv_dtos; qp_create.cap.max_send_sge = attr->max_request_iov; - qp_create.cap.max_recv_sge = attr->max_recv_iov; qp_create.cap.max_inline_data = ia_ptr->hca_ptr->ib_trans.max_inline_send; qp_create.send_cq = req_cq; - qp_create.recv_cq = rcv_cq; + + /* ibv assumes rcv_cq is never NULL, set to req_cq */ + if (rcv_cq == NULL) { + qp_create.recv_cq = req_cq; + qp_create.cap.max_recv_wr = 0; + qp_create.cap.max_recv_sge = 0; + } else { + qp_create.recv_cq = rcv_cq; + qp_create.cap.max_recv_wr = attr->max_recv_dtos; + qp_create.cap.max_recv_sge = attr->max_recv_iov; + } qp_create.qp_type = IBV_QPT_RC; qp_create.qp_context = (void*)ep_ptr; From arlin.r.davis at intel.com Mon Nov 6 14:44:33 2006 From: arlin.r.davis at intel.com (Arlin Davis) Date: Mon, 6 Nov 2006 14:44:33 -0800 Subject: [openib-general] [PATCH 3/3] uDAPL cma: add support for address and route retries, call disconnect when recving dreq Message-ID: <000201c701f5$20c3b2e0$bb97070a@amr.corp.intel.com> Fix some timeout and long disconnect delay issues discovered during scale-out testing. Added support to retry rdma_cm address and route resolution with configuration options. Provide a disconnect call when receiving the disconnect request to guarantee a disconnect reply and event on the remote side. The rdma_disconnect was not being called from dat_ep_disconnect() as a result of the state changing to DISCONNECTED in the event callback.   Here are the new options (environment variables) with the default setting:   DAPL_CM_ARP_TIMEOUT_MS   4000 DAPL_CM_ARP_RETRY_COUNT  15 DAPL_CM_ROUTE_TIMEOUT_MS  4000 DAPL_CM_ROUTE_RETRY_COUNT 15     Signed-off by: Arlin Davis ardavis at ichips.intel.com Index: dapl/openib_cma/dapl_ib_cm.c =================================================================== --- dapl/openib_cma/dapl_ib_cm.c (revision 10032) +++ dapl/openib_cma/dapl_ib_cm.c (working copy) @@ -58,6 +58,9 @@ #include "dapl_ib_util.h" #include #include +#include +#include +#include #include extern struct rdma_event_channel *g_cm_events; @@ -99,8 +102,8 @@ static void dapli_addr_resolve(struct da &ipaddr->src_addr)->sin_addr.s_addr), ntohl(((struct sockaddr_in *) &ipaddr->dst_addr)->sin_addr.s_addr)); - - ret = rdma_resolve_route(conn->cm_id, 2000); + + ret = rdma_resolve_route(conn->cm_id, conn->route_timeout); if (ret) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " rdma_connect failed: %s\n",strerror(errno)); @@ -120,6 +123,7 @@ static void dapli_route_resolve(struct d struct rdma_addr *ipaddr = &conn->cm_id->route.addr; struct ib_addr *ibaddr = &conn->cm_id->route.addr.addr.ibaddr; #endif + dapl_dbg_log(DAPL_DBG_TYPE_CM, " route_resolve: cm_id %p SRC %x DST %x PORT %d\n", conn->cm_id, @@ -331,21 +335,17 @@ static void dapli_cm_active_cb(struct da case RDMA_CM_EVENT_UNREACHABLE: case RDMA_CM_EVENT_CONNECT_ERROR: { - ib_cm_events_t cm_event; - dapl_dbg_log( + dapl_dbg_log( DAPL_DBG_TYPE_WARN, " dapli_cm_active_handler: CONN_ERR " " event=0x%x status=%d %s\n", event->event, event->status, (event->status == -ETIMEDOUT)?"TIMEOUT":"" ); - /* no device type specified so assume IB for now */ - if (event->status == -ETIMEDOUT) /* IB timeout */ - cm_event = IB_CME_TIMEOUT; - else - cm_event = IB_CME_DESTINATION_UNREACHABLE; - - dapl_evd_connection_callback(conn, cm_event, NULL, conn->ep); + /* per DAT SPEC provider always returns UNREACHABLE */ + dapl_evd_connection_callback(conn, + IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->ep); break; } case RDMA_CM_EVENT_REJECTED: @@ -381,6 +381,7 @@ static void dapli_cm_active_cb(struct da break; case RDMA_CM_EVENT_DISCONNECTED: + rdma_disconnect(conn->cm_id); /* force the DREP */ /* validate EP handle */ if (!DAPL_BAD_HANDLE(conn->ep, DAPL_MAGIC_EP)) dapl_evd_connection_callback(conn, @@ -494,6 +495,7 @@ static void dapli_cm_passive_cb(struct d break; case RDMA_CM_EVENT_DISCONNECTED: + rdma_disconnect(conn->cm_id); /* force the DREP */ /* validate SP handle context */ if (!DAPL_BAD_HANDLE(conn->sp, DAPL_MAGIC_PSP) || !DAPL_BAD_HANDLE(conn->sp, DAPL_MAGIC_RSP)) @@ -543,7 +545,8 @@ DAT_RETURN dapls_ib_connect(IN DAT_EP_HA IN void *p_data) { struct dapl_ep *ep_ptr = ep_handle; - + struct dapl_cm_id *conn; + /* Sanity check */ if (NULL == ep_ptr) return DAT_SUCCESS; @@ -552,36 +555,38 @@ DAT_RETURN dapls_ib_connect(IN DAT_EP_HA r_qual,p_data,p_size); /* rdma conn and cm_id pre-bound; reference via qp_handle */ - ep_ptr->cm_handle = ep_ptr->qp_handle; + conn = ep_ptr->cm_handle = ep_ptr->qp_handle; /* Setup QP/CM parameters and private data in cm_id */ - (void)dapl_os_memzero(&ep_ptr->cm_handle->params, - sizeof(ep_ptr->cm_handle->params)); - ep_ptr->cm_handle->params.responder_resources = IB_TARGET_MAX; - ep_ptr->cm_handle->params.initiator_depth = IB_INITIATOR_DEPTH; - ep_ptr->cm_handle->params.flow_control = 1; - ep_ptr->cm_handle->params.rnr_retry_count = IB_RNR_RETRY_COUNT; - ep_ptr->cm_handle->params.retry_count = IB_RC_RETRY_COUNT; + (void)dapl_os_memzero(&conn->params, sizeof(conn->params)); + conn->params.responder_resources = IB_TARGET_MAX; + conn->params.initiator_depth = IB_INITIATOR_DEPTH; + conn->params.flow_control = 1; + conn->params.rnr_retry_count = IB_RNR_RETRY_COUNT; + conn->params.retry_count = IB_RC_RETRY_COUNT; if (p_size) { - dapl_os_memcpy(ep_ptr->cm_handle->p_data, p_data, p_size); - ep_ptr->cm_handle->params.private_data = - ep_ptr->cm_handle->p_data; - ep_ptr->cm_handle->params.private_data_len = p_size; + dapl_os_memcpy(conn->p_data, p_data, p_size); + conn->params.private_data = conn->p_data; + conn->params.private_data_len = p_size; } + /* copy in remote address, need a copy for retry attempts */ + dapl_os_memcpy(&conn->r_addr, r_addr, sizeof(*r_addr)); + /* Resolve remote address, src already bound during QP create */ - ((struct sockaddr_in*)r_addr)->sin_port = htons(MAKE_PORT(r_qual)); - if (rdma_resolve_addr(ep_ptr->cm_handle->cm_id, - NULL, (struct sockaddr *)r_addr, 2000)) + ((struct sockaddr_in*)&conn->r_addr)->sin_port = htons(MAKE_PORT(r_qual)); + ((struct sockaddr_in*)&conn->r_addr)->sin_family = AF_INET; + + if (rdma_resolve_addr(conn->cm_id, NULL, + (struct sockaddr *)&conn->r_addr, + conn->arp_timeout)) return dapl_convert_errno(errno,"ib_connect"); dapl_dbg_log(DAPL_DBG_TYPE_CM, - " connect: resolve_addr: cm_id %p SRC %x DST %x port %d\n", - ep_ptr->cm_handle->cm_id, - ntohl(((struct sockaddr_in *) - &ep_ptr->cm_handle->hca->hca_address)->sin_addr.s_addr), - ntohl(((struct sockaddr_in *)r_addr)->sin_addr.s_addr), - MAKE_PORT(r_qual) ); + " connect: resolve_addr: cm_id %p -> %s port %d\n", + conn->cm_id, + inet_ntoa(((struct sockaddr_in *)&conn->r_addr)->sin_addr), + ((struct sockaddr_in*)&conn->r_addr)->sin_port ); return DAT_SUCCESS; } @@ -1163,15 +1168,60 @@ void dapli_cma_event_cb(void) case RDMA_CM_EVENT_ADDR_RESOLVED: dapli_addr_resolve(conn); break; + case RDMA_CM_EVENT_ROUTE_RESOLVED: dapli_route_resolve(conn); break; + case RDMA_CM_EVENT_ADDR_ERROR: + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " CM ADDR ERROR: -> %s retry (%d)..\n", + inet_ntoa(((struct sockaddr_in *) + &conn->r_addr)->sin_addr), + conn->arp_retries); + + /* retry address resolution */ + if ((--conn->arp_retries) && + (event->status == -ETIMEDOUT)) { + int ret; + ret = rdma_resolve_addr( + conn->cm_id, NULL, + (struct sockaddr *)&conn->r_addr, + conn->arp_timeout); + if (!ret) + break; + else { + dapl_dbg_log( + DAPL_DBG_TYPE_WARN, + " ERROR: rdma_resolve_addr = " + "%d %s\n", + ret,strerror(errno)); + } + } + /* retries exhausted or resolve_addr failed */ + dapl_evd_connection_callback( + conn, IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->ep); + break; + + case RDMA_CM_EVENT_ROUTE_ERROR: - dapl_evd_connection_callback(conn, - IB_CME_DESTINATION_UNREACHABLE, - NULL, conn->ep); + dapl_dbg_log(DAPL_DBG_TYPE_WARN, + " CM ROUTE ERROR: -> %s retry (%d)..\n", + inet_ntoa(((struct sockaddr_in *) + &conn->r_addr)->sin_addr), + conn->route_retries ); + + /* retry route resolution */ + if ((--conn->route_retries) && + (event->status == -ETIMEDOUT)) + dapli_addr_resolve(conn); + else + dapl_evd_connection_callback( conn, + IB_CME_DESTINATION_UNREACHABLE, + NULL, conn->ep); break; + case RDMA_CM_EVENT_DEVICE_REMOVAL: dapl_evd_connection_callback(conn, IB_CME_LOCAL_FAILURE, Index: dapl/openib_cma/dapl_ib_qp.c =================================================================== --- dapl/openib_cma/dapl_ib_qp.c (revision 10032) +++ dapl/openib_cma/dapl_ib_qp.c (working copy) @@ -160,6 +168,17 @@ DAT_RETURN dapls_ib_qp_alloc(IN DAPL_IA conn->cm_id = cm_id; conn->ep = ep_ptr; conn->hca = ia_ptr->hca_ptr; + + /* setup timers for address and route resolution */ + conn->arp_timeout = dapl_os_get_env_val("DAPL_CM_ARP_TIMEOUT_MS", + IB_ARP_TIMEOUT); + conn->arp_retries = dapl_os_get_env_val("DAPL_CM_ARP_RETRY_COUNT", + IB_ARP_RETRY_COUNT); + conn->route_timeout = dapl_os_get_env_val("DAPL_CM_ROUTE_TIMEOUT_MS", + IB_ROUTE_TIMEOUT); + conn->route_retries = dapl_os_get_env_val("DAPL_CM_ROUTE_RETRY_COUNT", + IB_ROUTE_RETRY_COUNT); + ep_ptr->qp_handle = conn; ep_ptr->qp_state = IB_QP_STATE_INIT; Index: dapl/openib_cma/dapl_ib_util.h =================================================================== --- dapl/openib_cma/dapl_ib_util.h (revision 10032) +++ dapl/openib_cma/dapl_ib_util.h (working copy) @@ -67,8 +67,12 @@ typedef ib_hca_handle_t dapl_ibal_ca_t; #define IB_RC_RETRY_COUNT 7 #define IB_RNR_RETRY_COUNT 7 -#define IB_CM_RESPONSE_TIMEOUT 20 /* 4 sec */ -#define IB_CM_RETRIES 15 +#define IB_CM_RESPONSE_TIMEOUT 23 /* 16 sec */ +#define IB_CM_RETRIES 15 /* 240 sec total default */ +#define IB_ARP_TIMEOUT 4000 /* 4 sec */ +#define IB_ARP_RETRY_COUNT 15 /* 60 sec total */ +#define IB_ROUTE_TIMEOUT 4000 /* 4 sec */ +#define IB_ROUTE_RETRY_COUNT 15 /* 60 sec total */ #define IB_REQ_MRA_TIMEOUT 27 /* a little over 9 minutes */ #define IB_MAX_AT_RETRY 3 #define IB_TARGET_MAX 4 /* max_qp_ous_rd_atom */ @@ -177,12 +181,17 @@ struct ib_llist_entry struct dapl_cm_id { DAPL_OS_LOCK lock; int destroy; + int arp_retries; + int arp_timeout; + int route_retries; + int route_timeout; int in_callback; struct rdma_cm_id *cm_id; struct dapl_hca *hca; struct dapl_sp *sp; struct dapl_ep *ep; struct rdma_conn_param params; + DAT_SOCK_ADDR6 r_addr; int p_len; unsigned char p_data[IB_MAX_DREP_PDATA_SIZE]; }; From rdreier at cisco.com Mon Nov 6 14:34:14 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Nov 2006 14:34:14 -0800 Subject: [openib-general] Error starting IB HCA while Linux Boot In-Reply-To: (Adit Ranadive's message of "Mon, 6 Nov 2006 17:28:07 -0500") References: Message-ID: > ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM The HCA is reporting a memory calibration error. Most likely this is a physical hardware problem somewhere in your system. Is it just one system failing like this, or do you have multiple HCAs/systems with the same error message? - R. From hnguyen at linux.vnet.ibm.com Mon Nov 6 15:56:39 2006 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Ngyuen) Date: Tue, 7 Nov 2006 00:56:39 +0100 Subject: [openib-general] [PATCH 2.6.19 1/4] ehca: assure 4k alignment for firmware control block in 64k page mode In-Reply-To: <200611062235.28667.arnd@arndb.de> References: <200611062226.44939.hnguyen@linux.vnet.ibm.com> <200611062235.28667.arnd@arndb.de> Message-ID: <200611070056.39994.hnguyen@linux.vnet.ibm.com> > The point Heiko made in his comment is that with ehca_alloc_fw_ctrlblock > returning a void*, you can (and _should_) remove the casts to other > pointer types. Hm, my bad. Should not happen that I totally ignored the whole sentence from Heiko. Roland, please throw the previous away. Here is eventually/hopefully the proper one. Thanks Nam This is a patch of ehca that assures 4k alignment for firmware control block in 64k page mode, because kzalloc()'s result address might not be 4k aligned if kernel's 64k page is enabled. Thus, we introduced wrappers called ehca_alloc/free_fw_ctrlblock(), which use a slab cache for objects with 4k length and 4k alignment in order to alloc/free firmware control blocks in 64k page mode. In 4k page mode those wrappers just are defines of get_zeroed_page() and free_page(). Signed-off-by: Hoang-Nam Nguyen --- ehca_hca.c | 17 +++++++++-------- ehca_irq.c | 17 ++++++++--------- ehca_iverbs.h | 8 ++++++++ ehca_main.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++--------- ehca_mrmw.c | 8 ++++---- ehca_qp.c | 10 +++++----- 6 files changed, 81 insertions(+), 35 deletions(-) diff -Nurp roland_git_orig/drivers/infiniband/hw/ehca/ehca_hca.c roland_git_work/drivers/infiniband/hw/ehca/ehca_hca.c --- roland_git_orig/drivers/infiniband/hw/ehca/ehca_hca.c 2006-11-02 10:47:04.000000000 +0100 +++ roland_git_work/drivers/infiniband/hw/ehca/ehca_hca.c 2006-11-07 00:03:03.000000000 +0100 @@ -40,6 +40,7 @@ */ #include "ehca_tools.h" +#include "ehca_iverbs.h" #include "hcp_if.h" int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) @@ -49,7 +50,7 @@ int ehca_query_device(struct ib_device * ib_device); struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -96,7 +97,7 @@ int ehca_query_device(struct ib_device * = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); query_device1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -109,7 +110,7 @@ int ehca_query_port(struct ib_device *ib ib_device); struct hipz_query_port *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -162,7 +163,7 @@ int ehca_query_port(struct ib_device *ib props->active_speed = 0x1; query_port1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -178,7 +179,7 @@ int ehca_query_pkey(struct ib_device *ib return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -193,7 +194,7 @@ int ehca_query_pkey(struct ib_device *ib memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); query_pkey1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -211,7 +212,7 @@ int ehca_query_gid(struct ib_device *ibd return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -227,7 +228,7 @@ int ehca_query_gid(struct ib_device *ibd memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); query_gid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } diff -Nurp roland_git_orig/drivers/infiniband/hw/ehca/ehca_irq.c roland_git_work/drivers/infiniband/hw/ehca/ehca_irq.c --- roland_git_orig/drivers/infiniband/hw/ehca/ehca_irq.c 2006-11-02 10:47:04.000000000 +0100 +++ roland_git_work/drivers/infiniband/hw/ehca/ehca_irq.c 2006-11-07 00:03:03.000000000 +0100 @@ -45,6 +45,7 @@ #include "ehca_tools.h" #include "hcp_if.h" #include "hipz_fns.h" +#include "ipz_pt_fn.h" #define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) #define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) @@ -137,38 +138,36 @@ int ehca_error_data(struct ehca_shca *sh u64 *rblock; unsigned long block_count; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Cannot allocate rblock memory."); ret = -ENOMEM; goto error_data1; } + /* rblock must be 4K aligned and should be 4K large */ ret = hipz_h_error_data(shca->ipz_hca_handle, resource, rblock, &block_count); - if (ret == H_R_STATE) { + if (ret == H_R_STATE) ehca_err(&shca->ib_device, "No error data is available: %lx.", resource); - } else if (ret == H_SUCCESS) { int length; length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); - if (length > PAGE_SIZE) - length = PAGE_SIZE; + if (length > EHCA_PAGESIZE) + length = EHCA_PAGESIZE; print_error_data(shca, data, rblock, length); - } - else { + } else ehca_err(&shca->ib_device, "Error data could not be fetched: %lx", resource); - } - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); error_data1: return ret; diff -Nurp roland_git_orig/drivers/infiniband/hw/ehca/ehca_iverbs.h roland_git_work/drivers/infiniband/hw/ehca/ehca_iverbs.h --- roland_git_orig/drivers/infiniband/hw/ehca/ehca_iverbs.h 2006-11-02 10:47:04.000000000 +0100 +++ roland_git_work/drivers/infiniband/hw/ehca/ehca_iverbs.h 2006-11-07 00:03:03.000000000 +0100 @@ -179,4 +179,12 @@ int ehca_mmap_register(u64 physical,void int ehca_munmap(unsigned long addr, size_t len); +#ifdef CONFIG_PPC_64K_PAGES +void *ehca_alloc_fw_ctrlblock(void); +void ehca_free_fw_ctrlblock(void *ptr); +#else +#define ehca_alloc_fw_ctrlblock() ((void*)get_zeroed_page(GFP_KERNEL)) +#define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) +#endif + #endif diff -Nurp roland_git_orig/drivers/infiniband/hw/ehca/ehca_main.c roland_git_work/drivers/infiniband/hw/ehca/ehca_main.c --- roland_git_orig/drivers/infiniband/hw/ehca/ehca_main.c 2006-11-02 10:47:04.000000000 +0100 +++ roland_git_work/drivers/infiniband/hw/ehca/ehca_main.c 2006-11-07 00:03:03.000000000 +0100 @@ -40,6 +40,9 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#ifdef CONFIG_PPC_64K_PAGES +#include +#endif #include "ehca_classes.h" #include "ehca_iverbs.h" #include "ehca_mrmw.h" @@ -49,7 +52,7 @@ MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0017"); +MODULE_VERSION("SVNEHCA_0018"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -94,11 +97,31 @@ spinlock_t ehca_cq_idr_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); + static struct list_head shca_list; /* list of all registered ehcas */ static spinlock_t shca_list_lock; static struct timer_list poll_eqs_timer; +#ifdef CONFIG_PPC_64K_PAGES +static struct kmem_cache *ctblk_cache = NULL; + +void *ehca_alloc_fw_ctrlblock(void) +{ + void *ret = kmem_cache_zalloc(ctblk_cache, SLAB_KERNEL); + if (!ret) + ehca_gen_err("Out of memory for ctblk"); + return ret; +} + +void ehca_free_fw_ctrlblock(void *ptr) +{ + if (ptr) + kmem_cache_free(ctblk_cache, ptr); + +} +#endif + static int ehca_create_slab_caches(void) { int ret; @@ -133,6 +156,17 @@ static int ehca_create_slab_caches(void) goto create_slab_caches5; } +#ifdef CONFIG_PPC_64K_PAGES + ctblk_cache = kmem_cache_create("ehca_cache_ctblk", + EHCA_PAGESIZE, H_CB_ALIGNMENT, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ctblk_cache) { + ehca_gen_err("Cannot create ctblk SLAB cache."); + ehca_cleanup_mrmw_cache(); + goto create_slab_caches5; + } +#endif return 0; create_slab_caches5: @@ -157,6 +191,10 @@ static void ehca_destroy_slab_caches(voi ehca_cleanup_qp_cache(); ehca_cleanup_cq_cache(); ehca_cleanup_pd_cache(); +#ifdef CONFIG_PPC_64K_PAGES + if (ctblk_cache) + kmem_cache_destroy(ctblk_cache); +#endif } #define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) @@ -168,7 +206,7 @@ int ehca_sense_attributes(struct ehca_sh u64 h_ret; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_gen_err("Cannot allocate rblock memory."); return -ENOMEM; @@ -211,7 +249,7 @@ int ehca_sense_attributes(struct ehca_sh shca->sport[1].rate = IB_RATE_30_GBPS; num_ports1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -220,7 +258,7 @@ static int init_node_guid(struct ehca_sh int ret = 0; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -235,7 +273,7 @@ static int init_node_guid(struct ehca_sh memcpy(&shca->ib_device.node_guid, &rblock->node_guid, sizeof(u64)); init_node_guid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -431,7 +469,7 @@ static ssize_t ehca_show_##name(struct \ shca = dev->driver_data; \ \ - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); \ + rblock = ehca_alloc_fw_ctrlblock(); \ if (!rblock) { \ dev_err(dev, "Can't allocate rblock memory."); \ return 0; \ @@ -439,12 +477,12 @@ static ssize_t ehca_show_##name(struct \ if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { \ dev_err(dev, "Can't query device properties"); \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ return 0; \ } \ \ data = rblock->name; \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ \ if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ return snprintf(buf, 256, "1\n"); \ @@ -752,7 +790,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0017)\n"); + "(Rel.: SVNEHCA_0018)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); diff -Nurp roland_git_orig/drivers/infiniband/hw/ehca/ehca_mrmw.c roland_git_work/drivers/infiniband/hw/ehca/ehca_mrmw.c --- roland_git_orig/drivers/infiniband/hw/ehca/ehca_mrmw.c 2006-11-02 10:47:04.000000000 +0100 +++ roland_git_work/drivers/infiniband/hw/ehca/ehca_mrmw.c 2006-11-07 00:03:03.000000000 +0100 @@ -1013,7 +1013,7 @@ int ehca_reg_mr_rpages(struct ehca_shca u32 i; u64 *kpage; - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1092,7 +1092,7 @@ int ehca_reg_mr_rpages(struct ehca_shca ehca_reg_mr_rpages_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p " @@ -1124,7 +1124,7 @@ inline int ehca_rereg_mr_rereg1(struct e ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1181,7 +1181,7 @@ inline int ehca_rereg_mr_rereg1(struct e } ehca_rereg_mr_rereg1_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x " diff -Nurp roland_git_orig/drivers/infiniband/hw/ehca/ehca_qp.c roland_git_work/drivers/infiniband/hw/ehca/ehca_qp.c --- roland_git_orig/drivers/infiniband/hw/ehca/ehca_qp.c 2006-11-02 10:47:04.000000000 +0100 +++ roland_git_work/drivers/infiniband/hw/ehca/ehca_qp.c 2006-11-07 00:03:03.000000000 +0100 @@ -811,8 +811,8 @@ static int internal_modify_qp(struct ib_ unsigned long spl_flags = 0; /* do query_qp to obtain current attr values */ - mqpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); - if (mqpcb == NULL) { + mqpcb = ehca_alloc_fw_ctrlblock(); + if (!mqpcb) { ehca_err(ibqp->device, "Could not get zeroed page for mqpcb " "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); return -ENOMEM; @@ -1225,7 +1225,7 @@ modify_qp_exit2: } modify_qp_exit1: - kfree(mqpcb); + ehca_free_fw_ctrlblock(mqpcb); return ret; } @@ -1277,7 +1277,7 @@ int ehca_query_qp(struct ib_qp *qp, return -EINVAL; } - qpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL ); + qpcb = ehca_alloc_fw_ctrlblock(); if (!qpcb) { ehca_err(qp->device,"Out of memory for qpcb " "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); @@ -1401,7 +1401,7 @@ int ehca_query_qp(struct ib_qp *qp, ehca_dmp(qpcb, 4*70, "qp_num=%x", qp->qp_num); query_qp_exit1: - kfree(qpcb); + ehca_free_fw_ctrlblock(qpcb); return ret; } From sashak at voltaire.com Mon Nov 6 16:34:10 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 7 Nov 2006 02:34:10 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm: trivial indentation fixes Message-ID: <20061107003410.GA30470@sashak.voltaire.com> Trivial indentation fixes in osm_inform.h and osm_service.h. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_inform.h | 55 ++++++++++++++++++++++---------------- osm/include/opensm/osm_service.h | 11 +++---- 2 files changed, 37 insertions(+), 29 deletions(-) diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h index cdce214..ab78aa1 100644 --- a/osm/include/opensm/osm_inform.h +++ b/osm/include/opensm/osm_inform.h @@ -78,7 +78,7 @@ BEGIN_C_DECLS * DESCRIPTION * The Inform record encapsulates the information needed by the * SA to manage InformInfo registrations and sending Reports(Notice) -* when SM receives Traps for registered LIDs. +* when SM receives Traps for registered LIDs. * * The inform records is not thread safe, thus callers must provide * serialization. @@ -105,18 +105,27 @@ BEGIN_C_DECLS */ typedef struct _osm_infr_t { - cl_list_item_t list_item; - osm_bind_handle_t h_bind; // a handle of lower level mad srvc - osm_infr_rcv_t* p_infr_rcv; // the receiver of inform_info's - osm_mad_addr_t report_addr; - ib_inform_info_record_t inform_record; + cl_list_item_t list_item; + osm_bind_handle_t h_bind; + osm_infr_rcv_t* p_infr_rcv; + osm_mad_addr_t report_addr; + ib_inform_info_record_t inform_record; } osm_infr_t; /* * FIELDS * list_item * List Item for qlist linkage. Must be first element!! * -* inform_record +* h_bind +* A handle of lower level mad srvc +* +* p_infr_rcv +* The receiver of inform_info's +* +* report_addr +* Report address +* +* inform_record * The Inform Info Record * * SEE ALSO @@ -234,14 +243,14 @@ osm_infr_get_by_rid( * p_subn * [in] Pointer to the subnet object * -* p_log -* [in] Pointer to the log object +* p_log +* [in] Pointer to the log object * -* p_inf_rec -* [in] Pointer to an inform_info record with the search RID +* p_inf_rec +* [in] Pointer to an inform_info record with the search RID * -* RETURN -* The matching osm_infr_t +* RETURN +* The matching osm_infr_t * SEE ALSO * Inform Record, osm_infr_construct, osm_infr_destroy *********/ @@ -265,14 +274,14 @@ osm_infr_get_by_rec( * p_subn * [in] Pointer to the subnet object * -* p_log -* [in] Pointer to the log object +* p_log +* [in] Pointer to the log object * -* p_inf_rec -* [in] Pointer to an inform_info record +* p_inf_rec +* [in] Pointer to an inform_info record * -* RETURN -* The matching osm_infr_t +* RETURN +* The matching osm_infr_t * SEE ALSO * Inform Record, osm_infr_construct, osm_infr_destroy *********/ @@ -313,11 +322,11 @@ osm_report_notice( * p_rcv * [in] Pointer to the trap receiver * -* p_ntc -* [in] Pointer to a copy of the incoming trap notice attribute. +* p_ntc +* [in] Pointer to a copy of the incoming trap notice attribute. * -* RETURN -* IB_SUCCESS on good completion +* RETURN +* IB_SUCCESS on good completion * * SEE ALSO * Inform Record, osm_trap_rcv diff --git a/osm/include/opensm/osm_service.h b/osm/include/opensm/osm_service.h index 97fed1d..d496fd0 100644 --- a/osm/include/opensm/osm_service.h +++ b/osm/include/opensm/osm_service.h @@ -100,12 +100,11 @@ BEGIN_C_DECLS typedef struct _osm_svcr_t { - cl_list_item_t list_item; - ib_net64_t svc_id; - ib_service_record_t service_record; - uint32_t modified_time; - uint32_t lease_period; - + cl_list_item_t list_item; + ib_net64_t svc_id; + ib_service_record_t service_record; + uint32_t modified_time; + uint32_t lease_period; } osm_svcr_t; /* * FIELDS -- 1.4.3.3.g8387 From sashak at voltaire.com Mon Nov 6 16:35:21 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 7 Nov 2006 02:35:21 +0200 Subject: [openib-general] [PATCH TRIVIAL] opensm: osm_service: remove unused svc_id field Message-ID: <20061107003521.GB30470@sashak.voltaire.com> This removes unused (but confused) svc_id field from osm_svcr structure. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_service.h | 1 - osm/opensm/osm_service.c | 1 - 2 files changed, 0 insertions(+), 2 deletions(-) diff --git a/osm/include/opensm/osm_service.h b/osm/include/opensm/osm_service.h index d496fd0..e68e295 100644 --- a/osm/include/opensm/osm_service.h +++ b/osm/include/opensm/osm_service.h @@ -101,7 +101,6 @@ BEGIN_C_DECLS typedef struct _osm_svcr_t { cl_list_item_t list_item; - ib_net64_t svc_id; ib_service_record_t service_record; uint32_t modified_time; uint32_t lease_period; diff --git a/osm/opensm/osm_service.c b/osm/opensm/osm_service.c index d4a1dc1..42a5b51 100644 --- a/osm/opensm/osm_service.c +++ b/osm/opensm/osm_service.c @@ -82,7 +82,6 @@ osm_svcr_init( CL_ASSERT( p_svcr ); p_svcr->modified_time = cl_get_time_stamp_sec(); - /* p_svcr->svc_id = p_svc_rec->service_id; */ /* We track the time left for this service in an external field to avoid extra cl_ntoh/hton -- 1.4.3.3.g8387 From vishal at endace.com Mon Nov 6 16:51:13 2006 From: vishal at endace.com (vishal) Date: Tue, 07 Nov 2006 13:51:13 +1300 Subject: [openib-general] SRP Target Installation problem Message-ID: <1162860673.5609.59.camel@julia.et.endace.com> Hi, I searched for the SRP target source on the openib site, and found it under the gen1 branch. Is it available under the gen2 branch ? I tried to install it on a machine which has OFED-1.1 (gen2) installed, and following are the first few errors from make :- /root/SRPTARGET/mtm_ib.h:38:29: error: ts_kernel_trace.h: No such file or directory /root/SRPTARGET/mtm_ib.h:39:30: error: ts_kernel_thread.h: No such file or directory /root/SRPTARGET/mtm_ib.h:40:30: error: ts_ib_core_types.h: No such file or directory /root/SRPTARGET/mtm_ib.h:41:24: error: ts_ib_core.h: No such file or directory /root/SRPTARGET/mtm_ib.h:42:23: error: ts_ib_mad.h: No such file or directory /root/SRPTARGET/mtm_ib.h:43:29: error: ts_ib_sa_client.h: No such file or directory /root/SRPTARGET/mtm_ib.h:44:28: error: ts_ib_cm_types.h: No such file or directory /root/SRPTARGET/mtm_ib.h:45:22: error: ts_ib_cm.h: No such file or directory /root/SRPTARGET/mtm_ib.h:46:29: error: ts_ib_dm_client.h: No such file or directory I managed to find the above missing files under the gen1 branch. I am not quite sure about installing packages from gen1 and gen2 on the same machine. Should I install IBGOLD instead of OFED ? Thanks! Vishal From mshefty at ichips.intel.com Mon Nov 6 16:45:16 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 06 Nov 2006 16:45:16 -0800 Subject: [openib-general] [RFC] [PATCH] rdma/ib_cm: fix APM support In-Reply-To: <454A490F.6050103@3leafnetworks.com> References: <000001c6fe31$ebc6e150$e0d8180a@amr.corp.intel.com> <454A490F.6050103@3leafnetworks.com> Message-ID: <454FD71C.1010909@ichips.intel.com> Venkatesh Babu wrote: > Let me make the steps clear - > 1. On Passive node register for remote port UP/DOWN event by > registering with ib_sa_serv_notice_hdlr() > 2. On Passive node start the listener by calling ib_cm_listen(). > 3. On Active node create the RC QP and establish the connection by > calling ib_send_cm_req(). In struct ib_cm_req_param specify both primary > path (say, through Port1) and alternate path (say, through Port2). > NOTE:-Assume Port1 of Active node is connected to Port1 of Passive node; > and Port2 of Active node is connected to Port2 of Passive node. > NOTE:- After this step QP's path_mig_state will be IB_MIG_ARMED. > 4. Let us say, Port1 on Active node fails > 5. IB_EVENT_PORT_ERR event is generated on Active node; and remote > port error event is generated on Passive node. > 6. In those event handler call ib_qp_modify() to set the > path_mig_state to IB_MIG_MIGRATED. This will let the HCA's firmware know > to switch to the alternate path. > 7. After a while, Port1 is comes back again. > 8. IB_EVENT_PORT_ACTIVE event is generated on Active node; and remote > port active event is generated on Passive node. > 9. On the Active node from IB_EVENT_PORT_ACTIVE event handler call > the ib_send_cm_lap() to send the alternate path (through Port1) to the > Passive node. > 9.1 Passive node receives the LAP message > 9.2 Calls ib_cm_init_rearm_attr() initialize the alternate path info > 9.3 Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM > 9.4 Send APR message back to the Active node. > 10. Active node receives the APR message > 11. Calls ib_cm_init_rearm_attr() initialize the alternate path info > 12. Calls ib_qp_modify() to update path_mig_state to IB_MIG_REARM > 13. Now when a first packet is passed between the Active and Passive > node the ib_core changes the path_mig_state to the IB_MIG_ARMED. > 14. Now it is all set for another failover. Using my cm patches, I have a test program that does the following: 1. Establish a connection between two nodes, including an alternate path. 2. Break the primary path (path 1). This generates IB_MIG_MIGRATED events on both nodes. Failover works. 3. Fix path 1. (Causes port active event on the client.) 4. Client sends a LAP message with path 1 to the server. ib_send_cm_lap. 5. Server loads the new alternate path. ib_cm_init_qp_attr and ib_modify_qp. 6. Server responds with an APR message. ib_send_cm_apr. 7. Client loads a new alternate path. ib_cm_init_qp_attr and ib_modify_qp. 8. Disconnect path 2 (original alternate). 9. Server sees IB_MIG_MIGRATED event. Client does not. I'm still debugging the issue as to why the client does not get the second IB_MIG_MIGRATED event. - Sean From sashak at voltaire.com Mon Nov 6 16:55:04 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 7 Nov 2006 02:55:04 +0200 Subject: [openib-general] [PATCH] opensm: permissions of db files directory Message-ID: <20061107005504.GC30470@sashak.voltaire.com> When creating directory for db files (guid2lid) storing create it with reasonable permissions (current 777 decimal = octal 01411) and don't do it world writable. Signed-off-by: Sasha Khapyorsky --- osm/opensm/osm_db_files.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/osm/opensm/osm_db_files.c b/osm/opensm/osm_db_files.c index 7539afd..dd0a8df 100644 --- a/osm/opensm/osm_db_files.c +++ b/osm/opensm/osm_db_files.c @@ -194,7 +194,7 @@ osm_db_init( /* make sure the directory exists */ if (lstat(p_db_imp->db_dir_name, &dstat)) { - if (mkdir(p_db_imp->db_dir_name, 777)) + if (mkdir(p_db_imp->db_dir_name, 0755)) { osm_log( p_log, OSM_LOG_ERROR, "osm_db_init: ERR 6101: " -- 1.4.3.3.g8387 From bugzilla-daemon at openib.org Mon Nov 6 20:26:26 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 6 Nov 2006 20:26:26 -0800 (PST) Subject: [openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop Message-ID: <20061107042626.A791C2283D8@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=263 ------- Comment #12 from rolandd at cisco.com 2006-11-06 20:26 ------- I finally spent some time tracking this down and I believe the problem is actually in the MAD layer. I will post more details and a patch to openib-general. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From rdreier at cisco.com Mon Nov 6 20:33:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Nov 2006 20:33:25 -0800 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive Message-ID: I've been working on fixing http://openib.org/bugzilla/show_bug.cgi which is an IPoIB crash. I figured out that the problem was a use-after-free caused by two path record query completions being generated, one with a successful status and one with a status of -EINTR. In turn, this was caused by the sa_query module getting both a successful receive completion and a send completion with status IB_WC_WR_FLUSH_ERR. I think I understand why this happens: in mad.c, ib_cancel_mad() calls ib_modify_mad(), which ends up setting the timeout to 0 and mad_send_wr->status to IB_WC_WR_FLUSH_ERR, and then calling ib_reset_mad_timeout() to actually flush the MAD send. However, this doesn't really flush the MAD immediately -- ib_reset_mad_timeout() defers it to process context via wait_for_response(), which schedules timed_work to run. However, this leaves a window where a receive completion could be polled before timeout_sends() gets a chance to run. This gives ib_mad_complete_recv() a chance to grab the send request and generate receive and send callbacks for it. The function does: mad_send_wc.status = IB_WC_SUCCESS; mad_send_wc.vendor_err = 0; mad_send_wc.send_buf = &mad_send_wr->send_buf; ib_mad_complete_send_wr(mad_send_wr, &mad_send_wc); so it tries to set mad_send_wc.status to IB_WC_SUCCESS, but ib_mad_complete_send_wr() has: if (mad_send_wr->status != IB_WC_SUCCESS ) mad_send_wc->status = mad_send_wr->status; so the status field will be set to IB_WC_WR_FLUSH_ERR. I don't believe we should generate receive callbacks for canceled sends, so I came up with the patch below (much simpler than the explanation that led up to it). I am no longer able to reproduce the IPoIB crash with this applied so I feel pretty good about this. Sean/Hal, does this make sense? If this looks good, I will merge this for 2.6.19. Thanks... diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 493f4c6..363db08 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1804,7 +1804,7 @@ static void ib_mad_complete_recv(struct if (ib_response_mad(mad_recv_wc->recv_buf.mad)) { spin_lock_irqsave(&mad_agent_priv->lock, flags); mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc); - if (!mad_send_wr) { + if (!mad_send_wr || mad_send_wr->status != IB_WC_SUCCESS) { spin_unlock_irqrestore(&mad_agent_priv->lock, flags); ib_free_recv_mad(mad_recv_wc); deref_mad_agent(mad_agent_priv); From ak at suse.de Mon Nov 6 09:47:26 2006 From: ak at suse.de (Andi Kleen) Date: Mon, 6 Nov 2006 18:47:26 +0100 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: References: <20061105064801.GV13381@stusta.de> Message-ID: <200611061847.27540.ak@suse.de> On Sunday 05 November 2006 10:30, Jeff Chua wrote: > Here's results with vanilla 2.6.19-rc4 (gcc version 3.4.5) ... > > 1) PCI access mode (Any) ... FAILED > > 2) PCI access mode (MMConfig) ... FAILED Full boot log please? -Andi From earny at net4u.de Mon Nov 6 18:17:34 2006 From: earny at net4u.de (Ernst Herzberg) Date: Tue, 7 Nov 2006 03:17:34 +0100 Subject: [openib-general] 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <20061105064801.GV13381@stusta.de> References: <20061105064801.GV13381@stusta.de> Message-ID: <200611070317.42230.earny@net4u.de> On Sunday 05 November 2006 07:48, Adrian Bunk wrote: > ... > Subject : ThinkPad R50p: boot fail with (lapic && on_battery) > References : http://lkml.org/lkml/2006/10/31/333 > Submitter : Ernst Herzberg > Status : problem is being debugged Update: 2.6.19-rc4-git11 does _not_ fix the problem. But now it doesn't matter if Cardbus/PCMCIA is compiled in or not. What also matters is a setting in the BIOS: If i set instead of Power -> SpeedStep -> Mode for AC - "Max Performance" Mode for Battery "Max Battery" to Power -> SpeedStep -> Mode for AC - "Max Battery" Mode for Battery "Max Performance" he _does_ boot from battery but also on AC... Also the problem is now(?) not very reliable. Sometimes the boot is successful even on battery and then the laptop works without a glitch. This makes it not easier to isolate the problem. About the reverts of some patches: I'm not, lets say it carefull, an very experienced git user;-) What i get is: # git-revert -n cf4c6a2f27f5db810b69dcb1da7f194489e8ff88 First trying simple merge strategy to revert. Simple revert fails; trying Automatic revert. Auto-merging arch/i386/kernel/io_apic.c merge: warning: conflicts during merge ERROR: Merge conflict in arch/i386/kernel/io_apic.c .... Is somewhere a howto-revert-patches-in-kernel-git-for-raw-beginners? From len.brown at intel.com Mon Nov 6 21:41:24 2006 From: len.brown at intel.com (Len Brown) Date: Tue, 7 Nov 2006 00:41:24 -0500 Subject: [openib-general] [linux-pm] 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <200611070317.42230.earny@net4u.de> References: <20061105064801.GV13381@stusta.de> <200611070317.42230.earny@net4u.de> Message-ID: <200611070041.28008.len.brown@intel.com> On Monday 06 November 2006 21:17, Ernst Herzberg wrote: > On Sunday 05 November 2006 07:48, Adrian Bunk wrote: > > ... > > Subject : ThinkPad R50p: boot fail with (lapic && on_battery) > > References : http://lkml.org/lkml/2006/10/31/333 > > Submitter : Ernst Herzberg > > Status : problem is being debugged > > Update: > > 2.6.19-rc4-git11 does _not_ fix the problem. > > But now it doesn't matter if Cardbus/PCMCIA is compiled in or not. > > What also matters is a setting in the BIOS: > > If i set instead of > > Power -> > SpeedStep -> > Mode for AC - "Max Performance" > Mode for Battery "Max Battery" > > to > > Power -> > SpeedStep -> > Mode for AC - "Max Battery" > Mode for Battery "Max Performance" > > he _does_ boot from battery but also on AC... so with the new settings it boots properly in either AC or battery mode with forced "lapic"? Strange, one would expect this to refer to APM settings, but who knows... Please test if booting with "processor.max_cstate=1" makes any difference Please test if building with CONFIG_CPU_FREQ=n makes any difference. Also, please make sure that booting with "apm=off" makes no difference -- there is a bug where the APM code is not currently disabled in ACPI mode, and who knows what effect that may have... > Also the problem is now(?) not very reliable. Sometimes the boot is successful > even on battery and then the laptop works without a glitch. > This makes it not easier to isolate the problem. > > Is somewhere a howto-revert-patches-in-kernel-git-for-raw-beginners? Documentation/git-bisect.txt is what you want -- though if you can't reliably reproduce the failure it may not easily lead you to the failure. Also, it may be a good use of time to make darn sure that booting 2.6.18 with forced "lapic" when on battery does not fail. ie. is this really a regression -- of did forcing the lapic on this box never work reliably? -Len From vuhuong at mellanox.com Mon Nov 6 22:48:38 2006 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 06 Nov 2006 22:48:38 -0800 Subject: [openib-general] SRP Target Installation problem In-Reply-To: <1162860673.5609.59.camel@julia.et.endace.com> References: <1162860673.5609.59.camel@julia.et.endace.com> Message-ID: <45502C46.9070007@mellanox.com> > > I managed to find the above missing files under the gen1 branch. I am > not quite sure about installing packages from gen1 and gen2 on the same > machine. Should I install IBGOLD instead of OFED ? > srp target sw only run on gen1 (ie. IBGD stack) and yes you need to install/run IBGD stack instead of OFED -vu From sjunaid at novell.com Mon Nov 6 23:27:10 2006 From: sjunaid at novell.com (Junaid Shariff) Date: Tue, 07 Nov 2006 00:27:10 -0700 Subject: [openib-general] [PATCH] IB/perftest: Fix ib_read_bw tool to parse -V option Message-ID: <45508207.9551.0036.0@novell.com> Hi, The ib_read_bw tool when used with the -V (version) option returns an error because it is not being parsed as an option. Regards, Junaid Novell, Inc. Signed-off-by: Junaid Shariff Index: userspace/perftest/read_bw.c =================================================================== --- src/userspace/perftest/read_bw.c 2006-10-19 19:51:09.000000000 +0530 +++ src1/userspace/perftest/read_bw.c 2006-11-06 14:07:20.000000000 +0530 @@ -624,7 +624,7 @@ { 0 } }; - c = getopt_long(argc, argv, "p:d:i:m:o:s:n:t:bae", long_options, NULL); + c = getopt_long(argc, argv, "p:d:i:m:o:s:n:t:baeV", long_options, NULL); if (c == -1) break; From mst at mellanox.co.il Tue Nov 7 00:44:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Nov 2006 10:44:53 +0200 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <200611061225.19077.jackm@dev.mellanox.co.il> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <200611061225.19077.jackm@dev.mellanox.co.il> Message-ID: <20061107084453.GA29565@mellanox.co.il> Quoting r. Jack Morgenstein : > Subject: Re: [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support > > On Wednesday 25 October 2006 00:25, Sean Hefty wrote: > > The following set of patches expand the rdma_cm support to include > > UD and multicast, and expose the rdma_cm to userspace. I would like to > > target the 2.6.20 kernel, but at least getting them into one or more > > branches would be helpful for other developers to test against these > > changes. > > > > I have incorporated your rdma patches for 2.6.20 (1-7 v2) into > our driver, and am experiencing problems with multicast.c. Here's another one from tonight - this is with IB stack from 2.6.19 + patches 1-7 loaded in SLES10 - kernel 2.6.16.21-0.8-smp. Nov 7 00:55:09 sw085 kernel: ib_mthca 0000:04:00.0: No AMGM entries left Nov 7 00:55:09 sw085 kernel: ib0: failed to attach to multicast group, ret = -12 Nov 7 00:55:09 sw085 kernel: ib0: couldn't attach QP to multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 Nov 7 00:55:09 sw085 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -12 Nov 7 00:55:11 sw085 kernel: ib_mthca 0000:04:00.0: No AMGM entries left Nov 7 00:55:11 sw085 kernel: ib0: failed to attach to multicast group, ret = -12 Nov 7 00:55:11 sw085 kernel: ib0: couldn't attach QP to multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 Nov 7 00:55:11 sw085 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -12 Nov 7 00:55:15 sw085 kernel: ib_mthca 0000:04:00.0: No AMGM entries left Nov 7 00:55:15 sw085 kernel: ib0: failed to attach to multicast group, ret = -12 Nov 7 00:55:15 sw085 kernel: ib0: couldn't attach QP to multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 Nov 7 00:55:15 sw085 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -12 Nov 7 00:55:23 sw085 kernel: ib_mthca 0000:04:00.0: No AMGM entries left Nov 7 00:55:23 sw085 kernel: ib0: failed to attach to multicast group, ret = -12 Nov 7 00:55:23 sw085 kernel: ib0: couldn't attach QP to multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 Nov 7 00:55:23 sw085 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -12 Nov 7 00:55:39 sw085 kernel: ib_mthca 0000:04:00.0: No AMGM entries left Nov 7 00:55:39 sw085 kernel: ib0: failed to attach to multicast group, ret = -12 Nov 7 00:55:39 sw085 kernel: ib0: couldn't attach QP to multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 Nov 7 00:55:39 sw085 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -12 Nov 7 00:55:55 sw085 kernel: ib_mthca 0000:04:00.0: No AMGM entries left Nov 7 00:55:55 sw085 kernel: ib0: failed to attach to multicast group, ret = -12 Nov 7 00:55:55 sw085 kernel: ib0: couldn't attach QP to multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 Nov 7 00:55:55 sw085 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -12 Nov 7 00:56:01 sw085 /usr/sbin/cron[26691]: (root) CMD (/mswg/projects/test_suite2/etc/check_daemon.csh >/dev/null) Nov 7 00:56:01 sw085 /usr/sbin/cron[26692]: (root) CMD (/usr/check_mswg.csh >/dev/null) Nov 7 00:56:20 sw085 kernel: NET: Unregistered protocol family 27 Nov 7 00:56:21 sw085 ifdown: ib0 device: Mellanox Technologies MT25208 InfiniHost III Ex (rev a0) Nov 7 00:56:21 sw085 ifdown: ib0 configuration: ib1 Nov 7 00:56:21 sw085 kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready Nov 7 00:56:21 sw085 ifdown: ib1 device: Mellanox Technologies MT25208 InfiniHost III Ex (rev a0) Nov 7 00:56:23 sw085 ifdown: ib0 Nov 7 00:56:23 sw085 ifdown: ib0 configuration: ib1 Nov 7 00:56:23 sw085 ifdown: ib1 Nov 7 00:56:29 sw085 kernel: NMI Watchdog detected LOCKUP on CPU 0 Nov 7 00:56:29 sw085 kernel: CPU 0 Nov 7 00:56:29 sw085 kernel: Modules linked in: ib_sa ib_umad ib_mthca ib_mad ib_core mst_pciconf mst_pci autofs4 ipv6 cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table nfs lockd nfs_acl sunrpc af_packet button battery ac apparmor aamatch_pcre loop dm_mod ehci_hcd ohci_hcd shpchp ide_cd cdrom i2c_nforce2 i2c_core usbcore pci_hotplug tg3 ext3 jbd edd fan thermal processor sg mptspi mptscsih mptbase scsi_transport_spi sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Nov 7 00:56:29 sw085 kernel: Pid: 26415, comm: ib_mad2 Tainted: GF U 2.6.16.21-0.8-smp #1 Nov 7 00:56:29 sw085 kernel: RIP: 0010:[] {.text.lock.spinlock+32} Nov 7 00:56:29 sw085 kernel: RSP: 0018:ffff8101053c1cd0 EFLAGS: 00000086 Nov 7 00:56:29 sw085 kernel: RAX: 0000000000000296 RBX: ffff81007c52e200 RCX: 0000000000000002 Nov 7 00:56:29 sw085 kernel: RDX: 0000000000000000 RSI: 0000000000000007 RDI: ffff81003b210e20 Nov 7 00:56:29 sw085 kernel: RBP: ffff81003b210e20 R08: ffff81011a44f900 R09: 0000000000000000 Nov 7 00:56:29 sw085 kernel: R10: ffff81013d2a97c0 R11: 0000000000000000 R12: ffff81003b210e18 Nov 7 00:56:29 sw085 kernel: R13: ffff81007c52e200 R14: ffff81007c52e268 R15: ffffffff882825c6 Nov 7 00:56:29 sw085 kernel: FS: 00002b9e3df036d0(0000) GS:ffffffff80445000(0000) knlGS:00000000f7d9c6b0 Nov 7 00:56:29 sw085 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Nov 7 00:56:29 sw085 kernel: CR2: 00002abaab0dd058 CR3: 000000005e7ea000 CR4: 00000000000006e0 Nov 7 00:56:29 sw085 kernel: Process ib_mad2 (pid: 26415, threadinfo ffff8101053c0000, task ffff8100315df080) Nov 7 00:56:29 sw085 kernel: Stack: ffffffff883222d4 0000000100000001 ffff81007961a900 ffff81013c3d5c50 Nov 7 00:56:29 sw085 kernel: 00000000ffffff92 ffff81007c52e200 ffffffff883228cf ffffffff801e9537 Nov 7 00:56:29 sw085 kernel: 0000000000000000 ffff8101053c1da8 Nov 7 00:56:29 sw085 kernel: Call Trace: {:ib_sa:release_group+30} Nov 7 00:56:29 sw085 kernel: {:ib_sa:mcast_work_handler+1124} {find_next_bit+85} Nov 7 00:56:29 sw085 kernel: {find_busiest_group+335} {:ib_sa:join_handler+32} Nov 7 00:56:29 sw085 kernel: {:ib_sa:ib_sa_mcmember_rec_callback+64} Nov 7 00:56:29 sw085 kernel: {thread_return+0} {:ib_sa:send_handler+72} Nov 7 00:56:29 sw085 kernel: {:ib_mad:timeout_sends+340} {run_workqueue+153} Nov 7 00:56:29 sw085 kernel: {worker_thread+0} {keventd_create_kthread+0} Nov 7 00:56:29 sw085 kernel: {worker_thread+265} {__wake_up_common+62} Nov 7 00:56:29 sw085 kernel: {default_wake_function+0} {keventd_create_kthread+0} Nov 7 00:56:29 sw085 kernel: {kthread+236} {child_rip+8} Nov 7 00:56:29 sw085 kernel: {keventd_create_kthread+0} {kthread+0} Nov 7 00:56:29 sw085 kernel: {child_rip+0} Nov 7 00:56:29 sw085 kernel: Nov 7 00:56:29 sw085 kernel: Code: f3 90 83 3f 00 7e f9 e9 c2 fe ff ff f3 90 83 3f 00 7e f9 e9 Nov 7 00:56:29 sw085 kernel: console shuts up ... -- MST From mst at mellanox.co.il Tue Nov 7 00:50:11 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Nov 2006 10:50:11 +0200 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <454F2D45.6060903@voltaire.com> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <200611061225.19077.jackm@dev.mellanox.co.il> <454F2D45.6060903@voltaire.com> Message-ID: <20061107085011.GB29565@mellanox.co.il> Quoting Or Gerlitz : > I am using this patch series on top of Roland git > tree of few weeks ago (eg more or less as 2.6.19-rc3) and have not got > this crash. Did you enable kernel debugging options? The crash is much harder to reproduce without them. We have these set: [*] Kernel debugging (17) Kernel log buffer size (16 => 64KB, 17 => 128KB) [*] Detect Soft Lockups [ ] Collect scheduler statistics [*] Debug slab memory allocations [*] Memory leak debugging [*] Mutex debugging, deadlock detection [*] Spinlock debugging [*] Sleep-inside-spinlock checking [*] kobject debugging [*] Compile the kernel with debug info [ ] Debug Filesystem -- MST From jackm at dev.mellanox.co.il Tue Nov 7 00:53:49 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 7 Nov 2006 10:53:49 +0200 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <200611061616.30434.jackm@dev.mellanox.co.il> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <454F2D45.6060903@voltaire.com> <200611061616.30434.jackm@dev.mellanox.co.il> Message-ID: <200611071053.50816.jackm@dev.mellanox.co.il> Another kernel Oops in the multicast module (core/multicast.c), again in procedure release_group, this time running under SLES10.0 (2.6.16.21-0.8-smp). This occurred during multicast regression testing: Nov 7 00:55:55 kernel: ib_mthca 0000:04:00.0: No AMGM entries left Nov 7 00:55:55 kernel: ib0: failed to attach to multicast group, ret = -12 Nov 7 00:55:55 kernel: ib0: couldn't attach QP to multicast group ff12:401b:ffff:0000:0000:0000:0000:0001 Nov 7 00:55:55 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -12 Nov 7 00:56:20 kernel: NET: Unregistered protocol family 27 Nov 7 00:56:21 ifdown: ib0 device: Mellanox Technologies MT25208 InfiniHost III Ex (rev a0) Nov 7 00:56:21 ifdown: ib0 configuration: ib1 Nov 7 00:56:21 kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready Nov 7 00:56:21 ifdown: ib1 device: Mellanox Technologies MT25208 InfiniHost III Ex (rev a0) Nov 7 00:56:23 ifdown: ib0 Nov 7 00:56:23 ifdown: ib0 configuration: ib1 Nov 7 00:56:23 ifdown: ib1 Nov 7 00:56:29 kernel: NMI Watchdog detected LOCKUP on CPU 0 Nov 7 00:56:29 kernel: CPU 0 Nov 7 00:56:29 kernel: Modules linked in: ib_sa ib_umad ib_mthca ib_mad ib_core mst_pciconf mst_pci autofs4 ipv6 cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table nfs lockd nfs_acl sunrpc af_packet button battery ac apparmor aamatch_pcre loop dm_mod ehci_hcd ohci_hcd shpchp ide_cd cdrom i2c_nforce2 i2c_core usbcore pci_hotplug tg3 ext3 jbd edd fan thermal processor sg mptspi mptscsih mptbase scsi_transport_spi sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Nov 7 00:56:29 kernel: Pid: 26415, comm: ib_mad2 Tainted: GF U 2.6.16.21-0.8-smp #1 Nov 7 00:56:29 kernel: RIP: 0010:[] {.text.lock.spinlock+32} Nov 7 00:56:29 kernel: RSP: 0018:ffff8101053c1cd0 EFLAGS: 00000086 Nov 7 00:56:29 kernel: RAX: 0000000000000296 RBX: ffff81007c52e200 RCX: 0000000000000002 Nov 7 00:56:29 kernel: RDX: 0000000000000000 RSI: 0000000000000007 RDI: ffff81003b210e20 Nov 7 00:56:29 kernel: RBP: ffff81003b210e20 R08: ffff81011a44f900 R09: 0000000000000000 Nov 7 00:56:29 kernel: R10: ffff81013d2a97c0 R11: 0000000000000000 R12: ffff81003b210e18 Nov 7 00:56:29 kernel: R13: ffff81007c52e200 R14: ffff81007c52e268 R15: ffffffff882825c6 Nov 7 00:56:29 kernel: FS: 00002b9e3df036d0(0000) GS:ffffffff80445000(0000) knlGS:00000000f7d9c6b0 Nov 7 00:56:29 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Nov 7 00:56:29 kernel: CR2: 00002abaab0dd058 CR3: 000000005e7ea000 CR4: 00000000000006e0 Nov 7 00:56:29 kernel: Process ib_mad2 (pid: 26415, threadinfo ffff8101053c0000, task ffff8100315df080) Nov 7 00:56:29 kernel: Stack: ffffffff883222d4 0000000100000001 ffff81007961a900 ffff81013c3d5c50 Nov 7 00:56:29 kernel: 00000000ffffff92 ffff81007c52e200 ffffffff883228cf ffffffff801e9537 Nov 7 00:56:29 kernel: 0000000000000000 ffff8101053c1da8 Nov 7 00:56:29 kernel: Call Trace: {:ib_sa:release_group+30} Nov 7 00:56:29 kernel: {:ib_sa:mcast_work_handler+1124} {find_next_bit+85} Nov 7 00:56:29 kernel: {find_busiest_group+335} {:ib_sa:join_handler+32} Nov 7 00:56:29 kernel: {:ib_sa:ib_sa_mcmember_rec_callback+64} Nov 7 00:56:29 kernel: {thread_return+0} {:ib_sa:send_handler+72} Nov 7 00:56:29 kernel: {:ib_mad:timeout_sends+340} {run_workqueue+153} Nov 7 00:56:29 kernel: {worker_thread+0} {keventd_create_kthread+0} Nov 7 00:56:29 kernel: {worker_thread+265} {__wake_up_common+62} Nov 7 00:56:29 kernel: {default_wake_function+0} {keventd_create_kthread+0} Nov 7 00:56:29 kernel: {kthread+236} {child_rip+8} Nov 7 00:56:29 kernel: {keventd_create_kthread+0} {kthread+0} Nov 7 00:56:29 kernel: {child_rip+0} Nov 7 00:56:29 kernel: Nov 7 00:56:29 kernel: Code: f3 90 83 3f 00 7e f9 e9 c2 fe ff ff f3 90 83 3f 00 7e f9 e9 Nov 7 00:56:29 kernel: console shuts up ... - Jack From ogerlitz at voltaire.com Tue Nov 7 00:53:36 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 7 Nov 2006 10:53:36 +0200 (IST) Subject: [openib-general] questions and a comment on the perftest package Message-ID: Hi Michael, We have some questions re the statistics of the perftest utils, hope you can comment on the below. Also, would you be able to run Lindent on the perftest sources, else the main loops are very much unreadable... Or. --- /dev/null 2006-10-31 15:38:42.428537323 +0200 +++ perftest/read_lat.c 2006-11-07 11:49:06.000000000 +0200 +int run_iter(struct pingpong_context *ctx, struct user_parameters *user_param, + struct pingpong_dest *rem_dest, int size) +{ + int scnt, ccnt; + int iters; + scnt = 0; + ccnt = 0; + struct ibv_wc wc; + int ne; ... + if (user_param->servername) { + while (scnt < user_param->iters) { + struct ibv_send_wr *bad_wr; + *post_buf = (char)++scnt; + tstamp[scnt - 1] = get_cycles(); + if (ibv_post_send(qp, wr, &bad_wr)) { + fprintf(stderr, "Couldn't post send: scnt=%d\n", + scnt); + return 11; + } + if (user_param->use_event) { ... + } + do { + ne = ibv_poll_cq(ctx->cq, 1, &wc); + } while (!user_param->use_event && ne < 1); ... Whouldn't it make more sense to get one time stamp before the i'th posting and one tstamp after the i'th completion is reaped from the cq? +static void print_report(struct report_options *options, + unsigned int iters, cycles_t * tstamp, int size) +{ + double cycles_to_units; + cycles_t median; + unsigned int i; + const char *units; + cycles_t *delta = malloc(iters * sizeof *delta); + + if (!delta) { + perror("malloc"); + return; + } + + for (i = 0; i < iters - 1; ++i) + delta[i] = tstamp[i + 1] - tstamp[i]; + .... + qsort(delta, iters - 1, sizeof *delta, cycles_compare); + + if (options->histogram) { + printf("#, %s\n", units); + for (i = 0; i < iters - 1; ++i) + printf("%d, %g\n", i + 1, delta[i] / cycles_to_units); + } + + median = get_median(iters - 1, delta); + printf("%7d %d %7.2f %7.2f %7.2f\n", + size, iters, delta[0] / cycles_to_units, + delta[iters - 3] / cycles_to_units, median / cycles_to_units); Following the above suggestion would enable to print the avg,min,max,median,std statistics where now there is this delta approximation which i find less informative. Or. From mst at mellanox.co.il Tue Nov 7 01:13:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Nov 2006 11:13:19 +0200 Subject: [openib-general] questions and a comment on the perftest package In-Reply-To: References: Message-ID: <20061107091318.GD29565@mellanox.co.il> Quoting Or Gerlitz : > Following the above suggestion would enable to print the avg,min,max,median,std > statistics where now there is this delta approximation which i find less informative. I'm not very interested in adding the average/std statistics - our measurements have too high a deviation for these to be useful. There's the -H option so you can always feed the results to your favorite statistical package. In my experience, people ignore the std deviation and immediately zone in on the average as the main result - so I get reports like "my latency went up by 2micro, help!" when there was one freak huge value that throws the avg off, and the std is on the order of 10 micro. Median does not have this problem - you get a representative value no matter what. And in practice in my measurements with default values, it is stable to within 0.1 micro. -- MST From mst at mellanox.co.il Tue Nov 7 01:18:36 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Nov 2006 11:18:36 +0200 Subject: [openib-general] questions and a comment on the perftest package In-Reply-To: References: Message-ID: <20061107091836.GE29565@mellanox.co.il> Quoting Or Gerlitz : > Whouldn't it make more sense to get one time stamp before the i'th posting > and one tstamp after the i'th completion is reaped from the cq? That's what we do, anyway - look how this works: stamp[0] post poll stamp[1] post poll stamp[2] .... so stamp[i] is taken before the i'th posting and stamp[i+1] is after the i'th completion. -- MST From ogerlitz at voltaire.com Tue Nov 7 01:52:41 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 07 Nov 2006 11:52:41 +0200 Subject: [openib-general] questions and a comment on the perftest package In-Reply-To: <20061107091836.GE29565@mellanox.co.il> References: <20061107091836.GE29565@mellanox.co.il> Message-ID: <45505769.8080302@voltaire.com> Michael S. Tsirkin wrote: > Quoting Or Gerlitz : >> Whouldn't it make more sense to get one time stamp before the i'th posting >> and one tstamp after the i'th completion is reaped from the cq? > > That's what we do, anyway - look how this works: > > stamp[0] > post > poll > stamp[1] > post > poll > stamp[2] > .... > > so stamp[i] is taken before the i'th posting > and stamp[i+1] is after the i'th completion. oops, i have just noted that read_lat.c practically ignores the tx_depth param... so stamp[i+1]-stamp[i] is indeed the wall time of the i'th operation. Anyway, i guess you would be open to get a patch that does exercise tx_depth in a similar fashion to read_bw.c ? Or. From michael.arndt at informatik.tu-chemnitz.de Tue Nov 7 02:45:59 2006 From: michael.arndt at informatik.tu-chemnitz.de (Michael Arndt) Date: Tue, 7 Nov 2006 11:45:59 +0100 Subject: [openib-general] osm_state_mgr_process and osm_sm_state_mgr_process Message-ID: <000801c70259$e9be5ba0$21606d86@one7> Hi, is there an idea or concept why the state managment is seperate in these two functions. I would really like if the answer isn't just yes or no.:) Thanks Michael From ogerlitz at voltaire.com Tue Nov 7 02:49:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 07 Nov 2006 12:49:19 +0200 Subject: [openib-general] questions and a comment on the perftest package In-Reply-To: <20061107103833.GA15610@mellanox.co.il> References: <45505769.8080302@voltaire.com> <20061107103833.GA15610@mellanox.co.il> Message-ID: <455064AF.8000303@voltaire.com> Michael S. Tsirkin wrote: > Quoting r. Or Gerlitz : >> oops, i have just noted that read_lat.c practically ignores the tx_depth >> param... so stamp[i+1]-stamp[i] is indeed the wall time of the i'th >> operation. Anyway, i guess you would be open to get a patch that does >> exercise tx_depth in a similar fashion to read_bw.c ? > Something like this has been on my todo list for a while. OK, anyway, we will not be able to work on this immediately, but this way or another we need a way to measure the latency per op per tx_depth and transfer size, so unless you would fix it before, we would need to do so... > However, isn't it the case that just giving tx depth = 1 to rdma_bw we get > all the necessary deltas? i was talking about read_lat.c and i want to get the correct delta for each possible value of tx_depth, not just for tx_depth=1 > So the right thing to do, IMO, is to take rdma_bw and teach it to report latency > as well. We thus will have a single test that measures both BW and latency for > reads, and have number of in-flight messages as parameter. With tx_depth = 1 > we'll get ping-pong. I don't see who does the -pong here, its a client doing rdma read from (rdma write to) a server. Your suggestion makes sense, we will look into this. Or. From mst at mellanox.co.il Tue Nov 7 02:38:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Nov 2006 12:38:33 +0200 Subject: [openib-general] questions and a comment on the perftest package In-Reply-To: <45505769.8080302@voltaire.com> References: <45505769.8080302@voltaire.com> Message-ID: <20061107103833.GA15610@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: Re: questions and a comment on the perftest package > > Michael S. Tsirkin wrote: > > Quoting Or Gerlitz : > >> Whouldn't it make more sense to get one time stamp before the i'th posting > >> and one tstamp after the i'th completion is reaped from the cq? > > > > That's what we do, anyway - look how this works: > > > > stamp[0] > > post > > poll > > stamp[1] > > post > > poll > > stamp[2] > > .... > > > > so stamp[i] is taken before the i'th posting > > and stamp[i+1] is after the i'th completion. > > oops, i have just noted that read_lat.c practically ignores the tx_depth > param... so stamp[i+1]-stamp[i] is indeed the wall time of the i'th > operation. Anyway, i guess you would be open to get a patch that does > exercise tx_depth in a similar fashion to read_bw.c ? Something like this has been on my todo list for a while. However, isn't it the case that just giving tx depth = 1 to rdma_bw we get all the necessary deltas? So the right thing to do, IMO, is to take rdma_bw and teach it to report latency as well. We thus will have a single test that measures both BW and latency for reads, and have number of in-flight messages as parameter. With tx_depth = 1 we'll get ping-pong. -- MST From ianjiang.ict at gmail.com Tue Nov 7 03:35:30 2006 From: ianjiang.ict at gmail.com (Ian Jiang) Date: Tue, 7 Nov 2006 19:35:30 +0800 Subject: [openib-general] SRP Target Installation problem Message-ID: <7b2fa1820611070335m7cf76e4drc184a96d696208bf@mail.gmail.com> > > I searched for the SRP target source on the openib site, and found > it under the gen1 branch. Is it available under the gen2 branch ? not available > I managed to find the above missing files under the gen1 branch. I am > not quite sure about installing packages from gen1 and gen2 on the same > machine. Should I install IBGOLD instead of OFED ? I never installed packages of gen1 and gen2 on the same machine. I suggest you install IBGD-1.8.x if you want to use the SRP Target. -- Ian Jiang From halr at voltaire.com Tue Nov 7 03:44:59 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2006 06:44:59 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: trivial indentation fixes In-Reply-To: <20061107003410.GA30470@sashak.voltaire.com> References: <20061107003410.GA30470@sashak.voltaire.com> Message-ID: <1162899878.25771.32579.camel@hal.voltaire.com> On Mon, 2006-11-06 at 19:34, Sasha Khapyorsky wrote: > Trivial indentation fixes in osm_inform.h and osm_service.h. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From mst at mellanox.co.il Tue Nov 7 03:47:06 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 7 Nov 2006 13:47:06 +0200 Subject: [openib-general] questions and a comment on the perftest package In-Reply-To: <455064AF.8000303@voltaire.com> References: <455064AF.8000303@voltaire.com> Message-ID: <20061107114706.GA17090@mellanox.co.il> > I don't see who does the -pong here, its a client doing rdma read from > (rdma write to) a server. Correct for rdma read. Rdma write is a bit more involved - although merging lat/bw tests there is also a good idea IMO. Basically, I think we'll need to do polling at the server side, and do a pong from server to client. Message sizes can be made configurable - this gives you duplex test as a case, too. BTW, this will make it possible to move the BW measurement to the server and this has some advantages over the current measurements: - can detect packet loss for UC and UD - can actually measure peak bandwidth as we don't care about ack delay -- MST From halr at voltaire.com Tue Nov 7 03:58:23 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2006 06:58:23 -0500 Subject: [openib-general] [PATCH] opensm: permissions of db files directory In-Reply-To: <20061107005504.GC30470@sashak.voltaire.com> References: <20061107005504.GC30470@sashak.voltaire.com> Message-ID: <1162900664.25771.33097.camel@hal.voltaire.com> On Mon, 2006-11-06 at 19:55, Sasha Khapyorsky wrote: > When creating directory for db files (guid2lid) storing create it with > reasonable permissions (current 777 decimal = octal 01411) and don't do > it world writable. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From monis at voltaire.com Tue Nov 7 04:03:37 2006 From: monis at voltaire.com (Moni Shoua) Date: Tue, 07 Nov 2006 14:03:37 +0200 Subject: [openib-general] ipoib mtu problem with UDP In-Reply-To: <20061106163258.GC31647@mellanox.co.il> References: <20061106163258.GC31647@mellanox.co.il> Message-ID: <45507619.3080108@voltaire.com> Michael S. Tsirkin wrote: >I tried using ifconfig to limit the ipoib mtu. >Once I do this on *either* both server and client, or only on the client side, >UDP seems to stop working: > >#ifconfig ib0 mtu 512 >#netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM >UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 >(11.4.3.68) port 0 AF_INET : demo >Socket Message Elapsed Messages CPU Service >Size Size Time Okay Errors Throughput Util Demand >bytes bytes secs # # MBytes/sec % SS us/KB > >118784 65507 10.00 27582 0 172.2 26.33 inf >118784 10.00 0 0.0 23.40 inf > >Things work fine if the mtu on the client side is 2044: ># ifconfig ib0 mtu 2044 ># netperf -c -C -H 11.4.3.68 -f M -t UDP_STREAM >UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68) port 0 AF_INET : demo >Socket Message Elapsed Messages CPU Service >Size Size Time Okay Errors Throughput Util Demand >bytes bytes secs # # MBytes/sec % SS us/KB > >118784 65507 10.00 78488 0 490.1 25.31 2.310 >118784 10.00 68534 428.0 24.55 2.241 > >Tested with kernel 2.6.19-rc4 and netperf 2.4.2. > > > I get the same results with iperf. However they succeed with smaller datagrams (netperf uses 65507 by default) dodly5:/home/shared/testing-tools/x86_64/netperf/netperf-2.4.1 # ifconfig ib0 ib0 Link encap:UNSPEC HWaddr 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.11.235 Bcast:192.168.11.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:512 Metric:1 RX packets:42 errors:0 dropped:0 overruns:0 frame:0 TX packets:14077513 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:5776 (5.6 Kb) TX bytes:6717604780 (6406.4 Mb) dodly5:/home/shared/testing-tools/x86_64/netperf/netperf-2.4.1 # ./netperf -H 192.168.11.233 -t UDP_STREAM -- -m 30000 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.11.233 (192.168.11.233) port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 262144 30000 10.00 52533 0 1260.59 262144 10.00 22956 550.86 dodly5:/home/shared/testing-tools/x86_64/iperf-2.0.2 # ./iperf -uc 192.168.11.233 -l 65000 ------------------------------------------------------------ Client connecting to 192.168.11.233, UDP port 5001 Sending 65000 byte datagrams UDP buffer size: 256 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.11.235 port 32769 connected with 192.168.11.233 port 5001 [ 3] 0.0-10.9 sec 1.36 MBytes 1.05 Mbits/sec [ 3] Sent 22 datagrams [ 3] WARNING: did not receive ack of last datagram after 10 tries. dodly5:/home/shared/testing-tools/x86_64/iperf-2.0.2 # ./iperf -uc 192.168.11.233 ------------------------------------------------------------ Client connecting to 192.168.11.233, UDP port 5001 Sending 1470 byte datagrams UDP buffer size: 256 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.11.235 port 32769 connected with 192.168.11.233 port 5001 [ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec [ 3] Sent 893 datagrams [ 3] Server Report: [ 3] 0.0-10.0 sec 1.25 MBytes 1.05 Mbits/sec 0.002 ms 0/ 893 (0%) From smomohana at hotmail.com Tue Nov 7 04:19:11 2006 From: smomohana at hotmail.com (lee momo) Date: Tue, 07 Nov 2006 12:19:11 +0000 Subject: [openib-general] hi, I tried to install OFED-1.1 on SLES 10 , but failed.... Message-ID: hi, I am using SLED 10 LINUX, and I tried to install OFED-1.1.. first. I inputed ----------------------------------------------------------------- #]./install.sh ----------------------------------------------------------------- so, I could see screen like below ----------------------------------------------------------------- Checking dependencies. Please wait.... There are no sources for 2.6.16.21_0.8_smp kernel installed. Please installed 2.6.16.21_0.8_smp sources to build RPM in this system. ----------------------------------------------------------------- So, I inputed these commands... ----------------------------------------------------------------- #]rpm -qa | fgrep kernel #]kernel-smp-2.6.16.21_0.8 #]kernel-xen-2.6.16.21_0.8 ----------------------------------------------------------------- How I would do????? I have to do kernel compile ? please let me know how to do for install OFED-1.1.... _________________________________________________________________ MSN Messenger를 통해 온라인상에 있는 친구와 대화를 나누세요. http://www.msn.co.kr/messenger From eitan at mellanox.co.il Tue Nov 7 04:36:26 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Tue, 07 Nov 2006 14:36:26 +0200 Subject: [openib-general] osm_state_mgr_process and osm_sm_state_mgr_process In-Reply-To: <000801c70259$e9be5ba0$21606d86@one7> References: <000801c70259$e9be5ba0$21606d86@one7> Message-ID: <45507DCA.6010207@mellanox.co.il> Michael Arndt wrote: > Hi, > > is there an idea or concept why the state managment is seperate in these two > functions. > I would really like if the answer isn't just yes or no.:) > The osm_state_mgr_process file is describing the OpenSM state machine. The osm_sm_state_mgr_process latter defines the SM state machine following the IBTA SM states and their transitions. The first one is far more complex and defines the stages required to discover and configure the fabric. The second one is used to track the SM state in terms that are IBTA compliant and standard. They are also map directly to SM states available in the SMInfo attribute. > Thanks Michael > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vlad at dev.mellanox.co.il Tue Nov 7 04:50:50 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 07 Nov 2006 14:50:50 +0200 Subject: [openib-general] hi, I tried to install OFED-1.1 on SLES 10 , but failed.... In-Reply-To: References: Message-ID: <4550812A.108@dev.mellanox.co.il> Hi, You need to install kernel-source-2.6.16.21-0.8 RPM. Regards, Vladimir lee momo wrote: > hi, > I am using SLED 10 LINUX, and I tried to install OFED-1.1.. > > first. I inputed > > ----------------------------------------------------------------- > #]./install.sh > ----------------------------------------------------------------- > > so, I could see screen like below > > ----------------------------------------------------------------- > Checking dependencies. Please wait.... > > There are no sources for 2.6.16.21_0.8_smp kernel installed. > Please installed 2.6.16.21_0.8_smp sources to build RPM in this system. > ----------------------------------------------------------------- > > So, I inputed these commands... > > ----------------------------------------------------------------- > #]rpm -qa | fgrep kernel > #]kernel-smp-2.6.16.21_0.8 > #]kernel-xen-2.6.16.21_0.8 > ----------------------------------------------------------------- > > How I would do????? > > I have to do kernel compile ? > > please let me know how to do for install OFED-1.1.... > > _________________________________________________________________ > MSN Messenger를 통해 온라인상에 있는 친구와 대화를 나누세요. > http://www.msn.co.kr/messenger > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From bugzilla-daemon at openib.org Tue Nov 7 05:00:08 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Tue, 7 Nov 2006 05:00:08 -0800 (PST) Subject: [openib-general] [Bug 289] New: executing ucmatose on local IPoIB address of IB port 2 in kernel 2.6.16.21-0.8-smp fails Message-ID: <20061107130008.A11AD2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=289 Summary: executing ucmatose on local IPoIB address of IB port 2 in kernel 2.6.16.21-0.8-smp fails Product: OpenFabrics Linux Version: 1.1 Platform: All OS/Version: SLES 10 Status: NEW Severity: normal Priority: P2 Component: IB Core AssignedTo: bugzilla at openib.org ReportedBy: dotanb at mellanox.co.il ************************************************************* Host Architecture : x86_64 Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 Kernel Version : 2.6.16.21-0.8-smp GCC Version : gcc (GCC) 4.1.0 (SUSE Linux) Memory size : 4047700 kB Driver Version : OFED-1.1-rc7-testbuild HCA ID(s) : mthca0 HCA model(s) : 25208 FW version(s) : 4.7.600 Board(s) : MT_00A0010001 ************************************************************* I have 2 machines connected port 1<-->port 1 and port2 <--> port 2 and i tried the following scenarios (only on one of the machine): scenario 1: passes SM was executed on port 1 i executed ucmatose server and ucmatose client with IPoIB IP address of port 1 scenario 2: fails SM was executed on port 2 i executed ucmatose server and ucmatose client with IPoIB IP address of port 2 here is the output of the client: ucmatose: starting client ucmatose: connecting ucmatose: event: 3, error: 0 receiving data transfers sending replies data transfers complete test complete return status 0 scenario 3: passes SM was executed on port 1 and on port 2 (i have 2 SMs, one on each port) (i executed ucmatose server and ucmatose client with IPoIB IP address of port 1) i executed ucmatose server and ucmatose client with IPoIB IP address of port 2 It seems that when using the IPoIB IP address of port 2 in the client side and there is an SM only on port 2 the test fails but if i add an SM on port 1 the test passes. this test fails ONLY on this machine (we have more machines that works on port 2). commands: ucmatose ucmatose 12.4.3.86 this issue was solved by eli and it is documented in: https://openib.org/tiki/tiki-index.php?page=OFED+Support in the known issues. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From dotanb at dev.mellanox.co.il Tue Nov 7 04:56:45 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Tue, 07 Nov 2006 14:56:45 +0200 Subject: [openib-general] [ucma] executing the ucmatose with local IPoIB IP address of port 2 fails In-Reply-To: <45352510.7020701@ichips.intel.com> References: <4534AE9B.4030907@dev.mellanox.co.il> <45352228.5050208@ichips.intel.com> <45352510.7020701@ichips.intel.com> Message-ID: <4550828D.6060409@dev.mellanox.co.il> Sean Hefty wrote: > Sean Hefty wrote: >> This is a ROUTE_ERROR (path record query fails). Are the IP >> addresses on different subnets? Are you having ucmatose bind to the >> port 2 ip address. > > Another thing to check is what port ucmatose binds to after calling > rdma_resolve_addr(). > > - Sean this issue is recorded in bug 289. thanks Dotan From tziporet at dev.mellanox.co.il Tue Nov 7 05:04:06 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 07 Nov 2006 15:04:06 +0200 Subject: [openib-general] SRP Target Installation problem In-Reply-To: <7b2fa1820611070335m7cf76e4drc184a96d696208bf@mail.gmail.com> References: <7b2fa1820611070335m7cf76e4drc184a96d696208bf@mail.gmail.com> Message-ID: <45508446.4040001@dev.mellanox.co.il> Ian Jiang wrote: >> I managed to find the above missing files under the gen1 branch. I am >> not quite sure about installing packages from gen1 and gen2 on the same >> machine. Should I install IBGOLD instead of OFED ? >> > > I never installed packages of gen1 and gen2 on the same machine. I > suggest you install IBGD-1.8.x if you want to use the SRP Target. > > Please uninstall OFED (uninstall.sh scrip is available under ) and then install IBGD. This is important since IBGD install is not aware to OFED, and having both on the same machine is not good. Tziporet From halr at voltaire.com Tue Nov 7 05:50:30 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2006 08:50:30 -0500 Subject: [openib-general] [PATCH] osm: comparing InformInfo records In-Reply-To: <454EDD97.5060000@dev.mellanox.co.il> References: <454EDD97.5060000@dev.mellanox.co.il> Message-ID: <1162907368.25771.37532.camel@hal.voltaire.com> Hi Yevgeny, On Mon, 2006-11-06 at 02:00, Yevgeny Kliteynik wrote: > Hi Hal > > [From Vu Pham] > 1. sending InformInfo set subscribe for trap 64,65,144 - this works; > however, osm.log outputs wrong value for "subscribe" field What code issues these subscriptions ? How was this patch tested ? > 2. sending InformInfo set *unsubscribe* for > trap 64,65,144 - I'm using/formating the same mad as (1) except the > "subscribe" field is zero; however, opensm response with status 0x200 > [/From Vu Pham] > > 1. The received InformInfo struct was modified before dumping it. > This was fixed as part of the second issue. > 2. The function that compares InformInfo structures was just comparing > the whole memory allocated for it, including reserved fields. > Fixed to compare more selectively. > > Yevgeny > > Signed-off-by: Yevgeny Kliteynik > > Index: opensm/osm_sa_informinfo.c > =================================================================== > --- opensm/osm_sa_informinfo.c (revision 10064) > +++ opensm/osm_sa_informinfo.c (working copy) > @@ -345,7 +345,6 @@ osm_infr_rcv_process_set_method( > ib_inform_info_t *p_recvd_inform_info; > osm_infr_t inform_info_rec; /* actual inform record to be stored for reports */ > osm_infr_t *p_infr; > - uint8_t subscribe; > ib_net32_t qpn; > uint8_t resp_time_val; > ib_api_status_t res; > @@ -403,19 +402,11 @@ osm_infr_rcv_process_set_method( > * > * QPN: > * internally we keep the QPN field of the InformInfo updated > - * so we can simply compare the entire record - when finding such. > - * IBA spec only requires the QPN field to be filled when an unsubscribe > - * Set(InformInfo) is done. See table 119 p 740 QPN field > - * > - * SUBSCRIBE: > - * For similar reasons we change the subscribe to 0 on the > - * inserted/searched data > + * so we can simply compare it in the record - when finding such. > */ > > - subscribe = p_recvd_inform_info->subscribe; > - if (subscribe) > + if (p_recvd_inform_info->subscribe) > { > - inform_info_rec.inform_record.inform_info.subscribe = 0; > ib_inform_info_set_qpn( > &inform_info_rec.inform_record.inform_info, > inform_info_rec.report_addr.addr_type.gsi.remote_qp ); > @@ -443,7 +434,7 @@ osm_infr_rcv_process_set_method( > p_infr = osm_infr_get_by_rec( p_rcv->p_subn, p_rcv->p_log, &inform_info_rec ); > > /* check to see if the request was for subscribe = 1 */ > - if (subscribe) > + if (p_recvd_inform_info->subscribe) > { > /* validate the request for a new or update InformInfo */ > if (__validate_infr( p_rcv, &inform_info_rec ) != TRUE) > @@ -480,6 +471,8 @@ osm_infr_rcv_process_set_method( > goto Exit; > } > > + /* set the subscribe bit to 0 before adding the record */ > + p_infr->inform_record.inform_info.subscribe = 0; It seems odd to me to set subscribe to 0 for a subscription (rather than when it is an unsibscription). Aren't only subscriptions kept in the database ? Is this an artifact of the matching code ? If so, why not change that ? > /* Add this new osm_infr_t object to subnet object */ > osm_infr_insert_to_db( p_rcv->p_subn, p_rcv->p_log, p_infr ); > > @@ -488,6 +481,8 @@ osm_infr_rcv_process_set_method( > { > /* Update the old instance of the osm_infr_t object */ > p_infr->inform_record = inform_info_rec.inform_record; > + /* set the subscribe bit to 0 after updating the record */ > + p_infr->inform_record.inform_info.subscribe = 0; Same as previous comment. > } > } > else > Index: opensm/osm_inform.c > =================================================================== > --- opensm/osm_inform.c (revision 10064) > +++ opensm/osm_inform.c (working copy) > @@ -206,30 +206,133 @@ __match_inf_rec( > osm_infr_t* p_infr_rec = (osm_infr_t *)context; > osm_infr_t* p_infr = (osm_infr_t*)p_list_item; > osm_log_t *p_log = p_infr_rec->p_infr_rcv->p_log; > - cl_status_t status; > - int32_t count1, count2; > + cl_status_t status = CL_NOT_FOUND; > + ib_gid_t all_zero_gid; > + > > OSM_LOG_ENTER( p_log, __match_inf_rec); > > - count1 = memcmp(&p_infr->report_addr, &p_infr_rec->report_addr, > - sizeof(p_infr_rec->report_addr)); > - if (count1) > - osm_log( p_log, OSM_LOG_DEBUG, > - "__match_inf_rec: " > - "Differ by Address\n" ); > - count2 = memcmp( > - &p_infr->inform_record.inform_info, > - &p_infr_rec->inform_record.inform_info, > - sizeof(p_infr->inform_record.inform_info) ); > - if (count2) > - osm_log( p_log, OSM_LOG_DEBUG, > - "__match_inf_rec: " > - "Differ by InformInfo\n" ); > - if ((count1 == 0) && (count2 == 0)) > - status = CL_SUCCESS; > + if ( !memcmp(&p_infr->report_addr, > + &p_infr_rec->report_addr, > + sizeof(p_infr_rec->report_addr)) ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by Address\n" ); > + goto Exit; > + } > + > + memset(&all_zero_gid, 0, sizeof(ib_gid_t)); > + > + /* if inform_info.gid is not zero, ignoring lid range */ > + if ( !memcmp(&p_infr_rec->inform_record.inform_info.gid, > + &all_zero_gid, > + sizeof(p_infr_rec->inform_record.inform_info.gid)) ) > + { > + if ( !memcmp(&p_infr->inform_record.inform_info.gid, > + &p_infr_rec->inform_record.inform_info.gid, > + sizeof(p_infr->inform_record.inform_info.gid)) ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.gid\n" ); > + goto Exit; > + } > + } > else > - status = CL_NOT_FOUND; > + { > + if ( (p_infr->inform_record.inform_info.lid_range_begin != > + p_infr_rec->inform_record.inform_info.lid_range_begin) || > + (p_infr->inform_record.inform_info.lid_range_end != > + p_infr_rec->inform_record.inform_info.lid_range_end) ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.LIDRange\n" ); > + goto Exit; > + } > + } > + > + if ( p_infr->inform_record.inform_info.is_generic != > + p_infr_rec->inform_record.inform_info.is_generic ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.IsGeneric\n" ); > + goto Exit; > + } > > + if ( p_infr->inform_record.inform_info.trap_type != > + p_infr_rec->inform_record.inform_info.trap_type ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.TrapType\n" ); > + goto Exit; > + } > + > + if ( p_infr->inform_record.inform_info.is_generic != > + p_infr_rec->inform_record.inform_info.is_generic ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.IsGeneric\n" ); > + } This appears to be a duplicate of what was added shortly earlier in this patch. -- Hal > + else if (p_infr->inform_record.inform_info.is_generic) > + { > + if ( p_infr->inform_record.inform_info.g_or_v.generic.trap_num != > + p_infr_rec->inform_record.inform_info.g_or_v.generic.trap_num ) > + { > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Generic.TrapNumber\n" ); > + goto Exit; > + } > + else if ( p_infr->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val != > + p_infr_rec->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val ) > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Generic.QPNRespTimeVal\n" ); > + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_msb != > + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_msb ) > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Generic.NodeTypeMSB\n" ); > + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_lsb != > + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_lsb ) > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Generic.NodeTypeLSB\n" ); > + else > + status = CL_SUCCESS; > + } > + else > + { > + if ( p_infr->inform_record.inform_info.g_or_v.vend.dev_id != > + p_infr_rec->inform_record.inform_info.g_or_v.vend.dev_id ) > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Vendor.DeviceID\n" ); > + else if ( p_infr->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val != > + p_infr_rec->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val ) > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Vendor.QPNRespTimeVal\n" ); > + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_msb != > + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_msb ) > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Vendor.VendorIdMSB\n" ); > + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_lsb != > + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_lsb ) > + osm_log( p_log, OSM_LOG_DEBUG, > + "__match_inf_rec: " > + "Differ by InformInfo.Vendor.VendorIdLSB\n" ); > + else > + status = CL_SUCCESS; > + } > + > + Exit: > OSM_LOG_EXIT( p_log ); > return status; > } > From jackm at dev.mellanox.co.il Tue Nov 7 06:03:16 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 7 Nov 2006 16:03:16 +0200 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive In-Reply-To: References: Message-ID: <200611071603.17841.jackm@dev.mellanox.co.il> On Tuesday 07 November 2006 06:33, Roland Dreier wrote: > I don't believe we should generate receive callbacks for canceled > sends, so I came up with the patch below (much simpler than the > explanation that led up to it). I am no longer able to reproduce the > IPoIB crash with this applied so I feel pretty good about this. > ... > > diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c > index 493f4c6..363db08 100644 > --- a/drivers/infiniband/core/mad.c > +++ b/drivers/infiniband/core/mad.c > @@ -1804,7 +1804,7 @@ static void ib_mad_complete_recv(struct > if (ib_response_mad(mad_recv_wc->recv_buf.mad)) { > spin_lock_irqsave(&mad_agent_priv->lock, flags); > mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc); > - if (!mad_send_wr) { > + if (!mad_send_wr || mad_send_wr->status != IB_WC_SUCCESS) { > spin_unlock_irqrestore(&mad_agent_priv->lock, flags); > ib_free_recv_mad(mad_recv_wc); > deref_mad_agent(mad_agent_priv); > I think you're correct architecturally regarding generating receive callbacks for cancelled sends. You need to check, though, that the above change does not result in memory leaks or broken logic. For example, in ipoib_main.c:ipoib_flush_paths(), If there is an outstanding query, ib_sa_cancel_query gets called. The code then goes on to wait_for_completion() anyway (assuming that even a cancelled query will result in a callback). - Jack From halr at voltaire.com Tue Nov 7 06:02:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2006 09:02:51 -0500 Subject: [openib-general] osm_state_mgr_process and osm_sm_state_mgr_process In-Reply-To: <45507DCA.6010207@mellanox.co.il> References: <000801c70259$e9be5ba0$21606d86@one7> <45507DCA.6010207@mellanox.co.il> Message-ID: <1162908170.25771.38050.camel@hal.voltaire.com> On Tue, 2006-11-07 at 07:36, Eitan Zahavi wrote: > Michael Arndt wrote: > > Hi, > > > > is there an idea or concept why the state managment is seperate in these two > > functions. > > I would really like if the answer isn't just yes or no.:) > > This predates my involvement with OpenSM but I have the following comments: > The osm_state_mgr_process file is describing the OpenSM state machine. This state machine handles the lower level aspects of the SM state machine as detailed in IBA 1.2 vol 1 section 14.4. This contains behavior "beyond the spec" in that much of SM behavior is an exercise left to the reader. > The osm_sm_state_mgr_process latter defines the SM state machine > following the IBTA SM states and their transitions. This machine handles the high level aspects of the SM state machine in 14.4.1 and can be driven by the SMInfo attribute. > The first one is far more complex and defines the stages required to > discover and configure the fabric. > The second one is used to track the SM state in terms that are IBTA > compliant and standard. They are also map directly to SM states > available in the SMInfo attribute. As to why they are separate state machines (and the implied question of whether they could/should be combined), I'm not sure. -- Hal > > Thanks Michael > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Tue Nov 7 06:09:20 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2006 09:09:20 -0500 Subject: [openib-general] [PATCH TRIVIAL] opensm: osm_service: remove unused svc_id field In-Reply-To: <20061107003521.GB30470@sashak.voltaire.com> References: <20061107003521.GB30470@sashak.voltaire.com> Message-ID: <1162908532.25771.38265.camel@hal.voltaire.com> On Mon, 2006-11-06 at 19:35, Sasha Khapyorsky wrote: > This removes unused (but confused) svc_id field from osm_svcr structure. > > Signed-off-by: Sasha Khapyorsky Thanks. Applied. -- Hal From ogerlitz at voltaire.com Tue Nov 7 06:27:08 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 07 Nov 2006 16:27:08 +0200 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <20061107085011.GB29565@mellanox.co.il> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <200611061225.19077.jackm@dev.mellanox.co.il> <454F2D45.6060903@voltaire.com> <20061107085011.GB29565@mellanox.co.il> Message-ID: <455097BC.6020004@voltaire.com> Michael S. Tsirkin wrote: > Quoting Or Gerlitz : >> I am using this patch series on top of Roland git >> tree of few weeks ago (eg more or less as 2.6.19-rc3) and have not got >> this crash. > > Did you enable kernel debugging options? > The crash is much harder to reproduce without them. Don't think so, thanks for suggesting this direction. Anyway, Sean - if you manages to reproduce this on your setup i will not bother to reproduce it here, let me know if you need more inputs on that. Or. From rdreier at cisco.com Tue Nov 7 07:25:30 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Nov 2006 07:25:30 -0800 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive In-Reply-To: <200611071603.17841.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 7 Nov 2006 16:03:16 +0200") References: <200611071603.17841.jackm@dev.mellanox.co.il> Message-ID: Jack> I think you're correct architecturally regarding generating Jack> receive callbacks for cancelled sends. You need to check, Jack> though, that the above change does not result in memory Jack> leaks or broken logic. Jack> For example, in ipoib_main.c:ipoib_flush_paths(), If there Jack> is an outstanding query, ib_sa_cancel_query gets called. Jack> The code then goes on to wait_for_completion() anyway Jack> (assuming that even a cancelled query will result in a Jack> callback). Yes, this is exactly what my patch tries to do: make sure that every SA query generates exactly one callback. The race I described is a case where the MAD layer might cause two callbacks for the same query, but I believe that with my change every query will generate one callback. Do you see a problem with it? - R. From jackm at dev.mellanox.co.il Tue Nov 7 07:32:36 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 7 Nov 2006 17:32:36 +0200 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive In-Reply-To: References: <200611071603.17841.jackm@dev.mellanox.co.il> Message-ID: <200611071732.37409.jackm@dev.mellanox.co.il> On Tuesday 07 November 2006 17:25, Roland Dreier wrote: > Jack> I think you're correct architecturally regarding generating > Jack> receive callbacks for cancelled sends. You need to check, > Jack> though, that the above change does not result in memory > Jack> leaks or broken logic. > > Jack> For example, in ipoib_main.c:ipoib_flush_paths(), If there > Jack> is an outstanding query, ib_sa_cancel_query gets called. > Jack> The code then goes on to wait_for_completion() anyway > Jack> (assuming that even a cancelled query will result in a > Jack> callback). > > Yes, this is exactly what my patch tries to do: make sure that every > SA query generates exactly one callback. The race I described is a > case where the MAD layer might cause two callbacks for the same > query, but I believe that with my change every query will generate one > callback. Do you see a problem with it? > > - R. > On a second examination (prompted by Michael Tsirkin), I see that your patch is fine. It only prevents (in the receive completion handler) generating a spurious callback for cancelled but as yet unprocessed queries. - Jack From sashak at voltaire.com Tue Nov 7 08:34:47 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 7 Nov 2006 18:34:47 +0200 Subject: [openib-general] osm_state_mgr_process and osm_sm_state_mgr_process In-Reply-To: <1162908170.25771.38050.camel@hal.voltaire.com> References: <000801c70259$e9be5ba0$21606d86@one7> <45507DCA.6010207@mellanox.co.il> <1162908170.25771.38050.camel@hal.voltaire.com> Message-ID: <20061107163447.GA32655@sashak.voltaire.com> On 09:02 Tue 07 Nov , Hal Rosenstock wrote: > On Tue, 2006-11-07 at 07:36, Eitan Zahavi wrote: > > Michael Arndt wrote: > > > Hi, > > > > > > is there an idea or concept why the state managment is seperate in these two > > > functions. > > > I would really like if the answer isn't just yes or no.:) > > > > > This predates my involvement with OpenSM but I have the following > comments: > > > The osm_state_mgr_process file is describing the OpenSM state machine. > > This state machine handles the lower level aspects of the SM state > machine as detailed in IBA 1.2 vol 1 section 14.4. This contains > behavior "beyond the spec" in that much of SM behavior is an exercise > left to the reader. I think we also would call this the "sweep" state machine because it follows/tracks OpenSM sweep (fabric discovery, configuration, setups) stages. > > The osm_sm_state_mgr_process latter defines the SM state machine > > following the IBTA SM states and their transitions. > > This machine handles the high level aspects of the SM state machine in > 14.4.1 and can be driven by the SMInfo attribute. > > > The first one is far more complex and defines the stages required to > > discover and configure the fabric. > > The second one is used to track the SM state in terms that are IBTA > > compliant and standard. They are also map directly to SM states > > available in the SMInfo attribute. > > As to why they are separate state machines (and the implied question of > whether they could/should be combined), I'm not sure. IMO they should be combined, or more correctly we don't need the "sweep" state processor there (known as complex and known as buggy - in the log we still see the messages like "invalid signal X in state Y"). The subnet sweep process has pretty linear logic (just look at repeated sections of states like OSM_STATE_PHASEX, OSM_STATE_PHASEX_WAIT, OSM_STATE_PHASEX_DONE even in existing code), so I think the implementation should follow this logic. At high level this could be something like "main loop" flow: while(1) { do_idle_works(timeout); /* breaks upon events or due to timeout */ ... do_sweep(); ... } This should be much simpler and more robust than existing implementation. And as yet another positive "side effect" this helps to remove the "big state lock" (which can block all dispatchers) and to improve scalability. Sasha From adit.262 at gmail.com Tue Nov 7 09:06:01 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Tue, 7 Nov 2006 12:06:01 -0500 Subject: [openib-general] Error starting IB HCA while Linux Boot In-Reply-To: References: Message-ID: This is the first one i tried.. During a couple of the boot up processes it did give an error sayin you amy need to replace your RAM modules.. Is it possible that due to bad RAM chips the device could fail to startup? Thanks, Adit On 11/6/06, Roland Dreier wrote: > > ib_mthca 0000:02:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM > > The HCA is reporting a memory calibration error. Most likely this is > a physical hardware problem somewhere in your system. > > Is it just one system failing like this, or do you have multiple > HCAs/systems with the same error message? > > - R. > -- Adit Ranadive Freshman, Georgia Institute of Technology, Atlanta, GA -- Adit Ranadive Freshman, Georgia Institute of Technology, Atlanta, GA From mshefty at ichips.intel.com Tue Nov 7 09:08:24 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 07 Nov 2006 09:08:24 -0800 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive In-Reply-To: References: Message-ID: <4550BD88.9020400@ichips.intel.com> Roland Dreier wrote: > I don't believe we should generate receive callbacks for canceled > sends, so I came up with the patch below (much simpler than the > explanation that led up to it). I am no longer able to reproduce the > IPoIB crash with this applied so I feel pretty good about this. I agree that this should be the case. If you look in ib_find_send_mad(), it checks that the wr->status for the send is still IB_WC_SUCCESS, but only in one of the two return paths. I think that we either want to fix the problem in ib_find_send_mad() or remove the check for status there. struct ib_mad_send_wr_private* ib_find_send_mad(struct ib_mad_agent_private *mad_agent_priv, struct ib_mad_recv_wc *wc) { struct ib_mad_send_wr_private *wr; struct ib_mad *mad; mad = (struct ib_mad *)wc->recv_buf.mad; list_for_each_entry(wr, &mad_agent_priv->wait_list, agent_list) { if ((wr->tid == mad->mad_hdr.tid) && rcv_has_same_class(wr, wc) && /* * Don't check GID for direct routed MADs. * These might have permissive LIDs. */ (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || rcv_has_same_gid(mad_agent_priv, wr, wc))) return wr; *** Missing check for status == SUCCESS } /* * It's possible to receive the response before we've * been notified that the send has completed */ list_for_each_entry(wr, &mad_agent_priv->send_list, agent_list) { if (is_data_mad(mad_agent_priv, wr->send_buf.mad) && wr->tid == mad->mad_hdr.tid && wr->timeout && rcv_has_same_class(wr, wc) && /* * Don't check GID for direct routed MADs. * These might have permissive LIDs. */ (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || rcv_has_same_gid(mad_agent_priv, wr, wc))) /* Verify request has not been canceled */ return (wr->status == IB_WC_SUCCESS) ? wr : NULL; *** Has check against canceled MADs } return NULL; } - Sean From mshefty at ichips.intel.com Tue Nov 7 09:10:25 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 07 Nov 2006 09:10:25 -0800 Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support In-Reply-To: <200611071053.50816.jackm@dev.mellanox.co.il> References: <000001c6f7bb$5c89fd50$a6d4180a@amr.corp.intel.com> <454F2D45.6060903@voltaire.com> <200611061616.30434.jackm@dev.mellanox.co.il> <200611071053.50816.jackm@dev.mellanox.co.il> Message-ID: <4550BE01.7050108@ichips.intel.com> Jack Morgenstein wrote: > Another kernel Oops in the multicast module (core/multicast.c), > again in procedure release_group, this time running under > SLES10.0 (2.6.16.21-0.8-smp). This occurred during multicast regression > testing: Can you apply Roland's mad patch and see if this still occurs? I likely won't get to this for a couple of days. - Sean From halr at voltaire.com Tue Nov 7 09:56:13 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2006 12:56:13 -0500 Subject: [openib-general] [PATCH] OpenSM/osm_node_info_rcv.c: Consolidate IB router handling with CA handling Message-ID: <1162922154.25771.47264.camel@hal.voltaire.com> OpenSM/osm_node_info_rcv.c: Consolidate IB router handling with CA handling From: Rolf Manderscheid Signed-off-by: Hal Rosenstock Index: opensm/osm_node_info_rcv.c =================================================================== --- opensm/osm_node_info_rcv.c (revision 10056) +++ opensm/osm_node_info_rcv.c (working copy) @@ -408,12 +408,12 @@ __osm_ni_rcv_get_node_desc( The plock must be held before calling this function. **********************************************************************/ static void -__osm_ni_rcv_process_new_ca( +__osm_ni_rcv_process_new_ca_or_router( IN const osm_ni_rcv_t* const p_rcv, IN osm_node_t* const p_node, IN const osm_madw_t* const p_madw ) { - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_new_ca ); + OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_new_ca_or_router ); __osm_ni_rcv_process_new_node( p_rcv, p_node, p_madw ); @@ -434,7 +434,7 @@ __osm_ni_rcv_process_new_ca( The plock must be held before calling this function. **********************************************************************/ static void -__osm_ni_rcv_process_existing_ca( +__osm_ni_rcv_process_existing_ca_or_router( IN const osm_ni_rcv_t* const p_rcv, IN osm_node_t* const p_node, IN const osm_madw_t* const p_madw ) @@ -452,7 +452,7 @@ __osm_ni_rcv_process_existing_ca( osm_bind_handle_t h_bind; cl_status_t cl_status; - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_existing_ca ); + OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_existing_ca_or_router ); p_smp = osm_madw_get_smp_ptr( p_madw ); p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); @@ -470,7 +470,7 @@ __osm_ni_rcv_process_existing_ca( if( p_port == (osm_port_t*)cl_qmap_end( p_guid_tbl ) ) { osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, - "__osm_ni_rcv_process_existing_ca: " + "__osm_ni_rcv_process_existing_ca_or_router: " "Creating new port object with GUID = 0x%" PRIx64 "\n", cl_ntoh64( p_ni->port_guid ) ); @@ -480,7 +480,7 @@ __osm_ni_rcv_process_existing_ca( if( p_port == NULL ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_ca: ERR 0D04: " + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D04: " "Unable to create new port object\n" ); goto Exit; } @@ -497,7 +497,7 @@ __osm_ni_rcv_process_existing_ca( Somehow, this port GUID already exists in the table. */ osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_ca: ERR 0D12: " + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D12: " "Port 0x%" PRIx64 " already in the database!\n", cl_ntoh64( p_ni->port_guid ) ); @@ -518,7 +518,7 @@ __osm_ni_rcv_process_existing_ca( if( cl_status != CL_SUCCESS ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_ca: ERR 0D08: " + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D08: " "Error %s adding to list\n", CL_STATUS_MSG( cl_status ) ); osm_port_delete( &p_port ); @@ -527,7 +527,7 @@ __osm_ni_rcv_process_existing_ca( else { osm_log( p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_ni_rcv_process_existing_ca: " + "__osm_ni_rcv_process_existing_ca_or_router: " "Adding port GUID:0x%016" PRIx64 " to new_ports_list\n", cl_ntoh64(osm_node_get_node_guid( p_port->p_node )) ); } @@ -544,7 +544,7 @@ __osm_ni_rcv_process_existing_ca( if ( !osm_physp_is_valid( p_physp ) ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_ca: ERR 0D19: " + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D19: " "Invalid physical port. Aborting discovery\n"); goto Exit; } @@ -576,189 +576,7 @@ __osm_ni_rcv_process_existing_ca( if( status != IB_SUCCESS ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_ca: ERR 0D13: " - "Failure initiating PortInfo request (%s)\n", - ib_get_err_str(status)); - } - - Exit: - OSM_LOG_EXIT( p_rcv->p_log ); -} - -/********************************************************************** - The plock must be held before calling this function. -**********************************************************************/ -static void -__osm_ni_rcv_process_new_router( - IN const osm_ni_rcv_t* const p_rcv, - IN osm_node_t* const p_node, - IN const osm_madw_t* const p_madw ) -{ - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_new_router ); - - __osm_ni_rcv_process_new_node( p_rcv, p_node, p_madw ); - - /* - A node guid of 0 is the corner case that indicates - we discovered our own node. Initialize the subnet - object with the SM's own port guid. - */ - if( osm_madw_get_ni_context_ptr( p_madw )->node_guid == 0 ) - { - p_rcv->p_subn->sm_port_guid = p_node->node_info.port_guid; - } - - OSM_LOG_EXIT( p_rcv->p_log ); -} - -/********************************************************************** - The plock must be held before calling this function. -**********************************************************************/ -static void -__osm_ni_rcv_process_existing_router( - IN const osm_ni_rcv_t* const p_rcv, - IN osm_node_t* const p_node, - IN const osm_madw_t* const p_madw ) -{ - ib_node_info_t *p_ni; - ib_smp_t *p_smp; - osm_port_t *p_port; - osm_port_t *p_port_check; - cl_qmap_t *p_guid_tbl; - osm_madw_context_t context; - uint8_t port_num; - osm_physp_t *p_physp; - ib_api_status_t status; - osm_dr_path_t *p_dr_path; - osm_bind_handle_t h_bind; - cl_status_t cl_status; - - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_existing_router ); - - p_smp = osm_madw_get_smp_ptr( p_madw ); - p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); - port_num = ib_node_info_get_local_port_num( p_ni ); - p_guid_tbl = &p_rcv->p_subn->port_guid_tbl; - h_bind = osm_madw_get_bind_handle( p_madw ); - - /* - Determine if we have encountered this node through a - previously undiscovered port. If so, build the new - port object. - */ - p_port = (osm_port_t*)cl_qmap_get( p_guid_tbl, p_ni->port_guid ); - - if( p_port == (osm_port_t*)cl_qmap_end( p_guid_tbl ) ) - { - osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, - "__osm_ni_rcv_process_existing_router: " - "Creating new port object with GUID = 0x%" PRIx64 "\n", - cl_ntoh64( p_ni->port_guid ) ); - - osm_node_init_physp( p_node, p_madw ); - - p_port = osm_port_new( p_ni, p_node ); - if( p_port == NULL ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_router: ERR 0D24: " - "Unable to create new port object\n" ); - goto Exit; - } - - /* - Add the new port object to the database. - */ - p_port_check = (osm_port_t*)cl_qmap_insert( p_guid_tbl, - p_ni->port_guid, &p_port->map_item ); - if( p_port_check != p_port ) - { - /* - We should never be here! - Somehow, this port GUID already exists in the table. - */ - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_router: ERR 0D22: " - "Port 0x%" PRIx64 " already in the database!\n", - cl_ntoh64( p_ni->port_guid ) ); - - osm_port_delete( &p_port ); - - goto Exit; - } - - /* If we are a master, then this means the port is new on the subnet. - Add it to the new_ports_list - need to send trap 64 on these ports. - The condition that we are master is true, since if we are in discovering - state (meaning we woke up from standby or we are just initializing), - then these ports may be new to us, but are not new on the subnet. - If we are master, then the subnet as we know it is the updated one, - and any new ports we encounter should cause trap 64. C14-72.1.1 */ - if ( p_rcv->p_subn->sm_state == IB_SMINFO_STATE_MASTER ) - { - cl_status = cl_list_insert_tail( &p_rcv->p_subn->new_ports_list, p_port ); - if( cl_status != CL_SUCCESS ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_router: ERR 0D28: " - "Error %s adding to list\n", - CL_STATUS_MSG( cl_status ) ); - osm_port_delete( &p_port ); - goto Exit; - } - else - { - osm_log( p_rcv->p_log, OSM_LOG_DEBUG, - "__osm_ni_rcv_process_existing_router: " - "Adding port GUID:0x%016" PRIx64 " to new_ports_list\n", - cl_ntoh64(osm_node_get_node_guid( p_port->p_node )) ); - } - } - - p_physp = osm_node_get_physp_ptr( p_node, port_num ); - } - else - { - p_physp = osm_node_get_physp_ptr( p_node, port_num ); - - CL_ASSERT( p_physp ); - - if ( !osm_physp_is_valid( p_physp ) ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_router: ERR 0D29: " - "Invalid physical port. Aborting discovery\n"); - goto Exit; - } - - /* - Update the DR Path to the port, - in case the old one is no longer available. - */ - p_dr_path = osm_physp_get_dr_path_ptr( p_physp ); - - osm_dr_path_init( p_dr_path, h_bind, p_smp->hop_count, - p_smp->initial_path ); - } - - context.pi_context.node_guid = p_ni->node_guid; - context.pi_context.port_guid = p_ni->port_guid; - context.pi_context.set_method = FALSE; - context.pi_context.update_master_sm_base_lid = FALSE; - context.pi_context.ignore_errors = FALSE; - context.pi_context.light_sweep = FALSE; - - status = osm_req_get( p_rcv->p_gen_req, - osm_physp_get_dr_path_ptr( p_physp ), - IB_MAD_ATTR_PORT_INFO, - cl_hton32( port_num ), - CL_DISP_MSGID_NONE, - &context ); - - if( status != IB_SUCCESS ) - { - osm_log( p_rcv->p_log, OSM_LOG_ERROR, - "__osm_ni_rcv_process_existing_router: ERR 0D23: " + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D13: " "Failure initiating PortInfo request (%s)\n", ib_get_err_str(status)); } @@ -1037,14 +855,12 @@ __osm_ni_rcv_process_new( switch( p_ni->node_type ) { case IB_NODE_TYPE_CA: - __osm_ni_rcv_process_new_ca( p_rcv, p_node, p_madw ); + case IB_NODE_TYPE_ROUTER: + __osm_ni_rcv_process_new_ca_or_router( p_rcv, p_node, p_madw ); break; case IB_NODE_TYPE_SWITCH: __osm_ni_rcv_process_new_switch( p_rcv, p_node, p_madw ); break; - case IB_NODE_TYPE_ROUTER: - __osm_ni_rcv_process_new_router( p_rcv, p_node, p_madw ); - break; default: osm_log( p_rcv->p_log, OSM_LOG_ERROR, "__osm_ni_rcv_process_new: ERR 0D16: " @@ -1099,12 +915,9 @@ __osm_ni_rcv_process_existing( switch( p_ni->node_type ) { - case IB_NODE_TYPE_ROUTER: - __osm_ni_rcv_process_existing_router( p_rcv, p_node, p_madw ); - break; - case IB_NODE_TYPE_CA: - __osm_ni_rcv_process_existing_ca( p_rcv, p_node, p_madw ); + case IB_NODE_TYPE_ROUTER: + __osm_ni_rcv_process_existing_ca_or_router( p_rcv, p_node, p_madw ); break; case IB_NODE_TYPE_SWITCH: From sashak at voltaire.com Tue Nov 7 10:38:37 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 7 Nov 2006 20:38:37 +0200 Subject: [openib-general] [PATCH] OpenSM/osm_node_info_rcv.c: Consolidate IB router handling with CA handling In-Reply-To: <1162922154.25771.47264.camel@hal.voltaire.com> References: <1162922154.25771.47264.camel@hal.voltaire.com> Message-ID: <20061107183837.GC1483@sashak.voltaire.com> On 12:56 Tue 07 Nov , Hal Rosenstock wrote: > OpenSM/osm_node_info_rcv.c: Consolidate IB router handling with CA > handling > > From: Rolf Manderscheid > Signed-off-by: Hal Rosenstock Good stuff. The patch looks good too. Sasha > > Index: opensm/osm_node_info_rcv.c > =================================================================== > --- opensm/osm_node_info_rcv.c (revision 10056) > +++ opensm/osm_node_info_rcv.c (working copy) > @@ -408,12 +408,12 @@ __osm_ni_rcv_get_node_desc( > The plock must be held before calling this function. > **********************************************************************/ > static void > -__osm_ni_rcv_process_new_ca( > +__osm_ni_rcv_process_new_ca_or_router( > IN const osm_ni_rcv_t* const p_rcv, > IN osm_node_t* const p_node, > IN const osm_madw_t* const p_madw ) > { > - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_new_ca ); > + OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_new_ca_or_router ); > > __osm_ni_rcv_process_new_node( p_rcv, p_node, p_madw ); > > @@ -434,7 +434,7 @@ __osm_ni_rcv_process_new_ca( > The plock must be held before calling this function. > **********************************************************************/ > static void > -__osm_ni_rcv_process_existing_ca( > +__osm_ni_rcv_process_existing_ca_or_router( > IN const osm_ni_rcv_t* const p_rcv, > IN osm_node_t* const p_node, > IN const osm_madw_t* const p_madw ) > @@ -452,7 +452,7 @@ __osm_ni_rcv_process_existing_ca( > osm_bind_handle_t h_bind; > cl_status_t cl_status; > > - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_existing_ca ); > + OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_existing_ca_or_router ); > > p_smp = osm_madw_get_smp_ptr( p_madw ); > p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); > @@ -470,7 +470,7 @@ __osm_ni_rcv_process_existing_ca( > if( p_port == (osm_port_t*)cl_qmap_end( p_guid_tbl ) ) > { > osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > - "__osm_ni_rcv_process_existing_ca: " > + "__osm_ni_rcv_process_existing_ca_or_router: " > "Creating new port object with GUID = 0x%" PRIx64 "\n", > cl_ntoh64( p_ni->port_guid ) ); > > @@ -480,7 +480,7 @@ __osm_ni_rcv_process_existing_ca( > if( p_port == NULL ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_ca: ERR 0D04: " > + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D04: " > "Unable to create new port object\n" ); > goto Exit; > } > @@ -497,7 +497,7 @@ __osm_ni_rcv_process_existing_ca( > Somehow, this port GUID already exists in the table. > */ > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_ca: ERR 0D12: " > + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D12: " > "Port 0x%" PRIx64 " already in the database!\n", > cl_ntoh64( p_ni->port_guid ) ); > > @@ -518,7 +518,7 @@ __osm_ni_rcv_process_existing_ca( > if( cl_status != CL_SUCCESS ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_ca: ERR 0D08: " > + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D08: " > "Error %s adding to list\n", > CL_STATUS_MSG( cl_status ) ); > osm_port_delete( &p_port ); > @@ -527,7 +527,7 @@ __osm_ni_rcv_process_existing_ca( > else > { > osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_ni_rcv_process_existing_ca: " > + "__osm_ni_rcv_process_existing_ca_or_router: " > "Adding port GUID:0x%016" PRIx64 " to new_ports_list\n", > cl_ntoh64(osm_node_get_node_guid( p_port->p_node )) ); > } > @@ -544,7 +544,7 @@ __osm_ni_rcv_process_existing_ca( > if ( !osm_physp_is_valid( p_physp ) ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_ca: ERR 0D19: " > + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D19: " > "Invalid physical port. Aborting discovery\n"); > goto Exit; > } > @@ -576,189 +576,7 @@ __osm_ni_rcv_process_existing_ca( > if( status != IB_SUCCESS ) > { > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_ca: ERR 0D13: " > - "Failure initiating PortInfo request (%s)\n", > - ib_get_err_str(status)); > - } > - > - Exit: > - OSM_LOG_EXIT( p_rcv->p_log ); > -} > - > -/********************************************************************** > - The plock must be held before calling this function. > -**********************************************************************/ > -static void > -__osm_ni_rcv_process_new_router( > - IN const osm_ni_rcv_t* const p_rcv, > - IN osm_node_t* const p_node, > - IN const osm_madw_t* const p_madw ) > -{ > - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_new_router ); > - > - __osm_ni_rcv_process_new_node( p_rcv, p_node, p_madw ); > - > - /* > - A node guid of 0 is the corner case that indicates > - we discovered our own node. Initialize the subnet > - object with the SM's own port guid. > - */ > - if( osm_madw_get_ni_context_ptr( p_madw )->node_guid == 0 ) > - { > - p_rcv->p_subn->sm_port_guid = p_node->node_info.port_guid; > - } > - > - OSM_LOG_EXIT( p_rcv->p_log ); > -} > - > -/********************************************************************** > - The plock must be held before calling this function. > -**********************************************************************/ > -static void > -__osm_ni_rcv_process_existing_router( > - IN const osm_ni_rcv_t* const p_rcv, > - IN osm_node_t* const p_node, > - IN const osm_madw_t* const p_madw ) > -{ > - ib_node_info_t *p_ni; > - ib_smp_t *p_smp; > - osm_port_t *p_port; > - osm_port_t *p_port_check; > - cl_qmap_t *p_guid_tbl; > - osm_madw_context_t context; > - uint8_t port_num; > - osm_physp_t *p_physp; > - ib_api_status_t status; > - osm_dr_path_t *p_dr_path; > - osm_bind_handle_t h_bind; > - cl_status_t cl_status; > - > - OSM_LOG_ENTER( p_rcv->p_log, __osm_ni_rcv_process_existing_router ); > - > - p_smp = osm_madw_get_smp_ptr( p_madw ); > - p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp ); > - port_num = ib_node_info_get_local_port_num( p_ni ); > - p_guid_tbl = &p_rcv->p_subn->port_guid_tbl; > - h_bind = osm_madw_get_bind_handle( p_madw ); > - > - /* > - Determine if we have encountered this node through a > - previously undiscovered port. If so, build the new > - port object. > - */ > - p_port = (osm_port_t*)cl_qmap_get( p_guid_tbl, p_ni->port_guid ); > - > - if( p_port == (osm_port_t*)cl_qmap_end( p_guid_tbl ) ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_VERBOSE, > - "__osm_ni_rcv_process_existing_router: " > - "Creating new port object with GUID = 0x%" PRIx64 "\n", > - cl_ntoh64( p_ni->port_guid ) ); > - > - osm_node_init_physp( p_node, p_madw ); > - > - p_port = osm_port_new( p_ni, p_node ); > - if( p_port == NULL ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_router: ERR 0D24: " > - "Unable to create new port object\n" ); > - goto Exit; > - } > - > - /* > - Add the new port object to the database. > - */ > - p_port_check = (osm_port_t*)cl_qmap_insert( p_guid_tbl, > - p_ni->port_guid, &p_port->map_item ); > - if( p_port_check != p_port ) > - { > - /* > - We should never be here! > - Somehow, this port GUID already exists in the table. > - */ > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_router: ERR 0D22: " > - "Port 0x%" PRIx64 " already in the database!\n", > - cl_ntoh64( p_ni->port_guid ) ); > - > - osm_port_delete( &p_port ); > - > - goto Exit; > - } > - > - /* If we are a master, then this means the port is new on the subnet. > - Add it to the new_ports_list - need to send trap 64 on these ports. > - The condition that we are master is true, since if we are in discovering > - state (meaning we woke up from standby or we are just initializing), > - then these ports may be new to us, but are not new on the subnet. > - If we are master, then the subnet as we know it is the updated one, > - and any new ports we encounter should cause trap 64. C14-72.1.1 */ > - if ( p_rcv->p_subn->sm_state == IB_SMINFO_STATE_MASTER ) > - { > - cl_status = cl_list_insert_tail( &p_rcv->p_subn->new_ports_list, p_port ); > - if( cl_status != CL_SUCCESS ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_router: ERR 0D28: " > - "Error %s adding to list\n", > - CL_STATUS_MSG( cl_status ) ); > - osm_port_delete( &p_port ); > - goto Exit; > - } > - else > - { > - osm_log( p_rcv->p_log, OSM_LOG_DEBUG, > - "__osm_ni_rcv_process_existing_router: " > - "Adding port GUID:0x%016" PRIx64 " to new_ports_list\n", > - cl_ntoh64(osm_node_get_node_guid( p_port->p_node )) ); > - } > - } > - > - p_physp = osm_node_get_physp_ptr( p_node, port_num ); > - } > - else > - { > - p_physp = osm_node_get_physp_ptr( p_node, port_num ); > - > - CL_ASSERT( p_physp ); > - > - if ( !osm_physp_is_valid( p_physp ) ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_router: ERR 0D29: " > - "Invalid physical port. Aborting discovery\n"); > - goto Exit; > - } > - > - /* > - Update the DR Path to the port, > - in case the old one is no longer available. > - */ > - p_dr_path = osm_physp_get_dr_path_ptr( p_physp ); > - > - osm_dr_path_init( p_dr_path, h_bind, p_smp->hop_count, > - p_smp->initial_path ); > - } > - > - context.pi_context.node_guid = p_ni->node_guid; > - context.pi_context.port_guid = p_ni->port_guid; > - context.pi_context.set_method = FALSE; > - context.pi_context.update_master_sm_base_lid = FALSE; > - context.pi_context.ignore_errors = FALSE; > - context.pi_context.light_sweep = FALSE; > - > - status = osm_req_get( p_rcv->p_gen_req, > - osm_physp_get_dr_path_ptr( p_physp ), > - IB_MAD_ATTR_PORT_INFO, > - cl_hton32( port_num ), > - CL_DISP_MSGID_NONE, > - &context ); > - > - if( status != IB_SUCCESS ) > - { > - osm_log( p_rcv->p_log, OSM_LOG_ERROR, > - "__osm_ni_rcv_process_existing_router: ERR 0D23: " > + "__osm_ni_rcv_process_existing_ca_or_router: ERR 0D13: " > "Failure initiating PortInfo request (%s)\n", > ib_get_err_str(status)); > } > @@ -1037,14 +855,12 @@ __osm_ni_rcv_process_new( > switch( p_ni->node_type ) > { > case IB_NODE_TYPE_CA: > - __osm_ni_rcv_process_new_ca( p_rcv, p_node, p_madw ); > + case IB_NODE_TYPE_ROUTER: > + __osm_ni_rcv_process_new_ca_or_router( p_rcv, p_node, p_madw ); > break; > case IB_NODE_TYPE_SWITCH: > __osm_ni_rcv_process_new_switch( p_rcv, p_node, p_madw ); > break; > - case IB_NODE_TYPE_ROUTER: > - __osm_ni_rcv_process_new_router( p_rcv, p_node, p_madw ); > - break; > default: > osm_log( p_rcv->p_log, OSM_LOG_ERROR, > "__osm_ni_rcv_process_new: ERR 0D16: " > @@ -1099,12 +915,9 @@ __osm_ni_rcv_process_existing( > > switch( p_ni->node_type ) > { > - case IB_NODE_TYPE_ROUTER: > - __osm_ni_rcv_process_existing_router( p_rcv, p_node, p_madw ); > - break; > - > case IB_NODE_TYPE_CA: > - __osm_ni_rcv_process_existing_ca( p_rcv, p_node, p_madw ); > + case IB_NODE_TYPE_ROUTER: > + __osm_ni_rcv_process_existing_ca_or_router( p_rcv, p_node, p_madw ); > break; > > case IB_NODE_TYPE_SWITCH: > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Tue Nov 7 10:45:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Nov 2006 10:45:06 -0800 Subject: [openib-general] Error starting IB HCA while Linux Boot In-Reply-To: (Adit Ranadive's message of "Tue, 7 Nov 2006 12:06:01 -0500") References: Message-ID: > Is it possible that due to bad RAM chips the device could fail to startup? Anything is possible if your system has bad memory... just fix it rather than wasting time trying to figure out exactly how it's failing. From rdreier at cisco.com Tue Nov 7 10:52:15 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Nov 2006 10:52:15 -0800 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive In-Reply-To: <4550BD88.9020400@ichips.intel.com> (Sean Hefty's message of "Tue, 07 Nov 2006 09:08:24 -0800") References: <4550BD88.9020400@ichips.intel.com> Message-ID: OK, I'm testing this patch now. Does this seem like a better fix (if it works of course)? diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 493f4c6..a72bcea 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1750,7 +1750,7 @@ ib_find_send_mad(struct ib_mad_agent_pri */ (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || rcv_has_same_gid(mad_agent_priv, wr, wc))) - return wr; + return (wr->status == IB_WC_SUCCESS) ? wr : NULL; } /* From rdreier at cisco.com Tue Nov 7 11:13:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Nov 2006 11:13:45 -0800 Subject: [openib-general] (belated) ANNOUNCE: libibverbs 1.0.4 Message-ID: Last week, I tagged the 1.0.4 release of libibverbs. It is on http://openib.org/downloads/, and Debian Unstable and Fedora Extras has updated binary packages. Major changes since 1.0.3: - Fix static linking so it actually works. - Add Valgrind annotations. - Other miscellaneous fixes. See the ChangeLog in the package for full details. Thanks, Roland From rdreier at cisco.com Tue Nov 7 11:25:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Nov 2006 11:25:12 -0800 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: <200611052141.29030.hnguyen@de.ibm.com> (Hoang-Nam Nguyen's message of "Sun, 5 Nov 2006 21:41:28 +0100") References: <200611052141.29030.hnguyen@de.ibm.com> Message-ID: > - *mapaddr = (u64)(ioremap(physaddr, EHCA_PAGESIZE)); > + *mapaddr = (u64)ioremap((physaddr & PAGE_MASK), PAGE_SIZE) + > + (physaddr & (~PAGE_MASK)); I'm confused -- shouldn't ioremap() do the right thing even if physaddr isn't page-aligned? Why is this needed? - R. From mshefty at ichips.intel.com Tue Nov 7 11:33:47 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Tue, 07 Nov 2006 11:33:47 -0800 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive In-Reply-To: References: <4550BD88.9020400@ichips.intel.com> Message-ID: <4550DF9B.3030102@ichips.intel.com> Roland Dreier wrote: > OK, I'm testing this patch now. Does this seem like a better fix (if > it works of course)? Looks fine by me. - Sean From rdreier at cisco.com Tue Nov 7 11:35:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Nov 2006 11:35:04 -0800 Subject: [openib-general] [PATCH/RFC] IB/mad: Fix race between cancel and receive In-Reply-To: <4550DF9B.3030102@ichips.intel.com> (Sean Hefty's message of "Tue, 07 Nov 2006 11:33:47 -0800") References: <4550BD88.9020400@ichips.intel.com> <4550DF9B.3030102@ichips.intel.com> Message-ID: > Looks fine by me. Cool, looks like it's surviving my test. I'll let it run for a few more hours and queue it up for 2.6.19 if it's still good. - R. From ardavis at ichips.intel.com Tue Nov 7 12:36:46 2006 From: ardavis at ichips.intel.com (Arlin Davis) Date: Tue, 07 Nov 2006 12:36:46 -0800 Subject: [openib-general] [PATCH 3/3] uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <000201c701f5$20c3b2e0$bb97070a@amr.corp.intel.com> References: <000201c701f5$20c3b2e0$bb97070a@amr.corp.intel.com> Message-ID: <4550EE5E.1010103@ichips.intel.com> All 3 patches committed in OFA(r10074) and SourceForge (r1414) From halr at voltaire.com Tue Nov 7 13:20:47 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 07 Nov 2006 16:20:47 -0500 Subject: [openib-general] [PATCH] OpenSM: When not forcing link speed, use LinkSpeedSupported as LinkSpeedEnabled Message-ID: <1162934443.2783.894.camel@hal.voltaire.com> OpenSM: When not forcing link speed, use LinkSpeedSupported as LinkSpeedEnabled. Signed-off-by: Hal Rosenstock Index: opensm/osm_lid_mgr.c =================================================================== -- opensm/osm_lid_mgr.c (revision 10056) +++ opensm/osm_lid_mgr.c (working copy) @@ -1155,7 +1155,7 @@ __osm_lid_mgr_set_physp_pi( if ( p_mgr->p_subn->opt.force_link_speed ) ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); else - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, sizeof(p_pi->link_speed) )) send_set = TRUE; Index: opensm/osm_link_mgr.c =================================================================== -- opensm/osm_link_mgr.c (revision 10056) +++ opensm/osm_link_mgr.c (working copy) @@ -313,7 +313,7 @@ __osm_link_mgr_set_physp_pi( if ( p_mgr->p_subn->opt.force_link_speed ) ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); else - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, sizeof(p_pi->link_speed) )) send_set = TRUE; From bunk at stusta.de Tue Nov 7 05:30:25 2006 From: bunk at stusta.de (Adrian Bunk) Date: Tue, 7 Nov 2006 14:30:25 +0100 Subject: [openib-general] 2.6.19-rc4: known unfixed regressions (v4) In-Reply-To: References: Message-ID: <20061107133025.GF8099@stusta.de> This email lists some known regressions in 2.6.19-rc compared to 2.6.18 that are not yet fixed in Linus' tree. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject : ThinkPad R50p: boot fail with (lapic && on_battery) References : http://lkml.org/lkml/2006/10/31/333 Submitter : Ernst Herzberg Handled-By : Len Brown Status : problem is being debugged Subject : ThinkPad T60: no screen after resume References : http://mail.matrix.de/pipermail/linux-thinkpad/2006-November/037011.html Submitter : Martin Lorenz Status : unknown Subject : ThinkPad T60: lose ACPI events after suspend/resume References : http://lkml.org/lkml/2006/10/10/39 http://bugzilla.kernel.org/show_bug.cgi?id=7408 Submitter : Martin Lorenz Status : problem might be fixed by commit f9dadfa71bc594df09044da61d1c72701121d802 Subject : i386: more DWARFs and strange messages References : http://lkml.org/lkml/2006/10/29/127 Submitter : Martin Lorenz Status : should be fixed by commit 4b96b1a10cb00c867103b21f0f2a6c91b705db11 Subject : BUG: scheduling while atomic: events/0/0x00000001/4, etc.. References : http://lkml.org/lkml/2006/11/2/209 Submitter : Paolo Ornati Status : unknown Subject : sata-via doesn't detect anymore disks attached to VIA vt6421 References : http://bugzilla.kernel.org/show_bug.cgi?id=7255 Submitter : Thierry Vignaud Status : unknown Subject : unable to rip cd References : http://lkml.org/lkml/2006/10/13/100 Submitter : Alex Romosan Status : unknown Subject : x86_64: oprofile doesn't work References : http://lkml.org/lkml/2006/10/27/3 Submitter : Prakash Punnoor Status : unknown Subject : x86_64: NR_IRQ increase causes 11.5% slowdown in lmbench's fork benchmark References : http://lkml.org/lkml/2006/11/2/192 Submitter : Tim Chen Caused-By : Eric W. Biederman commit 550f2299ac8ffaba943cf211380d3a8d3fa75301 Status : unknown Subject : PCI: MMCONFIG breakage References : http://lkml.org/lkml/2006/10/27/251 Submitter : Jeff Chua Handled-By : Andi Kleen Status : Andi is investigating, both BIOS and Direct work Subject : SMP kernel can not generate ISA irq properly References : http://lkml.org/lkml/2006/10/22/15 Submitter : Komuro Handled-By : Thomas Gleixner Status : Thomas is investigating Subject : ipath driver MCEs system on load when HT chip present References : http://bugzilla.kernel.org/show_bug.cgi?id=7455 Submitter : Bryan O'Sullivan Caused-By : Eric W. Biederman Handled-By : Bryan O'Sullivan Eric W. Biederman Status : Bryan and Eric are working on fixing the ipath driver Subject : boot hang in the microcode driver References : http://lkml.org/lkml/2006/11/6/117 Submitter : Arjan van de Ven Caused-By : Shaohua Li commit a30a6a2cb0fdc2c9701d6ddfb21affeb8146c038 Handled-By : Arjan van de Ven Patch : http://lkml.org/lkml/2006/11/6/117 Status : workaround-patch available From jeff.chua.linux at gmail.com Tue Nov 7 08:56:30 2006 From: jeff.chua.linux at gmail.com (Jeff Chua) Date: Wed, 8 Nov 2006 00:56:30 +0800 (SGT) Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) Message-ID: On 11/7/06, Aaron Durbin wrote: > Could please you post a dump of /proc/iomem for both the kernel that > works for you and the kernel that fails to allocate the PCI resources? 1) this works ... 00000000-0009ffff : System RAM 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000cc800-000cffff : Adapter ROM 000f0000-000fffff : System ROM 00100000-df686bff : System RAM 00100000-00357d27 : Kernel code 00357d28-0042bab3 : Kernel data df686c00-df688bff : ACPI Non-volatile Storage df688c00-df68abff : ACPI Tables df68ac00-dfffffff : reserved e0000000-efffffff : 0000:00:02.0 f0000000-f3ffffff : reserved fe700000-fe7fffff : PCI Bus #03 fe800000-fe8fffff : PCI Bus #02 fe8f0000-fe8fffff : 0000:02:00.0 fe8f0000-fe8fffff : tg3 fe900000-fe9fffff : PCI Bus #01 feabf900-feabf9ff : 0000:00:1e.2 feabfa00-feabfbff : 0000:00:1e.2 feac0000-feafffff : 0000:00:02.0 feb00000-feb7ffff : 0000:00:02.0 feb80000-febfffff : 0000:00:02.1 fed00000-fed003ff : HPET 0 fed20000-fed9ffff : reserved fee00000-feefffff : reserved ffa80800-ffa80bff : 0000:00:1d.7 ffa80800-ffa80bff : ehci_hcd ffb00000-ffffffff : reserved 2) this fails ... 00000000-0009ffff : System RAM 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000cc800-000cffff : Adapter ROM 000f0000-000fffff : System ROM 00100000-df686bff : System RAM 00100000-00358927 : Kernel code 00358928-0042cab3 : Kernel data df686c00-df688bff : ACPI Non-volatile Storage df688c00-df68abff : ACPI Tables df68ac00-dfffffff : reserved e0000000-efffffff : 0000:00:02.0 f0000000-ffffffff : PCI MMCONFIG 0 fed00000-fed003ff : HPET 0 Thanks, Jeff From jeff.chua.linux at gmail.com Tue Nov 7 08:57:03 2006 From: jeff.chua.linux at gmail.com (Jeff Chua) Date: Wed, 8 Nov 2006 00:57:03 +0800 (SGT) Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) Message-ID: On 11/7/06, Aaron Durbin wrote: > Could please you post a dump of /proc/iomem for both the kernel that > works for you and the kernel that fails to allocate the PCI resources? 1) this works ... 00000000-0009ffff : System RAM 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000cc800-000cffff : Adapter ROM 000f0000-000fffff : System ROM 00100000-df686bff : System RAM 00100000-00357d27 : Kernel code 00357d28-0042bab3 : Kernel data df686c00-df688bff : ACPI Non-volatile Storage df688c00-df68abff : ACPI Tables df68ac00-dfffffff : reserved e0000000-efffffff : 0000:00:02.0 f0000000-f3ffffff : reserved fe700000-fe7fffff : PCI Bus #03 fe800000-fe8fffff : PCI Bus #02 fe8f0000-fe8fffff : 0000:02:00.0 fe8f0000-fe8fffff : tg3 fe900000-fe9fffff : PCI Bus #01 feabf900-feabf9ff : 0000:00:1e.2 feabfa00-feabfbff : 0000:00:1e.2 feac0000-feafffff : 0000:00:02.0 feb00000-feb7ffff : 0000:00:02.0 feb80000-febfffff : 0000:00:02.1 fed00000-fed003ff : HPET 0 fed20000-fed9ffff : reserved fee00000-feefffff : reserved ffa80800-ffa80bff : 0000:00:1d.7 ffa80800-ffa80bff : ehci_hcd ffb00000-ffffffff : reserved 2) this fails ... 00000000-0009ffff : System RAM 000a0000-000bffff : Video RAM area 000c0000-000c7fff : Video ROM 000cc800-000cffff : Adapter ROM 000f0000-000fffff : System ROM 00100000-df686bff : System RAM 00100000-00358927 : Kernel code 00358928-0042cab3 : Kernel data df686c00-df688bff : ACPI Non-volatile Storage df688c00-df68abff : ACPI Tables df68ac00-dfffffff : reserved e0000000-efffffff : 0000:00:02.0 f0000000-ffffffff : PCI MMCONFIG 0 fed00000-fed003ff : HPET 0 Thanks, Jeff From matthew at wil.cx Tue Nov 7 09:11:43 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Tue, 7 Nov 2006 10:11:43 -0700 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: References: Message-ID: <20061107171143.GU27140@parisc-linux.org> On Wed, Nov 08, 2006 at 12:57:03AM +0800, Jeff Chua wrote: > 2) this fails ... > > e0000000-efffffff : 0000:00:02.0 > f0000000-ffffffff : PCI MMCONFIG 0 > fed00000-fed003ff : HPET 0 Heh, no kidding ... num_buses = pci_mmcfg_config[i].end_bus_number - pci_mmcfg_config[i].start_bus_number + 1; res->start = pci_mmcfg_config[i].base_address; res->end = res->start + (num_buses << 20) - 1; res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; insert_resource(&iomem_resource, res); So if we have 256 busses assigned, then we request 256MB and, well, there's no room for anyone else. This code was added by Andi in commit de09bddb9d6f96785be470c832b881e6d72d589f Hopefully he'll have a good idea how to restrict it. Given your "working" resource map, it seems like it should be limited to 16MB (and thus 16 busses). But how to figure that out? From adurbin at google.com Tue Nov 7 09:11:16 2006 From: adurbin at google.com (Aaron Durbin) Date: Tue, 7 Nov 2006 09:11:16 -0800 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: References: Message-ID: <8f95bb250611070911p612a332p20e406429759fbf4@mail.gmail.com> On 11/7/06, Jeff Chua wrote: > > On 11/7/06, Aaron Durbin wrote: > > > Could please you post a dump of /proc/iomem for both the kernel that > > works for you and the kernel that fails to allocate the PCI resources? > > > > 1) this works ... > > 00000000-0009ffff : System RAM > 000a0000-000bffff : Video RAM area > 000c0000-000c7fff : Video ROM > 000cc800-000cffff : Adapter ROM > 000f0000-000fffff : System ROM > 00100000-df686bff : System RAM > 00100000-00357d27 : Kernel code > 00357d28-0042bab3 : Kernel data > df686c00-df688bff : ACPI Non-volatile Storage > df688c00-df68abff : ACPI Tables > df68ac00-dfffffff : reserved > e0000000-efffffff : 0000:00:02.0 > f0000000-f3ffffff : reserved > fe700000-fe7fffff : PCI Bus #03 > fe800000-fe8fffff : PCI Bus #02 > fe8f0000-fe8fffff : 0000:02:00.0 > fe8f0000-fe8fffff : tg3 > fe900000-fe9fffff : PCI Bus #01 > feabf900-feabf9ff : 0000:00:1e.2 > feabfa00-feabfbff : 0000:00:1e.2 > feac0000-feafffff : 0000:00:02.0 > feb00000-feb7ffff : 0000:00:02.0 > feb80000-febfffff : 0000:00:02.1 > fed00000-fed003ff : HPET 0 > fed20000-fed9ffff : reserved > fee00000-feefffff : reserved > ffa80800-ffa80bff : 0000:00:1d.7 > ffa80800-ffa80bff : ehci_hcd > ffb00000-ffffffff : reserved > > > > 2) this fails ... > > 00000000-0009ffff : System RAM > 000a0000-000bffff : Video RAM area > 000c0000-000c7fff : Video ROM > 000cc800-000cffff : Adapter ROM > 000f0000-000fffff : System ROM > 00100000-df686bff : System RAM > 00100000-00358927 : Kernel code > 00358928-0042cab3 : Kernel data > df686c00-df688bff : ACPI Non-volatile Storage > df688c00-df68abff : ACPI Tables > df68ac00-dfffffff : reserved > e0000000-efffffff : 0000:00:02.0 > f0000000-ffffffff : PCI MMCONFIG 0 > fed00000-fed003ff : HPET 0 > Ok. Jeff I have patch in there that reserves the MMCONFIG space, however it is marked as reserved during resource insertion. For some reason your MMCONFIG space is being reported as very large, thus reserving the range f0000000-ffffffff. That is why your PCI devices are bombing out on resource allocation. It looks like the MMCONFIG region should be: f0000000-f3ffffff. This range is marked as reserved in your e820 map, however the MMCONFIG parsing is thinking it is 256MB. This is not the right answer, but you could patch up your kernel to fix it to the correct size for a temporary fix. I am going to see if I can parse any other information from your logs and see if I can come up w/ a better solution. I just wanted to point you and others in the right direction. -Aaron From adurbin at google.com Tue Nov 7 09:50:54 2006 From: adurbin at google.com (Aaron Durbin) Date: Tue, 7 Nov 2006 09:50:54 -0800 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <20061107171143.GU27140@parisc-linux.org> References: <20061107171143.GU27140@parisc-linux.org> Message-ID: <8f95bb250611070950m3dc45674gbd370e3173b6168d@mail.gmail.com> On 11/7/06, Matthew Wilcox wrote: > On Wed, Nov 08, 2006 at 12:57:03AM +0800, Jeff Chua wrote: > > 2) this fails ... > > > > e0000000-efffffff : 0000:00:02.0 > > f0000000-ffffffff : PCI MMCONFIG 0 > > fed00000-fed003ff : HPET 0 > > Heh, no kidding ... > > num_buses = pci_mmcfg_config[i].end_bus_number - > pci_mmcfg_config[i].start_bus_number + 1; > res->start = pci_mmcfg_config[i].base_address; > res->end = res->start + (num_buses << 20) - 1; > res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; > insert_resource(&iomem_resource, res); > > So if we have 256 busses assigned, then we request 256MB and, well, > there's no room for anyone else. This code was added by Andi in commit > de09bddb9d6f96785be470c832b881e6d72d589f > > Hopefully he'll have a good idea how to restrict it. Given your "working" > resource map, it seems like it should be limited to 16MB (and thus 16 busses). > But how to figure that out? > Maybe Andi can shed some light on the reasoning for not checking e820 to see if the entire MMCONFIG region is reported as reserved in the e820 map. I can patch up the pci_mmcfg_insert_resource to verify if the region that is exported by ACPI is reserved in e820 and printk an error message if it is not and skip the resource insertion. Does that seem like a good avenue to pursue? -Aaron From ebiederm at xmission.com Tue Nov 7 09:49:36 2006 From: ebiederm at xmission.com (ebiederm at xmission.com) Date: Tue, 07 Nov 2006 10:49:36 -0700 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <20061107171143.GU27140@parisc-linux.org> (Matthew Wilcox's message of "Tue, 7 Nov 2006 10:11:43 -0700") References: <20061107171143.GU27140@parisc-linux.org> Message-ID: Matthew Wilcox writes: > On Wed, Nov 08, 2006 at 12:57:03AM +0800, Jeff Chua wrote: >> 2) this fails ... >> >> e0000000-efffffff : 0000:00:02.0 >> f0000000-ffffffff : PCI MMCONFIG 0 >> fed00000-fed003ff : HPET 0 > > Heh, no kidding ... > > num_buses = pci_mmcfg_config[i].end_bus_number - > pci_mmcfg_config[i].start_bus_number + 1; > res->start = pci_mmcfg_config[i].base_address; > res->end = res->start + (num_buses << 20) - 1; > res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; > insert_resource(&iomem_resource, res); > > So if we have 256 busses assigned, then we request 256MB and, well, > there's no room for anyone else. This code was added by Andi in commit > de09bddb9d6f96785be470c832b881e6d72d589f > > Hopefully he'll have a good idea how to restrict it. Given your "working" > resource map, it seems like it should be limited to 16MB (and thus 16 busses). > But how to figure that out? Sounds like you need to find the current maximum bus number in use. A little more sophisticated would look at where the next reserved region begins. ACPI might have some of that information as well. Although I'm not certain where we you are coming from. If you don't have to worry about device hotplug getting the current maximum bus number should be all you need. Eric From matthew at wil.cx Tue Nov 7 09:56:52 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Tue, 7 Nov 2006 10:56:52 -0700 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <8f95bb250611070950m3dc45674gbd370e3173b6168d@mail.gmail.com> References: <20061107171143.GU27140@parisc-linux.org> <8f95bb250611070950m3dc45674gbd370e3173b6168d@mail.gmail.com> Message-ID: <20061107175651.GV27140@parisc-linux.org> On Tue, Nov 07, 2006 at 09:50:54AM -0800, Aaron Durbin wrote: > Maybe Andi can shed some light on the reasoning for not checking e820 > to see if the entire MMCONFIG region is reported as reserved in the > e820 map. I can patch up the pci_mmcfg_insert_resource to verify if > the region that is exported by ACPI is reserved in e820 and printk an > error message if it is not and skip the resource insertion. > > Does that seem like a good avenue to pursue? Sounds much better than Eric's idea of maximum bus number currently in use (which was also my first thought). But rather than skipping the resource insertion, I believe you should limit its size to the largest multiple of 1MB that will fit within the reserved region. From earny at net4u.de Tue Nov 7 12:05:36 2006 From: earny at net4u.de (Ernst Herzberg) Date: Tue, 7 Nov 2006 21:05:36 +0100 Subject: [openib-general] [linux-pm] 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <200611070041.28008.len.brown@intel.com> References: <200611070317.42230.earny@net4u.de> <200611070041.28008.len.brown@intel.com> Message-ID: <200611072105.50178.earny@net4u.de> On Tuesday 07 November 2006 06:41, Len Brown wrote: > On Monday 06 November 2006 21:17, Ernst Herzberg wrote: > > On Sunday 05 November 2006 07:48, Adrian Bunk wrote: > > > ... > > > Subject : ThinkPad R50p: boot fail with (lapic && on_battery) > > > References : http://lkml.org/lkml/2006/10/31/333 > > > Submitter : Ernst Herzberg > > > Status : problem is being debugged First i made shure again that 2.6.18.2 works..... damn shure. > > Please test if booting with "processor.max_cstate=1" makes any > difference --> NO. > Please test if building with CONFIG_CPU_FREQ=n makes any difference. --> NO. > Also, please make sure that booting with "apm=off" makes no difference > -- there is a bug where the APM code is not currently disabled in ACPI > mode, and who knows what effect that may have... Ahem. All previous tests was done with CONFIG_APM=n. So i tested with CONFIG_APM=y. Does not help. It makes no difference booting with "apm=off" or not. Another check: The laptop muste be powered on on_battery to trigger the problem. If i disconnect AC at the grub-prompt the problem does _not_ occur. The laptop has two batteries. The utraybay battery is nearly dead now, but it makes no difference if removed. This problem is not very important for me, i just wondering why is only occurs if running on battery. I would be happy, if someone can explain this;-) The laptop itself is rock stable (if he boots), never seen any glitch or instability (with lapic). I don't know exactly when i started using lapic. Its a long time ago (last year?), i have read the message that i can enable this so i did. No problems until 2.6.19-rc1.... If nobody can reproduce this, i don't care about the problem. There is also a life without lapic:) But maybe it shows a problem anywhere else, timing or whatever, so i'm willing to test everything against. So if someone is interesting to reproduce the problem, i repeat the conditions that must be met: 1.: Laptop must be powered on with AC removed (on battery) 2.: BIOS-Setting must be power --> Intel Speedstep --> Mode on Battery --> "Max Battery" 3.: Kernel command line must have "lapci" 4.: Kernel must be >= 2.6.19-rc1 dmidecode: (maybe that helps?) # dmidecode 2.8 SMBIOS 2.33 present. 61 structures occupying 2127 bytes. Table at 0x000E0010. Handle 0x0000, DMI type 0, 20 bytes BIOS Information Vendor: IBM Version: 1RETDPWW (3.21 ) Release Date: 06/02/2006 Address: 0xDC000 Runtime Size: 144 kB ROM Size: 1024 kB Characteristics: PCI is supported PC Card (PCMCIA) is supported PNP is supported APM is supported BIOS is upgradeable BIOS shadowing is allowed ESCD support is available Boot from CD is supported Selectable boot is supported EDD is supported 3.5"/720 KB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported AGP is supported BIOS boot specification is supported Handle 0x0001, DMI type 1, 25 bytes System Information Manufacturer: IBM Product Name: 183222G Version: ThinkPad R50p Serial Number: 99DR993 UUID: 5532DC80-466C-11CB-B373-95CD80E5548B Wake-up Type: Power Switch Handle 0x0002, DMI type 2, 8 bytes Base Board Information Manufacturer: IBM Product Name: 183222G Version: Not Available Serial Number: J1V9545B13X Handle 0x0003, DMI type 3, 17 bytes Chassis Information Manufacturer: IBM Type: Notebook Lock: Not Present Version: Not Available Serial Number: Not Available Asset Tag: No Asset Information Boot-up State: Unknown Power Supply State: Unknown Thermal State: Unknown Security Status: Unknown OEM Information: 0x00000000 Handle 0x0004, DMI type 126, 17 bytes Inactive Handle 0x0005, DMI type 126, 17 bytes Inactive Handle 0x0006, DMI type 4, 35 bytes Processor Information Socket Designation: None Type: Central Processor Family: Pentium M Manufacturer: GenuineIntel ID: 95 06 00 00 BF F9 E9 A7 Signature: Type 0, Family 6, Model 9, Stepping 5 Flags: FPU (Floating-point unit on-chip) VME (Virtual mode extension) DE (Debugging extension) PSE (Page size extension) TSC (Time stamp counter) MSR (Model specific registers) MCE (Machine check exception) CX8 (CMPXCHG8 instruction supported) SEP (Fast system call) MTRR (Memory type range registers) PGE (Page global enable) MCA (Machine check architecture) CMOV (Conditional move instruction supported) PAT (Page attribute table) CLFSH (CLFLUSH instruction supported) DS (Debug store) ACPI (ACPI supported) MMX (MMX technology supported) FXSR (Fast floating-point save and restore) SSE (Streaming SIMD extensions) SSE2 (Streaming SIMD extensions 2) TM (Thermal monitor supported) PBE (Pending break enabled) Version: Intel(R) Pentium(R) M processor Voltage: 1.5 V External Clock: 400 MHz Max Speed: 1700 MHz Current Speed: 1700 MHz Status: Populated, Enabled Upgrade: None L1 Cache Handle: 0x000A L2 Cache Handle: 0x000B L3 Cache Handle: Not Provided Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Handle 0x0007, DMI type 5, 20 bytes Memory Controller Information Error Detecting Method: None Error Correcting Capabilities: None Supported Interleave: One-way Interleave Current Interleave: One-way Interleave Maximum Memory Module Size: 1024 MB Maximum Total Memory Size: 2048 MB Supported Speeds: Other Supported Memory Types: DIMM SDRAM Memory Module Voltage: 2.9 V Associated Memory Slots: 2 0x0008 0x0009 Enabled Error Correcting Capabilities: None Handle 0x0008, DMI type 6, 12 bytes Memory Module Information Socket Designation: DIMM Slot 1 Bank Connections: 0 1 Current Speed: Unknown Type: DIMM SDRAM Installed Size: 512 MB (Double-bank Connection) Enabled Size: 512 MB (Double-bank Connection) Error Status: OK Handle 0x0009, DMI type 6, 12 bytes Memory Module Information Socket Designation: DIMM Slot 2 Bank Connections: 2 3 Current Speed: Unknown Type: DIMM SDRAM Installed Size: 512 MB (Double-bank Connection) Enabled Size: 512 MB (Double-bank Connection) Error Status: OK Handle 0x000A, DMI type 7, 19 bytes Cache Information Socket Designation: Internal L1 Cache Configuration: Enabled, Socketed, Level 1 Operational Mode: Write Back Location: Internal Installed Size: 32 KB Maximum Size: 32 KB Supported SRAM Types: Synchronous Installed SRAM Type: Synchronous Speed: Unknown Error Correction Type: Unknown System Type: Other Associativity: 8-way Set-associative Handle 0x000B, DMI type 7, 19 bytes Cache Information Socket Designation: Internal L2 Cache Configuration: Enabled, Socketed, Level 2 Operational Mode: Write Back Location: Internal Installed Size: 1024 KB Maximum Size: 1024 KB Supported SRAM Types: Burst Installed SRAM Type: Burst Speed: Unknown Error Correction Type: Multi-bit ECC System Type: Unified Associativity: 8-way Set-associative Handle 0x000C, DMI type 126, 9 bytes Inactive Handle 0x000D, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: Infrared External Connector Type: Infrared Port Type: Other Handle 0x000E, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: Parallel External Connector Type: DB-25 female Port Type: Parallel Port ECP/EPP Handle 0x000F, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: External Monitor External Connector Type: DB-15 female Port Type: Video Port Handle 0x0010, DMI type 126, 9 bytes Inactive Handle 0x0011, DMI type 126, 9 bytes Inactive Handle 0x0012, DMI type 126, 9 bytes Inactive Handle 0x0013, DMI type 126, 9 bytes Inactive Handle 0x0014, DMI type 126, 9 bytes Inactive Handle 0x0015, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: Microphone Jack External Connector Type: Mini Jack (headphones) Port Type: Audio Port Handle 0x0016, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: Headphone Jack External Connector Type: Mini Jack (headphones) Port Type: Audio Port Handle 0x0017, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: S-Video-Out External Connector Type: Other Port Type: Video Port Handle 0x0018, DMI type 126, 9 bytes Inactive Handle 0x0019, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: Modem External Connector Type: RJ-11 Port Type: Modem Port Handle 0x001A, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: Ethernet External Connector Type: RJ-45 Port Type: Network Port Handle 0x001B, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: USB 1 External Connector Type: Access Bus (USB) Port Type: USB Handle 0x001C, DMI type 8, 9 bytes Port Connector Information Internal Reference Designator: Not Available Internal Connector Type: None External Reference Designator: USB 2 External Connector Type: Access Bus (USB) Port Type: USB Handle 0x001D, DMI type 126, 9 bytes Inactive Handle 0x001E, DMI type 126, 9 bytes Inactive Handle 0x001F, DMI type 126, 9 bytes Inactive Handle 0x0020, DMI type 126, 9 bytes Inactive Handle 0x0021, DMI type 126, 9 bytes Inactive Handle 0x0022, DMI type 9, 13 bytes System Slot Information Designation: CardBus Slot 1 Type: 32-bit PC Card (PCMCIA) Current Usage: Available Length: Other ID: Adapter 0, Socket 0 Characteristics: 5.0 V is provided 3.3 V is provided PC Card-16 is supported Cardbus is supported Zoom Video is supported Modem ring resume is supported PME signal is supported Hot-plug devices are supported Handle 0x0023, DMI type 9, 13 bytes System Slot Information Designation: CardBus Slot 2 Type: 32-bit PC Card (PCMCIA) Current Usage: Available Length: Other ID: Adapter 1, Socket 0 Characteristics: 5.0 V is provided 3.3 V is provided PC Card-16 is supported Cardbus is supported Zoom Video is supported Modem ring resume is supported PME signal is supported Hot-plug devices are supported Handle 0x0024, DMI type 126, 13 bytes Inactive Handle 0x0025, DMI type 126, 13 bytes Inactive Handle 0x0026, DMI type 9, 13 bytes System Slot Information Designation: Mini-PCI Slot 1 Type: 32-bit PCI Current Usage: Available Length: Other ID: 1 Characteristics: 5.0 V is provided 3.3 V is provided PME signal is supported SMBus signal is supported Handle 0x0027, DMI type 126, 13 bytes Inactive Handle 0x0028, DMI type 10, 6 bytes On Board Device Information Type: Other Status: Enabled Description: IBM Embedded Security hardware Handle 0x0029, DMI type 11, 5 bytes OEM Strings String 1: IBM ThinkPad Embedded Controller -[1RHT71WW-3.04 ]- Handle 0x002A, DMI type 13, 22 bytes BIOS Language Information Installable Languages: 1 enUS Currently Installed Language: enUS Handle 0x002B, DMI type 15, 25 bytes System Event Log Area Length: 0 bytes Header Start Offset: 0x0000 Header Length: 16 bytes Data Start Offset: 0x0010 Access Method: General-purpose non-volatile data functions Access Address: 0x0000 Status: Invalid, Not Full Change Token: 0x00000004 Header Format: Type 1 Supported Log Type Descriptors: 1 Descriptor 1: POST error Data Format 1: POST results bitmap Handle 0x002C, DMI type 16, 15 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: None Maximum Capacity: 1 GB Error Information Handle: Not Provided Number Of Devices: 2 Handle 0x002D, DMI type 17, 27 bytes Memory Device Array Handle: 0x002C Error Information Handle: No Error Total Width: 64 bits Data Width: 64 bits Size: 512 MB Form Factor: SODIMM Set: None Locator: DIMM 1 Bank Locator: Bank 0/1 Type: DDR Type Detail: Synchronous Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Handle 0x002E, DMI type 17, 27 bytes Memory Device Array Handle: 0x002C Error Information Handle: No Error Total Width: 64 bits Data Width: 64 bits Size: 512 MB Form Factor: SODIMM Set: None Locator: DIMM 2 Bank Locator: Bank 2/3 Type: DDR Type Detail: Synchronous Speed: Unknown Manufacturer: Not Specified Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Handle 0x002F, DMI type 18, 23 bytes 32-bit Memory Error Information Type: OK Granularity: Unknown Operation: Unknown Vendor Syndrome: Unknown Memory Array Address: Unknown Device Address: Unknown Resolution: Unknown Handle 0x0030, DMI type 19, 15 bytes Memory Array Mapped Address Starting Address: 0x00000000000 Ending Address: 0x0003FFFFFFF Range Size: 1 GB Physical Array Handle: 0x002C Partition Width: 0 Handle 0x0031, DMI type 20, 19 bytes Memory Device Mapped Address Starting Address: 0x00000000000 Ending Address: 0x0001FFFFFFF Range Size: 512 MB Physical Device Handle: 0x002D Memory Array Mapped Address Handle: 0x0030 Partition Row Position: 1 Handle 0x0032, DMI type 20, 19 bytes Memory Device Mapped Address Starting Address: 0x00020000000 Ending Address: 0x0003FFFFFFF Range Size: 512 MB Physical Device Handle: 0x002E Memory Array Mapped Address Handle: 0x0030 Partition Row Position: 1 Handle 0x0033, DMI type 21, 7 bytes Built-in Pointing Device Type: Track Point Interface: PS/2 Buttons: 3 Handle 0x0034, DMI type 21, 7 bytes Built-in Pointing Device Type: Touch Pad Interface: PS/2 Buttons: 0 Handle 0x0035, DMI type 24, 5 bytes Hardware Security Power-On Password Status: Disabled Keyboard Password Status: Disabled Administrator Password Status: Disabled Front Panel Reset Status: Unknown Handle 0x0036, DMI type 32, 11 bytes System Boot Information Status: No errors detected Handle 0x0037, DMI type 131, 102 bytes OEM-specific Type Header and Data: 83 66 37 00 01 00 00 00 00 01 72 03 40 00 AE 80 00 02 00 00 00 00 00 2A 00 40 2A 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 16 00 80 16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Strings: IBMCFGDATA Handle 0x0038, DMI type 131, 17 bytes OEM-specific Type Header and Data: 83 11 38 00 01 02 03 FF FF 1F 00 00 00 00 00 02 00 Strings: BOOTINF 20h BOOTDEV 21h KEYPTRS 23h Handle 0x0039, DMI type 132, 7 bytes OEM-specific Type Header and Data: 84 07 39 00 01 D8 36 Handle 0x003A, DMI type 133, 5 bytes OEM-specific Type Header and Data: 85 05 3A 00 01 Strings: KHOIHGIUCCHHII Handle 0x003B, DMI type 126, 13 bytes Inactive Handle 0x003C, DMI type 127, 4 bytes End Of Table ------ Thx, From patrick at xentech.net Tue Nov 7 16:34:02 2006 From: patrick at xentech.net (Patrick (Xentech)) Date: Wed, 8 Nov 2006 01:34:02 +0100 Subject: [openib-general] dm_client_main.c compile error Message-ID: I'm trying to compile revision 9087 on a 2.6.18.2 kernel. The following error occurs: LD drivers/infiniband/client_query/built-in.o CC [M] drivers/infiniband/client_query/client_query.o CC [M] drivers/infiniband/client_query/client_query_export.o CC [M] drivers/infiniband/client_query/client_query_main.o CC [M] drivers/infiniband/client_query/dm_client_main.o drivers/infiniband/client_query/dm_client_main.c:51: error: syntax error before string constant drivers/infiniband/client_query/dm_client_main.c:51: warning: type defaults to `int' in declaration of `MODULE_PARM' drivers/infiniband/client_query/dm_client_main.c:51: warning: function declaration isn't a prototype drivers/infiniband/client_query/dm_client_main.c:51: warning: data definition has no type or storage class make[3]: *** [drivers/infiniband/client_query/dm_client_main.o] Error 1 make[2]: *** [drivers/infiniband/client_query] Error 2 make[1]: *** [drivers/infiniband] Error 2 make: *** [drivers] Error 2 -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Tue Nov 7 19:08:25 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Nov 2006 19:08:25 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <454FD71C.1010909@ichips.intel.com> Message-ID: <000001c702e3$27e92110$43d0180a@amr.corp.intel.com> Memo to me: read comments about missing functionality... Fixed an issue with the previous patch not having the right pkey when forwarding LAP messages to the user. With this patch, I'm able to fail over between two paths, reload a new path, and fail again repeatedly using my test program. Venkatesh, if you can verify that this code works for you, I will request that it be queued for 2.6.20. Signed-off-by: Sean Hefty --- diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 1cf0d42..ed69573 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -147,12 +147,12 @@ struct cm_id_private { __be32 rq_psn; int timeout_ms; enum ib_mtu path_mtu; + __be16 pkey; u8 private_data_len; u8 max_cm_retries; u8 peer_to_peer; u8 responder_resources; u8 initiator_depth; - u8 local_ack_timeout; u8 retry_count; u8 rnr_retry_count; u8 service_timeout; @@ -691,7 +691,7 @@ static void cm_enter_timewait(struct cm_ * timewait before notifying the user that we've exited timewait. */ cm_id_priv->id.state = IB_CM_TIMEWAIT; - wait_time = cm_convert_to_ms(cm_id_priv->local_ack_timeout); + wait_time = cm_convert_to_ms(cm_id_priv->av.packet_life_time + 1); queue_delayed_work(cm.wq, &cm_id_priv->timewait_info->work.work, msecs_to_jiffies(wait_time)); cm_id_priv->timewait_info = NULL; @@ -1010,6 +1010,7 @@ int ib_send_cm_req(struct ib_cm_id *cm_i cm_id_priv->responder_resources = param->responder_resources; cm_id_priv->retry_count = param->retry_count; cm_id_priv->path_mtu = param->primary_path->mtu; + cm_id_priv->pkey = param->primary_path->pkey; cm_id_priv->qp_type = param->qp_type; ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg); @@ -1024,8 +1025,6 @@ int ib_send_cm_req(struct ib_cm_id *cm_i cm_id_priv->local_qpn = cm_req_get_local_qpn(req_msg); cm_id_priv->rq_psn = cm_req_get_starting_psn(req_msg); - cm_id_priv->local_ack_timeout = - cm_req_get_primary_local_ack_timeout(req_msg); spin_lock_irqsave(&cm_id_priv->lock, flags); ret = ib_post_send_mad(cm_id_priv->msg, NULL); @@ -1410,9 +1409,8 @@ static int cm_req_handler(struct cm_work cm_id_priv->initiator_depth = cm_req_get_resp_res(req_msg); cm_id_priv->responder_resources = cm_req_get_init_depth(req_msg); cm_id_priv->path_mtu = cm_req_get_path_mtu(req_msg); + cm_id_priv->pkey = req_msg->pkey; cm_id_priv->sq_psn = cm_req_get_starting_psn(req_msg); - cm_id_priv->local_ack_timeout = - cm_req_get_primary_local_ack_timeout(req_msg); cm_id_priv->retry_count = cm_req_get_retry_count(req_msg); cm_id_priv->rnr_retry_count = cm_req_get_rnr_retry_count(req_msg); cm_id_priv->qp_type = cm_req_get_qp_type(req_msg); @@ -1716,7 +1714,7 @@ static int cm_establish_handler(struct c unsigned long flags; int ret; - /* See comment in ib_cm_establish about lookup. */ + /* See comment in cm_establish about lookup. */ cm_id_priv = cm_acquire_id(work->local_id, work->remote_id); if (!cm_id_priv) return -EINVAL; @@ -2402,11 +2400,16 @@ int ib_send_cm_lap(struct ib_cm_id *cm_i cm_id_priv = container_of(cm_id, struct cm_id_private, id); spin_lock_irqsave(&cm_id_priv->lock, flags); if (cm_id->state != IB_CM_ESTABLISHED || - cm_id->lap_state != IB_CM_LAP_IDLE) { + (cm_id->lap_state != IB_CM_LAP_UNINIT && + cm_id->lap_state != IB_CM_LAP_IDLE)) { ret = -EINVAL; goto out; } + ret = cm_init_av_by_path(alternate_path, &cm_id_priv->alt_av); + if (ret) + goto out; + ret = cm_alloc_msg(cm_id_priv, &msg); if (ret) goto out; @@ -2431,7 +2434,8 @@ out: spin_unlock_irqrestore(&cm_id_priv- } EXPORT_SYMBOL(ib_send_cm_lap); -static void cm_format_path_from_lap(struct ib_sa_path_rec *path, +static void cm_format_path_from_lap(struct cm_id_private *cm_id_priv, + struct ib_sa_path_rec *path, struct cm_lap_msg *lap_msg) { memset(path, 0, sizeof *path); @@ -2443,10 +2447,10 @@ static void cm_format_path_from_lap(stru path->hop_limit = lap_msg->alt_hop_limit; path->traffic_class = cm_lap_get_traffic_class(lap_msg); path->reversible = 1; - /* pkey is same as in REQ */ + path->pkey = cm_id_priv->pkey; path->sl = cm_lap_get_sl(lap_msg); path->mtu_selector = IB_SA_EQ; - /* mtu is same as in REQ */ + path->mtu = cm_id_priv->path_mtu; path->rate_selector = IB_SA_EQ; path->rate = cm_lap_get_packet_rate(lap_msg); path->packet_life_time_selector = IB_SA_EQ; @@ -2472,7 +2476,7 @@ static int cm_lap_handler(struct cm_work param = &work->cm_event.param.lap_rcvd; param->alternate_path = &work->path[0]; - cm_format_path_from_lap(param->alternate_path, lap_msg); + cm_format_path_from_lap(cm_id_priv, param->alternate_path, lap_msg); work->cm_event.private_data = &lap_msg->private_data; spin_lock_irqsave(&cm_id_priv->lock, flags); @@ -2480,6 +2484,7 @@ static int cm_lap_handler(struct cm_work goto unlock; switch (cm_id_priv->id.lap_state) { + case IB_CM_LAP_UNINIT: case IB_CM_LAP_IDLE: break; case IB_CM_MRA_LAP_SENT: @@ -2502,6 +2507,10 @@ static int cm_lap_handler(struct cm_work cm_id_priv->id.lap_state = IB_CM_LAP_RCVD; cm_id_priv->tid = lap_msg->hdr.tid; + cm_init_av_for_response(work->port, work->mad_recv_wc->wc, + work->mad_recv_wc->recv_buf.grh, + &cm_id_priv->av); + cm_init_av_by_path(param->alternate_path, &cm_id_priv->alt_av); ret = atomic_inc_and_test(&cm_id_priv->work_count); if (!ret) list_add_tail(&work->list, &cm_id_priv->work_list); @@ -3040,7 +3049,7 @@ static void cm_work_handler(void *data) cm_free_work(work); } -int ib_cm_establish(struct ib_cm_id *cm_id) +static int cm_establish(struct ib_cm_id *cm_id) { struct cm_id_private *cm_id_priv; struct cm_work *work; @@ -3088,7 +3097,44 @@ int ib_cm_establish(struct ib_cm_id *cm_ out: return ret; } -EXPORT_SYMBOL(ib_cm_establish); + +static int cm_migrate(struct ib_cm_id *cm_id) +{ + struct cm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct cm_id_private, id); + spin_lock_irqsave(&cm_id_priv->lock, flags); + if (cm_id->state == IB_CM_ESTABLISHED && + (cm_id->lap_state == IB_CM_LAP_UNINIT || + cm_id->lap_state == IB_CM_LAP_IDLE)) { + cm_id->lap_state = IB_CM_LAP_IDLE; + cm_id_priv->av = cm_id_priv->alt_av; + } else + ret = -EINVAL; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + + return ret; +} + +int ib_cm_notify(struct ib_cm_id *cm_id, enum ib_event_type event) +{ + int ret; + + switch (event) { + case IB_EVENT_COMM_EST: + ret = cm_establish(cm_id); + break; + case IB_EVENT_PATH_MIG: + ret = cm_migrate(cm_id); + break; + default: + ret = -EINVAL; + } + return ret; +} +EXPORT_SYMBOL(ib_cm_notify); static void cm_recv_handler(struct ib_mad_agent *mad_agent, struct ib_mad_recv_wc *mad_recv_wc) @@ -3221,6 +3267,9 @@ static int cm_init_qp_rtr_attr(struct cm if (cm_id_priv->alt_av.ah_attr.dlid) { *qp_attr_mask |= IB_QP_ALT_PATH; qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; + qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; + qp_attr->alt_timeout = + cm_id_priv->alt_av.packet_life_time + 1; qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; } ret = 0; @@ -3247,19 +3296,31 @@ static int cm_init_qp_rts_attr(struct cm case IB_CM_REP_SENT: case IB_CM_MRA_REP_RCVD: case IB_CM_ESTABLISHED: - *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; - qp_attr->sq_psn = be32_to_cpu(cm_id_priv->sq_psn); - if (cm_id_priv->qp_type == IB_QPT_RC) { - *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | - IB_QP_RNR_RETRY | - IB_QP_MAX_QP_RD_ATOMIC; - qp_attr->timeout = cm_id_priv->local_ack_timeout; - qp_attr->retry_cnt = cm_id_priv->retry_count; - qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; - qp_attr->max_rd_atomic = cm_id_priv->initiator_depth; - } - if (cm_id_priv->alt_av.ah_attr.dlid) { - *qp_attr_mask |= IB_QP_PATH_MIG_STATE; + if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) { + *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; + qp_attr->sq_psn = be32_to_cpu(cm_id_priv->sq_psn); + if (cm_id_priv->qp_type == IB_QPT_RC) { + *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | + IB_QP_RNR_RETRY | + IB_QP_MAX_QP_RD_ATOMIC; + qp_attr->timeout = + cm_id_priv->av.packet_life_time + 1; + qp_attr->retry_cnt = cm_id_priv->retry_count; + qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; + qp_attr->max_rd_atomic = + cm_id_priv->initiator_depth; + } + if (cm_id_priv->alt_av.ah_attr.dlid) { + *qp_attr_mask |= IB_QP_PATH_MIG_STATE; + qp_attr->path_mig_state = IB_MIG_REARM; + } + } else { + *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE; + qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; + qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; + qp_attr->alt_timeout = + cm_id_priv->alt_av.packet_life_time + 1; + qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; qp_attr->path_mig_state = IB_MIG_REARM; } ret = 0; diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index ad4f4d5..e04f662 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -685,11 +685,11 @@ out: return result; } -static ssize_t ib_ucm_establish(struct ib_ucm_file *file, - const char __user *inbuf, - int in_len, int out_len) +static ssize_t ib_ucm_notify(struct ib_ucm_file *file, + const char __user *inbuf, + int in_len, int out_len) { - struct ib_ucm_establish cmd; + struct ib_ucm_notify cmd; struct ib_ucm_context *ctx; int result; @@ -700,7 +700,7 @@ static ssize_t ib_ucm_establish(struct i if (IS_ERR(ctx)) return PTR_ERR(ctx); - result = ib_cm_establish(ctx->cm_id); + result = ib_cm_notify(ctx->cm_id, (enum ib_event_type) cmd.event); ib_ucm_ctx_put(ctx); return result; } @@ -1107,7 +1107,7 @@ static ssize_t (*ucm_cmd_table[])(struct [IB_USER_CM_CMD_DESTROY_ID] = ib_ucm_destroy_id, [IB_USER_CM_CMD_ATTR_ID] = ib_ucm_attr_id, [IB_USER_CM_CMD_LISTEN] = ib_ucm_listen, - [IB_USER_CM_CMD_ESTABLISH] = ib_ucm_establish, + [IB_USER_CM_CMD_NOTIFY] = ib_ucm_notify, [IB_USER_CM_CMD_SEND_REQ] = ib_ucm_send_req, [IB_USER_CM_CMD_SEND_REP] = ib_ucm_send_rep, [IB_USER_CM_CMD_SEND_RTU] = ib_ucm_send_rtu, diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index c9b4738..5c07017 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -60,6 +60,7 @@ enum ib_cm_state { }; enum ib_cm_lap_state { + IB_CM_LAP_UNINIT, IB_CM_LAP_IDLE, IB_CM_LAP_SENT, IB_CM_LAP_RCVD, @@ -443,13 +444,20 @@ int ib_send_cm_drep(struct ib_cm_id *cm_ u8 private_data_len); /** - * ib_cm_establish - Forces a connection state to established. + * ib_cm_notify - Notifies the CM of an event reported to the consumer. * @cm_id: Connection identifier to transition to established. + * @event: Type of event. * - * This routine should be invoked by users who receive messages on a - * connected QP before an RTU has been received. + * This routine should be invoked by users to notify the CM of relevant + * communication events. Events that should be reported to the CM and + * when to report them are: + * + * IB_EVENT_COMM_EST - Used when a message is received on a connected + * QP before an RTU has been received. + * IB_EVENT_PATH_MIG - Notifies the CM that the connection has failed over + * to the alternate path. */ -int ib_cm_establish(struct ib_cm_id *cm_id); +int ib_cm_notify(struct ib_cm_id *cm_id, enum ib_event_type event); /** * ib_send_cm_rej - Sends a connection rejection message to the diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h index 066c20b..37650af 100644 --- a/include/rdma/ib_user_cm.h +++ b/include/rdma/ib_user_cm.h @@ -38,7 +38,7 @@ #define IB_USER_CM_H #include -#define IB_USER_CM_ABI_VERSION 4 +#define IB_USER_CM_ABI_VERSION 5 enum { IB_USER_CM_CMD_CREATE_ID, @@ -46,7 +46,7 @@ enum { IB_USER_CM_CMD_ATTR_ID, IB_USER_CM_CMD_LISTEN, - IB_USER_CM_CMD_ESTABLISH, + IB_USER_CM_CMD_NOTIFY, IB_USER_CM_CMD_SEND_REQ, IB_USER_CM_CMD_SEND_REP, @@ -117,8 +117,9 @@ struct ib_ucm_listen { __u32 reserved; }; -struct ib_ucm_establish { +struct ib_ucm_notify { __u32 id; + __u32 event; }; struct ib_ucm_private_data { From mst at mellanox.co.il Tue Nov 7 19:19:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Nov 2006 05:19:49 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <000001c702e3$27e92110$43d0180a@amr.corp.intel.com> References: <454FD71C.1010909@ichips.intel.com> <000001c702e3$27e92110$43d0180a@amr.corp.intel.com> Message-ID: <20061108031949.GC30144@mellanox.co.il> Quoting r. Sean Hefty : > Subject: [RFC] [PATCH v2] rdma/ib_cm: fix APM support > > Memo to me: read comments about missing functionality... > > Fixed an issue with the previous patch not having the right pkey > when forwarding LAP messages to the user. > > With this patch, I'm able to fail over between two paths, reload > a new path, and fail again repeatedly using my test program. BTW, just to clarify, in CMA pieces are still missing for APM support. Is that right? -- MST From sean.hefty at intel.com Tue Nov 7 21:33:51 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 7 Nov 2006 21:33:51 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <20061108031949.GC30144@mellanox.co.il> Message-ID: <000001c702f7$798363a0$9bd0180a@amr.corp.intel.com> >BTW, just to clarify, in CMA pieces are still missing for APM support. >Is that right? Correct - the CMA does not provide path failover capabilities. Is this something that you're wanting in the CMA? - Sean From mst at mellanox.co.il Tue Nov 7 22:32:23 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Nov 2006 08:32:23 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <000001c702f7$798363a0$9bd0180a@amr.corp.intel.com> References: <000001c702f7$798363a0$9bd0180a@amr.corp.intel.com> Message-ID: <20061108063223.GA30654@mellanox.co.il> Quoting r. Sean Hefty : > Subject: RE: [RFC] [PATCH v2] rdma/ib_cm: fix APM support > > >BTW, just to clarify, in CMA pieces are still missing for APM support. > >Is that right? > > Correct - the CMA does not provide path failover capabilities. > > Is this something that you're wanting in the CMA? Yes - I am hearing this is as a requirement from SDP. -- MST From ogerlitz at voltaire.com Tue Nov 7 23:46:18 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 08 Nov 2006 09:46:18 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <20061108063223.GA30654@mellanox.co.il> References: <000001c702f7$798363a0$9bd0180a@amr.corp.intel.com> <20061108063223.GA30654@mellanox.co.il> Message-ID: <45518B4A.9040608@voltaire.com> Michael S. Tsirkin wrote: > Quoting r. Sean Hefty : >> Subject: RE: [RFC] [PATCH v2] rdma/ib_cm: fix APM support >> >>> BTW, just to clarify, in CMA pieces are still missing for APM support. >>> Is that right? >> Correct - the CMA does not provide path failover capabilities. >> >> Is this something that you're wanting in the CMA? > > Yes - I am hearing this is as a requirement from SDP. Few comments on CMA / APM / fail-over: to start with IB/APM is limited only to the case of both ports of the same HCA hence it does not provide general purpose solution for IB ULP HA, specifically costumers want to be protected against IB port / hca / cable / switch failure and APM does not address the hca case. more over, the CMA API addressing is based on **IP** addresses, so when a client connects it provides the IP address of the server and optionally the its src IP as well. The IP2GID resolution is done by the the linux netstack ARP api and neighboring subsystem. I don't see a robust way within this framework to have a single destination IP being resolved to two destination GIDs. And i strongly disagree to stop using this framework. Again, even with the second point being somehow solved, the apm nature makes it a very limited in power feature of IB. Or. From mst at mellanox.co.il Wed Nov 8 00:31:59 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Nov 2006 10:31:59 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <45518B4A.9040608@voltaire.com> References: <45518B4A.9040608@voltaire.com> Message-ID: <20061108083159.GA30872@mellanox.co.il> Quoting Or Gerlitz : > the CMA API addressing is based on **IP** addresses, so when > a client connects it provides the IP address of the server and > optionally the its src IP as well. Once we get the GID that matches the IP, we can locate an extra path and arm it. Read up on how SDP does this in "A4.5.2 Automatic Path Migration": there are several strategies there. > Again, even with the second point being somehow solved, the apm nature > makes it a very limited in power feature of IB. Protocols that rely on RC ACK for reliability guarantees (like SDP), basically do not make it possible to address the hca failure case: you got an ACK, but remote hca could have failed without committing data to memory. So APM failover is a requirement for these. It could be iser does not need APM, fine. -- MST From ogerlitz at voltaire.com Wed Nov 8 01:15:59 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 08 Nov 2006 11:15:59 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <20061108083159.GA30872@mellanox.co.il> References: <45518B4A.9040608@voltaire.com> <20061108083159.GA30872@mellanox.co.il> Message-ID: <4551A04F.2020307@voltaire.com> Michael S. Tsirkin wrote: > Quoting Or Gerlitz : >> the CMA API addressing is based on **IP** addresses, so when >> a client connects it provides the IP address of the server and >> optionally the its src IP as well. > > Once we get the GID that matches the IP, we can locate an extra path and arm it. > Read up on how SDP does this in "A4.5.2 Automatic Path Migration": there are > several strategies there. Thanks for the pointer. I was somehow aware to the method of locating the NODE GUID from the GID/LID and then another GID associated with this node but the test provides a very good description for it. I think it does not address how you get a second SGID, but this is trivial to implement... Basically i agree that if you are willing to do it out-of-cma-band, its possible to get an alt pair (it was even deployed in production at the IB stack of another OS...). >> Again, even with the second point being somehow solved, the apm nature >> makes it a very limited in power feature of IB. > Protocols that rely on RC ACK for reliability guarantees (like SDP), basically > do not make it possible to address the hca failure case: you got an ACK, but > remote hca could have failed without committing data to memory. So APM failover > is a requirement for these. It could be iser does not need APM, fine. This is news to me, does your HCA first sends an ACK and only then does the DMA transaction and if needed generates the CQE !?!?!? If this is not the case (thanks god) on what systems there is this issue where you (ie the HCA) issue a DMA, get a "bus completion", generates the CQE, sends an ACK but the data somehow was not committed to memory ? and how come APM is the solution to this crazy problem? Putting this a side, my basic assumption and this is something you need to check with the SDP customers is that apps coded for RC infrastructure (eg TCP, IB RC) are willing to ***reconnect*** when failure occurs. Moreover, this means that the infrastructure does not need to take care of house-keeping for unACKED messages, and once reconnect succeeds the app retransmists the unacked data. For those cma apps IPoIB failure is the only HA requirement since the IB ARP following the failure would return a valid DGID and from this point business are as usual (the listener also must not bind its id to a specific port, but this is trivial to do) Or. Or. From diego.guella at sircomtech.com Wed Nov 8 01:21:01 2006 From: diego.guella at sircomtech.com (Diego Guella) Date: Wed, 8 Nov 2006 10:21:01 +0100 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails Message-ID: <003201c70317$36911f40$05c8a8c0@DIEGO> Hi, I already installed OFED-1.0 on a PowerEdge 2850 and a Dell Dimension 9100, both with SUSE Linux 9.3, and it runs without problems. Now I have a Dell PowerEdge 1950, and the only Linux distribution I can install correctly on this server seems to be SUSE 10.2, now in Beta1 state (but releasing on December). I tried to install OFED-1.1 on this server, but I get compilation errors while building the RPM. Here are the details: ----- server1950:~ # uname -a Linux server1950 2.6.18.1-13-default #1 SMP Mon Oct 30 14:26:03 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux ----- trying to install: ----- RPM packages: Install kernel-ib: [y/N]:y Kernel level modules: Install ib_verbs: [y/N]:y Install ib_mthca: [y/N]:y Install ib_ipoib: [y/N]:y Install ib_ipath: [y/N]:y Install ib_sdp: [y/N]:y Install ib_srp: [y/N]:y Install kernel-ib-devel: [y/N]:y User level libraries/applications: Install libibverbs: [y/N]:y Install libibverbs-devel: [y/N]:y Install libibverbs-utils: [y/N]:y Install libibcm: [y/N]:y Install libibcm-devel: [y/N]:y Install libmthca: [y/N]:y Install libmthca-devel: [y/N]:y Install perftest: [y/N]:y Install mstflint: [y/N]:y Install libipathverbs: [y/N]:y Install libipathverbs-devel: [y/N]:y Install ofed-docs: [y/N]:y Install ofed-scripts: [y/N]:y Install libsdp: [y/N]:y Install srptools: [y/N]:y Install ipoibtools: [y/N]:y Install tvflash: [y/N]:N Install libibcommon: [y/N]:y Install libibcommon-devel: [y/N]:y Install libibmad: [y/N]:y Install libibmad-devel: [y/N]:y Install libibumad: [y/N]:y Install libibumad-devel: [y/N]:y Install libopensm: [y/N]:y Install libopensm-devel: [y/N]:y Install opensm: [y/N]:y Install libosmcomp: [y/N]:y Install libosmcomp-devel: [y/N]:y Install libosmvendor: [y/N]:y Install libosmvendor-devel: [y/N]:y Install openib-diags: [y/N]:y Install librdmacm: [y/N]:y Install librdmacm-devel: [y/N]:y Install librdmacm-utils: [y/N]:y Install dapl: [y/N]:y Install dapl-devel: [y/N]:y Install mpi_osu: [y/N]:y Install openmpi: [y/N]:y Install mpitests: [y/N]:y Install ibutils: [y/N]:y WARNING: No compilers for mpi_osu were found WARNING: OSU MPI cannot be installed The following compiler(s) on your system can be used to build/install openmpi: gcc Do you wish to create/install an openmpi RPM with gcc? [Y/n]: The following compiler(s) will be used to install the openmpi RPM(s): gcc Following is the list of OFED packages that you have chosen (some may have been added by the installation program due to package dependencies): ib_ipath ib_ipoib ib_mthca ib_sdp ib_srp ib_verbs dapl dapl-devel ipoibtools kernel-ib kernel-ib-devel libibcm libibcm-devel libibcommon libibcommon-devel libibmad libibmad-devel libibumad libibumad-devel libibverbs libibverbs-devel libibverbs-utils libipathverbs libipathverbs-devel libmthca libmthca-devel libopensm libopensm-devel libosmcomp libosmcomp-devel libosmvendor libosmvendor-devel librdmacm librdmacm-devel librdmacm-utils libsdp mstflint openib-diags opensm perftest srptools ofed-docs ofed-scripts openmpi mpitests ibutils Preparing to build the OFED RPMs: Do you want to include IPoIB configuration files (ifcfg-ib*)? [Y/n]: RPM build process requires a temporary directory. Please enter the temporary directory [/var/tmp/OFED]: Please enter the OFED installation directory [/usr/local/ofed]: The following compiler(s) will be used to build the openmpi RPM(s): gcc Checking dependencies. Please wait ... Building InfiniBand Software RPMs. Please wait... Building openib RPMs. Please wait... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm \ ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" See log file: /tmp/OFED.8057.log server1950:/opt/nfs_public/OFED-1.1 # ----- OFED.8057.log attached. The 'interesting' part of the log is: ----- gcc -Wp,-MD,/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/.ucma.o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.2/include -D__KERNEL__ -I/var/tmp/OFEDRPM/BUILD/openib-1.1/include \ -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/include \ -Iinclude \ -Iinclude2 -I/usr/src/linux-2.6.18.1-13/include \ -include include/linux/autoconf.h \ -include /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/autoconf.h \ -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -Os -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fomit-frame-pointer -fasynchronous-unwind-tables -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign -I/var/tmp/OFEDRPM/BUILD/openib-1.1/include -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/include -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/debug -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ucma)" -D"KBUILD_MODNAME=KBUILD_STR(rdma_ucm)" -c -o /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/.tmp_ucma.o /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c: In function 'ucma_init': /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c:878: error: 'struct miscdevice' has no member named 'class' /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c: In function 'ucma_cleanup': /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c:892: error: 'struct miscdevice' has no member named 'class' make[5]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.o] Error 1 make[4]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core] Error 2 make[3]: *** [_module_/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband] Error 2 make[2]: *** [modules] Error 2 make[1]: *** [modules] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.18.1-13-obj/x86_64/default' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.13083 (%install) RPM build errors: user vlad does not exist - using root group mtl does not exist - using root user vlad does not exist - using root group mtl does not exist - using root Bad exit status from /var/tmp/rpm-tmp.13083 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" ----- Looking at ucma.c, it #includes Now, my /usr/src/linux-2.6.18.1-13/include/linux/miscdevice.h looks like this: ----- #ifndef _LINUX_MISCDEVICE_H #define _LINUX_MISCDEVICE_H #include #include #define PSMOUSE_MINOR 1 #define MS_BUSMOUSE_MINOR 2 #define ATIXL_BUSMOUSE_MINOR 3 /*#define AMIGAMOUSE_MINOR 4 FIXME OBSOLETE */ #define ATARIMOUSE_MINOR 5 #define SUN_MOUSE_MINOR 6 #define APOLLO_MOUSE_MINOR 7 #define PC110PAD_MINOR 9 /*#define ADB_MOUSE_MINOR 10 FIXME OBSOLETE */ #define WATCHDOG_MINOR 130 /* Watchdog timer */ #define TEMP_MINOR 131 /* Temperature Sensor */ #define RTC_MINOR 135 #define EFI_RTC_MINOR 136 /* EFI Time services */ #define SUN_OPENPROM_MINOR 139 #define DMAPI_MINOR 140 /* DMAPI */ #define NVRAM_MINOR 144 #define SGI_MMTIMER 153 #define STORE_QUEUE_MINOR 155 #define I2O_MINOR 166 #define MICROCODE_MINOR 184 #define MWAVE_MINOR 219 /* ACP/Mwave Modem */ #define MPT_MINOR 220 #define MISC_DYNAMIC_MINOR 255 #define TUN_MINOR 200 #define HPET_MINOR 228 struct device; struct miscdevice { int minor; const char *name; const struct file_operations *fops; struct list_head list; struct device *parent; struct device *this_device; }; extern int misc_register(struct miscdevice * misc); extern int misc_deregister(struct miscdevice * misc); #define MODULE_ALIAS_MISCDEV(minor) \ MODULE_ALIAS("char-major-" __stringify(MISC_MAJOR) \ "-" __stringify(minor)) #endif ----- and my /usr/include/linux/miscdevice.h looks like this: ----- #ifndef _LINUX_MISCDEVICE_H #define _LINUX_MISCDEVICE_H #include #include #define PSMOUSE_MINOR 1 #define MS_BUSMOUSE_MINOR 2 #define ATIXL_BUSMOUSE_MINOR 3 /*#define AMIGAMOUSE_MINOR 4 FIXME OBSOLETE */ #define ATARIMOUSE_MINOR 5 #define SUN_MOUSE_MINOR 6 #define APOLLO_MOUSE_MINOR 7 #define PC110PAD_MINOR 9 /*#define ADB_MOUSE_MINOR 10 FIXME OBSOLETE */ #define WATCHDOG_MINOR 130 /* Watchdog timer */ #define TEMP_MINOR 131 /* Temperature Sensor */ #define RTC_MINOR 135 #define EFI_RTC_MINOR 136 /* EFI Time services */ #define SUN_OPENPROM_MINOR 139 #define DMAPI_MINOR 140 /* DMAPI */ #define NVRAM_MINOR 144 #define SGI_MMTIMER 153 #define STORE_QUEUE_MINOR 155 #define I2O_MINOR 166 #define MICROCODE_MINOR 184 #define MWAVE_MINOR 219 /* ACP/Mwave Modem */ #define MPT_MINOR 220 #define MISC_DYNAMIC_MINOR 255 #define TUN_MINOR 200 #define HPET_MINOR 228 struct device; struct class_device; struct miscdevice { int minor; const char *name; const struct file_operations *fops; struct list_head list; struct device *dev; struct class_device *class; }; extern int misc_register(struct miscdevice * misc); extern int misc_deregister(struct miscdevice * misc); #define MODULE_ALIAS_MISCDEV(minor) \ MODULE_ALIAS("char-major-" __stringify(MISC_MAJOR) \ "-" __stringify(minor)) #endif ----- which seems what OFED-1.1 expects to have. So, now i replace /usr/src/linux-2.6.18.1-13/include/linux/miscdevice.h with /usr/include/linux/miscdevice.h, launch the install again and see what happens: ----- ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" See log file: /tmp/OFED.29778.log ----- OFED.29778.log attached. The 'interesting' part of the log is: ----- gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include -DRESOLVE_HOSTNAMES -c -o utils.o utils.c utils.c: In function 'inet_addr_match': utils.c:333: warning: initialization discards qualifiers from pointer target type utils.c:334: warning: initialization discards qualifiers from pointer target type utils.c: In function '__get_hz': utils.c:368: error: 'HZ' undeclared (first use in this function) utils.c:368: error: (Each undeclared identifier is reported only once utils.c:368: error: for each function it appears in.) make[2]: *** [utils.o] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2/lib' make[1]: *** [lib] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2' make: *** [ipoibtools] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.4033 (%install) ----- Here, I give up: I don't understand where 'HZ' should be defined, or where it comes from. Where is the problem? Is there a solution to fix it? Thanks, Diego -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.29778.log.tar.gz Type: application/octet-stream Size: 53887 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.8057.log.tar.gz Type: application/octet-stream Size: 40704 bytes Desc: not available URL: From mst at mellanox.co.il Wed Nov 8 01:47:39 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Nov 2006 11:47:39 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <4551A04F.2020307@voltaire.com> References: <4551A04F.2020307@voltaire.com> Message-ID: <20061108094739.GA32442@mellanox.co.il> Quoting Or Gerlitz : > > Protocols that rely on RC ACK for reliability guarantees (like SDP), basically > > do not make it possible to address the hca failure case: you got an ACK, but > > remote hca could have failed without committing data to memory. So APM failover > > is a requirement for these. It could be iser does not need APM, fine. > > This is news to me, does your HCA first sends an ACK and only then does > the DMA transaction and if needed generates the CQE !?!?!? I can't tell either way, but why not? Consider also that DMA write is a posted transaction - HCA gets no indication when it was committed to memory, so it can not delay the ACK until this occurs. > and how come APM is the solution to this crazy problem? If HCA failure is a crazy problem, then what is the sane problem APM does *not* solve? -- MST From ogerlitz at voltaire.com Wed Nov 8 02:10:15 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 08 Nov 2006 12:10:15 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <20061108094739.GA32442@mellanox.co.il> References: <4551A04F.2020307@voltaire.com> <20061108094739.GA32442@mellanox.co.il> Message-ID: <4551AD07.6040702@voltaire.com> Michael S. Tsirkin wrote: > Quoting Or Gerlitz : >>> Protocols that rely on RC ACK for reliability guarantees (like SDP), basically >>> do not make it possible to address the hca failure case: you got an ACK, but >>> remote hca could have failed without committing data to memory. So APM failover >>> is a requirement for these. It could be iser does not need APM, fine. >> This is news to me, does your HCA first sends an ACK and only then does >> the DMA transaction and if needed generates the CQE !?!?!? > I can't tell either way, but why not? > Consider also that DMA write is a posted transaction - HCA gets no indication > when it was committed to memory, so it can not delay the ACK until this occurs. OK, OK, I see now the IB spec piece below, it was me expecting somehow too much from IB RC... rethinking on this matter i see now its more problematic to support this ack-following-dma-memory-write-success 9.7.5.1.6 ACKNOWLEDGE MESSAGE SCHEDULING For SEND or RDMA WRITE requests, an ACK may be scheduled before data is actually written into the responder’s memory. The ACK simply indicates that the data has successfully reached the fault domain of the responding node. That is, the data has been received by the channel adapter and the channel adapter will write that data to the memory system of the responding node, or the responding application will at least be informed of the failure. So anyway, what's your HCA behavior wrt this? >> and how come APM is the solution to this crazy problem? > If HCA failure is a crazy problem, then what is the sane problem APM does *not* solve? you misunderstood me, the "crazy problem" was related to my misconception of IB RC ACKs. My question is: how does APM solves the problem with transactions whose ACK was received but their data was not written/committed to memory? i was thinking that once the HCA sense a path failover APM makes the QP to use the alt path and retransmits all those anACKed messages, but you are referring to an ACKed message... Or. From RAISCH at de.ibm.com Wed Nov 8 02:22:27 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Wed, 8 Nov 2006 11:22:27 +0100 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: Message-ID: Roland Dreier wrote on 07.11.2006 20:25:12: > > - *mapaddr = (u64)(ioremap(physaddr, EHCA_PAGESIZE)); > > + *mapaddr = (u64)ioremap((physaddr & PAGE_MASK), PAGE_SIZE) + > > + (physaddr & (~PAGE_MASK)); > > I'm confused -- shouldn't ioremap() do the right thing even if > physaddr isn't page-aligned? Why is this needed? > > - R. ioremap maps 4k pages on 4k kernels and on 64k pages on 64k kernels. So far the theory. This is true for memory. For mapped PCI or ebus registers things are a bit different. Some PCI adapters expect that every other 4k page is a new area with different meaning starts (some PCI adapters are definetly ehca and mellanox here). The consequence is you have to map only 4k instead of 64k, otherwise you'd map 15 other "access areas" are also mapped. On POWER the ebus memory is mapped by H_ENTER. The hypervisor checks for 4k page size on H_ENTER, reason see above. The nopage handler now does seperate 4k H_ENTERs even for 64k pages in the ebus area, therefore we have to register a 64k page on a 64k boundary, and the nopage triggers the right H_ENTER as soon as we access the page at the right offset. We plan to change that as soon as the base kernel can handle mixed pagesizes in a more official way. Christop R. From dotanb at dev.mellanox.co.il Wed Nov 8 03:48:09 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 08 Nov 2006 13:48:09 +0200 Subject: [openib-general] [PATCH] 1/3 [core] Added support to IB_EVENT_GID_CHANGE async event Message-ID: <1162986489.12259.6.camel@mtls05.yok.mtl.com> Added support to IB_EVENT_GID_CHANGE async event in core. Signed-off-by: Dotan Barak --- Index: last_stable/drivers/infiniband/include/rdma/ib_verbs.h =================================================================== --- last_stable.orig/drivers/infiniband/include/rdma/ib_verbs.h 2006-10-24 17:06:06.000000000 +0200 +++ last_stable/drivers/infiniband/include/rdma/ib_verbs.h 2006-10-26 10:39:32.000000000 +0200 @@ -261,7 +261,8 @@ enum ib_event_type { IB_EVENT_SRQ_ERR, IB_EVENT_SRQ_LIMIT_REACHED, IB_EVENT_QP_LAST_WQE_REACHED, - IB_EVENT_CLIENT_REREGISTER + IB_EVENT_CLIENT_REREGISTER, + IB_EVENT_GID_CHANGE }; struct ib_event { From dotanb at dev.mellanox.co.il Wed Nov 8 03:48:27 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 08 Nov 2006 13:48:27 +0200 Subject: [openib-general] [PATCH] 3/3 [libibverbs] Added support to IBV_EVENT_GID_CHANGE async event Message-ID: <1162986507.12259.8.camel@mtls05.yok.mtl.com> Added support to IBV_EVENT_GID_CHANGE async event to libibverbs. Signed-off-by: Dotan Barak --- Index: last_stable/src/userspace/libibverbs/include/infiniband/verbs.h =================================================================== --- last_stable.orig/src/userspace/libibverbs/include/infiniband/verbs.h 2006-10-26 10:42:37.000000000 +0200 +++ last_stable/src/userspace/libibverbs/include/infiniband/verbs.h 2006-10-26 10:42:55.000000000 +0200 @@ -197,7 +197,8 @@ enum ibv_event_type { IBV_EVENT_SRQ_ERR, IBV_EVENT_SRQ_LIMIT_REACHED, IBV_EVENT_QP_LAST_WQE_REACHED, - IBV_EVENT_CLIENT_REREGISTER + IBV_EVENT_CLIENT_REREGISTER, + IBV_EVENT_GID_CHANGE }; struct ibv_async_event { Index: last_stable/src/userspace/libibverbs/examples/asyncwatch.c =================================================================== --- last_stable.orig/src/userspace/libibverbs/examples/asyncwatch.c 2006-10-26 10:44:52.000000000 +0200 +++ last_stable/src/userspace/libibverbs/examples/asyncwatch.c 2006-10-26 10:45:16.000000000 +0200 @@ -59,6 +59,9 @@ static const char *event_name_str(enum i return "IBV_EVENT_SM_CHANGE"; case IBV_EVENT_CLIENT_REREGISTER: return "IBV_EVENT_CLIENT_REREGISTER"; + case IBV_EVENT_GID_CHANGE: + return "IBV_EVENT_GID_CHANGE"; + case IBV_EVENT_CQ_ERR: case IBV_EVENT_QP_FATAL: From dotanb at dev.mellanox.co.il Wed Nov 8 03:48:23 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Wed, 08 Nov 2006 13:48:23 +0200 Subject: [openib-general] [PATCH] 2/3 [mthca] Added support to IB_EVENT_GID_CHANGE async event Message-ID: <1162986503.12259.7.camel@mtls05.yok.mtl.com> Added support to IB_EVENT_GID_CHANGE async event in mthca. Signed-off-by: Dotan Barak --- Index: last_stable/drivers/infiniband/hw/mthca/mthca_mad.c =================================================================== --- last_stable.orig/drivers/infiniband/hw/mthca/mthca_mad.c 2006-10-25 15:34:20.000000000 +0200 +++ last_stable/drivers/infiniband/hw/mthca/mthca_mad.c 2006-10-26 10:39:01.000000000 +0200 @@ -101,8 +101,8 @@ static void update_sm_ah(struct mthca_de } /* - * Snoop SM MADs for port info and P_Key table sets, so we can - * synthesize LID change and P_Key change events. + * Snoop SM MADs for port info, P_Key and GUID table sets, so we can + * synthesize LID change, P_Key change and GID change events. */ static void smp_snoop(struct ib_device *ibdev, u8 port_num, @@ -139,6 +139,13 @@ static void smp_snoop(struct ib_device * event.element.port_num = port_num; ib_dispatch_event(&event); } + + if (mad->mad_hdr.attr_id == IB_SMP_ATTR_GUID_INFO) { + event.device = ibdev; + event.event = IB_EVENT_GID_CHANGE; + event.element.port_num = port_num; + ib_dispatch_event(&event); + } } } From mst at mellanox.co.il Wed Nov 8 05:13:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 8 Nov 2006 15:13:19 +0200 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <4551AD07.6040702@voltaire.com> References: <4551AD07.6040702@voltaire.com> Message-ID: <20061108131319.GB478@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: Re: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support > > Michael S. Tsirkin wrote: > > Quoting Or Gerlitz : > >>> Protocols that rely on RC ACK for reliability guarantees (like SDP), basically > >>> do not make it possible to address the hca failure case: you got an ACK, but > >>> remote hca could have failed without committing data to memory. So APM failover > >>> is a requirement for these. It could be iser does not need APM, fine. > >> This is news to me, does your HCA first sends an ACK and only then does > >> the DMA transaction and if needed generates the CQE !?!?!? > > > I can't tell either way, but why not? > > Consider also that DMA write is a posted transaction - HCA gets no indication > > when it was committed to memory, so it can not delay the ACK until this occurs. > > OK, OK, I see now the IB spec piece below, it was me expecting somehow > too much from IB RC... rethinking on this matter i see now its more > problematic to support this ack-following-dma-memory-write-success > > 9.7.5.1.6 ACKNOWLEDGE MESSAGE SCHEDULING > > For SEND or RDMA WRITE requests, an ACK may be scheduled before > data is actually written into the responder?s memory. The ACK simply > indicates that the data has successfully reached the fault domain of the > responding node. That is, the data has been received by the channel > adapter and the channel adapter will write that data to the memory > system of the responding node, or the responding application will at > least be informed of the failure. > > So anyway, what's your HCA behavior wrt this? The behavior matches the spec. I can't give you extra guarantees. > >> and how come APM is the solution to this crazy problem? > > > If HCA failure is a crazy problem, then what is the sane problem APM does *not* solve? > > you misunderstood me, the "crazy problem" was related to my > misconception of IB RC ACKs. > > My question is: how does APM solves the problem with transactions whose > ACK was received but their data was not written/committed to memory? APM does not solve it - I just say the problem as formulated is not solvable without protocol changes. So all we can solve for a generic RC protocol, is port/switch failure, and APM solves this elegantly and transparently. -- MST From ogerlitz at voltaire.com Wed Nov 8 05:14:15 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 08 Nov 2006 15:14:15 +0200 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0611051325p7546ef75qe5a6af74fe97b56f@mail.gmail.com> References: <1162506570.29948.567.camel@brick.pathscale.com> <20061102231404.GC15403@mellanox.co.il> <454DE183.80405@voltaire.com> <15ddcffd0611051325p7546ef75qe5a6af74fe97b56f@mail.gmail.com> Message-ID: <4551D827.3010600@voltaire.com> Or Gerlitz wrote: > On 11/5/06, Roland Dreier wrote: >> > I have mentioned this to Ralph in the past, just want to get ack/nak >> > on that from you: also on 64bit arch a block driver (eg SCSI LLD eg >> > SRP/iSER/etc) might get from higher level an SG whose pages are >> > **not** mapped into the kernel virtual address space. For example this >> > can happen with Direct I/O. >> >> No, I don't see how that could happen. Aren't all pages always mapped >> by the the kernel direct mapping on 64-bit architectures? > > I don't know exactly how this happens, but one of the comments i've > got from Christoph > on the iser code, is that one can't assume page_address(sg[i].page) > will not be NULL for SG passed to a SCSI LLD, i think Direct I/O is > one flow where this might happen. Christoph, can you clarify this point? Or. From tziporet at dev.mellanox.co.il Wed Nov 8 06:20:53 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 08 Nov 2006 16:20:53 +0200 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails In-Reply-To: <003201c70317$36911f40$05c8a8c0@DIEGO> References: <003201c70317$36911f40$05c8a8c0@DIEGO> Message-ID: <4551E7C5.8020700@dev.mellanox.co.il> Diego Guella wrote: > OFED.29778.log attached. > The 'interesting' part of the log is: > ----- > gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include > -DRESOLVE_HOSTNAMES -c -o utils.o utils.c > utils.c: In function ‘inet_addr_match’: > utils.c:333: warning: initialization discards qualifiers from pointer > target type > utils.c:334: warning: initialization discards qualifiers from pointer > target type > utils.c: In function ‘__get_hz’: > utils.c:368: error: ‘HZ’ undeclared (first use in this function) > utils.c:368: error: (Each undeclared identifier is reported only once > utils.c:368: error: for each function it appears in.) > make[2]: *** [utils.o] Error 1 > make[2]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2/lib' > make[1]: *** [lib] Error 2 > make[1]: Leaving directory > `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2' > make: *** [ipoibtools] Error 2 > error: Bad exit status from /var/tmp/rpm-tmp.4033 (%install) > ----- > Here, I give up: I don't understand where 'HZ' should be defined, or > where it comes from. > Where is the problem? > Is there a solution to fix it? The failing is utility is used for IPoIB high availability. If you don't need to use them you can just change this line in ofed.conf: ipoibtools=n Tziporet From diego.guella at sircomtech.com Wed Nov 8 07:07:56 2006 From: diego.guella at sircomtech.com (Diego Guella) Date: Wed, 8 Nov 2006 16:07:56 +0100 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails References: <003201c70317$36911f40$05c8a8c0@DIEGO> <4551E7C5.8020700@dev.mellanox.co.il> Message-ID: <00a301c70347$accee4f0$05c8a8c0@DIEGO> From: "Tziporet Koren" > The failing is utility is used for IPoIB high availability. If you don't > need to use them you can just change this line in ofed.conf: > ipoibtools=n > > Tziporet > Thanks Tziporet for your answer. Tried just right now, i disabled ipoibtools. I get another, more strange error: (attached OFED.3816.log) ----- /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs Running: ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib CPPFLAGS="-I../libibverbs/include" configure: creating cache /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking for style of include used by make... GNU checking for gcc... gcc checking for C compiler default output file name... configure: error: C compiler cannot create executables See `config.log' for more details. Failed to execute: ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib CPPFLAGS="-I../libibverbs/include" error: Bad exit status from /var/tmp/rpm-tmp.46102 (%install) ----- Am I right? It says my C compiler cannot create executables???? Is it joking me???? In the log file, line 6393, it says: ----- checking for C compiler default output file name... a.out ----- I don't understand....! Is there something I can do to fix this? Thanks, Diego -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.3816.log.tar.gz Type: application/octet-stream Size: 93119 bytes Desc: not available URL: From jlentini at netapp.com Wed Nov 8 07:23:22 2006 From: jlentini at netapp.com (James Lentini) Date: Wed, 8 Nov 2006 10:23:22 -0500 (EST) Subject: [openib-general] dm_client_main.c compile error In-Reply-To: References: Message-ID: On Wed, 8 Nov 2006, Patrick (Xentech) wrote: > I'm trying to compile revision 9087 on a 2.6.18.2 kernel. The > following error occurs: > > LD drivers/infiniband/client_query/built-in.o > CC [M] drivers/infiniband/client_query/client_query.o > CC [M] drivers/infiniband/client_query/client_query_export.o > CC [M] drivers/infiniband/client_query/client_query_main.o > CC [M] drivers/infiniband/client_query/dm_client_main.o > drivers/infiniband/client_query/dm_client_main.c:51: error: syntax error before string constant > drivers/infiniband/client_query/dm_client_main.c:51: warning: type defaults to `int' in declaration of `MODULE_PARM' > drivers/infiniband/client_query/dm_client_main.c:51: warning: function declaration isn't a prototype > drivers/infiniband/client_query/dm_client_main.c:51: warning: data definition has no type or storage class > make[3]: *** [drivers/infiniband/client_query/dm_client_main.o] Error 1 > make[2]: *** [drivers/infiniband/client_query] Error 2 > make[1]: *** [drivers/infiniband] Error 2 > make: *** [drivers] Error 2 Nobody is maintaining that part of the tree. The maintained kernel code is located at: https://openfabrics.org/svn/gen2/trunk/src/linux-kernel/infiniband/ However, the kernel code repository is moving from svn to git as we speak. From dotanb at mellanox.co.il Wed Nov 8 08:00:23 2006 From: dotanb at mellanox.co.il (Dotan Barak) Date: Wed, 8 Nov 2006 18:00:23 +0200 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails Message-ID: <6C2C79E72C305246B504CBA17B5500C91BD834@mtlexch01.mtl.com> Hi Diego. You got the following output: utils.c: In function '__get_hz': utils.c:368: error: 'HZ' undeclared (first use in this function) utils.c:368: error: (Each undeclared identifier is reported only once utils.c:368: error: for each function it appears in.) because the macro HZ wasn't found by the compiler. in my machine HZ is defined in 2 files) 1 ) asm/param.h:#define HZ sysconf(_SC_CLK_TCK) 2) in asm-x86_64/param.h I noticed the following code (this define is being used in the compilation process in my 1000 MHz machine): #ifndef HZ #define HZ 100 #endif I believe that there is a missing include in this distribution. I think that if you'll add the later lines everything will work ... Dotan -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Diego Guella Sent: Wednesday, November 08, 2006 11:21 AM To: openib-general at openib.org Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails Hi, I already installed OFED-1.0 on a PowerEdge 2850 and a Dell Dimension 9100, both with SUSE Linux 9.3, and it runs without problems. Now I have a Dell PowerEdge 1950, and the only Linux distribution I can install correctly on this server seems to be SUSE 10.2, now in Beta1 state (but releasing on December). I tried to install OFED-1.1 on this server, but I get compilation errors while building the RPM. Here are the details: ----- server1950:~ # uname -a Linux server1950 2.6.18.1-13-default #1 SMP Mon Oct 30 14:26:03 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux ----- trying to install: ----- RPM packages: Install kernel-ib: [y/N]:y Kernel level modules: Install ib_verbs: [y/N]:y Install ib_mthca: [y/N]:y Install ib_ipoib: [y/N]:y Install ib_ipath: [y/N]:y Install ib_sdp: [y/N]:y Install ib_srp: [y/N]:y Install kernel-ib-devel: [y/N]:y User level libraries/applications: Install libibverbs: [y/N]:y Install libibverbs-devel: [y/N]:y Install libibverbs-utils: [y/N]:y Install libibcm: [y/N]:y Install libibcm-devel: [y/N]:y Install libmthca: [y/N]:y Install libmthca-devel: [y/N]:y Install perftest: [y/N]:y Install mstflint: [y/N]:y Install libipathverbs: [y/N]:y Install libipathverbs-devel: [y/N]:y Install ofed-docs: [y/N]:y Install ofed-scripts: [y/N]:y Install libsdp: [y/N]:y Install srptools: [y/N]:y Install ipoibtools: [y/N]:y Install tvflash: [y/N]:N Install libibcommon: [y/N]:y Install libibcommon-devel: [y/N]:y Install libibmad: [y/N]:y Install libibmad-devel: [y/N]:y Install libibumad: [y/N]:y Install libibumad-devel: [y/N]:y Install libopensm: [y/N]:y Install libopensm-devel: [y/N]:y Install opensm: [y/N]:y Install libosmcomp: [y/N]:y Install libosmcomp-devel: [y/N]:y Install libosmvendor: [y/N]:y Install libosmvendor-devel: [y/N]:y Install openib-diags: [y/N]:y Install librdmacm: [y/N]:y Install librdmacm-devel: [y/N]:y Install librdmacm-utils: [y/N]:y Install dapl: [y/N]:y Install dapl-devel: [y/N]:y Install mpi_osu: [y/N]:y Install openmpi: [y/N]:y Install mpitests: [y/N]:y Install ibutils: [y/N]:y WARNING: No compilers for mpi_osu were found WARNING: OSU MPI cannot be installed The following compiler(s) on your system can be used to build/install openmpi: gcc Do you wish to create/install an openmpi RPM with gcc? [Y/n]: The following compiler(s) will be used to install the openmpi RPM(s): gcc Following is the list of OFED packages that you have chosen (some may have been added by the installation program due to package dependencies): ib_ipath ib_ipoib ib_mthca ib_sdp ib_srp ib_verbs dapl dapl-devel ipoibtools kernel-ib kernel-ib-devel libibcm libibcm-devel libibcommon libibcommon-devel libibmad libibmad-devel libibumad libibumad-devel libibverbs libibverbs-devel libibverbs-utils libipathverbs libipathverbs-devel libmthca libmthca-devel libopensm libopensm-devel libosmcomp libosmcomp-devel libosmvendor libosmvendor-devel librdmacm librdmacm-devel librdmacm-utils libsdp mstflint openib-diags opensm perftest srptools ofed-docs ofed-scripts openmpi mpitests ibutils Preparing to build the OFED RPMs: Do you want to include IPoIB configuration files (ifcfg-ib*)? [Y/n]: RPM build process requires a temporary directory. Please enter the temporary directory [/var/tmp/OFED]: Please enter the OFED installation directory [/usr/local/ofed]: The following compiler(s) will be used to build the openmpi RPM(s): gcc Checking dependencies. Please wait ... Building InfiniBand Software RPMs. Please wait... Building openib RPMs. Please wait... Running rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm \ ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" See log file: /tmp/OFED.8057.log server1950:/opt/nfs_public/OFED-1.1 # ----- OFED.8057.log attached. The 'interesting' part of the log is: ----- gcc -Wp,-MD,/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/.ucma. o.d -nostdinc -isystem /usr/lib64/gcc/x86_64-suse-linux/4.1.2/include -D__KERNEL__ -I/var/tmp/OFEDRPM/BUILD/openib-1.1/include \ -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/include \ -Iinclude \ -Iinclude2 -I/usr/src/linux-2.6.18.1-13/include \ -include include/linux/autoconf.h \ -include /var/tmp/OFEDRPM/BUILD/openib-1.1/include/linux/autoconf.h \ -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -Werror-implicit-function-declaration -fno-strict-aliasing -fno-common -Os -mtune=generic -m64 -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -funit-at-a-time -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fomit-frame-pointer -fasynchronous-unwind-tables -fno-stack-protector -Wdeclaration-after-statement -Wno-pointer-sign -I/var/tmp/OFEDRPM/BUILD/openib-1.1/include -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/include -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib -I/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/debug -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(ucma)" -D"KBUILD_MODNAME=KBUILD_STR(rdma_ucm)" -c -o /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/.tmp_ucma.o /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c: In function 'ucma_init': /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c:878: error: 'struct miscdevice' has no member named 'class' /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c: In function 'ucma_cleanup': /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.c:892: error: 'struct miscdevice' has no member named 'class' make[5]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core/ucma.o] Error 1 make[4]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/core] Error 2 make[3]: *** [_module_/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband] Error 2 make[2]: *** [modules] Error 2 make[1]: *** [modules] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.18.1-13-obj/x86_64/default' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.13083 (%install) RPM build errors: user vlad does not exist - using root group mtl does not exist - using root user vlad does not exist - using root group mtl does not exist - using root Bad exit status from /var/tmp/rpm-tmp.13083 (%install) ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" ----- Looking at ucma.c, it #includes Now, my /usr/src/linux-2.6.18.1-13/include/linux/miscdevice.h looks like this: ----- #ifndef _LINUX_MISCDEVICE_H #define _LINUX_MISCDEVICE_H #include #include #define PSMOUSE_MINOR 1 #define MS_BUSMOUSE_MINOR 2 #define ATIXL_BUSMOUSE_MINOR 3 /*#define AMIGAMOUSE_MINOR 4 FIXME OBSOLETE */ #define ATARIMOUSE_MINOR 5 #define SUN_MOUSE_MINOR 6 #define APOLLO_MOUSE_MINOR 7 #define PC110PAD_MINOR 9 /*#define ADB_MOUSE_MINOR 10 FIXME OBSOLETE */ #define WATCHDOG_MINOR 130 /* Watchdog timer */ #define TEMP_MINOR 131 /* Temperature Sensor */ #define RTC_MINOR 135 #define EFI_RTC_MINOR 136 /* EFI Time services */ #define SUN_OPENPROM_MINOR 139 #define DMAPI_MINOR 140 /* DMAPI */ #define NVRAM_MINOR 144 #define SGI_MMTIMER 153 #define STORE_QUEUE_MINOR 155 #define I2O_MINOR 166 #define MICROCODE_MINOR 184 #define MWAVE_MINOR 219 /* ACP/Mwave Modem */ #define MPT_MINOR 220 #define MISC_DYNAMIC_MINOR 255 #define TUN_MINOR 200 #define HPET_MINOR 228 struct device; struct miscdevice { int minor; const char *name; const struct file_operations *fops; struct list_head list; struct device *parent; struct device *this_device; }; extern int misc_register(struct miscdevice * misc); extern int misc_deregister(struct miscdevice * misc); #define MODULE_ALIAS_MISCDEV(minor) \ MODULE_ALIAS("char-major-" __stringify(MISC_MAJOR) \ "-" __stringify(minor)) #endif ----- and my /usr/include/linux/miscdevice.h looks like this: ----- #ifndef _LINUX_MISCDEVICE_H #define _LINUX_MISCDEVICE_H #include #include #define PSMOUSE_MINOR 1 #define MS_BUSMOUSE_MINOR 2 #define ATIXL_BUSMOUSE_MINOR 3 /*#define AMIGAMOUSE_MINOR 4 FIXME OBSOLETE */ #define ATARIMOUSE_MINOR 5 #define SUN_MOUSE_MINOR 6 #define APOLLO_MOUSE_MINOR 7 #define PC110PAD_MINOR 9 /*#define ADB_MOUSE_MINOR 10 FIXME OBSOLETE */ #define WATCHDOG_MINOR 130 /* Watchdog timer */ #define TEMP_MINOR 131 /* Temperature Sensor */ #define RTC_MINOR 135 #define EFI_RTC_MINOR 136 /* EFI Time services */ #define SUN_OPENPROM_MINOR 139 #define DMAPI_MINOR 140 /* DMAPI */ #define NVRAM_MINOR 144 #define SGI_MMTIMER 153 #define STORE_QUEUE_MINOR 155 #define I2O_MINOR 166 #define MICROCODE_MINOR 184 #define MWAVE_MINOR 219 /* ACP/Mwave Modem */ #define MPT_MINOR 220 #define MISC_DYNAMIC_MINOR 255 #define TUN_MINOR 200 #define HPET_MINOR 228 struct device; struct class_device; struct miscdevice { int minor; const char *name; const struct file_operations *fops; struct list_head list; struct device *dev; struct class_device *class; }; extern int misc_register(struct miscdevice * misc); extern int misc_deregister(struct miscdevice * misc); #define MODULE_ALIAS_MISCDEV(minor) \ MODULE_ALIAS("char-major-" __stringify(MISC_MAJOR) \ "-" __stringify(minor)) #endif ----- which seems what OFED-1.1 expects to have. So, now i replace /usr/src/linux-2.6.18.1-13/include/linux/miscdevice.h with /usr/include/linux/miscdevice.h, launch the install again and see what happens: ----- ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" See log file: /tmp/OFED.29778.log ----- OFED.29778.log attached. The 'interesting' part of the log is: ----- gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include -DRESOLVE_HOSTNAMES -c -o utils.o utils.c utils.c: In function 'inet_addr_match': utils.c:333: warning: initialization discards qualifiers from pointer target type utils.c:334: warning: initialization discards qualifiers from pointer target type utils.c: In function '__get_hz': utils.c:368: error: 'HZ' undeclared (first use in this function) utils.c:368: error: (Each undeclared identifier is reported only once utils.c:368: error: for each function it appears in.) make[2]: *** [utils.o] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2/lib ' make[1]: *** [lib] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2' make: *** [ipoibtools] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.4033 (%install) ----- Here, I give up: I don't understand where 'HZ' should be defined, or where it comes from. Where is the problem? Is there a solution to fix it? Thanks, Diego -------------- next part -------------- An HTML attachment was scrubbed... URL: From bunk at stusta.de Wed Nov 8 08:28:32 2006 From: bunk at stusta.de (Adrian Bunk) Date: Wed, 8 Nov 2006 17:28:32 +0100 Subject: [openib-general] infiniband/hw/amso1100/c2_provider.c: possible NULL dereference Message-ID: <20061108162832.GB4729@stusta.de> The Coverity checker noted the following in drivers/infiniband/hw/amso1100/c2_provider.c: <-- snip --> ... int c2_register_device(struct c2_dev *dev) { int ret; int i; /* Register pseudo network device */ dev->pseudo_netdev = c2_pseudo_netdev_init(dev); if (dev->pseudo_netdev) { ret = register_netdev(dev->pseudo_netdev); if (ret) { printk(KERN_ERR PFX "Unable to register netdev, ret = %d\n", ret); free_netdev(dev->pseudo_netdev); return ret; } } pr_debug("%s:%u\n", __FUNCTION__, __LINE__); strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); dev->ibdev.owner = THIS_MODULE; dev->ibdev.uverbs_cmd_mask = (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | (1ull << IB_USER_VERBS_CMD_REG_MR) | (1ull << IB_USER_VERBS_CMD_DEREG_MR) | (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | (1ull << IB_USER_VERBS_CMD_CREATE_QP) | (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | (1ull << IB_USER_VERBS_CMD_POLL_CQ) | (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | (1ull << IB_USER_VERBS_CMD_POST_SEND) | (1ull << IB_USER_VERBS_CMD_POST_RECV); dev->ibdev.node_type = RDMA_NODE_RNIC; memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6); ... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <-- snip --> Above there's an "if (dev->pseudo_netdev)" check, but here it's dereferenced without a check. It seems instead of the "if (dev->pseudo_netdev)", there should be some kind of if (!dev->pseudo_netdev) return -ESOME_ERROR; cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed From swise at opengridcomputing.com Wed Nov 8 08:31:47 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Wed, 08 Nov 2006 08:31:47 -0800 Subject: [openib-general] infiniband/hw/amso1100/c2_provider.c: possible NULL dereference In-Reply-To: <20061108162832.GB4729@stusta.de> References: <20061108162832.GB4729@stusta.de> Message-ID: <1163003508.4142.0.camel@linux-q667.site> yep. We'll fix this up asap... Thanks, Steve. On Wed, 2006-11-08 at 17:28 +0100, Adrian Bunk wrote: > The Coverity checker noted the following in > drivers/infiniband/hw/amso1100/c2_provider.c: > > <-- snip --> > > ... > int c2_register_device(struct c2_dev *dev) > { > int ret; > int i; > > /* Register pseudo network device */ > dev->pseudo_netdev = c2_pseudo_netdev_init(dev); > if (dev->pseudo_netdev) { > ret = register_netdev(dev->pseudo_netdev); > if (ret) { > printk(KERN_ERR PFX > "Unable to register netdev, ret = %d\n", ret); > free_netdev(dev->pseudo_netdev); > return ret; > } > } > > pr_debug("%s:%u\n", __FUNCTION__, __LINE__); > strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); > dev->ibdev.owner = THIS_MODULE; > dev->ibdev.uverbs_cmd_mask = > (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | > (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | > (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | > (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | > (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | > (1ull << IB_USER_VERBS_CMD_REG_MR) | > (1ull << IB_USER_VERBS_CMD_DEREG_MR) | > (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | > (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | > (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | > (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | > (1ull << IB_USER_VERBS_CMD_CREATE_QP) | > (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | > (1ull << IB_USER_VERBS_CMD_POLL_CQ) | > (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | > (1ull << IB_USER_VERBS_CMD_POST_SEND) | > (1ull << IB_USER_VERBS_CMD_POST_RECV); > > dev->ibdev.node_type = RDMA_NODE_RNIC; > memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); > memcpy(&dev->ibdev.node_guid, dev->pseudo_netdev->dev_addr, 6); > ... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > <-- snip --> > > Above there's an "if (dev->pseudo_netdev)" check, but here it's > dereferenced without a check. > > It seems instead of the "if (dev->pseudo_netdev)", there should be some > kind of > > if (!dev->pseudo_netdev) > return -ESOME_ERROR; > > > cu > Adrian > From mshefty at ichips.intel.com Wed Nov 8 11:52:21 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Wed, 08 Nov 2006 11:52:21 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <20061108063223.GA30654@mellanox.co.il> References: <000001c702f7$798363a0$9bd0180a@amr.corp.intel.com> <20061108063223.GA30654@mellanox.co.il> Message-ID: <45523575.8020906@ichips.intel.com> > Yes - I am hearing this is as a requirement from SDP. We will need to discuss the best approach to adding this support to the CMA. However, I should mention that this is not a requirement for any of the path forward work that I'm doing, and since path forward work sets most of my development schedule, I'm not sure when it will get done. - Sean From tom at opengridcomputing.com Wed Nov 8 12:23:22 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Wed, 08 Nov 2006 14:23:22 -0600 Subject: [openib-general] [PATCH 1/1] Unitialized pseudo_netdev accessed in c2_register_device Message-ID: <1163017402.8753.13.camel@trinity.ogc.int> Roland/Adrian: Reworked some load-time error handling. c2_register_device leaked when it failed and the function that called it didn't check the return code. Signed-off-by: Tom Tucker --- drivers/infiniband/hw/amso1100/c2.c | 3 +- drivers/infiniband/hw/amso1100/c2_provider.c | 39 +++++++++++++------------- 2 files changed, 22 insertions(+), 20 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c index 9e7bd94..27fe242 100644 --- a/drivers/infiniband/hw/amso1100/c2.c +++ b/drivers/infiniband/hw/amso1100/c2.c @@ -1155,7 +1155,8 @@ static int __devinit c2_probe(struct pci goto bail10; } - c2_register_device(c2dev); + if (c2_register_device(c2dev)) + goto bail10; return 0; diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index da98d9f..fef9727 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -757,20 +757,17 @@ #endif int c2_register_device(struct c2_dev *dev) { - int ret; + int ret = -ENOMEM; int i; /* Register pseudo network device */ dev->pseudo_netdev = c2_pseudo_netdev_init(dev); - if (dev->pseudo_netdev) { - ret = register_netdev(dev->pseudo_netdev); - if (ret) { - printk(KERN_ERR PFX - "Unable to register netdev, ret = %d\n", ret); - free_netdev(dev->pseudo_netdev); - return ret; - } - } + if (!dev->pseudo_netdev) + goto out3; + + ret = register_netdev(dev->pseudo_netdev); + if (ret) + goto out2; pr_debug("%s:%u\n", __FUNCTION__, __LINE__); strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); @@ -848,21 +845,25 @@ int c2_register_device(struct c2_dev *de ret = ib_register_device(&dev->ibdev); if (ret) - return ret; + goto out1; for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { ret = class_device_create_file(&dev->ibdev.class_dev, c2_class_attributes[i]); - if (ret) { - unregister_netdev(dev->pseudo_netdev); - free_netdev(dev->pseudo_netdev); - ib_unregister_device(&dev->ibdev); - return ret; - } + if (ret) + goto out0; } + goto out3; - pr_debug("%s:%u\n", __FUNCTION__, __LINE__); - return 0; +out0: + ib_unregister_device(&dev->ibdev); +out1: + unregister_netdev(dev->pseudo_netdev); +out2: + free_netdev(dev->pseudo_netdev); +out3: + pr_debug("%s:%u ret=%d\n", __FUNCTION__, __LINE__, ret); + return ret; } void c2_unregister_device(struct c2_dev *dev) From rdreier at cisco.com Wed Nov 8 14:06:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Nov 2006 14:06:42 -0800 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: (Christoph Raisch's message of "Wed, 8 Nov 2006 11:22:27 +0100") References: Message-ID: > We plan to change that as soon as the base kernel can handle mixed > pagesizes in a more official way. OK, so this is just a temporary workaround for powerpc's broken ioremap()? I'll apply for 2.6.19, and I hope we can back this out in 2.6.20. - R. From chris_youb at yahoo.ca Wed Nov 8 14:06:58 2006 From: chris_youb at yahoo.ca (chris_youb at yahoo.ca) Date: Thu, 9 Nov 2006 06:06:58 +0800 Subject: [openib-general] OFED 1.1 on Debian based system? In-Reply-To: <20061023131535.78bd0681@marvin.local> Message-ID: <31128398.1163023618172.JavaMail.websites@opensubscriber> Trying to install OFED on a Debian system; Ubuntu 5.10 w/ Xen (2.6.16-xen). 1. Runnig install from OFED-1.1/SOURCES/openib-1.1 gives the following error: Running configure ... ERROR: Failed to execute: env CONFIG_INFINIBAND=m CONFIG_INFINIBAND_USER_MAD=m CONFIG_INFINIBAND_USER_ACCESS=m CONFIG_INFINIBAND_ADDR_TRANS=y CONFIG_INFINIBAND_MTHCA=m CONFIG_INFINIBAND_MTHCA_DEBUG=y CONFIG_INFINIBAND_IPOIB=m CONFIG_INFINIBAND_IPOIB_DEBUG=y CONFIG_INFINIBAND_SDP=m CONFIG_INFINIBAND_SDP_DEBUG=y CONFIG_INFINIBAND_SRP=m ./configure --prefix=/usr/local/ofed --libdir=/usr/local/ofed/lib See /tmp/openib_gen2/configure.log for more details Please open an issue in the http://openib.org/bugzilla and attach /tmp/openib_gen2/debug_info.tgz That .tgz contains a lot of fail messages like this: (Stripping trailing CRs from patch.) patching file drivers/infiniband/hw/ipath/Kconfig Hunk #1 FAILED at 1. 1 out of 1 hunk FAILED -- saving rejects to file drivers/infiniband/hw/ipath/Kco nfig.rej (Stripping trailing CRs from patch.) patching file drivers/infiniband/hw/ipath/Makefile Hunk #1 FAILED at 1. 1 out of 1 hunk FAILED -- saving rejects to file drivers/infiniband/hw/ipath/Mak efile.rej 2. I had some luck from here: OFED-1.1/SOURCES/openib-1.1/src/userspace/management 2a. ./configure use to complain as follows: "configure:20603: error: cannot compute sizeof (long), 77". This was actually a library problem "./conftest: error while loading shared libraries: libibverbs.so.1: cannot open s hared object file: No such file or directory" 2b. Explcitly setting "LD_LIBRARY_PATH=/usr/local/lib" worked. I could "configure" and "make" / "make install" opensm and it's libraries. 2c. Opensm compiles and runs, but of course I need the kernel modules. # opensm ------------------------------------------------- OpenSM Rev:openib-2.0.5 Based on OpenIB svn Exported revision Command Line Arguments: Log File: /var/log/osm.log ------------------------------------------------- OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision ibwarn: [500] umad_init: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded? 3. I tried compiling the kernel modules by themselves: "make kernel", "make install_modules". They all run and complete without errors yet "/lib/modules/2.6.16-xen/kernel/drivers/" has nothing and there isn't a *.ko to be found. In order to at least run opensm I believe I need ib_core, ib_mthca, ib_mad, ib_umad and ib_uverbs. Has anyone gotten further than me with running opensm on a Debian system and has information to share? -- This message was sent on behalf of chris_youb at yahoo.ca at openSubscriber.com http://www.opensubscriber.com/message/openib-general at openib.org/5189364.html From bos at pathscale.com Wed Nov 8 13:20:51 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 08 Nov 2006 14:20:51 -0700 Subject: [openib-general] [PATCH] Fix patch Message-ID: <5011f2a1dbc9446539c1.1163024451@localhost.localdomain> diff -r b610f87d01e2 -r 5011f2a1dbc9 ipath-htirq.patch --- a/ipath-htirq.patch Wed Nov 08 12:08:13 2006 -0800 +++ b/ipath-htirq.patch Wed Nov 08 14:20:04 2006 -0800 @@ -1,15 +1,16 @@ IB/ipath - program interrupt control reg -IB/ipath - program interrupt control register using new htirq hook +IB/ipath - program intconfig register using new HT irq hook Eric's changes to the htirq infrastructure require corresponding modifications to the ipath HT driver code so that interrupts are still delivered properly. Signed-off-by: Bryan O'Sullivan -Cc: Eric W. Biedermann +Cc: Eric W. Biederman +Cc: Roland Dreier -diff -r bb12c8d85f7c drivers/infiniband/hw/ipath/ipath_driver.c ---- a/drivers/infiniband/hw/ipath/ipath_driver.c Tue Nov 07 11:35:24 2006 -0800 -+++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Nov 08 11:25:18 2006 -0800 +diff -r 69779e2890e3 drivers/infiniband/hw/ipath/ipath_driver.c +--- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Nov 08 14:17:04 2006 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Nov 08 14:17:04 2006 -0800 @@ -304,7 +304,7 @@ static int __devinit ipath_init_one(stru } addr = pci_resource_start(pdev, 0); @@ -54,9 +55,9 @@ diff -r bb12c8d85f7c drivers/infiniband/ } else ipath_dbg("irq is 0, not doing free_irq " "for unit %u\n", dd->ipath_unit); -diff -r bb12c8d85f7c drivers/infiniband/hw/ipath/ipath_iba6110.c ---- a/drivers/infiniband/hw/ipath/ipath_iba6110.c Tue Nov 07 11:35:24 2006 -0800 -+++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c Wed Nov 08 11:57:00 2006 -0800 +diff -r 69779e2890e3 drivers/infiniband/hw/ipath/ipath_iba6110.c +--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c Wed Nov 08 14:17:04 2006 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c Wed Nov 08 14:17:04 2006 -0800 @@ -38,6 +38,7 @@ #include @@ -241,9 +242,9 @@ diff -r bb12c8d85f7c drivers/infiniband/ /* * initialize chip-specific variables -diff -r bb12c8d85f7c drivers/infiniband/hw/ipath/ipath_iba6120.c ---- a/drivers/infiniband/hw/ipath/ipath_iba6120.c Tue Nov 07 11:35:24 2006 -0800 -+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c Wed Nov 08 11:56:55 2006 -0800 +diff -r 69779e2890e3 drivers/infiniband/hw/ipath/ipath_iba6120.c +--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c Wed Nov 08 14:17:04 2006 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c Wed Nov 08 14:17:04 2006 -0800 @@ -851,6 +851,7 @@ static int ipath_setup_pe_config(struct int pos, ret; @@ -273,9 +274,9 @@ diff -r bb12c8d85f7c drivers/infiniband/ /* initialize chip-specific variables */ dd->ipath_f_tidtemplate = ipath_pe_tidtemplate; -diff -r bb12c8d85f7c drivers/infiniband/hw/ipath/ipath_intr.c ---- a/drivers/infiniband/hw/ipath/ipath_intr.c Tue Nov 07 11:35:24 2006 -0800 -+++ b/drivers/infiniband/hw/ipath/ipath_intr.c Wed Nov 08 11:16:41 2006 -0800 +diff -r 69779e2890e3 drivers/infiniband/hw/ipath/ipath_intr.c +--- a/drivers/infiniband/hw/ipath/ipath_intr.c Wed Nov 08 14:17:04 2006 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_intr.c Wed Nov 08 14:17:04 2006 -0800 @@ -710,14 +710,14 @@ static void ipath_bad_intr(struct ipath_ * linuxbios development work, and it may happen in * the future again. @@ -304,9 +305,9 @@ diff -r bb12c8d85f7c drivers/infiniband/ } else if (allbits > 2) { if ((allbits % 10000) == 0) printk("."); -diff -r bb12c8d85f7c drivers/infiniband/hw/ipath/ipath_kernel.h ---- a/drivers/infiniband/hw/ipath/ipath_kernel.h Tue Nov 07 11:35:24 2006 -0800 -+++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Nov 08 12:07:03 2006 -0800 +diff -r 69779e2890e3 drivers/infiniband/hw/ipath/ipath_kernel.h +--- a/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Nov 08 14:17:04 2006 -0800 ++++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Nov 08 14:17:04 2006 -0800 @@ -213,6 +213,8 @@ struct ipath_devdata { void (*ipath_f_setextled)(struct ipath_devdata *, u64, u64); /* fill out chip-specific fields */ From bos at pathscale.com Wed Nov 8 13:21:26 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 08 Nov 2006 14:21:26 -0700 Subject: [openib-general] [PATCH] IB/ipath - program intconfig register using new HT irq hook Message-ID: <545156d49f883c43af70.1163024486@localhost.localdomain> Eric's changes to the htirq infrastructure require corresponding modifications to the ipath HT driver code so that interrupts are still delivered properly. Signed-off-by: Bryan O'Sullivan Cc: Eric W. Biederman Cc: Roland Dreier diff -r 69779e2890e3 -r 545156d49f88 drivers/infiniband/hw/ipath/ipath_driver.c --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Nov 08 14:17:04 2006 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Nov 08 14:19:27 2006 -0800 @@ -304,7 +304,7 @@ static int __devinit ipath_init_one(stru } addr = pci_resource_start(pdev, 0); len = pci_resource_len(pdev, 0); - ipath_cdbg(VERBOSE, "regbase (0) %llx len %d irq %x, vend %x/%x " + ipath_cdbg(VERBOSE, "regbase (0) %llx len %d pdev->irq %d, vend %x/%x " "driver_data %lx\n", addr, len, pdev->irq, ent->vendor, ent->device, ent->driver_data); @@ -467,15 +467,15 @@ static int __devinit ipath_init_one(stru * check 0 irq after we return from chip-specific bus setup, since * that can affect this due to setup */ - if (!pdev->irq) + if (!dd->ipath_irq) ipath_dev_err(dd, "irq is 0, BIOS error? Interrupts won't " "work\n"); else { - ret = request_irq(pdev->irq, ipath_intr, IRQF_SHARED, + ret = request_irq(dd->ipath_irq, ipath_intr, IRQF_SHARED, IPATH_DRV_NAME, dd); if (ret) { ipath_dev_err(dd, "Couldn't setup irq handler, " - "irq=%u: %d\n", pdev->irq, ret); + "irq=%d: %d\n", dd->ipath_irq, ret); goto bail_iounmap; } } @@ -637,11 +637,10 @@ static void __devexit ipath_remove_one(s * free up port 0 (kernel) rcvhdr, egr bufs, and eventually tid bufs * for all versions of the driver, if they were allocated */ - if (pdev->irq) { - ipath_cdbg(VERBOSE, - "unit %u free_irq of irq %x\n", - dd->ipath_unit, pdev->irq); - free_irq(pdev->irq, dd); + if (dd->ipath_irq) { + ipath_cdbg(VERBOSE, "unit %u free irq %d\n", + dd->ipath_unit, dd->ipath_irq); + dd->ipath_f_free_irq(dd); } else ipath_dbg("irq is 0, not doing free_irq " "for unit %u\n", dd->ipath_unit); diff -r 69779e2890e3 -r 545156d49f88 drivers/infiniband/hw/ipath/ipath_iba6110.c --- a/drivers/infiniband/hw/ipath/ipath_iba6110.c Wed Nov 08 14:17:04 2006 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c Wed Nov 08 14:19:27 2006 -0800 @@ -38,6 +38,7 @@ #include #include +#include #include "ipath_kernel.h" #include "ipath_registers.h" @@ -913,49 +914,40 @@ static void slave_or_pri_blk(struct ipat } } -static int set_int_handler(struct ipath_devdata *dd, struct pci_dev *pdev, - int pos) -{ - u32 int_handler_addr_lower; - u32 int_handler_addr_upper; - u64 ihandler; - u32 intvec; - - /* use indirection register to get the intr handler */ - pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, 0x10); - pci_read_config_dword(pdev, pos + 4, &int_handler_addr_lower); - pci_write_config_byte(pdev, pos + HT_INTR_REG_INDEX, 0x11); - pci_read_config_dword(pdev, pos + 4, &int_handler_addr_upper); - - ihandler = (u64) int_handler_addr_lower | - ((u64) int_handler_addr_upper << 32); - - /* - * kernels with CONFIG_PCI_MSI set the vector in the irq field of - * struct pci_device, so we use that to program the internal - * interrupt register (not config space) with that value. The BIOS - * must still have done the basic MSI setup. - */ - intvec = pdev->irq; - /* - * clear any vector bits there; normally not set but we'll overload - * this for some debug purposes (setting the HTC debug register - * value from software, rather than GPIOs), so it might be set on a - * driver reload. - */ - ihandler &= ~0xff0000; - /* x86 vector goes in intrinfo[23:16] */ - ihandler |= intvec << 16; - ipath_cdbg(VERBOSE, "ihandler lower %x, upper %x, intvec %x, " - "interruptconfig %llx\n", int_handler_addr_lower, - int_handler_addr_upper, intvec, - (unsigned long long) ihandler); - - /* can't program yet, so save for interrupt setup */ - dd->ipath_intconfig = ihandler; - /* keep going, so we find link control stuff also */ - - return ihandler != 0; +static int ipath_ht_intconfig(struct ipath_devdata *dd) +{ + int ret; + + if (dd->ipath_intconfig) { + ipath_write_kreg(dd, dd->ipath_kregs->kr_interruptconfig, + dd->ipath_intconfig); /* interrupt address */ + ret = 0; + } else { + ipath_dev_err(dd, "No interrupts enabled, couldn't setup " + "interrupt address\n"); + ret = -EINVAL; + } + + return ret; +} + +static void ipath_ht_irq_update(struct pci_dev *dev, int irq, + struct ht_irq_msg *msg) +{ + struct ipath_devdata *dd = pci_get_drvdata(dev); + u64 prev_intconfig = dd->ipath_intconfig; + + dd->ipath_intconfig = msg->address_lo; + dd->ipath_intconfig |= ((u64) msg->address_hi) << 32; + + /* + * If the previous value of dd->ipath_intconfig is zero, we're + * getting configured for the first time, and must not program the + * intconfig register here (it will be programmed later, when the + * hardware is ready). Otherwise, we should. + */ + if (prev_intconfig) + ipath_ht_intconfig(dd); } /** @@ -971,12 +963,19 @@ static int ipath_setup_ht_config(struct static int ipath_setup_ht_config(struct ipath_devdata *dd, struct pci_dev *pdev) { - int pos, ret = 0; - int ihandler = 0; - - /* - * Read the capability info to find the interrupt info, and also - * handle clearing CRC errors in linkctrl register if necessary. We + int pos, ret; + + ret = __ht_create_irq(pdev, 0, ipath_ht_irq_update); + if (ret < 0) { + ipath_dev_err(dd, "Couldn't create interrupt handler: " + "err %d\n", ret); + goto bail; + } + dd->ipath_irq = ret; + ret = 0; + + /* + * Handle clearing CRC errors in linkctrl register if necessary. We * do this early, before we ever enable errors or hardware errors, * mostly to avoid causing the chip to enter freeze mode. */ @@ -1000,16 +999,8 @@ static int ipath_setup_ht_config(struct } if (!(cap_type & 0xE0)) slave_or_pri_blk(dd, pdev, pos, cap_type); - else if (cap_type == HT_INTR_DISC_CONFIG) - ihandler = set_int_handler(dd, pdev, pos); } while ((pos = pci_find_next_capability(pdev, pos, PCI_CAP_ID_HT))); - - if (!ihandler) { - ipath_dev_err(dd, "Couldn't find interrupt handler in " - "config space\n"); - ret = -ENODEV; - } bail: return ret; @@ -1360,25 +1351,6 @@ static void ipath_ht_quiet_serdes(struct ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val); } -static int ipath_ht_intconfig(struct ipath_devdata *dd) -{ - int ret; - - if (!dd->ipath_intconfig) { - ipath_dev_err(dd, "No interrupts enabled, couldn't setup " - "interrupt address\n"); - ret = 1; - goto bail; - } - - ipath_write_kreg(dd, dd->ipath_kregs->kr_interruptconfig, - dd->ipath_intconfig); /* interrupt address */ - ret = 0; - -bail: - return ret; -} - /** * ipath_pe_put_tid - write a TID in chip * @dd: the infinipath device @@ -1575,6 +1547,14 @@ static int ipath_ht_get_base_info(struct return 0; } +static void ipath_ht_free_irq(struct ipath_devdata *dd) +{ + free_irq(dd->ipath_irq, dd); + ht_destroy_irq(dd->ipath_irq); + dd->ipath_irq = 0; + dd->ipath_intconfig = 0; +} + /** * ipath_init_iba6110_funcs - set up the chip-specific function pointers * @dd: the infinipath device @@ -1598,6 +1578,7 @@ void ipath_init_iba6110_funcs(struct ipa dd->ipath_f_cleanup = ipath_setup_ht_cleanup; dd->ipath_f_setextled = ipath_setup_ht_setextled; dd->ipath_f_get_base_info = ipath_ht_get_base_info; + dd->ipath_f_free_irq = ipath_ht_free_irq; /* * initialize chip-specific variables diff -r 69779e2890e3 -r 545156d49f88 drivers/infiniband/hw/ipath/ipath_iba6120.c --- a/drivers/infiniband/hw/ipath/ipath_iba6120.c Wed Nov 08 14:17:04 2006 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c Wed Nov 08 14:19:27 2006 -0800 @@ -851,6 +851,7 @@ static int ipath_setup_pe_config(struct int pos, ret; dd->ipath_msi_lo = 0; /* used as a flag during reset processing */ + dd->ipath_irq = pdev->irq; ret = pci_enable_msi(dd->pcidev); if (ret) ipath_dev_err(dd, "pci_enable_msi failed: %d, " @@ -1323,6 +1324,12 @@ done: return 0; } +static void ipath_pe_free_irq(struct ipath_devdata *dd) +{ + free_irq(dd->ipath_irq, dd); + dd->ipath_irq = 0; +} + /** * ipath_init_iba6120_funcs - set up the chip-specific function pointers * @dd: the infinipath device @@ -1349,6 +1356,7 @@ void ipath_init_iba6120_funcs(struct ipa dd->ipath_f_cleanup = ipath_setup_pe_cleanup; dd->ipath_f_setextled = ipath_setup_pe_setextled; dd->ipath_f_get_base_info = ipath_pe_get_base_info; + dd->ipath_f_free_irq = ipath_pe_free_irq; /* initialize chip-specific variables */ dd->ipath_f_tidtemplate = ipath_pe_tidtemplate; diff -r 69779e2890e3 -r 545156d49f88 drivers/infiniband/hw/ipath/ipath_intr.c --- a/drivers/infiniband/hw/ipath/ipath_intr.c Wed Nov 08 14:17:04 2006 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_intr.c Wed Nov 08 14:19:27 2006 -0800 @@ -710,14 +710,14 @@ static void ipath_bad_intr(struct ipath_ * linuxbios development work, and it may happen in * the future again. */ - if (dd->pcidev && dd->pcidev->irq) { + if (dd->pcidev && dd->ipath_irq) { ipath_dev_err(dd, "Now %u unexpected " "interrupts, unregistering " "interrupt handler\n", *unexpectp); - ipath_dbg("free_irq of irq %x\n", - dd->pcidev->irq); - free_irq(dd->pcidev->irq, dd); + ipath_dbg("free_irq of irq %d\n", + dd->ipath_irq); + dd->ipath_f_free_irq(dd); } } if (ipath_read_kreg32(dd, dd->ipath_kregs->kr_intmask)) { @@ -753,7 +753,7 @@ static void ipath_bad_regread(struct ipa if (allbits == 2) { ipath_dev_err(dd, "Still bad interrupt status, " "unregistering interrupt\n"); - free_irq(dd->pcidev->irq, dd); + dd->ipath_f_free_irq(dd); } else if (allbits > 2) { if ((allbits % 10000) == 0) printk("."); diff -r 69779e2890e3 -r 545156d49f88 drivers/infiniband/hw/ipath/ipath_kernel.h --- a/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Nov 08 14:17:04 2006 -0800 +++ b/drivers/infiniband/hw/ipath/ipath_kernel.h Wed Nov 08 14:19:27 2006 -0800 @@ -213,6 +213,8 @@ struct ipath_devdata { void (*ipath_f_setextled)(struct ipath_devdata *, u64, u64); /* fill out chip-specific fields */ int (*ipath_f_get_base_info)(struct ipath_portdata *, void *); + /* free irq */ + void (*ipath_f_free_irq)(struct ipath_devdata *); struct ipath_ibdev *verbs_dev; struct timer_list verbs_timer; /* total dwords sent (summed from counter) */ @@ -328,6 +330,8 @@ struct ipath_devdata { /* so we can rewrite it after a chip reset */ u32 ipath_pcibar1; + /* interrupt number */ + int ipath_irq; /* HT/PCI Vendor ID (here for NodeInfo) */ u16 ipath_vendorid; /* HT/PCI Device ID (here for NodeInfo) */ From rdreier at cisco.com Wed Nov 8 14:20:41 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Nov 2006 14:20:41 -0800 Subject: [openib-general] OFED 1.1 on Debian based system? In-Reply-To: <31128398.1163023618172.JavaMail.websites@opensubscriber> ( chris youb's message of "Thu, 9 Nov 2006 06:06:58 +0800") References: <31128398.1163023618172.JavaMail.websites@opensubscriber> Message-ID: > In order to at least run opensm I believe I need ib_core, ib_mthca, ib_mad, ib_umad and ib_uverbs. Has anyone gotten further than me with running opensm on a Debian system and has information to share? I run on Debian systems all the time. But I just use the standard Debian kernel, which has all the IB modules built by default. I'm not sure about the Ubuntu Xen dom0 kernel, but I know the default Ubuntu Edgy kernel (2.6.17) has all the IB modules enabled. Probably the simplest thing to do would be just to reconfigure your Xen dom0 kernel and rebuild it with the modules you need if it doesn't have IB enabled. - R. From bos at pathscale.com Wed Nov 8 14:23:27 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 08 Nov 2006 14:23:27 -0800 Subject: [openib-general] [PATCH] Fix patch In-Reply-To: <5011f2a1dbc9446539c1.1163024451@localhost.localdomain> References: <5011f2a1dbc9446539c1.1163024451@localhost.localdomain> Message-ID: <455258DF.6010807@pathscale.com> Bryan O'Sullivan wrote: > diff -r b610f87d01e2 -r 5011f2a1dbc9 ipath-htirq.patch Duh, fumbly fingers. Sorry; proper patch follows. References: <545156d49f883c43af70.1163024486@localhost.localdomain> Message-ID: <20061108144402.0b6a7b23.akpm@osdl.org> On Wed, 08 Nov 2006 14:21:26 -0700 "Bryan O'Sullivan" wrote: > Eric's changes to the htirq infrastructure require corresponding > modifications to the ipath HT driver code so that interrupts are still > delivered properly. > > Signed-off-by: Bryan O'Sullivan > Cc: Eric W. Biederman > Cc: Roland Dreier > > diff -r 69779e2890e3 -r 545156d49f88 drivers/infiniband/hw/ipath/ipath_driver.c > --- a/drivers/infiniband/hw/ipath/ipath_driver.c Wed Nov 08 14:17:04 2006 -0800 > +++ b/drivers/infiniband/hw/ipath/ipath_driver.c Wed Nov 08 14:19:27 2006 -0800 so... Is this: htirq-refactor-so-we-only-have-one-function-that-writes-to-the-chip.patch htirq-allow-buggy-drivers-of-buggy-hardware-to-write-the-registers.patch htirq-allow-buggy-drivers-of-buggy-hardware-to-write-the-registers-update.patch ib-ipath-program-intconfig-register-using-new-ht-irq-hook.patch considered 2.6.19 material? From bos at pathscale.com Wed Nov 8 15:08:30 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Wed, 08 Nov 2006 15:08:30 -0800 Subject: [openib-general] [PATCH] IB/ipath - program intconfig register using new HT irq hook In-Reply-To: <20061108144402.0b6a7b23.akpm@osdl.org> References: <545156d49f883c43af70.1163024486@localhost.localdomain> <20061108144402.0b6a7b23.akpm@osdl.org> Message-ID: <4552636E.3090809@pathscale.com> Andrew Morton wrote: > so... Is this: > > htirq-refactor-so-we-only-have-one-function-that-writes-to-the-chip.patch > htirq-allow-buggy-drivers-of-buggy-hardware-to-write-the-registers.patch You should drop the above patch from Eric... > htirq-allow-buggy-drivers-of-buggy-hardware-to-write-the-registers-update.patch ...in favour of this one, which is my rework of Eric's patch. > ib-ipath-program-intconfig-register-using-new-ht-irq-hook.patch > > considered 2.6.19 material? Yes, please. I might be able to simplify the ib-ipath patch (by a matter of a few lines), but it works fine as it stands. (Hoang-Nam Nguyen's message of "Sun, 5 Nov 2006 21:42:56 +0100") References: <200611052142.56722.hnguyen@de.ibm.com> Message-ID: Thanks, I've applied 1 through 4. From rdreier at cisco.com Wed Nov 8 15:16:44 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Nov 2006 15:16:44 -0800 Subject: [openib-general] OFED 1.1 on Debian based system? In-Reply-To: (Roland Dreier's message of "Wed, 08 Nov 2006 14:20:41 -0800") References: <31128398.1163023618172.JavaMail.websites@opensubscriber> Message-ID: > Probably the simplest thing to do would be just to reconfigure your > Xen dom0 kernel and rebuild it with the modules you need if it doesn't > have IB enabled. I just looked at my Xen hg tree checkout, and I see that Xen builds IB drivers for its kernel too. Does the Ubuntu Xen kernel really turn off the IB modules? Hmm, no it does include them. I just checked: $ dpkg -L xen-image-xen0-2.6.16-11.2-generic|grep infiniband /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core/ib_cm.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core/ib_core.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core/ib_mad.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core/ib_sa.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core/ib_ucm.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core/ib_umad.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/core/ib_uverbs.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/hw /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/hw/mthca /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/hw/mthca/ib_mthca.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/ulp /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/ulp/ipoib /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/ulp/srp /lib/modules/2.6.16-11.2-generic/kernel/drivers/infiniband/ulp/srp/ib_srp.ko So I would recommend just using what Ubuntu packaged for you, no need to mess around with the OFED kernel module build. BTW, any reason not to use Ubuntu's 2.6.17 Xen kernel? I honestly don't know much about Ubuntu's Xen support so maybe the 2.6.16 kernel really is better. - R. From akpm at osdl.org Wed Nov 8 15:18:21 2006 From: akpm at osdl.org (Andrew Morton) Date: Wed, 8 Nov 2006 15:18:21 -0800 Subject: [openib-general] [PATCH] IB/ipath - program intconfig register using new HT irq hook In-Reply-To: <4552636E.3090809@pathscale.com> References: <545156d49f883c43af70.1163024486@localhost.localdomain> <20061108144402.0b6a7b23.akpm@osdl.org> <4552636E.3090809@pathscale.com> Message-ID: <20061108151821.5a55fecd.akpm@osdl.org> On Wed, 08 Nov 2006 15:08:30 -0800 "Bryan O'Sullivan" wrote: > Andrew Morton wrote: > > > so... Is this: > > > > htirq-refactor-so-we-only-have-one-function-that-writes-to-the-chip.patch > > htirq-allow-buggy-drivers-of-buggy-hardware-to-write-the-registers.patch > > You should drop the above patch from Eric... > > > htirq-allow-buggy-drivers-of-buggy-hardware-to-write-the-registers-update.patch > > ...in favour of this one, which is my rework of Eric's patch. > > > ib-ipath-program-intconfig-register-using-new-ht-irq-hook.patch If you look, you'll see that the htirq-allow-buggy-drivers-of-buggy-hardware-to-write-the-registers-update.patch which I merged is the diff between Eric's patch and yours (ie: the diff which you should have sent ;)) > > considered 2.6.19 material? > > Yes, please. I might be able to simplify the ib-ipath patch (by a > matter of a few lines), but it works fine as it stands. > ho hum, OK. From ak at suse.de Tue Nov 7 23:39:44 2006 From: ak at suse.de (Andi Kleen) Date: Wed, 8 Nov 2006 08:39:44 +0100 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <20061107171143.GU27140@parisc-linux.org> References: <20061107171143.GU27140@parisc-linux.org> Message-ID: <200611080839.46670.ak@suse.de> On Tuesday 07 November 2006 18:11, Matthew Wilcox wrote: > On Wed, Nov 08, 2006 at 12:57:03AM +0800, Jeff Chua wrote: > > 2) this fails ... > > > > e0000000-efffffff : 0000:00:02.0 > > f0000000-ffffffff : PCI MMCONFIG 0 > > fed00000-fed003ff : HPET 0 > > Heh, no kidding ... > > num_buses = pci_mmcfg_config[i].end_bus_number - > pci_mmcfg_config[i].start_bus_number + 1; > res->start = pci_mmcfg_config[i].base_address; > res->end = res->start + (num_buses << 20) - 1; > res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; > insert_resource(&iomem_resource, res); > > So if we have 256 busses assigned, then we request 256MB and, well, > there's no room for anyone else. This code was added by Andi in commit > de09bddb9d6f96785be470c832b881e6d72d589f > > Hopefully he'll have a good idea how to restrict it. Given your "working" > resource map, it seems like it should be limited to 16MB (and thus 16 > busses). But how to figure that out? ACPI knows the number of busses. Just need to get the information there, which is a ordering issue (normally MCFG initialization is before this is known I think) Len, ACPI folks, any ideas how to fix this cleanly? -Andi From matthew at wil.cx Wed Nov 8 04:22:37 2006 From: matthew at wil.cx (Matthew Wilcox) Date: Wed, 8 Nov 2006 05:22:37 -0700 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <200611080839.46670.ak@suse.de> References: <20061107171143.GU27140@parisc-linux.org> <200611080839.46670.ak@suse.de> Message-ID: <20061108122237.GF27140@parisc-linux.org> On Wed, Nov 08, 2006 at 08:39:44AM +0100, Andi Kleen wrote: > ACPI knows the number of busses. But what if the number of busses increases later, eg by hotplugging a card with a PCI-PCI bridge on it? Or does it know the number of busses which can be supported by this machine's MMCONFIG region? If so, why isn't this information reported in the MCFG table properly instead of claiming to support 0-255? > Just need to get the information there, which is a ordering issue > (normally MCFG initialization is before this is known I think) > > Len, ACPI folks, any ideas how to fix this cleanly? > > -Andi From ak at suse.de Wed Nov 8 07:14:01 2006 From: ak at suse.de (Andi Kleen) Date: Wed, 8 Nov 2006 16:14:01 +0100 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <8f95bb250611070950m3dc45674gbd370e3173b6168d@mail.gmail.com> References: <20061107171143.GU27140@parisc-linux.org> <8f95bb250611070950m3dc45674gbd370e3173b6168d@mail.gmail.com> Message-ID: <200611081614.03735.ak@suse.de> > I can patch up the pci_mmcfg_insert_resource to verify if > the region that is exported by ACPI is reserved in e820 and printk an > error message if it is not and skip the resource insertion. It probably should get its information from pci_mcfg_init() and only reserve what is used there instead of adding duplicate e820 checking code somewhere else. Or perhaps only reserve when the bus is discovered? -Andi From torvalds at osdl.org Wed Nov 8 08:05:18 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Nov 2006 08:05:18 -0800 (PST) Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <20061108122237.GF27140@parisc-linux.org> References: <20061107171143.GU27140@parisc-linux.org> <200611080839.46670.ak@suse.de> <20061108122237.GF27140@parisc-linux.org> Message-ID: On Wed, 8 Nov 2006, Matthew Wilcox wrote: > > On Wed, Nov 08, 2006 at 08:39:44AM +0100, Andi Kleen wrote: > > ACPI knows the number of busses. > > But what if the number of busses increases later, eg by hotplugging > a card with a PCI-PCI bridge on it? Or does it know the number of > busses which can be supported by this machine's MMCONFIG region? ACPI will give the maximum number. However, in this case, the correct thing to do (always _has_ been) is to not use ACPI for _anything_, but just read the base and the size of the MMCONFIG region from the hardware itself. Anyway, I do not consider this a regression. MMCONFIG has _never_ worked reliably. It has always been a case of "we can make it work on some machines by making it break on others". Linus From ebiederm at xmission.com Wed Nov 8 09:38:27 2006 From: ebiederm at xmission.com (ebiederm at xmission.com) Date: Wed, 08 Nov 2006 10:38:27 -0700 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: (Linus Torvalds's message of "Wed, 8 Nov 2006 08:05:18 -0800 (PST)") References: <20061107171143.GU27140@parisc-linux.org> <200611080839.46670.ak@suse.de> <20061108122237.GF27140@parisc-linux.org> Message-ID: Linus Torvalds writes: > On Wed, 8 Nov 2006, Matthew Wilcox wrote: >> >> On Wed, Nov 08, 2006 at 08:39:44AM +0100, Andi Kleen wrote: >> > ACPI knows the number of busses. >> >> But what if the number of busses increases later, eg by hotplugging >> a card with a PCI-PCI bridge on it? Or does it know the number of >> busses which can be supported by this machine's MMCONFIG region? > > ACPI will give the maximum number. > > However, in this case, the correct thing to do (always _has_ been) is to > not use ACPI for _anything_, but just read the base and the size of the > MMCONFIG region from the hardware itself. > > Anyway, I do not consider this a regression. MMCONFIG has _never_ worked > reliably. It has always been a case of "we can make it work on some > machines by making it break on others". The implementations I have seen, I believe have all been on bridges and the maximum size is actually generated from the bus number below the bridge. Eric From torvalds at osdl.org Wed Nov 8 10:52:05 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Nov 2006 10:52:05 -0800 (PST) Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: References: <20061107171143.GU27140@parisc-linux.org> <200611080839.46670.ak@suse.de> <20061108122237.GF27140@parisc-linux.org> Message-ID: On Wed, 8 Nov 2006, Eric W. Biederman wrote: > > The implementations I have seen, I believe have all been on bridges and > the maximum size is actually generated from the bus number below the bridge. Hmm. It might be possible to first set up the MMCONFIG thing for the minimum range, then read the bus numbers from the host bridge on that bus, and then expand the mmconfig range if necessary. Because pretty much ANYTHING is better than trusting the BIOS tables. That said, I'd really be a _lot_ more confident about it if we were to be able to read the values from the hardware itself some way. There's obviously a chicken-and-egg issue on mmcfg configuration, but it's one that the BIOS startup code also has, so I assume that there is a solution to that somewhere. Linus From adurbin at google.com Wed Nov 8 11:10:13 2006 From: adurbin at google.com (Aaron Durbin) Date: Wed, 8 Nov 2006 11:10:13 -0800 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: References: <20061107171143.GU27140@parisc-linux.org> <200611080839.46670.ak@suse.de> <20061108122237.GF27140@parisc-linux.org> Message-ID: <8f95bb250611081110x4260f95di6620e9ba7a00e58f@mail.gmail.com> On 11/8/06, Linus Torvalds wrote: > > > On Wed, 8 Nov 2006, Eric W. Biederman wrote: > > > > The implementations I have seen, I believe have all been on bridges and > > the maximum size is actually generated from the bus number below the bridge. > > Hmm. It might be possible to first set up the MMCONFIG thing for the > minimum range, then read the bus numbers from the host bridge on that bus, > and then expand the mmconfig range if necessary. > > Because pretty much ANYTHING is better than trusting the BIOS tables. > > That said, I'd really be a _lot_ more confident about it if we were to be > able to read the values from the hardware itself some way. There's > obviously a chicken-and-egg issue on mmcfg configuration, but it's one > that the BIOS startup code also has, so I assume that there is a solution > to that somewhere. > I agree that the orignal patch was stupid in relying on the MCFG table reported in ACPI, however, like you said, without the actual knowledge of the MCFG region being pulled out of the hardware even the e820 check is not valid. It is close, but not entirely correct. For instance, if the MCFG region is being reported in ACPI land as 256 buses and the e820 has a reservation at the MCFG base address of 18MB that does not necessarily mean the MCFG region allows for PCI config access on 18 buses. It could be that it only allows 16 buses w/ another piece of hardware on that last 2MB. So what is the proper scenario? One needs to know the actual upper limit of MCFG region. Otherwise when detecting unreachable devices one could be poking something else in the process of trying to discover these unreachable devices. I am open to ideas and am willing to rework some of the code, but I do like the idea of having the region being reported in the resource table. -Aaron From torvalds at osdl.org Wed Nov 8 11:25:40 2006 From: torvalds at osdl.org (Linus Torvalds) Date: Wed, 8 Nov 2006 11:25:40 -0800 (PST) Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: <8f95bb250611081110x4260f95di6620e9ba7a00e58f@mail.gmail.com> References: <20061107171143.GU27140@parisc-linux.org> <200611080839.46670.ak@suse.de> <20061108122237.GF27140@parisc-linux.org> <8f95bb250611081110x4260f95di6620e9ba7a00e58f@mail.gmail.com> Message-ID: On Wed, 8 Nov 2006, Aaron Durbin wrote: > > For instance, if the MCFG region is being reported in ACPI land as 256 > buses and the e820 has a reservation at the MCFG base address of 18MB > that does not necessarily mean the MCFG region allows for PCI config > access on 18 buses. It could be that it only allows 16 buses w/ another > piece of hardware on that last 2MB. Oh, I agree. You'd _hope_ that the BIOS reports that as a separate region, and we could use that as a hint, but it's never going to be fool-proof. It's just much much better to try to figure out what the hardware itself thinks it is doing, rather than relying on a firmware engineer filling out the table to match what he _thinks_ the hardware is doing (or, more accurately, randomly scribbling values until Windows boots, at which point it's not his problem any more, and people ship the crap). Some misguided people used to think that we shouldn't do our own PCI probing, but use ACPI instead. This is the same thing, except on a smaller scale. MAYBE the scale ends up being so small that we can figure out some reliable way without actually asking the hardware itself, but I kind of doubt it. Especially judging by the current situation. > So what is the proper scenario? One needs to know the actual upper limit of > MCFG region. Otherwise when detecting unreachable devices one could be poking > something else in the process of trying to discover these unreachable devices. > I am open to ideas and am willing to rework some of the code, but I do like > the idea of having the region being reported in the resource table. Absolutely. I'd _love_ to have the region reported in the resource table. It's just that right now it doesn't seem practical, since the downsides are bigger than the upsides (and the upsides aren't _that_ big, since we require the thing to be marked reserved in the e820 tables anyway, so the resource tables do know about it, about as well as they currently can). In the absense of a way to actually ask the hardware, we could perhaps modify the thing so that it does request the regions in the resource table, but _only_ if the e820 entries aren't there (ie the "config type 1 didn't even work" case). Alternatively, we might choose to request just the known smallest region, because that should be relatively "safer". It's better than not reporting the regions at all, and while it's not perfect, it at least shouldn't have huge potential downsides from getting the size totally wrong... Hmm? Linus From ebiederm at xmission.com Wed Nov 8 11:24:54 2006 From: ebiederm at xmission.com (ebiederm at xmission.com) Date: Wed, 08 Nov 2006 12:24:54 -0700 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: (Linus Torvalds's message of "Wed, 8 Nov 2006 10:52:05 -0800 (PST)") References: <20061107171143.GU27140@parisc-linux.org> <200611080839.46670.ak@suse.de> <20061108122237.GF27140@parisc-linux.org> Message-ID: Linus Torvalds writes: > On Wed, 8 Nov 2006, Eric W. Biederman wrote: >> >> The implementations I have seen, I believe have all been on bridges and >> the maximum size is actually generated from the bus number below the bridge. > > Hmm. It might be possible to first set up the MMCONFIG thing for the > minimum range, then read the bus numbers from the host bridge on that bus, > and then expand the mmconfig range if necessary. > > Because pretty much ANYTHING is better than trusting the BIOS tables. > > That said, I'd really be a _lot_ more confident about it if we were to be > able to read the values from the hardware itself some way. There's > obviously a chicken-and-egg issue on mmcfg configuration, but it's one > that the BIOS startup code also has, so I assume that there is a solution > to that somewhere. cfc and cf8 still work on x86. So you can start with the old path and then when you know mmconfig works you can upgrade. In fact mmconfig doesn't necessary allow access to the entire pci domain. On AMD systems currently you will get all of the subordinate busses but the cpus themselves will not show up in the mmconfig space. So we should have the infrastructure to only use mmconfig for some set of busses. If that interface is well described we can probably bootstrap sanely, only enabling what we know exists and like wise only reserving what we know is used. For chipsets I know that there is quite a bit of information publicly available. For intel chipsets I believe those are registers they make available in their public docs. For things like the Nvidia chipset the knowledge should be in the publicly available linuxbios code base. Hopefully that is enough of a pointer to get people going. I might have enough time to write the patch but I don't have enough time to maintain it until mmconfig becomes boring. Eric From krkumar2 at in.ibm.com Wed Nov 8 20:00:34 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 09 Nov 2006 09:30:34 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Memory corruption bug in cm_work_handler Message-ID: <20061109040034.7062.78130.sendpatchset@localhost.localdomain> Possible memory corruption scenario : after putting the work entry back on the work_free_list, we call process_event() which dereferences work->event, which could have been modified to another value meanwhile. Patches against 2.6.19-rc4 bits. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -830,7 +830,8 @@ static int process_event(struct iwcm_id_ */ static void cm_work_handler(void *arg) { - struct iwcm_work *work = arg, lwork; + struct iwcm_work *work = arg; + struct iw_cm_event levent; struct iwcm_id_private *cm_id_priv = work->cm_id; unsigned long flags; int empty; @@ -843,11 +844,11 @@ static void cm_work_handler(void *arg) struct iwcm_work, list); list_del_init(&work->list); empty = list_empty(&cm_id_priv->work_list); - lwork = *work; + levent = work->event; put_work(work); spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = process_event(cm_id_priv, &work->event); + ret = process_event(cm_id_priv, &levent); if (ret) { set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); destroy_cm_id(&cm_id_priv->id); From krkumar2 at in.ibm.com Wed Nov 8 20:00:41 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 09 Nov 2006 09:30:41 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Fix memory leak Message-ID: <20061109040041.7062.89283.sendpatchset@localhost.localdomain> If we get IW_CM_EVENT_CONNECT_REQUEST message and encounter an error (not in the LISTEN state, cannot create an id, cannot alloc work_entry, etc), then the memory allocated by cm_event_handler() in the event->private_data gets leaked. Since cm_work_handler has already put the event on the work_free_list, this allocated memory is leaked. High backlog value can allow DoS attacks. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -620,7 +620,7 @@ static void cm_conn_req_handler(struct i spin_lock_irqsave(&listen_id_priv->lock, flags); if (listen_id_priv->state != IW_CM_STATE_LISTEN) { spin_unlock_irqrestore(&listen_id_priv->lock, flags); - return; + goto out; } spin_unlock_irqrestore(&listen_id_priv->lock, flags); @@ -629,7 +629,7 @@ static void cm_conn_req_handler(struct i listen_id_priv->id.context); /* If the cm_id could not be created, ignore the request */ if (IS_ERR(cm_id)) - return; + goto out; cm_id->provider_data = iw_event->provider_data; cm_id->local_addr = iw_event->local_addr; @@ -642,7 +642,7 @@ static void cm_conn_req_handler(struct i if (ret) { iw_cm_reject(cm_id, NULL, 0); iw_destroy_cm_id(cm_id); - return; + goto out; } /* Call the client CM handler */ @@ -654,6 +654,7 @@ static void cm_conn_req_handler(struct i kfree(cm_id); } +out: if (iw_event->private_data_len) kfree(iw_event->private_data); } From krkumar2 at in.ibm.com Wed Nov 8 20:00:37 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 09 Nov 2006 09:30:37 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() Message-ID: <20061109040037.7062.26245.sendpatchset@localhost.localdomain> Get rid of extra call to list_empty(), and unnecessary variable. Has the side effect of sometimes resulting in faster processing of new events (like handling new connections, eg when cm_work_handler was processing the last entry) added to this list instead of cm_work_handler function exiting and re-entering when a new queue_work() is done. Doing the redundant queue_work() (if cm_work_handler is already running and processing the last entry) will not result in another call to cm_work_handler (run_workqueue) where no entry is found, since cm_work_handler will remove all entries from the list, even ones that are added late. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -834,22 +834,17 @@ static void cm_work_handler(void *arg) struct iw_cm_event levent; struct iwcm_id_private *cm_id_priv = work->cm_id; unsigned long flags; - int empty; - int ret = 0; spin_lock_irqsave(&cm_id_priv->lock, flags); - empty = list_empty(&cm_id_priv->work_list); - while (!empty) { + while (!list_empty(&cm_id_priv->work_list)) { work = list_entry(cm_id_priv->work_list.next, struct iwcm_work, list); list_del_init(&work->list); - empty = list_empty(&cm_id_priv->work_list); levent = work->event; put_work(work); spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = process_event(cm_id_priv, &levent); - if (ret) { + if (process_event(cm_id_priv, &levent)) { set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); destroy_cm_id(&cm_id_priv->id); } From krkumar2 at in.ibm.com Wed Nov 8 20:00:48 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 09 Nov 2006 09:30:48 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Rewrite comment for iwcm_deref_id() to match code. Message-ID: <20061109040048.7062.22004.sendpatchset@localhost.localdomain> In iwcm_deref_id(), the comment says : "If the last reference is being removed and iw_destroy_cm_id is waiting, wake up the waiting thread". The second part of the comment "and iw_destroy_cm_id is waiting" is wrong, since this function either wakes the waiter already waiting in iwcm_deref_id, or enables it (so that when wait_for_completion() is performed later, it will immediately return). Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -148,8 +148,9 @@ static int copy_private_data(struct iw_c } /* - * Release a reference on cm_id. If the last reference is being removed - * and iw_destroy_cm_id is waiting, wake up the waiting thread. + * Release a reference on cm_id. If the last reference is being + * released, enable the waiting thread (in iw_destroy_cm_id) to + * get woken up, and return 1 if a thread is already waiting. */ static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) { From krkumar2 at in.ibm.com Wed Nov 8 20:00:45 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 09 Nov 2006 09:30:45 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Remove unnecessary function argument. Message-ID: <20061109040045.7062.17865.sendpatchset@localhost.localdomain> Remove unnecessary function argument, and change text to reflect the code. Fix couple of typos. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -80,7 +80,7 @@ struct iwcm_work { * 1) in the event upcall, cm_event_handler(), for a listening cm_id. If * the backlog is exceeded, then no more connection request events will * be processed. cm_event_handler() returns -ENOMEM in this case. Its up - * to the provider to reject the connectino request. + * to the provider to reject the connection request. * 2) in the connection request workqueue handler, cm_conn_req_handler(). * If work elements cannot be allocated for the new connect request cm_id, * then IWCM will call the provider reject method. This is ok since @@ -131,12 +131,11 @@ static int alloc_work_entries(struct iwc } /* - * Save private data from incoming connection requests in the - * cm_id_priv so the low level driver doesn't have to. Adjust + * Save private data from incoming connection requests to + * iw_cm_event, so the low level driver doesn't have to. Adjust * the event ptr to point to the local copy. */ -static int copy_private_data(struct iwcm_id_private *cm_id_priv, - struct iw_cm_event *event) +static int copy_private_data(struct iw_cm_event *event) { void *p; @@ -243,7 +242,7 @@ static int iwcm_modify_qp_sqd(struct ib_ /* * CM_ID <-- CLOSING * - * Block if a passive or active connection is currenlty being processed. Then + * Block if a passive or active connection is currently being processed. Then * process the event as follows: * - If we are ESTABLISHED, move to CLOSING and modify the QP state * based on the abrupt flag @@ -903,7 +902,7 @@ static int cm_event_handler(struct iw_cm if ((work->event.event == IW_CM_EVENT_CONNECT_REQUEST || work->event.event == IW_CM_EVENT_CONNECT_REPLY) && work->event.private_data_len) { - ret = copy_private_data(cm_id_priv, &work->event); + ret = copy_private_data(&work->event); if (ret) { put_work(work); goto out; From krkumar2 at in.ibm.com Wed Nov 8 20:00:43 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Thu, 09 Nov 2006 09:30:43 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Remove un-required initializations. Message-ID: <20061109040043.7062.97875.sendpatchset@localhost.localdomain> Remove un-required initializations. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -408,7 +408,7 @@ int iw_cm_listen(struct iw_cm_id *cm_id, { struct iwcm_id_private *cm_id_priv; unsigned long flags; - int ret = 0; + int ret; cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); @@ -535,7 +535,7 @@ EXPORT_SYMBOL(iw_cm_accept); int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) { struct iwcm_id_private *cm_id_priv; - int ret = 0; + int ret; unsigned long flags; struct ib_qp *qp; @@ -675,7 +675,7 @@ static int cm_conn_est_handler(struct iw struct iw_cm_event *iw_event) { unsigned long flags; - int ret = 0; + int ret; spin_lock_irqsave(&cm_id_priv->lock, flags); @@ -705,7 +705,7 @@ static int cm_conn_rep_handler(struct iw struct iw_cm_event *iw_event) { unsigned long flags; - int ret = 0; + int ret; spin_lock_irqsave(&cm_id_priv->lock, flags); /* From bboas at systemfabricworks.com Wed Nov 8 20:09:00 2006 From: bboas at systemfabricworks.com (Bill Boas) Date: Wed, 8 Nov 2006 23:09:00 -0500 Subject: [openib-general] Reminder - OpenFabrics Developer Summit at SC06 - register now if you haven't already. Message-ID: <20061109041037.DKXM16607.rrcs-fep-10.hrndva.rr.com@telerio44fea95> Another reminder to those who have not registered yet, please do so now we are getting to running out of room. We will have to cut off registration soon - so do it now or never! To attend you must register at: http://www.acteva.com/booking.cfm?bevaID=120521 To see the agenda go to: http://openib.org/conference/nov2006sc/ofa_dev_agenda.pdf Here's who has registered so far: Customer Name Customer Email Jeff Squyres jsquyres at cisco.com Robert Pearson rpearson at systemfabricworks.com Jim Ryan jim.ryan at intel.com Tziporet Koren tziporet at mellanox.co.il David Fellinger dfellinger at datadirectnet.com Matt Leininger mlleini at sandia.gov Patrick Mullaney pmullaney at novell.com Eitan Zahavi eitan at mellanox.co.il Hal Rosenstock hnrose at earthlink.net Dhabaleswar Panda panda at cse.ohio-state.edu Johann George johann.george at qlogic.com Amit Krig amitk at mellanox.co.il James Lentini jlentini at netapp.com Makia Minich minich at ornl.gov Sonia Pignorel soniapi at microsoft.com Aviram Gutman aviram at mellanox.co.il Nimrod Gindi nimrodg at mellanox.co.il Asaf Somekh asafs at voltaire.com Sayantan Sur surs at cse.ohio-state.edu Matthew Koop koop at cse.ohio-state.edu William Boas bboas at systemfabricworks.com Sujal Das sujal at mellanox.com Sundeep Narravula narravul at cse.ohio-state.edu Lei Chai chail at cse.ohio-state.edu Sean Hefty sean.hefty at intel.com Matthew Koop koop at cse.ohio-state.edu Sayantan Sur surs at cse.ohio-state.edu Dhabaleswar Panda panda at cse.ohio-state.edu Raj Channa raj.channa at credit-suisse.com Head Bubba Head.Bubba at credit.suisse.com James Lentini jlentini at netapp.com Thad Omura thad at mellanox.com John Hagerman jhagerman at neteffect.com Chet Mehta chetm at us.ibm.com Or Gerlitz ogerlitz at voltaire.com Jeff Broughton jeff.broughton at qlogic.com Bob Woodruff woody at jf.intel.com Jamie Riotto jriotto at cisco.com Changqing Tang ctang at rsn.hp.com Betsy Zeller betsy.zeller at qlogic.com leonid grossman leonid at neterion.com Arkady Kanevsky arkady at netapp.com Richard Frank richard.frank at oracle.com Gilad Shainer shainer at mellanox.com Cheng Tang ctang at xsigo.com Madhu Lakshmanan mlakshmanan at silverstorm.com Glenn Grundstrom ggrundstrom at neteffect.com Asgeir Eiriksson asgeir at chelsio.com Helen Chen hycsw at sandia.gov David Cohen d_cohen at ml.com Doug Ledford dledford at redhat.com Jeremy Brown jeremy.brown at qlogic.com Thank you. Bill. Bill Boas VP, Business Development | System Fabric Works bboas at systemfabricworks.com | 510-375-8840 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ebiederm at xmission.com Wed Nov 8 21:29:14 2006 From: ebiederm at xmission.com (ebiederm at xmission.com) Date: Wed, 08 Nov 2006 22:29:14 -0700 Subject: [openib-general] [PATCH] IB/ipath - program intconfig register using new HT irq hook In-Reply-To: <20061108151821.5a55fecd.akpm@osdl.org> (Andrew Morton's message of "Wed, 8 Nov 2006 15:18:21 -0800") References: <545156d49f883c43af70.1163024486@localhost.localdomain> <20061108144402.0b6a7b23.akpm@osdl.org> <4552636E.3090809@pathscale.com> <20061108151821.5a55fecd.akpm@osdl.org> Message-ID: Andrew Morton writes: > On Wed, 08 Nov 2006 15:08:30 -0800 > "Bryan O'Sullivan" wrote: > >> Andrew Morton wrote: > >> > considered 2.6.19 material? >> >> Yes, please. I might be able to simplify the ib-ipath patch (by a >> matter of a few lines), but it works fine as it stands. >> > > ho hum, OK. There is only one driver in the kernel that currently uses the htirq the infrastructure so the chance of actually breaking something else is exactly 0. :) Thanks for collecting the patches up Andrew. I know of a couple out of tree drivers but I don't think their hardware has escaped the lab yet. Eric From michael at ellerman.id.au Wed Nov 8 22:40:42 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 09 Nov 2006 17:40:42 +1100 Subject: [openib-general] [PATCH 0/6] Add pci_find_ht_capability() and fixup HT_CAPTYPE #defines Message-ID: <1163054442.964430.47244403003.qpush@cradle> OK, how's this look to people? It encapsulates the logic to find a HT capability, and adds logic to cope with the fact that the HT_CAPTYPE fields actually sit in 3 or 5 bits of the top byte in the capability. Built for Powerpc and Intel, booted on Powerpc. cheers From michael at ellerman.id.au Wed Nov 8 22:40:43 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 09 Nov 2006 17:40:43 +1100 Subject: [openib-general] [PATCH 2/6] Use pci_find_ht_capability() in drivers/pci/htirq.c In-Reply-To: <1163054442.964430.47244403003.qpush@cradle> Message-ID: <20061109064046.C67D467C78@ozlabs.org> Use pci_find_ht_capability() in drivers/pci/htirq.c Signed-off-by: Michael Ellerman --- drivers/pci/htirq.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) Index: msi/drivers/pci/htirq.c =================================================================== --- msi.orig/drivers/pci/htirq.c +++ msi/drivers/pci/htirq.c @@ -124,14 +124,7 @@ int ht_create_irq(struct pci_dev *dev, i int pos; int irq; - pos = pci_find_capability(dev, PCI_CAP_ID_HT); - while (pos) { - u8 subtype; - pci_read_config_byte(dev, pos + 3, &subtype); - if (subtype == HT_CAPTYPE_IRQ) - break; - pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); - } + pos = pci_find_ht_capability(dev, HT_CAPTYPE_IRQ); if (!pos) return -EINVAL; From michael at ellerman.id.au Wed Nov 8 22:40:43 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 09 Nov 2006 17:40:43 +1100 Subject: [openib-general] [PATCH 1/6] Add pci_find_ht_capability() for finding Hypertransport capabilities In-Reply-To: <1163054442.964430.47244403003.qpush@cradle> Message-ID: <20061109064046.5726767C76@ozlabs.org> There are already several places in the kernel that want to search a PCI device for a given Hypertransport capability. Although this is possible using pci_find_capability() etc., it makes sense to encapsulate that logic in a helper - pci_find_ht_capability(). To cater for searching exhaustively for a capability, we also provide pci_find_next_ht_capability(). We also need to cater for the fact that the HT capability fields may be either 3 or 5 bits wide. pci_find_ht_capability() deals with this for you, but callers using the #defines directly must handle that themselves. Signed-off-by: Michael Ellerman --- drivers/pci/pci.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++ include/linux/pci.h | 2 + include/linux/pci_regs.h | 12 +++++++++- 3 files changed, 68 insertions(+), 1 deletion(-) Index: msi/drivers/pci/pci.c =================================================================== --- msi.orig/drivers/pci/pci.c +++ msi/drivers/pci/pci.c @@ -215,6 +215,61 @@ int pci_find_ext_capability(struct pci_d EXPORT_SYMBOL_GPL(pci_find_ext_capability); /** + * pci_find_next_ht_capability - query a device's Hypertransport capabilities + * @dev: PCI device to query + * @pos: Position from which to continue searching + * @ht_cap: Hypertransport capability code + * + * To be used in conjunction with pci_find_ht_capability() to search for + * all capabilities matching @ht_cap. @pos should always be a value returned + * from pci_find_ht_capability(). + */ +int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap) +{ + int rc; + u8 cap, mask; + + if (ht_cap == HT_CAPTYPE_SLAVE || ht_cap == HT_CAPTYPE_HOST) + mask = HT_3BIT_CAP_MASK; + else + mask = HT_5BIT_CAP_MASK; + + while (pos) { + rc = pci_read_config_byte(dev, pos + 3, &cap); + if (rc != PCIBIOS_SUCCESSFUL) + return 0; + + if ((cap & mask) == ht_cap) + return pos; + + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); + } + + return 0; +} +EXPORT_SYMBOL_GPL(pci_find_next_ht_capability); + +/** + * pci_find_ht_capability - query a device's Hypertransport capabilities + * @dev: PCI device to query + * @ht_cap: Hypertransport capability code + * + * Tell if a device supports a given Hypertransport capability. + * Returns an address within the device's PCI configuration space + * or 0 in case the device does not support the request capability. + * The address points to the PCI capability, of type PCI_CAP_ID_HT, + * which has a Hypertransport capability matching @ht_cap. + */ +int pci_find_ht_capability(struct pci_dev *dev, int ht_cap) +{ + int pos; + + pos = pci_find_capability(dev, PCI_CAP_ID_HT); + return pci_find_next_ht_capability(dev, pos, ht_cap); +} +EXPORT_SYMBOL_GPL(pci_find_ht_capability); + +/** * pci_find_parent_resource - return resource region of parent bus of given region * @dev: PCI device structure contains resources to be searched * @res: child resource record for which parent is sought Index: msi/include/linux/pci.h =================================================================== --- msi.orig/include/linux/pci.h +++ msi/include/linux/pci.h @@ -453,6 +453,8 @@ struct pci_dev *pci_find_slot (unsigned int pci_find_capability (struct pci_dev *dev, int cap); int pci_find_next_capability (struct pci_dev *dev, u8 pos, int cap); int pci_find_ext_capability (struct pci_dev *dev, int cap); +int pci_find_ht_capability (struct pci_dev *dev, int ht_cap); +int pci_find_next_ht_capability (struct pci_dev *dev, int pos, int ht_cap); struct pci_bus *pci_find_next_bus(const struct pci_bus *from); struct pci_dev *pci_get_device(unsigned int vendor, unsigned int device, Index: msi/include/linux/pci_regs.h =================================================================== --- msi.orig/include/linux/pci_regs.h +++ msi/include/linux/pci_regs.h @@ -468,9 +468,19 @@ #define PCI_PWR_CAP 12 /* Capability */ #define PCI_PWR_CAP_BUDGET(x) ((x) & 1) /* Included in system budget */ -/* Hypertransport sub capability types */ +/* + * Hypertransport sub capability types + * + * Unfortunately there are both 3 bit and 5 bit capability types defined + * in the HT spec, catering for that is a little messy. You probably don't + * want to use these directly, just use pci_find_ht_capability() and it + * will do the right thing for you. + */ +#define HT_3BIT_CAP_MASK 0xE0 #define HT_CAPTYPE_SLAVE 0x00 /* Slave/Primary link configuration */ #define HT_CAPTYPE_HOST 0x20 /* Host/Secondary link configuration */ + +#define HT_5BIT_CAP_MASK 0xF8 #define HT_CAPTYPE_IRQ 0x80 /* IRQ Configuration */ #define HT_CAPTYPE_REMAPPING_40 0xA0 /* 40 bit address remapping */ #define HT_CAPTYPE_REMAPPING_64 0xA2 /* 64 bit address remapping */ From michael at ellerman.id.au Wed Nov 8 22:40:44 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 09 Nov 2006 17:40:44 +1100 Subject: [openib-general] [PATCH 3/6] Only check the HT capability bits in mpic.c In-Reply-To: <1163054442.964430.47244403003.qpush@cradle> Message-ID: <20061109064047.4DA1067C79@ozlabs.org> Only compare the exact HT capability bits against HT_CAPTYPE_IRQ, this is a little paranoid, but doesn't hurt. Signed-off-by: Michael Ellerman --- arch/powerpc/sysdev/mpic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: msi/arch/powerpc/sysdev/mpic.c =================================================================== --- msi.orig/arch/powerpc/sysdev/mpic.c +++ msi/arch/powerpc/sysdev/mpic.c @@ -341,7 +341,7 @@ static void __init mpic_scan_ht_pic(stru u8 id = readb(devbase + pos + PCI_CAP_LIST_ID); if (id == PCI_CAP_ID_HT) { id = readb(devbase + pos + 3); - if (id == HT_CAPTYPE_IRQ) + if ((id & HT_5BIT_CAP_MASK) == HT_CAPTYPE_IRQ) break; } } From michael at ellerman.id.au Wed Nov 8 22:40:44 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 09 Nov 2006 17:40:44 +1100 Subject: [openib-general] [PATCH 4/6] Add #defines for Hypertransport MSI fields In-Reply-To: <1163054442.964430.47244403003.qpush@cradle> Message-ID: <20061109064047.B94A767C7B@ozlabs.org> Add a few #defines for grabbing and working with the address fields in a HT_CAPTYPE_MSI_MAPPING capability. All from the HT spec v3.00. Signed-off-by: Michael Ellerman --- include/linux/pci_regs.h | 7 +++++++ 1 file changed, 7 insertions(+) Index: msi/include/linux/pci_regs.h =================================================================== --- msi.orig/include/linux/pci_regs.h +++ msi/include/linux/pci_regs.h @@ -487,6 +487,13 @@ #define HT_CAPTYPE_UNITID_CLUMP 0x90 /* Unit ID clumping */ #define HT_CAPTYPE_EXTCONF 0x98 /* Extended Configuration Space Access */ #define HT_CAPTYPE_MSI_MAPPING 0xA8 /* MSI Mapping Capability */ +#define HT_MSI_FLAGS 0x02 /* Offset to flags */ +#define HT_MSI_FLAGS_ENABLE 0x1 /* Mapping enable */ +#define HT_MSI_FLAGS_FIXED 0x2 /* Fixed mapping only */ +#define HT_MSI_FIXED_ADDR 0x00000000FEE00000ULL /* Fixed addr */ +#define HT_MSI_ADDR_LO 0x04 /* Offset to low addr bits */ +#define HT_MSI_ADDR_LO_MASK 0xFFF00000 /* Low address bit mask */ +#define HT_MSI_ADDR_HI 0x08 /* Offset to high addr bits */ #define HT_CAPTYPE_DIRECT_ROUTE 0xB0 /* Direct routing configuration */ #define HT_CAPTYPE_VCSET 0xB8 /* Virtual Channel configuration */ #define HT_CAPTYPE_ERROR_RETRY 0xC0 /* Retry on error configuration */ From michael at ellerman.id.au Wed Nov 8 22:40:45 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 09 Nov 2006 17:40:45 +1100 Subject: [openib-general] [PATCH 6/6] Use pci_find_ht_capability() in drivers/infiniband/hw/ipath/ipath_iba6110.c In-Reply-To: <1163054442.964430.47244403003.qpush@cradle> Message-ID: <20061109064048.A012067C7D@ozlabs.org> Use pci_find_ht_capability() in drivers/infiniband/hw/ipath/ipath_iba6110.c The old code made no guarantees about whether slave_or_pri_blk() was called before or after set_int_handler() - so I assume they're order independant. We now always call slave_or_pri_blk() first (if at all), followed by set_int_handler(). Signed-off-by: Michael Ellerman --- drivers/infiniband/hw/ipath/ipath_iba6110.c | 31 ++++++++-------------------- 1 file changed, 9 insertions(+), 22 deletions(-) Index: msi/drivers/infiniband/hw/ipath/ipath_iba6110.c =================================================================== --- msi.orig/drivers/infiniband/hw/ipath/ipath_iba6110.c +++ msi/drivers/infiniband/hw/ipath/ipath_iba6110.c @@ -750,7 +750,6 @@ static int ipath_setup_ht_reset(struct i return 0; } -#define HT_INTR_DISC_CONFIG 0x80 /* HT interrupt and discovery cap */ #define HT_INTR_REG_INDEX 2 /* intconfig requires indirect accesses */ /* @@ -971,8 +970,7 @@ static int set_int_handler(struct ipath_ static int ipath_setup_ht_config(struct ipath_devdata *dd, struct pci_dev *pdev) { - int pos, ret = 0; - int ihandler = 0; + int pos; /* * Read the capability info to find the interrupt info, and also @@ -980,14 +978,8 @@ static int ipath_setup_ht_config(struct * do this early, before we ever enable errors or hardware errors, * mostly to avoid causing the chip to enter freeze mode. */ - pos = pci_find_capability(pdev, PCI_CAP_ID_HT); - if (!pos) { - ipath_dev_err(dd, "Couldn't find HyperTransport " - "capability; no interrupts\n"); - ret = -ENODEV; - goto bail; - } - do { + pos = pci_find_ht_capability(pdev, HT_CAPTYPE_SLAVE); + if (pos) { u8 cap_type; /* the HT capability type byte is 3 bytes after the @@ -996,23 +988,18 @@ static int ipath_setup_ht_config(struct if (pci_read_config_byte(pdev, pos + 3, &cap_type)) { dev_info(&pdev->dev, "Couldn't read config " "command @ %d\n", pos); - continue; - } - if (!(cap_type & 0xE0)) + } else slave_or_pri_blk(dd, pdev, pos, cap_type); - else if (cap_type == HT_INTR_DISC_CONFIG) - ihandler = set_int_handler(dd, pdev, pos); - } while ((pos = pci_find_next_capability(pdev, pos, - PCI_CAP_ID_HT))); + } - if (!ihandler) { + pos = pci_find_ht_capability(pdev, HT_CAPTYPE_IRQ); + if (!pos || !set_int_handler(dd, pdev, pos)) { ipath_dev_err(dd, "Couldn't find interrupt handler in " "config space\n"); - ret = -ENODEV; + return -ENODEV; } -bail: - return ret; + return 0; } /** From michael at ellerman.id.au Wed Nov 8 22:40:45 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 09 Nov 2006 17:40:45 +1100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <1163054442.964430.47244403003.qpush@cradle> Message-ID: <20061109064048.32E5E67C7C@ozlabs.org> Use pci_find_ht_capability() in drivers/pci/quirks.c. I'm pretty sure the logic is unchanged here, but someone please eye-ball it for me. I've changed the message to be a little shorter, it's now: PCI: Found (enabled|disabled) HT MSI mapping on xxxx:xx:xx.x Signed-off-by: Michael Ellerman --- drivers/pci/quirks.c | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) Index: msi/drivers/pci/quirks.c =================================================================== --- msi.orig/drivers/pci/quirks.c +++ msi/drivers/pci/quirks.c @@ -1724,19 +1724,22 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AM * return 1 if a HT MSI capability is found and enabled */ static int __devinit msi_ht_cap_enabled(struct pci_dev *dev) { - u8 pos; - int ttl; - for (pos = pci_find_capability(dev, PCI_CAP_ID_HT), ttl = 48; - pos && ttl; - pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT), ttl--) { - u32 cap_hdr; - /* MSI mapping section according to Hypertransport spec */ - if (pci_read_config_dword(dev, pos, &cap_hdr) == 0 - && (cap_hdr & 0xf8000000) == 0xa8000000 /* MSI mapping */) { - printk(KERN_INFO "PCI: Found HT MSI mapping on %s with capability %s\n", - pci_name(dev), cap_hdr & 0x10000 ? "enabled" : "disabled"); - return (cap_hdr & 0x10000) != 0; /* MSI mapping cap enabled */ + int pos; + + pos = pci_find_ht_capability(dev, HT_CAPTYPE_MSI_MAPPING); + while (pos) { + u8 flags; + + if (pci_read_config_byte(dev, + pos + HT_MSI_FLAGS, &flags) == 0) { + printk(KERN_INFO "PCI: Found %s HT MSI Mapping on %s\n", + flags & HT_MSI_FLAGS_ENABLE ? + "enabled" : "disabled", pci_name(dev)); + return (flags & HT_MSI_FLAGS_ENABLE) != 0; } + + pos = pci_find_next_ht_capability(dev, pos, + HT_CAPTYPE_MSI_MAPPING); } return 0; } From segher at kernel.crashing.org Thu Nov 9 00:01:58 2006 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 9 Nov 2006 09:01:58 +0100 Subject: [openib-general] [PATCH 1/6] Add pci_find_ht_capability() for finding Hypertransport capabilities In-Reply-To: <20061109064046.5726767C76@ozlabs.org> References: <20061109064046.5726767C76@ozlabs.org> Message-ID: <363CCF5E-2CB8-4AFB-A620-5BB2F57458AB@kernel.crashing.org> > +int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int > ht_cap) > +{ > + int rc; > + u8 cap, mask; > + > + if (ht_cap == HT_CAPTYPE_SLAVE || ht_cap == HT_CAPTYPE_HOST) > + mask = HT_3BIT_CAP_MASK; > + else > + mask = HT_5BIT_CAP_MASK; > + + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); or the caller will loop forever if a second same type HT cap is found. > + while (pos) { > + rc = pci_read_config_byte(dev, pos + 3, &cap); > + if (rc != PCIBIOS_SUCCESSFUL) > + return 0; > + > + if ((cap & mask) == ht_cap) > + return pos; > + > + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); > + } > + > + return 0; > +} Segher From diego.guella at sircomtech.com Thu Nov 9 00:31:35 2006 From: diego.guella at sircomtech.com (Diego Guella) Date: Thu, 9 Nov 2006 09:31:35 +0100 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails References: <6C2C79E72C305246B504CBA17B5500C91BD834@mtlexch01.mtl.com> Message-ID: <009501c703d9$790cff70$05c8a8c0@DIEGO> MessageDotan, thanks for your reply. In my distribution I found ----- #define USER_HZ 100 ----- so I added this line ----- #define HZ 100 ----- in utils.h. Now, I get another error! I tried also to not include 'ipoibtools', as Tziporet told me, and this is the same error I got. This is my previous message, with the error I got: ----- > > From: "Tziporet Koren" >> The failing is utility is used for IPoIB high availability. If you don't >> need to use them you can just change this line in ofed.conf: >> ipoibtools=n >> >> Tziporet >> > Thanks Tziporet for your answer. > > > Tried just right now, i disabled ipoibtools. I get another, more strange > error: > (attached OFED.3816.log) > ----- > /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples > cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs > Running: > ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck > --prefix /usr/local/ofed --libdir /usr/local/ofed/lib > CPPFLAGS="-I../libibverbs/include" > configure: creating cache /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > checking for a BSD-compatible install... /usr/bin/install -c > checking whether build environment is sane... yes > checking for gawk... gawk > checking whether make sets $(MAKE)... yes > checking build system type... x86_64-unknown-linux-gnu > checking host system type... x86_64-unknown-linux-gnu > checking for style of include used by make... GNU > checking for gcc... gcc > checking for C compiler default output file name... configure: error: C > compiler cannot create executables > See `config.log' for more details. > Failed to execute: > ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache --disable-libcheck > --prefix /usr/local/ofed --libdir /usr/local/ofed/lib > CPPFLAGS="-I../libibverbs/include" > error: Bad exit status from /var/tmp/rpm-tmp.46102 (%install) > ----- > > Am I right? It says my C compiler cannot create executables???? Is it joking > me???? > In the log file, line 6393, it says: > ----- > checking for C compiler default output file name... a.out > ----- > > I don't understand....! > Is there something I can do to fix this? > > > Thanks, > Diego > > ----- Why on line 6393 it says my compiler default output file name is a.out, and 5000 lines after it says my C compiler cannot create executables?? It seems this is the last error preventing me to have OFED built! What can I do now? Thanks, Diego ----- Original Message ----- From: Dotan Barak To: Diego Guella ; openib-general at openib.org Sent: Wednesday, November 08, 2006 5:00 PM Subject: RE: [openib-general] Installation on openSUSE 10.2 Beta1 fails Hi Diego. You got the following output: utils.c: In function '__get_hz': utils.c:368: error: 'HZ' undeclared (first use in this function) utils.c:368: error: (Each undeclared identifier is reported only once utils.c:368: error: for each function it appears in.) because the macro HZ wasn't found by the compiler. in my machine HZ is defined in 2 files) 1 ) asm/param.h:#define HZ sysconf(_SC_CLK_TCK) 2) in asm-x86_64/param.h I noticed the following code (this define is being used in the compilation process in my 1000 MHz machine): #ifndef HZ #define HZ 100 #endif I believe that there is a missing include in this distribution. I think that if you'll add the later lines everything will work ... Dotan -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Diego Guella Sent: Wednesday, November 08, 2006 11:21 AM To: openib-general at openib.org Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails <> ----- ERROR: Failed executing "rpmbuild --rebuild --define '_topdir /var/tmp/OFEDRPM' --define '_prefix /usr/local/ofed' --define 'build_root /var/tmp/OFED' --define 'configure_options --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools --with-mstflint --with-perftest --with-ipath_inf-mod --with-ipoib-mod --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod' --define 'configure_options32 --with-dapl --with-ipoibtools --with-libibcm --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm --with-libsdp --with-openib-diags --with-srptools ' --define 'KVERSION 2.6.18.1-13-default' --define 'KSRC /lib/modules/2.6.18.1-13-default/build' --define 'build_kernel_ib 1' --define 'build_kernel_ib_devel 1' --define 'NETWORK_CONF_DIR /etc/sysconfig/network' --define 'modprobe_update 1' --define 'include_ipoib_conf 1' --define 'build_32bit 1' /opt/nfs_public/OFED-1.1/SRPMS/openib-1.1-0.src.rpm" See log file: /tmp/OFED.29778.log ----- OFED.29778.log attached. The 'interesting' part of the log is: ----- gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include -DRESOLVE_HOSTNAMES -c -o utils.o utils.c utils.c: In function 'inet_addr_match': utils.c:333: warning: initialization discards qualifiers from pointer target type utils.c:334: warning: initialization discards qualifiers from pointer target type utils.c: In function '__get_hz': utils.c:368: error: 'HZ' undeclared (first use in this function) utils.c:368: error: (Each undeclared identifier is reported only once utils.c:368: error: for each function it appears in.) make[2]: *** [utils.o] Error 1 make[2]: Leaving directory `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2/lib' make[1]: *** [lib] Error 2 make[1]: Leaving directory `/var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/ipoibtools/iproute2' make: *** [ipoibtools] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.4033 (%install) ----- Here, I give up: I don't understand where 'HZ' should be defined, or where it comes from. Where is the problem? Is there a solution to fix it? Thanks, Diego -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OFED.3816.log.tar.gz Type: application/octet-stream Size: 93119 bytes Desc: not available URL: From brice at myri.com Thu Nov 9 01:05:18 2006 From: brice at myri.com (Brice Goglin) Date: Thu, 09 Nov 2006 10:05:18 +0100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <20061109064048.32E5E67C7C@ozlabs.org> References: <20061109064048.32E5E67C7C@ozlabs.org> Message-ID: <4552EF4E.4030408@myri.com> You don't have any TTL in the while loop below, neither in the while loop in pci_find_next_ht_capability(). It's paranoid, but I'd rather keep a TTL in both loops (a brain-damaged capability chain in the PCI config space could lead to an infinite loop without any clue of what's going on, not easy to find out...). Apart from that, I like the idea. Brice Michael Ellerman wrote: > Use pci_find_ht_capability() in drivers/pci/quirks.c. > > I'm pretty sure the logic is unchanged here, but someone please eye-ball it > for me. I've changed the message to be a little shorter, it's now: > > PCI: Found (enabled|disabled) HT MSI mapping on xxxx:xx:xx.x > > Signed-off-by: Michael Ellerman > --- > > drivers/pci/quirks.c | 27 +++++++++++++++------------ > 1 file changed, 15 insertions(+), 12 deletions(-) > > Index: msi/drivers/pci/quirks.c > =================================================================== > --- msi.orig/drivers/pci/quirks.c > +++ msi/drivers/pci/quirks.c > @@ -1724,19 +1724,22 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AM > * return 1 if a HT MSI capability is found and enabled */ > static int __devinit msi_ht_cap_enabled(struct pci_dev *dev) > { > - u8 pos; > - int ttl; > - for (pos = pci_find_capability(dev, PCI_CAP_ID_HT), ttl = 48; > - pos && ttl; > - pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT), ttl--) { > - u32 cap_hdr; > - /* MSI mapping section according to Hypertransport spec */ > - if (pci_read_config_dword(dev, pos, &cap_hdr) == 0 > - && (cap_hdr & 0xf8000000) == 0xa8000000 /* MSI mapping */) { > - printk(KERN_INFO "PCI: Found HT MSI mapping on %s with capability %s\n", > - pci_name(dev), cap_hdr & 0x10000 ? "enabled" : "disabled"); > - return (cap_hdr & 0x10000) != 0; /* MSI mapping cap enabled */ > + int pos; > + > + pos = pci_find_ht_capability(dev, HT_CAPTYPE_MSI_MAPPING); > + while (pos) { > + u8 flags; > + > + if (pci_read_config_byte(dev, > + pos + HT_MSI_FLAGS, &flags) == 0) { > + printk(KERN_INFO "PCI: Found %s HT MSI Mapping on %s\n", > + flags & HT_MSI_FLAGS_ENABLE ? > + "enabled" : "disabled", pci_name(dev)); > + return (flags & HT_MSI_FLAGS_ENABLE) != 0; > } > + > + pos = pci_find_next_ht_capability(dev, pos, > + HT_CAPTYPE_MSI_MAPPING); > } > return 0; > } > From segher at kernel.crashing.org Thu Nov 9 01:43:25 2006 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 9 Nov 2006 10:43:25 +0100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <4552EF4E.4030408@myri.com> References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> Message-ID: <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> > You don't have any TTL in the while loop below, neither in the while > loop in pci_find_next_ht_capability(). It's paranoid, but I'd rather > keep a TTL in both loops (a brain-damaged capability chain in the PCI > config space could lead to an infinite loop without any clue of what's > going on, not easy to find out...). There's so many other ways broken PCI headers can cause problems, it's just not funny. You can't catch all of them however hard you try. I always thought the super-over-the-top paranoia checks in the generic PCI capability list walkers were workarounds for problems actually observed in the field; can we do the same for the HT-specific walker? I.e., don't implement the workaround before we know we need it. Segher From mst at mellanox.co.il Thu Nov 9 02:02:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 9 Nov 2006 12:02:29 +0200 Subject: [openib-general] Fwd: IPoIB new multicast API patches oops Message-ID: <20061109100229.GF14960@mellanox.co.il> Following Sean's suggestion, I have let the nightly tests run with Roland's mad patch in addition to Sean's new multicast interface patches (v2), and got the following crash: Nov 9 10:59:43 sw084 kernel: NET: Unregistered protocol family 27 Nov 9 10:59:46 sw084 net.agent[25786]: remove event not handled Nov 9 10:59:46 sw084 net.agent[25787]: remove event not handled Nov 9 10:59:46 sw084 net.agent[25818]: remove event not handled Nov 9 10:59:46 sw084 net.agent[25814]: remove event not handled Nov 9 10:59:46 sw084 kernel: BUG: spinlock bad magic on CPU#1, ib_mad2/1588 Nov 9 10:59:46 sw084 kernel: general protection fault: 0000 [1] SMP Nov 9 10:59:46 sw084 kernel: CPU 1 Nov 9 10:59:46 sw084 kernel: Modules linked in: nfsd exportfs ipv6 parport_pc lp parport autofs4 nfs lockd nfs_acl sunrpc vfat fat dm_mirror dm_mod button b attery ac ohci_hcd ehci_hcd i2c_nforce2 i2c_core ib_mthca ib_umad ib_sa ib_mad ib_core tg3 ext3 jbd sata_nv libata mptsas scsi_transport_sas sd_mod Nov 9 10:59:46 sw084 kernel: Pid: 1588, comm: ib_mad2 Not tainted 2.6.17.7 #3 Nov 9 10:59:46 sw084 kernel: RIP: 0010:[] {spin_bug+116} Nov 9 10:59:46 sw084 kernel: RSP: 0018:ffff81013c611ca8 EFLAGS: 00010002 Nov 9 10:59:46 sw084 kernel: RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffffffff8044c057 Nov 9 10:59:46 sw084 kernel: RDX: ffffffff804a7f18 RSI: 0000000000000046 RDI: ffffffff804a7f00 Nov 9 10:59:46 sw084 kernel: RBP: ffff81013c48a320 R08: 00000000ffffffff R09: 0000000000000000 Nov 9 10:59:46 sw084 kernel: R10: 0000000100000000 R11: ffffffff802160df R12: ffff81013c48a318 Nov 9 10:59:46 sw084 kernel: R13: 0000000000000212 R14: 0000000000000000 R15: ffffffff8808f2d8 Nov 9 10:59:46 sw084 kernel: FS: 00002af8d7bd8b00(0000) GS:ffff81013fc616d0(0000) knlGS:00000000f7f796c0 Nov 9 10:59:46 sw084 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Nov 9 10:59:46 sw084 kernel: CR2: 00000000005b719c CR3: 000000013ad5c000 CR4: 00000000000006e0 Nov 9 10:59:46 sw084 kernel: Process ib_mad2 (pid: 1588, threadinfo ffff81013c610000, task ffff81013f5c29f0) Nov 9 10:59:46 sw084 kernel: Stack: 0000000000000003 ffff81013c48a320 ffff81013c48a320 ffffffff802ddc8d Nov 9 10:59:46 sw084 kernel: ffff81013bb722f0 ffff81013c48a320 ffff81013c48a318 ffffffff80428b2b Nov 9 10:59:46 sw084 kernel: 0000000000000246 ffffffff88097eff Nov 9 10:59:46 sw084 kernel: Call Trace: {_raw_spin_lock+28} {_spin_lock_irqsave+11} Nov 9 10:59:46 sw084 kernel: {:ib_sa:release_group+27} {:ib_sa:mcast_work_handler+1280} Nov 9 10:59:46 sw084 kernel: {_spin_unlock_irq+7} {:ib_mad:timeout_sends+0} Nov 9 10:59:46 sw084 kernel: {:ib_sa:ib_sa_mcmember_rec_callback+64} Nov 9 10:59:46 sw084 kernel: {_spin_unlock_irq+7} {:ib_sa:send_handler+74} Nov 9 10:59:46 sw084 kernel: {:ib_mad:timeout_sends+397} {run_workqueue+161} Nov 9 10:59:46 sw084 kernel: {worker_thread+0} {keventd_create_kthread+0} Nov 9 10:59:46 sw084 kernel: {worker_thread+261} {default_wake_function+0} Nov 9 10:59:46 sw084 kernel: {keventd_create_kthread+0} {default_wake_function+0} Nov 9 10:59:46 sw084 kernel: {keventd_create_kthread+0} {kthread+200} Nov 9 10:59:46 sw084 kernel: {child_rip+8} {keventd_create_kthread+0} Nov 9 10:59:46 sw084 kernel: {kthread+0} {child_rip+0} Nov 9 10:59:46 sw084 kernel: Nov 9 10:59:46 sw084 kernel: Code: 44 8b 83 04 01 00 00 48 8d 8b a0 02 00 00 8b 55 04 41 89 c1 Nov 9 10:59:46 sw084 kernel: RIP {spin_bug+116} RSP Nov 9 10:59:46 sw084 kernel: <3>BUG: sleeping function called from invalid context at include/linux/rwsem.h:43 Nov 9 10:59:46 sw084 kernel: in_atomic():0, irqs_disabled():1 Nov 9 10:59:46 sw084 kernel: Nov 9 10:59:46 sw084 kernel: Call Trace: {__might_sleep+190} {blocking_notifier_call_chain+31} Nov 9 10:59:46 sw084 kernel: {do_exit+34} {_spin_lock_irqsave+11} Nov 9 10:59:46 sw084 kernel: {vgacon_set_cursor_size+51} {:ib_mad:timeout_sends+0} Nov 9 10:59:46 sw084 kernel: {do_divide_error+0} {do_general_protection+254} Nov 9 10:59:46 sw084 kernel: {error_exit+0} {:ib_mad:timeout_sends+0} Nov 9 10:59:46 sw084 kernel: {flat_send_IPI_mask+0} {spin_bug+116} Nov 9 10:59:46 sw084 kernel: {spin_bug+97} {_raw_spin_lock+28} Nov 9 10:59:46 sw084 kernel: {_spin_lock_irqsave+11} {:ib_sa:release_group+27} Nov 9 10:59:46 sw084 kernel: {:ib_sa:mcast_work_handler+1280} {_spin_unlock_irq+7} Nov 9 10:59:46 sw084 kernel: {:ib_mad:timeout_sends+0} {:ib_sa:ib_sa_mcmember_rec_callback+64} Nov 9 10:59:46 sw084 kernel: {_spin_unlock_irq+7} {:ib_sa:send_handler+74} Nov 9 10:59:46 sw084 kernel: {:ib_mad:timeout_sends+397} {run_workqueue+161} Nov 9 10:59:47 sw084 kernel: {worker_thread+0} {keventd_create_kthread+0} Nov 9 10:59:47 sw084 kernel: {worker_thread+261} {default_wake_function+0} Nov 9 10:59:47 sw084 kernel: {keventd_create_kthread+0} {default_wake_function+0} Nov 9 10:59:47 sw084 kernel: {keventd_create_kthread+0} {kthread+200} Nov 9 10:59:47 sw084 kernel: {child_rip+8} {keventd_create_kthread+0} Nov 9 10:59:47 sw084 kernel: {kthread+0} {child_rip+0} Nov 9 11:00:45 sw084 sshd(pam_unix)[25846]: session opened for user root by root(uid=0) -- MST From jsquyres at cisco.com Thu Nov 9 04:02:23 2006 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 9 Nov 2006 07:02:23 -0500 Subject: [openib-general] [mvapich-discuss] This is the last time I'm asking... In-Reply-To: <44892EBC-6E14-4AED-BEDF-4000BA51D569@cisco.com> References: <200611061453.kA6Er91h001869@xi.cse.ohio-state.edu> <44892EBC-6E14-4AED-BEDF-4000BA51D569@cisco.com> Message-ID: <15F67B2D-7592-4578-A642-3C8A2B4E2D4F@cisco.com> DK -- Are you going to answer my questions? On Nov 6, 2006, at 11:27 AM, Jeff Squyres wrote: > As I explained in my mail, no one had replied to any of the posts > containing my very directed and specific questions (not even you -- > and you still haven't), so I figured that no one cared. That's not > an unreasonable assumption given that I posted the same questions 3 > times and got silence in return. > > I am unaware of any special "right" required to make a motion. Are > there some protocols (perhaps a la Robert's Rules of Order) that > are typically used for making a motion? I haven't seen any...? > > The agenda for the SC Developer's Summit is already over-full. > This conversation is fine to begin in e-mail; a good start would be > answering my original questions. Thanks! > > > On Nov 6, 2006, at 9:53 AM, Dhabaleswar Panda wrote: > >> Jeff: >> >> May I know on with what `right' you are making this motion to remove >> the code. >> >> To have the code there was decided by the OpenIB community and the >> organizers. It needs to be decided by the community, not by an >> individual person. >> >> Let me suggest that we we discuss this at the Developers Summit at SC >> '06. If the Open Fabrics community no longer wants the code to be >> there and will prefer to download it from the OSU SVN site, we can >> proceed accordingly. >> >> Thanks, >> >> DK >> >> >>> Having received no replies for 2 weeks as to why it is useful to >>> have >>> MVAPICH in the OpenFabrics SVN, I can only conclude that no one >>> cares. If someone does care, please respond to my original >>> questions >>> included below ASAP (originally posted 23 Oct, 27 Oct, 1 Nov). >>> >>> I therefore make the motion to remove MVAPICH from the OpenFabrics >>> SVN (all the source is still available via the OSU SVN and other >>> distribution points). Specifically, I motion to do the following >>> around COB tomorrow (7 Nov 2006): >>> >>> svn rm https://openib.org/svn/gen2/trunk/src/userspace/mpi >>> >>> Any objections? >>> >>> >>> >>> On Nov 1, 2006, at 10:53 AM, Jeff Squyres wrote: >>> >>>> Forwarding this to the mvapich-discuss list because it has gotten >>>> zero replies on the openib-general list. If someone from OSU could >>>> reply, it would be most helpful. Thanks. >>>> >>>> >>>> Begin forwarded message: >>>> >>>>> From: Jeff Squyres >>>>> Date: October 27, 2006 11:05:17 AM EDT >>>>> To: openib >>>>> Subject: Re: [mvapich] Announcing the release of MVAPICH2 0.9.6 >>>>> with on-demand connection management, multi-core optimized shared >>>>> memory communication and memory hook support >>>>> >>>>> Any response from the OSU crew? >>>>> >>>>> Can someone provide a reason why MVAPICH is still in OpenIB's >>>>> Subversion repository? Please see my original mail, below, for >>>>> more detailed questions. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> On Oct 23, 2006, at 7:36 AM, Jeff Squyres wrote: >>>>> >>>>>> On Oct 22, 2006, at 11:53 PM, Dhabaleswar Panda wrote: >>>>>> >>>>>>> A stripped down version of this release is also available at the >>>>>>> OpenIB SVN. >>>>>> >>>>>> I see this statement in every MVAPICH release notice and it >>>>>> continues to puzzle me. >>>>>> >>>>>> I understand that there was a use for an alternate distribution >>>>>> source before MVAPICH became open source. But now that the >>>>>> MVAPICH code bases are freely available from OSU via multiple >>>>>> mechanisms (anonymous SVN, tarball download, etc.), why is a >>>>>> "stripped down version" maintained in the OpenIB SVN? >>>>>> >>>>>> 1. What, exactly, is the difference between the MVAPICH available >>>>>> from OSU and the "stripped down version" in the OpenIB SVN? >>>>>> >>>>>> 2. Why would someone choose to download the "stripped down >>>>>> version" from the OpenIB SVN? Have any real users/customers done >>>>>> so? >>>>>> >>>>>> 3. What is the point of maintaining yet more flavors of MVAPICH >>>>>> -- aren't there enough already (multiple versions from OSU, >>>>>> more versions available from each IB vendor)? >>>>>> >>>>>> DK -- can you please explain? Thanks. >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> Server Virtualization Business Unit >>>>>> Cisco Systems >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> Server Virtualization Business Unit >>>>> Cisco Systems >>>>> >>>>> >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> Server Virtualization Business Unit >>>> Cisco Systems >>>> >>>> >>> >>> >>> -- >>> Jeff Squyres >>> Server Virtualization Business Unit >>> Cisco Systems >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss at cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> > > > -- > Jeff Squyres > Server Virtualization Business Unit > Cisco Systems > > -- Jeff Squyres Server Virtualization Business Unit Cisco Systems From greg at kroah.com Thu Nov 9 06:17:33 2006 From: greg at kroah.com (Greg KH) Date: Thu, 9 Nov 2006 23:17:33 +0900 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> Message-ID: <20061109141733.GA11499@kroah.com> On Thu, Nov 09, 2006 at 10:43:25AM +0100, Segher Boessenkool wrote: > >You don't have any TTL in the while loop below, neither in the while > >loop in pci_find_next_ht_capability(). It's paranoid, but I'd rather > >keep a TTL in both loops (a brain-damaged capability chain in the PCI > >config space could lead to an infinite loop without any clue of what's > >going on, not easy to find out...). > > There's so many other ways broken PCI headers can cause > problems, it's just not funny. You can't catch all of > them however hard you try. > > I always thought the super-over-the-top paranoia checks > in the generic PCI capability list walkers were workarounds > for problems actually observed in the field; can we do the > same for the HT-specific walker? I.e., don't implement > the workaround before we know we need it. While yes, we should not in general add new workarounds before we need them, for this quirk, you should keep the original functionality, unless you wrote the quirk, or unless you have the hardware that needs it and you can verify that the change works properly. Are any of these last two options true for you? If not, I suggest that you put the TTL logic back in just to be safe. thanks, greg k-h From segher at kernel.crashing.org Thu Nov 9 06:43:47 2006 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 9 Nov 2006 15:43:47 +0100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <20061109141733.GA11499@kroah.com> References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> <20061109141733.GA11499@kroah.com> Message-ID: > While yes, we should not in general add new workarounds before we need > them, for this quirk, you should keep the original functionality, > unless > you wrote the quirk, or unless you have the hardware that needs it and > you can verify that the change works properly. > > Are any of these last two options true for you? This new code only runs on HyperTransport devices and none of those _existed_ when the quirk was first written. I cannot claim I know for sure it is never needed there of course, but it's quite improbable at least. > If not, I suggest that you put the TTL logic back in just to be safe. I'm fine with that -- but I'm not writing the code here, Michael is, and I just hope he has more spine than I do ;-) Segher From ogerlitz at voltaire.com Thu Nov 9 07:20:47 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 09 Nov 2006 17:20:47 +0200 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <454A3476.6090402@ichips.intel.com> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> <4549D3B7.1050208@voltaire.com> <454A3476.6090402@ichips.intel.com> Message-ID: <4553474F.6040701@voltaire.com> Sean Hefty wrote: > Or Gerlitz wrote: >> Can be very nice if you share with the community the IB stack issues >> revealed under scale-out testing... basically what was the testbed? > We have a 256 node (512 processors) cluster that we can test with on the > second Tuesday following the first Monday of any month with two full > moons. We're only now getting some time on the cluster, and our test > capabilities are limited. > The main issue that we saw was that the SA simply doesn't scale. I see. Thanks for the detailed response and sorry for the no reply on my side so far, i was too busy... Your email describes the problem under the all-to-all connection model. My thinking is that this design is the first one to be revisited, i understand that open mpi opens connections on demand (at this point of time it does not use the ib stack connection management services as well). Even in the all-to-all-conn model, a question to ask is if the connecting is done in N phases or for all ranks you just call in a loop for(j=i+1; j At 5000 queries per second, it will take the SA nearly 30 seconds to > respond to the first set of requests, most of which will have timed > out. By the time it reached the end of the first 130,000 requests, it > had hundreds of thousands of queued retries, most of which had also > already timed out. (E.g. even with a exponential backoff, you'd have > retries at 4 seconds, 12 seconds, and 28 seconds before the SA can > finish processing the first set of requests.) > To further complicate the issue, retried requests are given new > transaction IDs by the ib_sa module, which makes it impossible for the > SA to detect retries from original requests. It sees all requests as > new. On our largest run, we were never able to complete route resolution. OK, i recall some patch or rfc you have posted which enables a response on original request match a "pending retry", basically it means that all the retries use the TID of the original request, correct? am i dreaming so this is indeed somewhere in the pipe to the kernel? > We're still exploring possibilities in this area. Or. From dotanb at dev.mellanox.co.il Thu Nov 9 07:23:59 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Thu, 09 Nov 2006 17:23:59 +0200 Subject: [openib-general] what should happen in a completion event channel is being destroyed when there are several CQs associated to it? Message-ID: <4553480F.80000@dev.mellanox.co.il> Hi. What should happen in a completion event channel is being destroyed when there are several CQs associated to it? Should this operation fail (return EBUSY)? Should this operation pass? Is it legal for a user to perform this operation? When i tried to do it and later on try to wait for a completion on this event channel i got seg fault... thanks Dotan From vlad at dev.mellanox.co.il Thu Nov 9 07:48:32 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 09 Nov 2006 17:48:32 +0200 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails In-Reply-To: <00a301c70347$accee4f0$05c8a8c0@DIEGO> References: <003201c70317$36911f40$05c8a8c0@DIEGO> <4551E7C5.8020700@dev.mellanox.co.il> <00a301c70347$accee4f0$05c8a8c0@DIEGO> Message-ID: <45534DD0.3010803@dev.mellanox.co.il> Hello Diego, Check that you have libstdc++, libstdc++-devel and compat-libstdc++ RPMs installed. Regards, Vladimir Diego Guella wrote: > > From: "Tziporet Koren" >> The failing is utility is used for IPoIB high availability. If you >> don't need to use them you can just change this line in ofed.conf: >> ipoibtools=n >> >> Tziporet >> > Thanks Tziporet for your answer. > > > Tried just right now, i disabled ipoibtools. I get another, more > strange error: > (attached OFED.3816.log) > ----- > /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples > cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs > Running: ./configure > --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > --disable-libcheck --prefix /usr/local/ofed --libdir > /usr/local/ofed/lib CPPFLAGS="-I../libibverbs/include" > configure: creating cache > /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > checking for a BSD-compatible install... /usr/bin/install -c > checking whether build environment is sane... yes > checking for gawk... gawk > checking whether make sets $(MAKE)... yes > checking build system type... x86_64-unknown-linux-gnu > checking host system type... x86_64-unknown-linux-gnu > checking for style of include used by make... GNU > checking for gcc... gcc > checking for C compiler default output file name... configure: error: > C compiler cannot create executables > See `config.log' for more details. > Failed to execute: ./configure > --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > --disable-libcheck --prefix /usr/local/ofed --libdir > /usr/local/ofed/lib CPPFLAGS="-I../libibverbs/include" > error: Bad exit status from /var/tmp/rpm-tmp.46102 (%install) > ----- > > Am I right? It says my C compiler cannot create executables???? Is it > joking me???? > In the log file, line 6393, it says: > ----- > checking for C compiler default output file name... a.out > ----- > > I don't understand....! > Is there something I can do to fix this? > > > Thanks, > Diego > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Nov 9 07:54:17 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Nov 2006 07:54:17 -0800 Subject: [openib-general] scaling issues, was: uDAPL cma: add support for address and route retries, call disconnect when recving dreq In-Reply-To: <4553474F.6040701@voltaire.com> References: <000a01c6fe17$19b4f3b0$bb97070a@amr.corp.intel.com> <4549D3B7.1050208@voltaire.com> <454A3476.6090402@ichips.intel.com> <4553474F.6040701@voltaire.com> Message-ID: <45534F29.5000608@ichips.intel.com> Or Gerlitz wrote: > for(j=i+1; j dat_ep_connect(ep[j], ip-address of peer j) > > > and then > > while(there are more non established connections) > dat_evd_wait(...) I'm not overly familiar with the the MPI code, so I can't comment on the implementation. > OK, i recall some patch or rfc you have posted which enables a response > on original request match a "pending retry", basically it means that all > the retries use the TID of the original request, correct? am i dreaming > so this is indeed somewhere in the pipe to the kernel? I have a patch that exposed the mad layer retry count up through the SA query code. However, I'm not sure that it helps us all that much without additional changes. Detecting duplicate requests is left as a responsibility to the receiver, and retries are issued using a linear timeout. - Sean From thomas.bub at thomson.net Thu Nov 9 08:36:51 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Thu, 9 Nov 2006 17:36:51 +0100 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. Message-ID: As written before I have to connect a gen2 client with a gen1 server using CM. The connection is established fine and both sides go into the connected state. However I can't send any data from none of the two sides. As soon as I do so I get a 0x81 VAPI_RETRY_EXC_ERR when trying to send from the gen1 or a vendor_err 0x81 when trying to send from the gen2 side. What works so far with my gen1 and gen2 code is: Connecting gen1 client and gen1 server is no problem. Connecting gen2 client and gen2 server is no problem. Connecting gen1 client and gen2 server is no problem Unfortunately the only usage scenario I have is to connect gen2 client and gen1 server. :-( Is there a way to diagnose the reason for trouble when a VAPI_SEND / IBV_WR_SEND returns that 0x81 VAPI_RETRY_EXC_ERR? It seems as if the receiving side which is sure in the receive mode does not get any notification as if the sender does not even start sending. I already printed out the qp_state of both the gen2 client and the gen1 server before the send fails. Which look like: Gen2 client QP state. qp_state: 3 cur_qp_state: 3 path_mtu: 4 path_mig_state: 0 qkey: 16391 rq_psn: 5571589 sq_psn: 15025863 dest_qp_num: 10814473 qp_access_flags: 14 cap: 256 ah_attr: 256 alt_ah_attr: 29 pkey_index: 3 alt_pkey_index: 460 en_sqd_async_notify: 33022 sq_draining: 0 max_rd_atomic: 82905088 max_dest_rd_atomic: 219649539 min_rnr_timer: 0 port_num: 16384 timeout: 50331753 retry_cnt: 65795 rnr_retry: 0 alt_port_num: 0 alt_timeout: 0 Gen1 server QP state. qp_state: 3 en_sqd_asyn_notif: 0 sq_draining: 0 qp_num: 10814473 remote_atomic_flags: 7 qkey: 0 path_mtu: 4 path_mig_state: 0 rq_psn: 15025863 sq_psn: 5571589 qp_ous_rd_atom: 4 ous_dst_rd_atom: 4 min_rnr_timer: 27 cap: 200 dest_qp_num: 200 sched_queue: 28 pkey_ix: 28 port: 460 av: 5571589 timeout: 0 retry_count: 0 rnr_retry: 1 alt_pkey_ix: 0 alt_port: 0 alt_av: 0 alt_timeout: 0 In the good cases described above the qp_state look similar to the ones described here Any help welcome. Thomas Bub ............................................................ Thomas Bub Grass Valley Germany GmbH Brunnenweg 9 64331 Weiterstadt, Germany Tel: +49 6150 104 147 Fax: +49 6150 104 656 Email: Thomas.Bub at thomson.net www.GrassValley.com ............................................................ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mshefty at ichips.intel.com Thu Nov 9 09:56:06 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Nov 2006 09:56:06 -0800 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. In-Reply-To: References: Message-ID: <45536BB6.9010100@ichips.intel.com> Bub Thomas wrote: > As soon as I do so I get a 0x81 VAPI_RETRY_EXC_ERR when trying to send > from the gen1 or a vendor_err 0x81 when trying to send from the gen2 side. It sounds like there's an issue with the QP configuration. Maybe there's a difference between which byte-order the QP attributes are specified? > rq_psn: 5571589 > > sq_psn: 15025863 > > dest_qp_num: 10814473 Can you verify the byte order for the values above? I didn't see the local qp_num listed. > pkey_index: 3 This isn't necessarily a problem, but I usually see a pkey index of 0. Are you running with multiple partitions on your subnet? What pkey value does this equate to? (cat /sys/class/infiniband/mthca0/ports/1/pkeys/3) > port_num: 16384 Is the displayed port_num valid? > Gen1 server QP state. > > qp_num: 10814473 > > rq_psn: 15025863 > > sq_psn: 5571589 > > dest_qp_num: 200 Can you verify the qp_num on the remote side? > pkey_ix: 28 This index looks unlikely. > port: 460 Is this value valid? > retry_count: 0 Try increasing this. - Sean From mshefty at ichips.intel.com Thu Nov 9 10:06:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 09 Nov 2006 10:06:18 -0800 Subject: [openib-general] Fwd: IPoIB new multicast API patches oops In-Reply-To: <20061109100229.GF14960@mellanox.co.il> References: <20061109100229.GF14960@mellanox.co.il> Message-ID: <45536E1A.80106@ichips.intel.com> Michael S. Tsirkin wrote: > Following Sean's suggestion, I have let the nightly tests run with Roland's mad patch > in addition to Sean's new multicast interface patches (v2), and got the > following crash: I have time now to try reproducing this. Can you describe the test setup some? Was opensm running? Was this after loading/unloading ipoib? Was this while trying to unload ib_sa? - Sean From venkatesh.babu at 3leafnetworks.com Thu Nov 9 11:49:58 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 09 Nov 2006 11:49:58 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <000001c702e3$27e92110$43d0180a@amr.corp.intel.com> References: <000001c702e3$27e92110$43d0180a@amr.corp.intel.com> Message-ID: <45538666.4050300@3leafnetworks.com> Hi Sean, I have verified your changes and it is working fine. I have tried port failover on both Active and Passive nodes. It is working fine. Since you have not provided the ib_sa_serv_notice_hdlr() changes for the remote event notification I am still using my patch. What are your plans for updating that ? How did you tested the failover on the Passive node ? VBabu Sean Hefty wrote: >Memo to me: read comments about missing functionality... > >Fixed an issue with the previous patch not having the right pkey >when forwarding LAP messages to the user. > >With this patch, I'm able to fail over between two paths, reload >a new path, and fail again repeatedly using my test program. > >Venkatesh, if you can verify that this code works for you, I will >request that it be queued for 2.6.20. > >Signed-off-by: Sean Hefty >--- >diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >index 1cf0d42..ed69573 100644 >--- a/drivers/infiniband/core/cm.c >+++ b/drivers/infiniband/core/cm.c >@@ -147,12 +147,12 @@ struct cm_id_private { > __be32 rq_psn; > int timeout_ms; > enum ib_mtu path_mtu; >+ __be16 pkey; > u8 private_data_len; > u8 max_cm_retries; > u8 peer_to_peer; > u8 responder_resources; > u8 initiator_depth; >- u8 local_ack_timeout; > u8 retry_count; > u8 rnr_retry_count; > u8 service_timeout; >@@ -691,7 +691,7 @@ static void cm_enter_timewait(struct cm_ > * timewait before notifying the user that we've exited timewait. > */ > cm_id_priv->id.state = IB_CM_TIMEWAIT; >- wait_time = cm_convert_to_ms(cm_id_priv->local_ack_timeout); >+ wait_time = cm_convert_to_ms(cm_id_priv->av.packet_life_time + 1); > queue_delayed_work(cm.wq, &cm_id_priv->timewait_info->work.work, > msecs_to_jiffies(wait_time)); > cm_id_priv->timewait_info = NULL; >@@ -1010,6 +1010,7 @@ int ib_send_cm_req(struct ib_cm_id *cm_i > cm_id_priv->responder_resources = param->responder_resources; > cm_id_priv->retry_count = param->retry_count; > cm_id_priv->path_mtu = param->primary_path->mtu; >+ cm_id_priv->pkey = param->primary_path->pkey; > cm_id_priv->qp_type = param->qp_type; > > ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg); >@@ -1024,8 +1025,6 @@ int ib_send_cm_req(struct ib_cm_id *cm_i > > cm_id_priv->local_qpn = cm_req_get_local_qpn(req_msg); > cm_id_priv->rq_psn = cm_req_get_starting_psn(req_msg); >- cm_id_priv->local_ack_timeout = >- cm_req_get_primary_local_ack_timeout(req_msg); > > spin_lock_irqsave(&cm_id_priv->lock, flags); > ret = ib_post_send_mad(cm_id_priv->msg, NULL); >@@ -1410,9 +1409,8 @@ static int cm_req_handler(struct cm_work > cm_id_priv->initiator_depth = cm_req_get_resp_res(req_msg); > cm_id_priv->responder_resources = cm_req_get_init_depth(req_msg); > cm_id_priv->path_mtu = cm_req_get_path_mtu(req_msg); >+ cm_id_priv->pkey = req_msg->pkey; > cm_id_priv->sq_psn = cm_req_get_starting_psn(req_msg); >- cm_id_priv->local_ack_timeout = >- cm_req_get_primary_local_ack_timeout(req_msg); > cm_id_priv->retry_count = cm_req_get_retry_count(req_msg); > cm_id_priv->rnr_retry_count = cm_req_get_rnr_retry_count(req_msg); > cm_id_priv->qp_type = cm_req_get_qp_type(req_msg); >@@ -1716,7 +1714,7 @@ static int cm_establish_handler(struct c > unsigned long flags; > int ret; > >- /* See comment in ib_cm_establish about lookup. */ >+ /* See comment in cm_establish about lookup. */ > cm_id_priv = cm_acquire_id(work->local_id, work->remote_id); > if (!cm_id_priv) > return -EINVAL; >@@ -2402,11 +2400,16 @@ int ib_send_cm_lap(struct ib_cm_id *cm_i > cm_id_priv = container_of(cm_id, struct cm_id_private, id); > spin_lock_irqsave(&cm_id_priv->lock, flags); > if (cm_id->state != IB_CM_ESTABLISHED || >- cm_id->lap_state != IB_CM_LAP_IDLE) { >+ (cm_id->lap_state != IB_CM_LAP_UNINIT && >+ cm_id->lap_state != IB_CM_LAP_IDLE)) { > ret = -EINVAL; > goto out; > } > >+ ret = cm_init_av_by_path(alternate_path, &cm_id_priv->alt_av); >+ if (ret) >+ goto out; >+ > ret = cm_alloc_msg(cm_id_priv, &msg); > if (ret) > goto out; >@@ -2431,7 +2434,8 @@ out: spin_unlock_irqrestore(&cm_id_priv- > } > EXPORT_SYMBOL(ib_send_cm_lap); > >-static void cm_format_path_from_lap(struct ib_sa_path_rec *path, >+static void cm_format_path_from_lap(struct cm_id_private *cm_id_priv, >+ struct ib_sa_path_rec *path, > struct cm_lap_msg *lap_msg) > { > memset(path, 0, sizeof *path); >@@ -2443,10 +2447,10 @@ static void cm_format_path_from_lap(stru > path->hop_limit = lap_msg->alt_hop_limit; > path->traffic_class = cm_lap_get_traffic_class(lap_msg); > path->reversible = 1; >- /* pkey is same as in REQ */ >+ path->pkey = cm_id_priv->pkey; > path->sl = cm_lap_get_sl(lap_msg); > path->mtu_selector = IB_SA_EQ; >- /* mtu is same as in REQ */ >+ path->mtu = cm_id_priv->path_mtu; > path->rate_selector = IB_SA_EQ; > path->rate = cm_lap_get_packet_rate(lap_msg); > path->packet_life_time_selector = IB_SA_EQ; >@@ -2472,7 +2476,7 @@ static int cm_lap_handler(struct cm_work > > param = &work->cm_event.param.lap_rcvd; > param->alternate_path = &work->path[0]; >- cm_format_path_from_lap(param->alternate_path, lap_msg); >+ cm_format_path_from_lap(cm_id_priv, param->alternate_path, lap_msg); > work->cm_event.private_data = &lap_msg->private_data; > > spin_lock_irqsave(&cm_id_priv->lock, flags); >@@ -2480,6 +2484,7 @@ static int cm_lap_handler(struct cm_work > goto unlock; > > switch (cm_id_priv->id.lap_state) { >+ case IB_CM_LAP_UNINIT: > case IB_CM_LAP_IDLE: > break; > case IB_CM_MRA_LAP_SENT: >@@ -2502,6 +2507,10 @@ static int cm_lap_handler(struct cm_work > > cm_id_priv->id.lap_state = IB_CM_LAP_RCVD; > cm_id_priv->tid = lap_msg->hdr.tid; >+ cm_init_av_for_response(work->port, work->mad_recv_wc->wc, >+ work->mad_recv_wc->recv_buf.grh, >+ &cm_id_priv->av); >+ cm_init_av_by_path(param->alternate_path, &cm_id_priv->alt_av); > ret = atomic_inc_and_test(&cm_id_priv->work_count); > if (!ret) > list_add_tail(&work->list, &cm_id_priv->work_list); >@@ -3040,7 +3049,7 @@ static void cm_work_handler(void *data) > cm_free_work(work); > } > >-int ib_cm_establish(struct ib_cm_id *cm_id) >+static int cm_establish(struct ib_cm_id *cm_id) > { > struct cm_id_private *cm_id_priv; > struct cm_work *work; >@@ -3088,7 +3097,44 @@ int ib_cm_establish(struct ib_cm_id *cm_ > out: > return ret; > } >-EXPORT_SYMBOL(ib_cm_establish); >+ >+static int cm_migrate(struct ib_cm_id *cm_id) >+{ >+ struct cm_id_private *cm_id_priv; >+ unsigned long flags; >+ int ret = 0; >+ >+ cm_id_priv = container_of(cm_id, struct cm_id_private, id); >+ spin_lock_irqsave(&cm_id_priv->lock, flags); >+ if (cm_id->state == IB_CM_ESTABLISHED && >+ (cm_id->lap_state == IB_CM_LAP_UNINIT || >+ cm_id->lap_state == IB_CM_LAP_IDLE)) { >+ cm_id->lap_state = IB_CM_LAP_IDLE; >+ cm_id_priv->av = cm_id_priv->alt_av; >+ } else >+ ret = -EINVAL; >+ spin_unlock_irqrestore(&cm_id_priv->lock, flags); >+ >+ return ret; >+} >+ >+int ib_cm_notify(struct ib_cm_id *cm_id, enum ib_event_type event) >+{ >+ int ret; >+ >+ switch (event) { >+ case IB_EVENT_COMM_EST: >+ ret = cm_establish(cm_id); >+ break; >+ case IB_EVENT_PATH_MIG: >+ ret = cm_migrate(cm_id); >+ break; >+ default: >+ ret = -EINVAL; >+ } >+ return ret; >+} >+EXPORT_SYMBOL(ib_cm_notify); > > static void cm_recv_handler(struct ib_mad_agent *mad_agent, > struct ib_mad_recv_wc *mad_recv_wc) >@@ -3221,6 +3267,9 @@ static int cm_init_qp_rtr_attr(struct cm > if (cm_id_priv->alt_av.ah_attr.dlid) { > *qp_attr_mask |= IB_QP_ALT_PATH; > qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; >+ qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; >+ qp_attr->alt_timeout = >+ cm_id_priv->alt_av.packet_life_time + 1; > qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; > } > ret = 0; >@@ -3247,19 +3296,31 @@ static int cm_init_qp_rts_attr(struct cm > case IB_CM_REP_SENT: > case IB_CM_MRA_REP_RCVD: > case IB_CM_ESTABLISHED: >- *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; >- qp_attr->sq_psn = be32_to_cpu(cm_id_priv->sq_psn); >- if (cm_id_priv->qp_type == IB_QPT_RC) { >- *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | >- IB_QP_RNR_RETRY | >- IB_QP_MAX_QP_RD_ATOMIC; >- qp_attr->timeout = cm_id_priv->local_ack_timeout; >- qp_attr->retry_cnt = cm_id_priv->retry_count; >- qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; >- qp_attr->max_rd_atomic = cm_id_priv->initiator_depth; >- } >- if (cm_id_priv->alt_av.ah_attr.dlid) { >- *qp_attr_mask |= IB_QP_PATH_MIG_STATE; >+ if (cm_id_priv->id.lap_state == IB_CM_LAP_UNINIT) { >+ *qp_attr_mask = IB_QP_STATE | IB_QP_SQ_PSN; >+ qp_attr->sq_psn = be32_to_cpu(cm_id_priv->sq_psn); >+ if (cm_id_priv->qp_type == IB_QPT_RC) { >+ *qp_attr_mask |= IB_QP_TIMEOUT | IB_QP_RETRY_CNT | >+ IB_QP_RNR_RETRY | >+ IB_QP_MAX_QP_RD_ATOMIC; >+ qp_attr->timeout = >+ cm_id_priv->av.packet_life_time + 1; >+ qp_attr->retry_cnt = cm_id_priv->retry_count; >+ qp_attr->rnr_retry = cm_id_priv->rnr_retry_count; >+ qp_attr->max_rd_atomic = >+ cm_id_priv->initiator_depth; >+ } >+ if (cm_id_priv->alt_av.ah_attr.dlid) { >+ *qp_attr_mask |= IB_QP_PATH_MIG_STATE; >+ qp_attr->path_mig_state = IB_MIG_REARM; >+ } >+ } else { >+ *qp_attr_mask = IB_QP_ALT_PATH | IB_QP_PATH_MIG_STATE; >+ qp_attr->alt_port_num = cm_id_priv->alt_av.port->port_num; >+ qp_attr->alt_pkey_index = cm_id_priv->alt_av.pkey_index; >+ qp_attr->alt_timeout = >+ cm_id_priv->alt_av.packet_life_time + 1; >+ qp_attr->alt_ah_attr = cm_id_priv->alt_av.ah_attr; > qp_attr->path_mig_state = IB_MIG_REARM; > } > ret = 0; >diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c >index ad4f4d5..e04f662 100644 >--- a/drivers/infiniband/core/ucm.c >+++ b/drivers/infiniband/core/ucm.c >@@ -685,11 +685,11 @@ out: > return result; > } > >-static ssize_t ib_ucm_establish(struct ib_ucm_file *file, >- const char __user *inbuf, >- int in_len, int out_len) >+static ssize_t ib_ucm_notify(struct ib_ucm_file *file, >+ const char __user *inbuf, >+ int in_len, int out_len) > { >- struct ib_ucm_establish cmd; >+ struct ib_ucm_notify cmd; > struct ib_ucm_context *ctx; > int result; > >@@ -700,7 +700,7 @@ static ssize_t ib_ucm_establish(struct i > if (IS_ERR(ctx)) > return PTR_ERR(ctx); > >- result = ib_cm_establish(ctx->cm_id); >+ result = ib_cm_notify(ctx->cm_id, (enum ib_event_type) cmd.event); > ib_ucm_ctx_put(ctx); > return result; > } >@@ -1107,7 +1107,7 @@ static ssize_t (*ucm_cmd_table[])(struct > [IB_USER_CM_CMD_DESTROY_ID] = ib_ucm_destroy_id, > [IB_USER_CM_CMD_ATTR_ID] = ib_ucm_attr_id, > [IB_USER_CM_CMD_LISTEN] = ib_ucm_listen, >- [IB_USER_CM_CMD_ESTABLISH] = ib_ucm_establish, >+ [IB_USER_CM_CMD_NOTIFY] = ib_ucm_notify, > [IB_USER_CM_CMD_SEND_REQ] = ib_ucm_send_req, > [IB_USER_CM_CMD_SEND_REP] = ib_ucm_send_rep, > [IB_USER_CM_CMD_SEND_RTU] = ib_ucm_send_rtu, >diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h >index c9b4738..5c07017 100644 >--- a/include/rdma/ib_cm.h >+++ b/include/rdma/ib_cm.h >@@ -60,6 +60,7 @@ enum ib_cm_state { > }; > > enum ib_cm_lap_state { >+ IB_CM_LAP_UNINIT, > IB_CM_LAP_IDLE, > IB_CM_LAP_SENT, > IB_CM_LAP_RCVD, >@@ -443,13 +444,20 @@ int ib_send_cm_drep(struct ib_cm_id *cm_ > u8 private_data_len); > > /** >- * ib_cm_establish - Forces a connection state to established. >+ * ib_cm_notify - Notifies the CM of an event reported to the consumer. > * @cm_id: Connection identifier to transition to established. >+ * @event: Type of event. > * >- * This routine should be invoked by users who receive messages on a >- * connected QP before an RTU has been received. >+ * This routine should be invoked by users to notify the CM of relevant >+ * communication events. Events that should be reported to the CM and >+ * when to report them are: >+ * >+ * IB_EVENT_COMM_EST - Used when a message is received on a connected >+ * QP before an RTU has been received. >+ * IB_EVENT_PATH_MIG - Notifies the CM that the connection has failed over >+ * to the alternate path. > */ >-int ib_cm_establish(struct ib_cm_id *cm_id); >+int ib_cm_notify(struct ib_cm_id *cm_id, enum ib_event_type event); > > /** > * ib_send_cm_rej - Sends a connection rejection message to the >diff --git a/include/rdma/ib_user_cm.h b/include/rdma/ib_user_cm.h >index 066c20b..37650af 100644 >--- a/include/rdma/ib_user_cm.h >+++ b/include/rdma/ib_user_cm.h >@@ -38,7 +38,7 @@ #define IB_USER_CM_H > > #include > >-#define IB_USER_CM_ABI_VERSION 4 >+#define IB_USER_CM_ABI_VERSION 5 > > enum { > IB_USER_CM_CMD_CREATE_ID, >@@ -46,7 +46,7 @@ enum { > IB_USER_CM_CMD_ATTR_ID, > > IB_USER_CM_CMD_LISTEN, >- IB_USER_CM_CMD_ESTABLISH, >+ IB_USER_CM_CMD_NOTIFY, > > IB_USER_CM_CMD_SEND_REQ, > IB_USER_CM_CMD_SEND_REP, >@@ -117,8 +117,9 @@ struct ib_ucm_listen { > __u32 reserved; > }; > >-struct ib_ucm_establish { >+struct ib_ucm_notify { > __u32 id; >+ __u32 event; > }; > > struct ib_ucm_private_data { > > > From rdreier at cisco.com Thu Nov 9 11:55:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 11:55:59 -0800 Subject: [openib-general] ANNOUNCE: libmthca 1.0.3 Message-ID: I've just made a new 1.0.3 release of libmthca and pushed it out to the relevant channel, which means that it should appear on http://openib.org/downloads/ shortly. Binary packages will also appear in Debian and Fedora Extras when the builds complete. Changes since 1.0.2 include: - Add Valgrind annotations, enabled with "configure --with-valgrind". - fork() support when built against the libibverbs development branch. - Various fixes and cleanups. See the ChangeLog in the package for full details. Thanks, Roland From khali at linux-fr.org Thu Nov 9 12:02:26 2006 From: khali at linux-fr.org (Jean Delvare) Date: Thu, 9 Nov 2006 21:02:26 +0100 Subject: [openib-general] [PATCH] amso1100: Fix && typo Message-ID: <20061109210226.605b4700.khali@linux-fr.org> Fix the AMSO1100 firmware version computation, which was broken due to "&&" being used where "&" should have. Signed-off-by: Jean Delvare --- drivers/infiniband/hw/amso1100/c2_rnic.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.19-rc5.orig/drivers/infiniband/hw/amso1100/c2_rnic.c 2006-11-09 10:30:33.000000000 +0100 +++ linux-2.6.19-rc5/drivers/infiniband/hw/amso1100/c2_rnic.c 2006-11-09 20:50:28.000000000 +0100 @@ -157,8 +157,8 @@ props->fw_ver = ((u64)be32_to_cpu(reply->fw_ver_major) << 32) | - ((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) | - (be32_to_cpu(reply->fw_ver_patch) && 0xFFFF); + ((be32_to_cpu(reply->fw_ver_minor) & 0xFFFF) << 16) | + (be32_to_cpu(reply->fw_ver_patch) & 0xFFFF); memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); props->max_mr_size = 0xFFFFFFFF; props->page_size_cap = ~(C2_MIN_PAGESIZE-1); -- Jean Delvare From rdreier at cisco.com Thu Nov 9 12:03:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 12:03:06 -0800 Subject: [openib-general] what should happen in a completion event channel is being destroyed when there are several CQs associated to it? In-Reply-To: <4553480F.80000@dev.mellanox.co.il> (Dotan Barak's message of "Thu, 09 Nov 2006 17:23:59 +0200") References: <4553480F.80000@dev.mellanox.co.il> Message-ID: > What should happen in a completion event channel is being destroyed > when there are several CQs associated to it? > Should this operation fail (return EBUSY)? I think that would be the most consistent thing, since we return EBUSY for example if a CQ is destroyed with QPs still attached. > When i tried to do it and later on try to wait for a completion on > this event channel i got seg fault... Does the destroy succeed? Anyway I'll look at this code to see if it seems OK. - R. From sean.hefty at intel.com Thu Nov 9 12:10:01 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Nov 2006 12:10:01 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <45538666.4050300@3leafnetworks.com> Message-ID: <000001c7043b$09dd7b40$8698070a@amr.corp.intel.com> >I have verified your changes and it is working fine. I have tried port >failover on both Active and Passive nodes. It is working fine. Good to hear. I will request these changes for 2.6.20. >Since you have not provided the ib_sa_serv_notice_hdlr() changes for the >remote event notification I am still using my patch. What are your plans >for updating that ? How did you tested the failover on the Passive node ? Support for InformInfo/Notice will be a separate set of patches. Work on that will follow this. I only tested failover on one side of the connection, because I didn't see a need to test more, and only one side of my test had path records. Once a connection has been established, there's no difference between the active and passive sides. - Sean From rdreier at cisco.com Thu Nov 9 12:39:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 12:39:46 -0800 Subject: [openib-general] [PATCH 1/1] Unitialized pseudo_netdev accessed in c2_register_device In-Reply-To: <1163017402.8753.13.camel@trinity.ogc.int> (Tom Tucker's message of "Wed, 08 Nov 2006 14:23:22 -0600") References: <1163017402.8753.13.camel@trinity.ogc.int> Message-ID: thanks, queued for 2.6.19 From rdreier at cisco.com Thu Nov 9 12:40:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 12:40:25 -0800 Subject: [openib-general] [PATCH] amso1100: Fix && typo In-Reply-To: <20061109210226.605b4700.khali@linux-fr.org> (Jean Delvare's message of "Thu, 9 Nov 2006 21:02:26 +0100") References: <20061109210226.605b4700.khali@linux-fr.org> Message-ID: Looks pretty obvious. Tom/Steve, I queued this for 2.6.19, so tell me if it's wrong and I should drop it. - R. From tom at opengridcomputing.com Thu Nov 9 12:41:26 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 09 Nov 2006 14:41:26 -0600 Subject: [openib-general] [PATCH] amso1100: Fix && typo In-Reply-To: <20061109210226.605b4700.khali@linux-fr.org> References: <20061109210226.605b4700.khali@linux-fr.org> Message-ID: <1163104886.19643.48.camel@trinity.ogc.int> Jean: Thanks. I gave this a whirl because I honestly couldn't remember exactly how these numbers were reported by the FW and it seems to work correctly. Roland, can you pull in Jean's patch? Thanks, Tom On Thu, 2006-11-09 at 21:02 +0100, Jean Delvare wrote: > Fix the AMSO1100 firmware version computation, which was broken > due to "&&" being used where "&" should have. > > Signed-off-by: Jean Delvare > --- > drivers/infiniband/hw/amso1100/c2_rnic.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > --- linux-2.6.19-rc5.orig/drivers/infiniband/hw/amso1100/c2_rnic.c 2006-11-09 10:30:33.000000000 +0100 > +++ linux-2.6.19-rc5/drivers/infiniband/hw/amso1100/c2_rnic.c 2006-11-09 20:50:28.000000000 +0100 > @@ -157,8 +157,8 @@ > > props->fw_ver = > ((u64)be32_to_cpu(reply->fw_ver_major) << 32) | > - ((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) | > - (be32_to_cpu(reply->fw_ver_patch) && 0xFFFF); > + ((be32_to_cpu(reply->fw_ver_minor) & 0xFFFF) << 16) | > + (be32_to_cpu(reply->fw_ver_patch) & 0xFFFF); > memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); > props->max_mr_size = 0xFFFFFFFF; > props->page_size_cap = ~(C2_MIN_PAGESIZE-1); > > From rdreier at cisco.com Thu Nov 9 12:42:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 12:42:52 -0800 Subject: [openib-general] [PATCH] amso1100: Fix && typo In-Reply-To: <1163104886.19643.48.camel@trinity.ogc.int> (Tom Tucker's message of "Thu, 09 Nov 2006 14:41:26 -0600") References: <20061109210226.605b4700.khali@linux-fr.org> <1163104886.19643.48.camel@trinity.ogc.int> Message-ID: > Roland, can you pull in Jean's patch? Already done, thanks for the ACK though. - R. From ralph.campbell at qlogic.com Thu Nov 9 13:40:54 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 09 Nov 2006 13:40:54 -0800 Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband Message-ID: <1163108454.2508.29.camel@brick.pathscale.com> This patch adds a new file to the kernel infiniband documentation directory to briefly describe how to use memory regions. Note: I will be on vacation from Nov. 11 through Nov. 26. Signed-off-by: Ralph Campbell diff -r b9d92097f918 Documentation/infiniband/memory_regions.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Documentation/infiniband/memory_regions.txt Wed Nov 08 18:35:46 2006 -0800 @@ -0,0 +1,110 @@ +INFINIBAND MEMORY REGIONS + + This is an overview of memory region usage for the user and kernel + verbs interface. The verbs API to send and receive data does not + specify memory addresses directly. Instead, a memory region + is constructed and a Lkey or Rkey is used to refer to the region. + +User Space Memory Regions + + User space memory regions are created by calling ibv_reg_mr(). + It returns a pointer to a struct ibv_mr which contains the + 'lkey' field and 'rkey' field. The lkey should be copied + into the 'lkey' field of struct ibv_sge when posting buffers + with ibv_post_send(), ibv_post_recv(), and ibv_post_srq_recv(). + The 'addr' field of the ibv_sge should be a user address between + the address and address + length passed to ibv_reg_mr(). + + The 'rkey' can be sent to another process and used by the + remote process in RDMA write, read, and atomic operations + to access the local process' memory region. + The 'remote_addr' field in the ibv_send_wr should be the local + process' address within the memory region. At some point in + the future, the interface may be extended to allow zero based + remote addresses which would mean the remote_addr would be + an offset within the local process' memory region. + + A memory region is destroyed by calling ibv_dereg_mr(). + + Note that creating and destroying memory regions results + in kernel system calls which lock the user's virtual memory + to physical memory. This means the system administrator must set + the RLIMIT memory lock limit high enough for processes to + be able to create memory regions of the desired size. + It is therefore best to limit the size of memory regions created. + +Kernel Space Memory Regions + + ib_get_dma_mr() This function returns a pointer to struct ib_mr + which contains the 'lkey' and 'rkey' fields similar to user + memory regions. The memory region represents all of physical + memory so no base address or length is needed when creating it. + The addresses used for the 'addr' field of struct ib_sge need + to be hardware device addresses suitable for DMA. + Since this mapping may be device specific, there are a set + of kernel verbs functions corresponding to the DMA mapping + functions described in DMA-API.txt. Another useful reference + is the "Linux Device Drivers" book, 3rd edition, by Rubini and Corbet. + + ib_dma_mapping_error() + ib_dma_map_single() + ib_dma_unmap_single() + ib_dma_map_page() + ib_dma_unmap_page() + ib_dma_map_sg() + ib_dma_unmap_sg() + ib_sg_dma_address() + ib_sg_dma_len() + ib_dma_sync_single_for_cpu() + ib_dma_sync_single_for_device() + + Remote processes should use the same address for 'remote_addr' + as the local kernel's address as returned by the mapping functions + listed above. The only difference is the local kernel uses the + 'lkey' and the remote kernel uses the 'rkey'. + + Note that the mapped addresses need to be unmapped after they + are no longer needed. This may require the local and remote + kernels to pass messages at the middle or upper layers to + sychronize. + + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. + It takes an array of device DMA addresses and lengths which are used + to describe the memory region. These addresses are created by + calling the mapping functions listed for ib_get_dma_mr(). + The 'iova' argument is the starting address of the memory region + which should be used with the 'lkey' or 'rkey' returned in the + struct ib_mr. + + ib_dereg_mr() is used to destroy memory regions created by + either ib_get_dma_mr() or ib_reg_phys_mr(). + + ib_alloc_fmr() This returns a pointer to a struct ib_fmr. + The struct ib_fmr_attr argument specifies the size of each + FMR "page" as a power of two in 'page_shift'. This size + is assumed by ib_map_phys_fmr() described below. + A FMR cannot be used until ib_map_phys_fmr() is called. + The 'lkey' and 'rkey' fields are defined in struct ib_fmr + and used the same way as the other memory regions. + + ib_map_phys_fmr() The function takes an array of u64 and a length + for the number of entries in the array. Each u64 value should be + a DMA address created with the mapping functions listed for + ib_get_dma_mr(). The length of each u64 address region is the + FMR page size set when ib_alloc_fmr() was called. + Note that this now defines the memory region to start at address + 'iova' and is the base address used for 'addr' and 'remote_addr'. + The size of the memory region is the array length times the + FMR page size. + + FMR memory regions should be unmapped by calling ib_unmap_fmr() + and then the ib_fmr destroyed by calling ib_dealloc_fmr(). + + See also ib_create_fmr_pool(), ib_fmr_pool_map_phys(), and + ib_fmr_pool_unmap() which are defined in the ib_core module + to assist in caching FMRs. This can help performance when + the same memory is mapped/unmapped frequently. + + Despite the name FMR, the memory region allocation and deallocation + functions perform very differently depending on device, processor, + and platform differences. From venkatesh.babu at 3leafnetworks.com Thu Nov 9 13:42:42 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 09 Nov 2006 13:42:42 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <000001c7043b$09dd7b40$8698070a@amr.corp.intel.com> References: <000001c7043b$09dd7b40$8698070a@amr.corp.intel.com> Message-ID: <4553A0D2.4080102@3leafnetworks.com> Sean Hefty wrote: >I only tested failover on one side of the connection, because I didn't see a >need to test more, and only one side of my test had path records. Once a >connection has been established, there's no difference between the active and >passive sides. > > Yes, only Active side will have the path records. But port may fail on the Passive side too. When a port is failed on the Passive node, active node also need to change the QP state to migrated. Only Active node can reload the alternate path. So if the path comes back on the Passive node, it has to send event notification to the active to reload the alternate path. With my ib_sa_serv_notice_hdlr() and your CM changes I have tested all possible combinations. It was working fine. VBabu >- Sean > > From sean.hefty at intel.com Thu Nov 9 13:48:10 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Nov 2006 13:48:10 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <4553A0D2.4080102@3leafnetworks.com> Message-ID: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> >Yes, only Active side will have the path records. But port may fail on >the Passive side too. >When a port is failed on the Passive node, active node also need to >change the QP state to migrated. The QP state will automatically change to migrated on both sides of the connection after a failure occurs. There's a delay before you'll see the IB_EVENT_PATH_MIGRATED event on the QP though, so a manual transition of the QP state may be faster, but isn't necessary. For my testing, I waited for both sides to process the IB_EVENT_PATH_MIGRATED event before having the original active side call ib_send_cm_lap(). - Sean From paulus at samba.org Wed Nov 8 22:36:39 2006 From: paulus at samba.org (Paul Mackerras) Date: Thu, 9 Nov 2006 17:36:39 +1100 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: References: Message-ID: <17746.52343.815568.368590@cargo.ozlabs.ibm.com> Christoph Raisch writes: > ioremap maps 4k pages on 4k kernels and on 64k pages on 64k kernels. So far > the theory. > > This is true for memory. And for I/O. :) ioremap updates the (Linux) page tables that map the vmalloc/ioremap area, and that is at page granularity. So there is in fact no difference in the end result in the page tables whether you ask to map a small amount inside a page, or the whole page. > On POWER the ebus memory is mapped by H_ENTER. > The hypervisor checks for 4k page size on H_ENTER, reason see above. The next part of the story is that the low-level MMU code on System-P (pSeries) machines only does the H_ENTER when you access an I/O mapping. It does H_ENTER for 4k pages for non-cacheable mappings, and it only does the H_ENTER for the 4k subpages of a 64k page that the kernel actually accesses. So Roland is correct in his comment about how ioremap is called. Regards, Paul. From rdreier at cisco.com Thu Nov 9 13:58:21 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 13:58:21 -0800 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: <17746.52343.815568.368590@cargo.ozlabs.ibm.com> (Paul Mackerras's message of "Thu, 9 Nov 2006 17:36:39 +1100") References: <17746.52343.815568.368590@cargo.ozlabs.ibm.com> Message-ID: > So Roland is correct in his comment about how ioremap is called. Umm, so is this patch really needed? Where did the patch come from -- is it needed to fix something actually seen, or was it written just based on some theoretical understanding? I'm confused... - R. From swise at opengridcomputing.com Thu Nov 9 14:43:08 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Thu, 09 Nov 2006 14:43:08 -0800 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: <20061109040037.7062.26245.sendpatchset@localhost.localdomain> References: <20061109040037.7062.26245.sendpatchset@localhost.localdomain> Message-ID: <1163112188.4346.38.camel@linux-q667.site> Tom, can you review at this one? I remember this was sensitive code. I don't see the need for the 'empty' variable, but perhaps its plugging a race condition? On Thu, 2006-11-09 at 09:30 +0530, Krishna Kumar wrote: > Get rid of extra call to list_empty(), and unnecessary > variable. Has the side effect of sometimes resulting in > faster processing of new events (like handling new > connections, eg when cm_work_handler was processing the > last entry) added to this list instead of cm_work_handler > function exiting and re-entering when a new queue_work() > is done. > > Doing the redundant queue_work() (if cm_work_handler is > already running and processing the last entry) will not > result in another call to cm_work_handler (run_workqueue) > where no entry is found, since cm_work_handler will remove > all entries from the list, even ones that are added late. > > Signed-off-by: Krishna Kumar > --- > diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c > --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 > +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 > @@ -834,22 +834,17 @@ static void cm_work_handler(void *arg) > struct iw_cm_event levent; > struct iwcm_id_private *cm_id_priv = work->cm_id; > unsigned long flags; > - int empty; > - int ret = 0; > > spin_lock_irqsave(&cm_id_priv->lock, flags); > - empty = list_empty(&cm_id_priv->work_list); > - while (!empty) { > + while (!list_empty(&cm_id_priv->work_list)) { > work = list_entry(cm_id_priv->work_list.next, > struct iwcm_work, list); > list_del_init(&work->list); > - empty = list_empty(&cm_id_priv->work_list); > levent = work->event; > put_work(work); > spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > - ret = process_event(cm_id_priv, &levent); > - if (ret) { > + if (process_event(cm_id_priv, &levent)) { > set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); > destroy_cm_id(&cm_id_priv->id); > } > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Thu Nov 9 14:46:49 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Thu, 09 Nov 2006 14:46:49 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Memory corruption bug in cm_work_handler] Message-ID: <1163112409.4346.43.camel@linux-q667.site> Roland, this fix looks good to me. I don't think it is high severity, so perhaps it can just go into 2.6.20. Krishna, for future patches, please include netdev at vger.kernel.org since this code is now in linux proper. The module in svn is no longer being maintained in svn... Acked-by: Steve Wise -------- Forwarded Message -------- From: Krishna Kumar To: openib-general at openib.org Subject: [openib-general] [PATCH] RDMA/iwcm: Memory corruption bug in cm_work_handler Date: Thu, 09 Nov 2006 09:30:34 +0530 Possible memory corruption scenario : after putting the work entry back on the work_free_list, we call process_event() which dereferences work->event, which could have been modified to another value meanwhile. Patches against 2.6.19-rc4 bits. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -830,7 +830,8 @@ static int process_event(struct iwcm_id_ */ static void cm_work_handler(void *arg) { - struct iwcm_work *work = arg, lwork; + struct iwcm_work *work = arg; + struct iw_cm_event levent; struct iwcm_id_private *cm_id_priv = work->cm_id; unsigned long flags; int empty; @@ -843,11 +844,11 @@ static void cm_work_handler(void *arg) struct iwcm_work, list); list_del_init(&work->list); empty = list_empty(&cm_id_priv->work_list); - lwork = *work; + levent = work->event; put_work(work); spin_unlock_irqrestore(&cm_id_priv->lock, flags); - ret = process_event(cm_id_priv, &work->event); + ret = process_event(cm_id_priv, &levent); if (ret) { set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); destroy_cm_id(&cm_id_priv->id); _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu Nov 9 14:53:21 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Thu, 09 Nov 2006 14:53:21 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] Message-ID: <1163112801.4346.47.camel@linux-q667.site> Roland, This fix looks good. IMO it's not high priority for 2.6.19, so 2.6.20 is fine. If anyone thinks otherwise, hollar... Acked-by: Steve Wise -------- Forwarded Message -------- From: Krishna Kumar To: openib-general at openib.org Subject: [openib-general] [PATCH] RDMA/iwcm: Fix memory leak Date: Thu, 09 Nov 2006 09:30:41 +0530 If we get IW_CM_EVENT_CONNECT_REQUEST message and encounter an error (not in the LISTEN state, cannot create an id, cannot alloc work_entry, etc), then the memory allocated by cm_event_handler() in the event->private_data gets leaked. Since cm_work_handler has already put the event on the work_free_list, this allocated memory is leaked. High backlog value can allow DoS attacks. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -620,7 +620,7 @@ static void cm_conn_req_handler(struct i spin_lock_irqsave(&listen_id_priv->lock, flags); if (listen_id_priv->state != IW_CM_STATE_LISTEN) { spin_unlock_irqrestore(&listen_id_priv->lock, flags); - return; + goto out; } spin_unlock_irqrestore(&listen_id_priv->lock, flags); @@ -629,7 +629,7 @@ static void cm_conn_req_handler(struct i listen_id_priv->id.context); /* If the cm_id could not be created, ignore the request */ if (IS_ERR(cm_id)) - return; + goto out; cm_id->provider_data = iw_event->provider_data; cm_id->local_addr = iw_event->local_addr; @@ -642,7 +642,7 @@ static void cm_conn_req_handler(struct i if (ret) { iw_cm_reject(cm_id, NULL, 0); iw_destroy_cm_id(cm_id); - return; + goto out; } /* Call the client CM handler */ @@ -654,6 +654,7 @@ static void cm_conn_req_handler(struct i kfree(cm_id); } +out: if (iw_event->private_data_len) kfree(iw_event->private_data); } _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Thu Nov 9 14:56:48 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 9 Nov 2006 14:56:48 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: <1163112801.4346.47.camel@linux-q667.site> Message-ID: <000201c70452$5648cef0$8698070a@amr.corp.intel.com> > if (iw_event->private_data_len) > kfree(iw_event->private_data); Kfree checks for a null value, so is the private_data_len check necessary? - Sean From rdreier at cisco.com Thu Nov 9 14:59:00 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 14:59:00 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: <000201c70452$5648cef0$8698070a@amr.corp.intel.com> (Sean Hefty's message of "Thu, 9 Nov 2006 14:56:48 -0800") References: <000201c70452$5648cef0$8698070a@amr.corp.intel.com> Message-ID: > > if (iw_event->private_data_len) > > kfree(iw_event->private_data); > > Kfree checks for a null value, so is the private_data_len check necessary? Could private_data be a junk pointer if private_data_len == 0 ? - R. From swise at opengridcomputing.com Thu Nov 9 14:59:42 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Thu, 09 Nov 2006 14:59:42 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Rewrite comment for iwcm_deref_id() to match code.] Message-ID: <1163113183.4346.49.camel@linux-q667.site> Roland, Comment cleanup for 2.6.20. Acked-by: Steve Wise -------- Forwarded Message -------- From: Krishna Kumar To: openib-general at openib.org Subject: [openib-general] [PATCH] RDMA/iwcm: Rewrite comment for iwcm_deref_id() to match code. Date: Thu, 09 Nov 2006 09:30:48 +0530 In iwcm_deref_id(), the comment says : "If the last reference is being removed and iw_destroy_cm_id is waiting, wake up the waiting thread". The second part of the comment "and iw_destroy_cm_id is waiting" is wrong, since this function either wakes the waiter already waiting in iwcm_deref_id, or enables it (so that when wait_for_completion() is performed later, it will immediately return). Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -148,8 +148,9 @@ static int copy_private_data(struct iw_c } /* - * Release a reference on cm_id. If the last reference is being removed - * and iw_destroy_cm_id is waiting, wake up the waiting thread. + * Release a reference on cm_id. If the last reference is being + * released, enable the waiting thread (in iw_destroy_cm_id) to + * get woken up, and return 1 if a thread is already waiting. */ static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) { _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu Nov 9 15:01:09 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Thu, 09 Nov 2006 15:01:09 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Remove unnecessary function argument.] Message-ID: <1163113269.4346.51.camel@linux-q667.site> Roland, Another small cleanup for 2.6.20. Acked-by: Steve Wise -------- Forwarded Message -------- From: Krishna Kumar To: openib-general at openib.org Subject: [openib-general] [PATCH] RDMA/iwcm: Remove unnecessary function argument. Date: Thu, 09 Nov 2006 09:30:45 +0530 Remove unnecessary function argument, and change text to reflect the code. Fix couple of typos. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -80,7 +80,7 @@ struct iwcm_work { * 1) in the event upcall, cm_event_handler(), for a listening cm_id. If * the backlog is exceeded, then no more connection request events will * be processed. cm_event_handler() returns -ENOMEM in this case. Its up - * to the provider to reject the connectino request. + * to the provider to reject the connection request. * 2) in the connection request workqueue handler, cm_conn_req_handler(). * If work elements cannot be allocated for the new connect request cm_id, * then IWCM will call the provider reject method. This is ok since @@ -131,12 +131,11 @@ static int alloc_work_entries(struct iwc } /* - * Save private data from incoming connection requests in the - * cm_id_priv so the low level driver doesn't have to. Adjust + * Save private data from incoming connection requests to + * iw_cm_event, so the low level driver doesn't have to. Adjust * the event ptr to point to the local copy. */ -static int copy_private_data(struct iwcm_id_private *cm_id_priv, - struct iw_cm_event *event) +static int copy_private_data(struct iw_cm_event *event) { void *p; @@ -243,7 +242,7 @@ static int iwcm_modify_qp_sqd(struct ib_ /* * CM_ID <-- CLOSING * - * Block if a passive or active connection is currenlty being processed. Then + * Block if a passive or active connection is currently being processed. Then * process the event as follows: * - If we are ESTABLISHED, move to CLOSING and modify the QP state * based on the abrupt flag @@ -903,7 +902,7 @@ static int cm_event_handler(struct iw_cm if ((work->event.event == IW_CM_EVENT_CONNECT_REQUEST || work->event.event == IW_CM_EVENT_CONNECT_REPLY) && work->event.private_data_len) { - ret = copy_private_data(cm_id_priv, &work->event); + ret = copy_private_data(&work->event); if (ret) { put_work(work); goto out; _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu Nov 9 15:03:13 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Thu, 09 Nov 2006 15:03:13 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Remove un-required initializations.] Message-ID: <1163113393.4346.54.camel@linux-q667.site> Roland, Code cleanup for 2.6.20. Acked-by: Steve Wise -------- Forwarded Message -------- From: Krishna Kumar To: openib-general at openib.org Subject: [openib-general] [PATCH] RDMA/iwcm: Remove un-required initializations. Date: Thu, 09 Nov 2006 09:30:43 +0530 Remove un-required initializations. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -408,7 +408,7 @@ int iw_cm_listen(struct iw_cm_id *cm_id, { struct iwcm_id_private *cm_id_priv; unsigned long flags; - int ret = 0; + int ret; cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); @@ -535,7 +535,7 @@ EXPORT_SYMBOL(iw_cm_accept); int iw_cm_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *iw_param) { struct iwcm_id_private *cm_id_priv; - int ret = 0; + int ret; unsigned long flags; struct ib_qp *qp; @@ -675,7 +675,7 @@ static int cm_conn_est_handler(struct iw struct iw_cm_event *iw_event) { unsigned long flags; - int ret = 0; + int ret; spin_lock_irqsave(&cm_id_priv->lock, flags); @@ -705,7 +705,7 @@ static int cm_conn_rep_handler(struct iw struct iw_cm_event *iw_event) { unsigned long flags; - int ret = 0; + int ret; spin_lock_irqsave(&cm_id_priv->lock, flags); /* _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu Nov 9 15:05:20 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Thu, 09 Nov 2006 15:05:20 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: References: <000201c70452$5648cef0$8698070a@amr.corp.intel.com> Message-ID: <1163113521.4346.55.camel@linux-q667.site> I think the semantics are that the pointer is only used if private_data_len > 0. Otherwise, it is undefined. So I think we should keep the check. Plus I don't like calling kfree() with a NULL pointer. It just seems wrong... ;-) On Thu, 2006-11-09 at 14:59 -0800, Roland Dreier wrote: > > > if (iw_event->private_data_len) > > > kfree(iw_event->private_data); > > > > Kfree checks for a null value, so is the private_data_len check necessary? > > Could private_data be a junk pointer if private_data_len == 0 ? > > - R. From rdreier at cisco.com Thu Nov 9 15:10:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 15:10:37 -0800 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: <1163113521.4346.55.camel@linux-q667.site> (Steve WIse's message of "Thu, 09 Nov 2006 15:05:20 -0800") References: <000201c70452$5648cef0$8698070a@amr.corp.intel.com> <1163113521.4346.55.camel@linux-q667.site> Message-ID: > I think the semantics are that the pointer is only used if > private_data_len > 0. Otherwise, it is undefined. So I think we should > keep the check. Plus I don't like calling kfree() with a NULL pointer. > It just seems wrong... Well, the first half definitely justifies leaving the check. However you're wrong about kfree(NULL) :) Every time you write if (foo) kfree(foo); a kitten is killed... Seriously, the check is pure bloat that wastes instruction cache, etc. - R. From hch at lst.de Thu Nov 9 15:50:09 2006 From: hch at lst.de (Christoph Hellwig) Date: Fri, 10 Nov 2006 00:50:09 +0100 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <15ddcffd0611051325p7546ef75qe5a6af74fe97b56f@mail.gmail.com> References: <1162506570.29948.567.camel@brick.pathscale.com> <20061102231404.GC15403@mellanox.co.il> <454DE183.80405@voltaire.com> <15ddcffd0611051325p7546ef75qe5a6af74fe97b56f@mail.gmail.com> Message-ID: <20061109235009.GA13264@lst.de> On Sun, Nov 05, 2006 at 11:25:27PM +0200, Or Gerlitz wrote: > On 11/5/06, Roland Dreier wrote: > > > I have mentioned this to Ralph in the past, just want to get ack/nak > > > on that from you: also on 64bit arch a block driver (eg SCSI LLD eg > > > SRP/iSER/etc) might get from higher level an SG whose pages are > > > **not** mapped into the kernel virtual address space. For example this > > > can happen with Direct I/O. > > > >No, I don't see how that could happen. Aren't all pages always mapped > >by the the kernel direct mapping on 64-bit architectures? > > I don't know exactly how this happens, but one of the comments i've > got from Christoph > on the iser code, is that one can't assume page_address(sg[i].page) > will not be NULL for SG passed to a SCSI LLD, i think Direct I/O is > one flow where this might happen. That statement is indeed true. Only for GFP_KERNEL allocations you can assume page_address is valid, and the scatterlist passed to a SCSI LLDD can contain any type of pages. Currently on all 64bit architectures page_address works on all pages, but that's an implementation detail that could change any time and that you should not rely on. From bos at pathscale.com Thu Nov 9 16:07:43 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Thu, 09 Nov 2006 16:07:43 -0800 Subject: [openib-general] amso1100 bug found by checker Message-ID: <4553C2CF.1030708@pathscale.com> I've filed the bug at kernel.org. It looks easy to fix. Please take ownership, if you will: http://bugzilla.kernel.org/show_bug.cgi?id=7478 We just had an "internal parity error" on a mellanox HCA. The HCA recovered. However, IPoIB did not fair as well. We are not sure of the details. What I have on the console is: 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: Catastrophic error detected: internal parity error 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[00]: 05000014 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[01]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[02]: 00196240 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[03]: 00126618 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[04]: 00206128 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[05]: 001d6ff8 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[06]: ffffffff 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[07]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[08]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[09]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0a]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0b]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0c]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0d]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0e]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0f]: 00000000 2006-11-09 15:20:05 divert: no divert_blk to free, ib0 not ethernet 2006-11-09 15:20:05 divert: no divert_blk to free, ib1 not ethernet ifconfig showed ib0 as "gone" (as in not listed). We tried to ifup ib0 and got: # zeus64 /root > ifup ib0 ib_ipoib ib_ipoib device ib0 does not seem to be present, delaying initialization. I then tried to unload the ib_ipoib module and that has hung for the last 15 min. I have run ibv_rc_pingpong and ib_rdma_bw through the node fine. ibstat and ibstatus and the switch show the link to be up. So it appears as though the card recovered fine. What can we do? :-/ Thanks, Ira From boris at mellanox.com Thu Nov 9 17:27:41 2006 From: boris at mellanox.com (Boris Shpolyansky) Date: Thu, 9 Nov 2006 17:27:41 -0800 Subject: [openib-general] OFED 1.1 IPoIB did not recover after a mthca catas recovery. Message-ID: <1E3DCD1C63492545881FACB6063A57C16E3F59@mtiexch01.mti.com> Ira, I think our general recommendation is to reboot the machine once the HCA has reported catastrophic error, since the device is in the fatal state and wouldn't respond to any command from the host. However the gen-2 driver, i.e. ib_mthca, resets the HCA when it starts, so restarting the driver may serve you just fine (unless you have a persistent HW failure). >From what you reported IPoIB doesn't seem to survive this, so it looks like you still have to reboot your machine. Regards, Boris Shpolyansky Application Engineer Mellanox Technologies Inc. 2900 Stender Way Santa Clara, CA 95054 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Ira Weiny Sent: Thursday, November 09, 2006 4:45 PM To: openib-general at openib.org Cc: Roland Dreier; Trent D'Hooge Subject: [openib-general] OFED 1.1 IPoIB did not recover after a mthca catas recovery. We just had an "internal parity error" on a mellanox HCA. The HCA recovered. However, IPoIB did not fair as well. We are not sure of the details. What I have on the console is: 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: Catastrophic error detected: internal parity error 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[00]: 05000014 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[01]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[02]: 00196240 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[03]: 00126618 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[04]: 00206128 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[05]: 001d6ff8 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[06]: ffffffff 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[07]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[08]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[09]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0a]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0b]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0c]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0d]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0e]: 00000000 2006-11-09 15:20:05 ib_mthca 0000:07:00.0: buf[0f]: 00000000 2006-11-09 15:20:05 divert: no divert_blk to free, ib0 not ethernet 2006-11-09 15:20:05 divert: no divert_blk to free, ib1 not ethernet ifconfig showed ib0 as "gone" (as in not listed). We tried to ifup ib0 and got: # zeus64 /root > ifup ib0 ib_ipoib ib_ipoib device ib0 does not seem to be present, delaying initialization. I then tried to unload the ib_ipoib module and that has hung for the last 15 min. I have run ibv_rc_pingpong and ib_rdma_bw through the node fine. ibstat and ibstatus and the switch show the link to be up. So it appears as though the card recovered fine. What can we do? :-/ Thanks, Ira _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Thu Nov 9 17:29:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 17:29:46 -0800 Subject: [openib-general] OFED 1.1 IPoIB did not recover after a mthca catas recovery. In-Reply-To: <20061109164512.3d6e0372.weiny2@llnl.gov> (Ira Weiny's message of "Thu, 9 Nov 2006 16:45:12 -0800") References: <20061109164512.3d6e0372.weiny2@llnl.gov> Message-ID: > What can we do? Something wacky happened I guess. If you still have the system in the state where unloading ib_ipoib hung, could you do echo t > /proc/sysrq-trigger and then send the kernel log with that output? Thanks, Roland From venkatesh.babu at 3leafnetworks.com Thu Nov 9 17:18:19 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 09 Nov 2006 17:18:19 -0800 Subject: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support In-Reply-To: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> References: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> Message-ID: <4553D35B.2040100@3leafnetworks.com> Sean Hefty wrote: >The QP state will automatically change to migrated on both sides of the >connection after a failure occurs. There's a delay before you'll see the >IB_EVENT_PATH_MIGRATED event on the QP though, so a manual transition of the QP >state may be faster, but isn't necessary. > > > Atleast in OFED 1.0 QP state was not automatically changeing to migrated. I have to manually call ib_modify_qp() to do this. >For my testing, I waited for both sides to process the IB_EVENT_PATH_MIGRATED >event before having the original active side call ib_send_cm_lap(). > > That path might not have come back up when you load the alternate path. I presume it is possible to load the alternate path even though it is down. If the failover hapens before the alternate path comes up, failover fails. It is no different than if it is not loaded. So both your case and my case works the same. VBabu >- Sean > > From chris_youb at yahoo.ca Thu Nov 9 18:01:25 2006 From: chris_youb at yahoo.ca (chris_youb at yahoo.ca) Date: Fri, 10 Nov 2006 10:01:25 +0800 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated Message-ID: <4512144.1163124085768.JavaMail.websites@opensubscriber> Setups: A) Suse 10.0 w/ OFED 1.1 B) Ubuntu 6.10 (native drivers), self compiled opensm from OFED 1.1 Suse - I've successfully setup opensm on the Suse system and it appears to be running fine (has been for days). Ubuntu - Thanks to Roland D. I setup the ib drivers on a new Ubuntu 6.10 box. I've also compiled opensm from OFED-1.1. It runs, allocates LIDs and otherwise appears OK. But it terminates after a minute as follows: *** stack smashing detected ***: opensm terminated Aborted (core dumped) Observations: On the Suse system I periodically get "MAD received with unsupported base version 0" in the console window but it continues on. On the Ubuntu box I never see them, except in /var/log/messages I get "ib_mad: MAD received with unsupported base version 0" around the same time it crashes. But that could be coincidence. Questions: I looked into the stack smashing message and it appears to be a safety check from gcc, which could be a false positive? Anyways, I am running gcc 4.1.2. Is there a way to: A) confirm if this is an error (what do I need to provide) B) turn off this check via a compiler flag (in the case of a false positive). Thanks. -- This message was sent on behalf of chris_youb at yahoo.ca at openSubscriber.com http://www.opensubscriber.com/messages/openib-general at openib.org/topic.html From halr at voltaire.com Thu Nov 9 19:47:07 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Nov 2006 05:47:07 +0200 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated References: <4512144.1163124085768.JavaMail.websites@opensubscriber> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018943F4@taurus.voltaire.com> "MAD messages received with unsupported base version 0" means the MADs are somehow corrupted. Current (and only) base version is 1. What HCA are you using ? What firmware ? As far as the stack smashing goes, it would be nice to know the file and line number where this occurred. There have been several bugs recently fixed which might relate to this. Using the -fstack-protector argument (add to CFLAGS) to compile in automatic buffer overflow protection and seeing what errors or warnings are generated might be instructive. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of chris_youb at yahoo.ca Sent: Thu 11/9/2006 9:01 PM To: openib-general at openib.org Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated Setups: A) Suse 10.0 w/ OFED 1.1 B) Ubuntu 6.10 (native drivers), self compiled opensm from OFED 1.1 Suse - I've successfully setup opensm on the Suse system and it appears to be running fine (has been for days). Ubuntu - Thanks to Roland D. I setup the ib drivers on a new Ubuntu 6.10 box. I've also compiled opensm from OFED-1.1. It runs, allocates LIDs and otherwise appears OK. But it terminates after a minute as follows: *** stack smashing detected ***: opensm terminated Aborted (core dumped) Observations: On the Suse system I periodically get "MAD received with unsupported base version 0" in the console window but it continues on. On the Ubuntu box I never see them, except in /var/log/messages I get "ib_mad: MAD received with unsupported base version 0" around the same time it crashes. But that could be coincidence. Questions: I looked into the stack smashing message and it appears to be a safety check from gcc, which could be a false positive? Anyways, I am running gcc 4.1.2. Is there a way to: A) confirm if this is an error (what do I need to provide) B) turn off this check via a compiler flag (in the case of a false positive). Thanks. -- This message was sent on behalf of chris_youb at yahoo.ca at openSubscriber.com http://www.opensubscriber.com/messages/openib-general at openib.org/topic.html _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Thu Nov 9 19:51:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Nov 2006 19:51:46 -0800 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated In-Reply-To: <4512144.1163124085768.JavaMail.websites@opensubscriber> ( chris youb's message of "Fri, 10 Nov 2006 10:01:25 +0800") References: <4512144.1163124085768.JavaMail.websites@opensubscriber> Message-ID: > *** stack smashing detected ***: opensm terminated > Aborted (core dumped) Probably a bug in opensm. Running gdb on the core file and sending the backtrace backtrace (output of gdb command "bt") would be useful for fixing this I guess. - R. From krkumar2 at in.ibm.com Thu Nov 9 20:41:46 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 10 Nov 2006 10:11:46 +0530 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: <1163113521.4346.55.camel@linux-q667.site> Message-ID: Though the amso driver (c2_ae_event) is setting the private_data and private_data_len together for connect request and connect result, so the check may not be necessary. But if the semantics prefer checking to make sure, we should follow that (esp if other future drivers may also simply set private_data_len to zero without modifying private_data). I did it this way since cm_conn_rep_handler() had the same check :) thanks, - KK > I think the semantics are that the pointer is only used if > private_data_len > 0. Otherwise, it is undefined. So I think we should > keep the check. Plus I don't like calling kfree() with a NULL pointer. > It just seems wrong... > > ;-) > > > On Thu, 2006-11-09 at 14:59 -0800, Roland Dreier wrote: > > > > if (iw_event->private_data_len) > > > > kfree(iw_event->private_data); > > > > > > Kfree checks for a null value, so is the private_data_len check necessary? > > > > Could private_data be a junk pointer if private_data_len == 0 ? > > > > - R. > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Thu Nov 9 20:50:18 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 09 Nov 2006 22:50:18 -0600 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: Message-ID: If it's truly nul or a ptr, we don't need to (and shouldn't) check, just call kfree. If it's unitialized, we can't tell anyway and it's a bug -- right? Am I missing something? On 11/9/06 10:41 PM, "Krishna Kumar2" wrote: > Though the amso driver (c2_ae_event) is setting the private_data and > private_data_len together for connect request and connect result, so > the check may not be necessary. But if the semantics prefer checking > to make sure, we should follow that (esp if other future drivers may > also simply set private_data_len to zero without modifying > private_data). > > I did it this way since cm_conn_rep_handler() had the same check :) > > thanks, > > - KK > >> I think the semantics are that the pointer is only used if >> private_data_len > 0. Otherwise, it is undefined. So I think we should >> keep the check. Plus I don't like calling kfree() with a NULL pointer. >> It just seems wrong... >> >> ;-) >> >> >> On Thu, 2006-11-09 at 14:59 -0800, Roland Dreier wrote: >>>>> if (iw_event->private_data_len) >>>>> kfree(iw_event->private_data); >>>> >>>> Kfree checks for a null value, so is the private_data_len check > necessary? >>> >>> Could private_data be a junk pointer if private_data_len == 0 ? >>> >>> - R. >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general >> > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From krkumar2 at in.ibm.com Thu Nov 9 21:11:07 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 10 Nov 2006 10:41:07 +0530 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: Message-ID: That is valid only if the drivers also comply. Eg if driver has two stack variables private_data and private_data_len, and it sets only private_data_len to zero. Then when calling the upper layer, it sets the event->private_data to its local private_data (uninitialized) and event->private_data_len to its local private_data_len (zero). Here we have to check the private_data_len before touching private_data or risk bug/panic. thanks, - KK Tom Tucker wrote on 11/10/2006 10:20:18 AM: > > If it's truly nul or a ptr, we don't need to (and shouldn't) check, just > call kfree. If it's unitialized, we can't tell anyway and it's a bug -- > right? > > Am I missing something? > > On 11/9/06 10:41 PM, "Krishna Kumar2" wrote: > > > Though the amso driver (c2_ae_event) is setting the private_data and > > private_data_len together for connect request and connect result, so > > the check may not be necessary. But if the semantics prefer checking > > to make sure, we should follow that (esp if other future drivers may > > also simply set private_data_len to zero without modifying > > private_data). > > > > I did it this way since cm_conn_rep_handler() had the same check :) > > > > thanks, > > > > - KK > > > >> I think the semantics are that the pointer is only used if > >> private_data_len > 0. Otherwise, it is undefined. So I think we should > >> keep the check. Plus I don't like calling kfree() with a NULL pointer. > >> It just seems wrong... > >> > >> ;-) > >> > >> > >> On Thu, 2006-11-09 at 14:59 -0800, Roland Dreier wrote: > >>>>> if (iw_event->private_data_len) > >>>>> kfree(iw_event->private_data); > >>>> > >>>> Kfree checks for a null value, so is the private_data_len check > > necessary? > >>> > >>> Could private_data be a junk pointer if private_data_len == 0 ? > >>> > >>> - R. > >> > >> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > >> > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > From tom at opengridcomputing.com Thu Nov 9 21:38:39 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 09 Nov 2006 23:38:39 -0600 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: Message-ID: Krishna: Maybe I'm missing something, but whether if (len) kfree(ptr) or if (ptr) kfree(ptr) is correct is contingent upon how you couple the two variables. But I don't' think this has anything to do with the Roland's point. I think "Pandora's box" was opened when Steve suggested that it's just good policy to check for nul before calling free...and in general it is good defensive programming. However, Roland's point is that in the kernel, it's contingent upon us all to know and leverage the error checking done by the services we use. If kfree checks for nul, we don't have to....and shouldn't check it. Kittens are cute... really ... who can argue with that? What 'len' allows us to assume about 'ptr' is a little more ... well... fuzzy. On 11/9/06 11:11 PM, "Krishna Kumar2" wrote: > That is valid only if the drivers also comply. Eg if driver has two > stack variables private_data and private_data_len, and it sets > only private_data_len to zero. Then when calling the upper layer, > it sets the event->private_data to its local private_data (uninitialized) > and event->private_data_len to its local private_data_len (zero). > Here we have to check the private_data_len before touching > private_data or risk bug/panic. > > thanks, > > - KK > > Tom Tucker wrote on 11/10/2006 10:20:18 AM: > >> >> If it's truly nul or a ptr, we don't need to (and shouldn't) check, just >> call kfree. If it's unitialized, we can't tell anyway and it's a bug -- >> right? >> >> Am I missing something? >> >> On 11/9/06 10:41 PM, "Krishna Kumar2" wrote: >> >>> Though the amso driver (c2_ae_event) is setting the private_data and >>> private_data_len together for connect request and connect result, so >>> the check may not be necessary. But if the semantics prefer checking >>> to make sure, we should follow that (esp if other future drivers may >>> also simply set private_data_len to zero without modifying >>> private_data). >>> >>> I did it this way since cm_conn_rep_handler() had the same check :) >>> >>> thanks, >>> >>> - KK >>> >>>> I think the semantics are that the pointer is only used if >>>> private_data_len > 0. Otherwise, it is undefined. So I think we > should >>>> keep the check. Plus I don't like calling kfree() with a NULL > pointer. >>>> It just seems wrong... >>>> >>>> ;-) >>>> >>>> >>>> On Thu, 2006-11-09 at 14:59 -0800, Roland Dreier wrote: >>>>>>> if (iw_event->private_data_len) >>>>>>> kfree(iw_event->private_data); >>>>>> >>>>>> Kfree checks for a null value, so is the private_data_len check >>> necessary? >>>>> >>>>> Could private_data be a junk pointer if private_data_len == 0 ? >>>>> >>>>> - R. >>>> >>>> >>>> _______________________________________________ >>>> openib-general mailing list >>>> openib-general at openib.org >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general >>> >> >> > From xma at us.ibm.com Thu Nov 9 21:41:18 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 9 Nov 2006 21:41:18 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Roland Dreier wrote on 10/19/2006 09:10:35 PM: Roland, > I looked over my code again, and I don't see anything obviously wrong, > but it's quite possible I made a mistake that I just can't see right > now (like reversing a truth value somewhere). Someone who knows how > ehca works might be able to spot the error. > > - R. Your code is OK. I just found the problem here. + if (empty) { + netif_rx_complete(dev); + ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP, &missed_event); + if (unlikely(missed_event) && netif_rx_reschedule(dev, 0)) + goto repoll; + + return 0; + } netif_rx_complete() should be called right before return. It does improve none scaling performance with this patch, but reduce scaling performance. + if (empty) { + ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP, &missed_event); + if (unlikely(missed_event) && netif_rx_reschedule(dev, 0)) + goto repoll; + netif_rx_complete(dev); + + return 0; + } Any other reason, calling netif_rx_complete() while still possibably within napi? Thanks Shirley Ma IBM Linux Technology Center -------------- next part -------------- An HTML attachment was scrubbed... URL: From ak at suse.de Thu Nov 9 22:52:54 2006 From: ak at suse.de (Andi Kleen) Date: Fri, 10 Nov 2006 07:52:54 +0100 Subject: [openib-general] [discuss] Re: 2.6.19-rc4: known unfixed regressions (v3) In-Reply-To: References: Message-ID: <200611100752.55517.ak@suse.de> > So we should have the infrastructure to only use mmconfig for some set > of busses. If that interface is well described we can probably > bootstrap sanely, only enabling what we know exists and like wise > only reserving what we know is used. Unfortunately there is a chicken and egg problem on those few broken systems (like some x86 Macs) where only mcfg works. Without mcfg you won't be able to probe the bus. Ok you could trust ACPI when it says it's there, but I'm not sure Linus would like that. Still perhaps I guess only reserving when the bus is probed is probably a good idea. In most cases we only probe a small number of busses because ACPI tells us the number. This basically means pci_mcfg_init() should be split up. -Andi From RAISCH at de.ibm.com Thu Nov 9 23:49:48 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Fri, 10 Nov 2006 08:49:48 +0100 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: Message-ID: > Umm, so is this patch really needed? Where did the patch come from -- > is it needed to fix something actually seen, or was it written just > based on some theoretical understanding? > > I'm confused... > > - R. The patch is needed. We've seen it on the real system. We did fix it on the real system. ...and it conforms to theory... although theory is a bit confusing here. let me try to summarize: ioremap checks for 64k boundary (actually page boundary) nopage does H_ENTER in 4k granularity if it's configured like that for a certain type of POWER processor. so you have to adjust the ioremap to page boundary, and THEN access at the offset within the 64k. Took quite a while until we understood that code path.... ;-) Christoph R. From diego.guella at sircomtech.com Fri Nov 10 00:25:00 2006 From: diego.guella at sircomtech.com (Diego Guella) Date: Fri, 10 Nov 2006 09:25:00 +0100 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails References: <003201c70317$36911f40$05c8a8c0@DIEGO> <4551E7C5.8020700@dev.mellanox.co.il> <00a301c70347$accee4f0$05c8a8c0@DIEGO> <45534DD0.3010803@dev.mellanox.co.il> Message-ID: <013601c704a1$b86d11f0$05c8a8c0@DIEGO> Hi Vladimir, Thanks for your answer. I have installed: compat-libstdc++ (version 5.0.7-35) libstdc++-32bit (version 4.1.2_20060705-2) libstdc++41 (version 4.1.2_20061024-3) libstdc++41-devel (version 4.1.2_20061024-3) libstdc++-devel (version 4.1.3-22) but remember that in the log file, first it says (line 6393): ----- checking for C compiler default output file name... a.out ----- and about 5000 lines below, it says my compiler can't create executables (of course this isn't true, because this is the machine on wich I compile all the programs I make) Have you got any other suggestion? Thanks, Diego ----- Original Message ----- From: "Vladimir Sokolovsky" To: "Diego Guella" Cc: "Tziporet Koren" ; Sent: Thursday, November 09, 2006 4:48 PM Subject: Re: [openib-general] Installation on openSUSE 10.2 Beta1 fails > Hello Diego, > Check that you have libstdc++, libstdc++-devel and compat-libstdc++ RPMs > installed. > > Regards, > Vladimir > > Diego Guella wrote: >> >> From: "Tziporet Koren" >>> The failing is utility is used for IPoIB high availability. If you don't >>> need to use them you can just change this line in ofed.conf: >>> ipoibtools=n >>> >>> Tziporet >>> >> Thanks Tziporet for your answer. >> >> >> Tried just right now, i disabled ipoibtools. I get another, more strange >> error: >> (attached OFED.3816.log) >> ----- >> /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs >> Running: >> ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib >> CPPFLAGS="-I../libibverbs/include" >> configure: creating cache >> /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> checking for a BSD-compatible install... /usr/bin/install -c >> checking whether build environment is sane... yes >> checking for gawk... gawk >> checking whether make sets $(MAKE)... yes >> checking build system type... x86_64-unknown-linux-gnu >> checking host system type... x86_64-unknown-linux-gnu >> checking for style of include used by make... GNU >> checking for gcc... gcc >> checking for C compiler default output file name... configure: error: C >> compiler cannot create executables >> See `config.log' for more details. >> Failed to execute: >> ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib >> CPPFLAGS="-I../libibverbs/include" >> error: Bad exit status from /var/tmp/rpm-tmp.46102 (%install) >> ----- >> >> Am I right? It says my C compiler cannot create executables???? Is it >> joking me???? >> In the log file, line 6393, it says: >> ----- >> checking for C compiler default output file name... a.out >> ----- >> >> I don't understand....! >> Is there something I can do to fix this? >> >> >> Thanks, >> Diego >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > From paulus at samba.org Fri Nov 10 00:46:10 2006 From: paulus at samba.org (Paul Mackerras) Date: Fri, 10 Nov 2006 19:46:10 +1100 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: References: Message-ID: <17748.15442.906060.210242@cargo.ozlabs.ibm.com> Christoph Raisch writes: > The patch is needed. We've seen it on the real system. We did fix it on the > real system. I disagree that the ioremap change is needed. > ...and it conforms to theory... although theory is a bit confusing here. > > let me try to summarize: > ioremap checks for 64k boundary (actually page boundary) Actually, ioremap itself already does the calculations that your patch adds - that is, it generates the offset within the page and the physical address of the start of the page, does the mapping using the latter, then adds on the offset to the virtual address of the page and returns that. Paul. From johnt1johnt2 at gmail.com Fri Nov 10 02:02:34 2006 From: johnt1johnt2 at gmail.com (john t) Date: Fri, 10 Nov 2006 15:32:34 +0530 Subject: [openib-general] BandWidth doubt Message-ID: Hi, I got following readings in one of my experiments: Single 64-bit xeon machine (2 dual-core 3.2 GHz Intel CPUs, linux FC4, OFED 1.0) with two Mellanox DDR (4x) HCAs (each having two ports and each connected to a PCI x8 interface) is connected to a switch (all the 4 DDR (4x) ports are connected to the switch). If I send data from mthca0-1 to mthca0-1 meaning from same port to the same port i.e. same port doing send/recv (also same cable doing send/recv) I get a BW of around 10 Gb/sec. Similarly, from mthca1-1 to mthca1-1 I get same i.e. around 10 Gb/sec. So, individual port-to-port gives 10 Gb/sec. But when I use them together i.e when I send the data from mthca0-1 to mthca0-1 AND from mthca1-1 to mthca1-1 at the same time (simultaneously) I get a BW of 6.7 Gb/sec on each port. This is less than 10 Gb/sec that is expected. Note that mthca0 and mthca1 are connected to two different PCI-x8 interfaces, so there is no question of bandwidth splitting. What could be causing such a behaviour ?? Just to add if the same thing is done between two different hosts i.e. If I send data from mthca0-0 and mthca1-1 of one host to mthca0-0 and mthca1-1 of other host, I get expected BW i.e. 10 Gb/sec on each port/link. Regards, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Fri Nov 10 07:00:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 10 Nov 2006 07:00:46 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: (Shirley Ma's message of "Thu, 9 Nov 2006 21:41:18 -0800") References: Message-ID: I think it has to stay the way I wrote it. Your version: + if (empty) { + ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP, &missed_event); + if (unlikely(missed_event) && netif_rx_reschedule(dev, 0)) + goto repoll; + netif_rx_complete(dev); + + return 0; + } has a race: suppose missed_event is 0 but an event _is_ generated right before the call to netif_rx_complete(). Then the interrupt handler might run before the call to netif_rx_complete(), try to schedule the NAPI poll, but end up doing nothing because the poll routine is still running. Then the poll routine will call netif_rx_complete() and return 0, so it won't get called again ever (because the CQ event has already fired). And so the interface will hang and never make any more progress. I would really like to understand why ehca does worse with NAPI. In my tests both mthca and ipath exhibit various degrees of improvement depending on the test -- but I've never seen performance get worse. This is the main thing holding back merging NAPI. Does the NAPI patch help mthca on pSeries? I wonder if it's not ehca, but rather that there's some ppc64 quirk that makes NAPI a lot more expensive. - R. From rdreier at cisco.com Fri Nov 10 07:08:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 10 Nov 2006 07:08:06 -0800 Subject: [openib-general] [PATCH] 1/3 [core] Added support to IB_EVENT_GID_CHANGE async event In-Reply-To: <1162986489.12259.6.camel@mtls05.yok.mtl.com> (Dotan Barak's message of "Wed, 08 Nov 2006 13:48:09 +0200") References: <1162986489.12259.6.camel@mtls05.yok.mtl.com> Message-ID: So this GID change event would be an extension to what the verbs spec defines. What is the motivation for adding this? A GID change doesn't seem particularly important to consumers. - R. From jlentini at netapp.com Fri Nov 10 07:34:38 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 10 Nov 2006 10:34:38 -0500 (EST) Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband In-Reply-To: <1163108454.2508.29.camel@brick.pathscale.com> References: <1163108454.2508.29.camel@brick.pathscale.com> Message-ID: Ralph, Thanks for writing this! It will be a big help to ULP authors. A few suggestions below: On Thu, 9 Nov 2006, Ralph Campbell wrote: > This patch adds a new file to the kernel infiniband documentation > directory to briefly describe how to use memory regions. > > Note: I will be on vacation from Nov. 11 through Nov. 26. > > Signed-off-by: Ralph Campbell > > diff -r b9d92097f918 Documentation/infiniband/memory_regions.txt > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > +++ b/Documentation/infiniband/memory_regions.txt Wed Nov 08 18:35:46 2006 -0800 > @@ -0,0 +1,110 @@ > +Kernel Space Memory Regions > + > + ib_get_dma_mr() This function returns a pointer to struct ib_mr ^ a > + which contains the 'lkey' and 'rkey' fields similar to user > + memory regions. The memory region represents all of physical > + memory so no base address or length is needed when creating it. > + The addresses used for the 'addr' field of struct ib_sge need > + to be hardware device addresses suitable for DMA. ^ access by RDMA devices. > + Since this mapping may be device specific, there are a set > + of kernel verbs functions corresponding to the DMA mapping > + functions described in DMA-API.txt. Another useful reference > + is the "Linux Device Drivers" book, 3rd edition, by Rubini and Corbet. > + > + ib_dma_mapping_error() > + ib_dma_map_single() > + ib_dma_unmap_single() > + ib_dma_map_page() > + ib_dma_unmap_page() > + ib_dma_map_sg() > + ib_dma_unmap_sg() > + ib_sg_dma_address() > + ib_sg_dma_len() > + ib_dma_sync_single_for_cpu() > + ib_dma_sync_single_for_device() > + > + Remote processes should use the same address for 'remote_addr' > + as the local kernel's address as returned by the mapping functions > + listed above. The only difference is the local kernel uses the > + 'lkey' and the remote kernel uses the 'rkey'. I found the above paragraph difficult to understand on the first read. How about structuring the text similar to the user space explanation above: The addresses returned by these mapping functions should be used for a struct ib_send_wr's 'remote_addr' field with the appropriate rkey. > + Note that the mapped addresses need to be unmapped after they > + are no longer needed. This may require the local and remote > + kernels to pass messages at the middle or upper layers to > + sychronize. > + > + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. > + It takes an array of device DMA addresses and lengths which are used > + to describe the memory region. These addresses are created by > + calling the mapping functions listed for ib_get_dma_mr(). The iova is an in/out parameter. I recommend including a description of how to initialize it: The 'iova' argument can be used by the caller to request an address to associate with the first byte of the address region. Upon return, the 'iova' argument is the ... > + The 'iova' argument is the starting address of the memory region > + which should be used with the 'lkey' or 'rkey' returned in the > + struct ib_mr. > + > + ib_dereg_mr() is used to destroy memory regions created by > + either ib_get_dma_mr() or ib_reg_phys_mr(). > + > + ib_alloc_fmr() This returns a pointer to a struct ib_fmr. > + The struct ib_fmr_attr argument specifies the size of each > + FMR "page" as a power of two in 'page_shift'. This size > + is assumed by ib_map_phys_fmr() described below. > + A FMR cannot be used until ib_map_phys_fmr() is called. > + The 'lkey' and 'rkey' fields are defined in struct ib_fmr > + and used the same way as the other memory regions. > + > + ib_map_phys_fmr() The function takes an array of u64 and a length > + for the number of entries in the array. Each u64 value should be > + a DMA address created with the mapping functions listed for > + ib_get_dma_mr(). The length of each u64 address region is the > + FMR page size set when ib_alloc_fmr() was called. > + Note that this now defines the memory region to start at address > + 'iova' and is the base address used for 'addr' and 'remote_addr'. > + The size of the memory region is the array length times the > + FMR page size. > + > + FMR memory regions should be unmapped by calling ib_unmap_fmr() > + and then the ib_fmr destroyed by calling ib_dealloc_fmr(). > + > + See also ib_create_fmr_pool(), ib_fmr_pool_map_phys(), and > + ib_fmr_pool_unmap() which are defined in the ib_core module > + to assist in caching FMRs. This can help performance when > + the same memory is mapped/unmapped frequently. > + > + Despite the name FMR, the memory region allocation and deallocation > + functions perform very differently depending on device, processor, > + and platform differences. > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tom at opengridcomputing.com Fri Nov 10 07:38:23 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Nov 2006 09:38:23 -0600 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: <20061109040037.7062.26245.sendpatchset@localhost.localdomain> References: <20061109040037.7062.26245.sendpatchset@localhost.localdomain> Message-ID: <1163173103.30738.14.camel@trinity.ogc.int> This patch adds a race that causes an event to be processed twice. Each cm_id has a list of pending work stuck on cm_id_priv->work_list. When new work is created, cm_event_handler needs to know whether it needs to schedule a call to cm_work_handler by enqueing work on the iwcm_wq or whether it can just append it to the list of work already scheduled. The race is that the cm_work_handler removes the last element on the list (leaving it empty) and then releases the lock. The cm_event_handler checks the list, finds it empty and so schedules another call to cm_work_handler (which is still running). cm_work_handler processes the work (previous end of list) and then takes the lock again, checks the list and voila, finds the work just added by cm_event_handler, so he processes it. But the work is still queued on the iwcm_wq. So when the cm_work_handler runs again, it processes that same work element a second time -- causing all kinds of misadventure. Tom On Thu, 2006-11-09 at 09:30 +0530, Krishna Kumar wrote: > Get rid of extra call to list_empty(), and unnecessary > variable. Has the side effect of sometimes resulting in > faster processing of new events (like handling new > connections, eg when cm_work_handler was processing the > last entry) added to this list instead of cm_work_handler > function exiting and re-entering when a new queue_work() > is done. > > Doing the redundant queue_work() (if cm_work_handler is > already running and processing the last entry) will not > result in another call to cm_work_handler (run_workqueue) > where no entry is found, since cm_work_handler will remove > all entries from the list, even ones that are added late. > > Signed-off-by: Krishna Kumar > --- > diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c > --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 > +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 > @@ -834,22 +834,17 @@ static void cm_work_handler(void *arg) > struct iw_cm_event levent; > struct iwcm_id_private *cm_id_priv = work->cm_id; > unsigned long flags; > - int empty; > - int ret = 0; > > spin_lock_irqsave(&cm_id_priv->lock, flags); > - empty = list_empty(&cm_id_priv->work_list); > - while (!empty) { > + while (!list_empty(&cm_id_priv->work_list)) { > work = list_entry(cm_id_priv->work_list.next, > struct iwcm_work, list); > list_del_init(&work->list); > - empty = list_empty(&cm_id_priv->work_list); > levent = work->event; > put_work(work); > spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > - ret = process_event(cm_id_priv, &levent); > - if (ret) { > + if (process_event(cm_id_priv, &levent)) { > set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); > destroy_cm_id(&cm_id_priv->id); > } > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From xma at us.ibm.com Fri Nov 10 09:23:49 2006 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 10 Nov 2006 09:23:49 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: > I would really like to understand why ehca does worse with NAPI. In > my tests both mthca and ipath exhibit various degrees of improvement > depending on the test -- but I've never seen performance get worse. > This is the main thing holding back merging NAPI. > > Does the NAPI patch help mthca on pSeries? I wonder if it's not ehca, > but rather that there's some ppc64 quirk that makes NAPI a lot more > expensive. > > - R. Got your point. Sorry I haven't made any big progress yet. What I have found so far in none scaling code, if I always set missed_event = 0 without peeking rotting packet, then NAPI will increase the performance and reduce the cpu utilization. That's the reason I suggest above change. I have't found the reason for scaling code dropping 2/3 of the performance yet. The NAPI touch test for methca on power performance is good. So I don't think it's ppc4 issue. Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From krause at cup.hp.com Fri Nov 10 09:26:54 2006 From: krause at cup.hp.com (Michael Krause) Date: Fri, 10 Nov 2006 09:26:54 -0800 Subject: [openib-general] BandWidth doubt In-Reply-To: References: Message-ID: <6.2.0.14.2.20061110092502.02580b10@esmail.cup.hp.com> At 02:02 AM 11/10/2006, john t wrote: >Hi, > >I got following readings in one of my experiments: > >Single 64-bit xeon machine (2 dual-core 3.2 GHz Intel CPUs, linux FC4, >OFED 1.0) with two Mellanox DDR (4x) HCAs (each having two ports and each >connected to a PCI x8 interface) is connected to a switch (all the 4 DDR >(4x) ports are connected to the switch). > >If I send data from mthca0-1 to mthca0-1 meaning from same port to the >same port i.e. same port doing send/recv (also same cable doing send/recv) >I get a BW of around 10 Gb/sec. > >Similarly, from mthca1-1 to mthca1-1 I get same i.e. around 10 Gb/sec. > >So, individual port-to-port gives 10 Gb/sec. > >But when I use them together i.e when I send the data from mthca0-1 to >mthca0-1 AND from mthca1-1 to mthca1-1 at the same time (simultaneously) I >get a BW of 6.7 Gb/sec on each port. This is less than 10 Gb/sec that is >expected. Note that mthca0 and mthca1 are connected to two different >PCI-x8 interfaces, so there is no question of bandwidth splitting. What >could be causing such a behaviour ?? > >Just to add if the same thing is done between two different hosts i.e. If >I send data from mthca0-0 and mthca1-1 of one host to mthca0-0 and >mthca1-1 of other host, I get expected BW i.e. 10 Gb/sec on each port/link. > You have two links pounding on a shared PCIe Root Complex / memory controller. This sounds like a chipset issue not an IB / software issue when it is placed under load. Mike From thomas.bub at thomson.net Fri Nov 10 09:27:32 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Fri, 10 Nov 2006 18:27:32 +0100 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. Message-ID: Sean, I had a bug in my debug printout so that the value at the end where showing up wrong. Find enclosed the values after corrections in the printout. I see all your questionable values in a good shape now. I tried to increase the retry_count but without success. In fact the maximum value I could set was 7. The only thing that worries me is the timeout of 1 for the gen2 stack which is 12 for the gen1 stack. Is there a way to increase this? Question about psn: In my gen1 application I have no place where I explicitly the psn's. The psn's are either set by qp creation or the cm kernel code (I don't know and care it works!) In my gen2 code I copied an example where the ib_cm_req_param.starting_psn is explicitly set to the qp_num and the qp_attr.rq_psn is set to qp_num in the transition from Rts to Rtr. Whithout that last setting in Rts to Rtr even gen2 to gen2 does not work. Is there something that I'm missing? Thomas Gen1 Bad Server (0xb75f5a80) sv.c:2168 getQpAttributes qp_state: 3 en_sqd_asyn_notif: 0 sq_draining: 0 qp_num: 1442826 remote_atomic_flags: 7 qkey: 0 path_mtu: 4 path_mig_state: 0 rq_psn: 8693334 sq_psn: 3867654 qp_ous_rd_atom: 4 ous_dst_rd_atom: 4 min_rnr_timer: 27 cap.max_oust_wr_sq: 200 cap.max_oust_wr_rq: 200 cap.max_sg_size_sq: 28 cap.max_sg_size_rq: 28 cap.max_inline_data_sq: 460 dest_qp_num: 3867654 sched_queue: 0 pkey_ix: 0 port: 1 av.subnetId: 0 av.guid: 0 av.sl: 0 av.dlid: 3 av.src_path_bits: 105 av.static_rate: 0 av.grh_flag: 0 av.traffic_class: 0 av.hop_limit: 0 av.flow_label: 0 av.sgid_index: 0 av.port: 0 timeout: 12 retry_count: 6 rnr_retry: 6 alt_pkey_ix: 0 alt_port: 0 av.subnetId: 0 av.guid: 0 av.sl: 0 alt_av.dlid: 0 alt_av.src_path_bits: 0 alt_av.static_rate: 0 alt_av.grh_flag: 0 alt_av.traffic_class: 0 alt_av.hop_limit: 0 alt_av.flow_label: 0 alt_av.sgid_index: 0 alt_av.port: 0 alt_timeout: 0 Gen2 Bad Client (0x41001940) sv2.c:1520 queryQpState qp_num: 3867654 qp_state: 3 cur_qp_state: 3 path_mtu: 4 path_mig_state: 0 qkey: 131079 rq_psn: 3867654 sq_psn: 8693334 dest_qp_num: 1442826 qp_access_flags: 14 cap.max_send_wr: 256 cap.max_recv_wr: 256 cap.max_send_sge: 29 cap.max_recv_sge: 30 cap.max_inline_data: 460 ah_attr.grh.dgid.subnet_prefix: 80fe ah_attr.grh.dgid.interface_id: d17960304f10800 ah_attr.grh.flow_label: 0 ah.attr.grh.sgid_index: 0 ah.attr.grh.hop_limit: 64 ah.attr.grh.traffic_class: 0 ah_attr.dlid: 105 ah_attr.sl: 0 ah_attr.src_path_bits: 3 ah_attr.static_rate: 3 ah_attr.is_global: 1 ah_attr.port_num: 1 alt_ah.attr.grh.dgid.subnet_prefix: 0 alt_ah.attr.grh.dgid.interface_id: 0 alt_ah.attr.grh.sgid_index: 0 alt_ah.attr.grh.hop_limit: 0 alt_ah.attr.grh.traffic_class: 0 alt_ah_attr.dlid: 0 alt_ah_attr.sl: 0 alt_ah_attr.src_path_bits: 0 alt_ah_attr.static_rate: 0 alt_ah_attr.is_global: 0 alt_ah_attr.port_num: 0 pkey_index: 0 alt_pkey_index: 0 en_sqd_async_notify: 0 sq_draining: 0 max_rd_atomic: 0 max_dest_rd_atomic: 4 min_rnr_timer: 4 port_num: 0 timeout: 1 retry_cnt: 12 rnr_retry: 6 alt_port_num: 7 alt_timeout: 0 From greg.lindahl at qlogic.com Fri Nov 10 09:29:21 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri, 10 Nov 2006 09:29:21 -0800 Subject: [openib-general] BandWidth doubt In-Reply-To: References: Message-ID: <20061110172921.GA1118@greglaptop.internal.keyresearch.com> On Fri, Nov 10, 2006 at 03:32:34PM +0530, john t wrote: > If I send data from mthca0-1 to mthca0-1 meaning from same port to the same > port i.e. same port doing send/recv (also same cable doing send/recv) I get > a BW of around 10 Gb/sec. Note that the IB standard says in this case that the adaptor may not send this traffic to the switch. So what you're seeing is a loopback operation inside the HCA or inside the host. -- greg From mshefty at ichips.intel.com Fri Nov 10 09:32:07 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Nov 2006 09:32:07 -0800 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: <1163173103.30738.14.camel@trinity.ogc.int> References: <20061109040037.7062.26245.sendpatchset@localhost.localdomain> <1163173103.30738.14.camel@trinity.ogc.int> Message-ID: <4554B797.5060800@ichips.intel.com> Tom Tucker wrote: > This patch adds a race that causes an event to be processed twice. > > Each cm_id has a list of pending work stuck on cm_id_priv->work_list. > When new work is created, cm_event_handler needs to know whether it > needs to schedule a call to cm_work_handler by enqueing work on the > iwcm_wq or whether it can just append it to the list of work already > scheduled. > > The race is that the cm_work_handler removes the last element on the > list (leaving it empty) and then releases the lock. The cm_event_handler > checks the list, finds it empty and so schedules another call to > cm_work_handler (which is still running). cm_work_handler processes the > work (previous end of list) and then takes the lock again, checks the > list and voila, finds the work just added by cm_event_handler, so he > processes it. But the work is still queued on the iwcm_wq. So when the > cm_work_handler runs again, it processes that same work element a second > time -- causing all kinds of misadventure. There may be a race here, but... Why wouldn't the second call into cm_work_handler simply find the list empty on entry into the call? As an alternative, could you defer the list_del_init() call to the end of the loop, which would avoid scheduling cm_work_handler while it's running? - Sean From ralph.campbell at qlogic.com Fri Nov 10 10:13:14 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 10 Nov 2006 10:13:14 -0800 Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband In-Reply-To: References: <1163108454.2508.29.camel@brick.pathscale.com> Message-ID: <1163182394.2508.42.camel@brick.pathscale.com> On Fri, 2006-11-10 at 10:34 -0500, James Lentini wrote: > Ralph, > > Thanks for writing this! It will be a big help to ULP authors. You are welcome. > A few suggestions below: > > On Thu, 9 Nov 2006, Ralph Campbell wrote: > > > This patch adds a new file to the kernel infiniband documentation > > directory to briefly describe how to use memory regions. > > > > Note: I will be on vacation from Nov. 11 through Nov. 26. > > > > Signed-off-by: Ralph Campbell > > > > diff -r b9d92097f918 Documentation/infiniband/memory_regions.txt > > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > > +++ b/Documentation/infiniband/memory_regions.txt Wed Nov 08 18:35:46 2006 -0800 > > @@ -0,0 +1,110 @@ > > > +Kernel Space Memory Regions > > + > > + ib_get_dma_mr() This function returns a pointer to struct ib_mr > ^ > a OK. > > + which contains the 'lkey' and 'rkey' fields similar to user > > + memory regions. The memory region represents all of physical > > + memory so no base address or length is needed when creating it. > > + The addresses used for the 'addr' field of struct ib_sge need > > + to be hardware device addresses suitable for DMA. > ^ > access by RDMA devices. OK. > > + Since this mapping may be device specific, there are a set > > + of kernel verbs functions corresponding to the DMA mapping > > + functions described in DMA-API.txt. Another useful reference > > + is the "Linux Device Drivers" book, 3rd edition, by Rubini and Corbet. > > + > > + ib_dma_mapping_error() > > + ib_dma_map_single() > > + ib_dma_unmap_single() > > + ib_dma_map_page() > > + ib_dma_unmap_page() > > + ib_dma_map_sg() > > + ib_dma_unmap_sg() > > + ib_sg_dma_address() > > + ib_sg_dma_len() > > + ib_dma_sync_single_for_cpu() > > + ib_dma_sync_single_for_device() > > + > > + Remote processes should use the same address for 'remote_addr' > > + as the local kernel's address as returned by the mapping functions > > + listed above. The only difference is the local kernel uses the > > + 'lkey' and the remote kernel uses the 'rkey'. > > I found the above paragraph difficult to understand on the first > read. How about structuring the text similar to the user space > explanation above: > > The addresses returned by these mapping functions should be > used for a struct ib_send_wr's 'remote_addr' field with the > appropriate rkey. OK. I guess I was trying to point out that the address used with the lkey or rkey will always be the same when accessing the same byte of memory. It is just that the lkey is used when accessing memory locally and the rkey is used to access the memory remotely. > > + Note that the mapped addresses need to be unmapped after they > > + are no longer needed. This may require the local and remote > > + kernels to pass messages at the middle or upper layers to > > + sychronize. > > + > > + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. > > + It takes an array of device DMA addresses and lengths which are used > > + to describe the memory region. These addresses are created by > > + calling the mapping functions listed for ib_get_dma_mr(). > > The iova is an in/out parameter. I recommend including a description > of how to initialize it: > > The 'iova' argument can be used by the caller to request an address > to associate with the first byte of the address region. Upon return, > the 'iova' argument is the ... Iova was at one point in time an in/out parameter but was changed to be just an in parameter some time ago. I modified the text as follows: The 'iova' argument can be used by the caller to set the address of the first byte of the address region. The addresses used with struct ib_sge 'addr' and struct ib_send_wr 'remote_addr' are thus 'iova' plus offset within the length of the memory region. The 'lkey' and 'rkey' fields for the above structs should be set with the values returned in the struct ib_mr 'lkey' and 'rkey' fields. I will update the patch and resend it with these changes. From ralph.campbell at qlogic.com Fri Nov 10 10:17:21 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 10 Nov 2006 10:17:21 -0800 Subject: [openib-general] [PATCH v2] IB/documentation - add new file to Documentation/infiniband Message-ID: <1163182641.2508.46.camel@brick.pathscale.com> Here is the updated documention with changes suggested by James Lentini . Signed-off-by: Ralph Campbell diff -r b9d92097f918 Documentation/infiniband/memory_regions.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Documentation/infiniband/memory_regions.txt Fri Nov 10 10:11:35 2006 -0800 @@ -0,0 +1,112 @@ +INFINIBAND MEMORY REGIONS + + This is an overview of memory region usage for the user and kernel + verbs interface. The verbs API to send and receive data does not + specify memory addresses directly. Instead, a memory region + is constructed and a Lkey or Rkey is used to refer to the region. + +User Space Memory Regions + + User space memory regions are created by calling ibv_reg_mr(). + It returns a pointer to a struct ibv_mr which contains the + 'lkey' field and 'rkey' field. The lkey should be copied + into the 'lkey' field of struct ibv_sge when posting buffers + with ibv_post_send(), ibv_post_recv(), and ibv_post_srq_recv(). + The 'addr' field of the ibv_sge should be a user address between + the address and address + length passed to ibv_reg_mr(). + + The 'rkey' can be sent to another process and used by the + remote process in RDMA write, read, and atomic operations + to access the local process' memory region. + The 'remote_addr' field in the ibv_send_wr should be the local + process' address within the memory region. At some point in + the future, the interface may be extended to allow zero based + remote addresses which would mean the remote_addr would be + an offset within the local process' memory region. + + A memory region is destroyed by calling ibv_dereg_mr(). + + Note that creating and destroying memory regions results + in kernel system calls which lock the user's virtual memory + to physical memory. This means the system administrator must set + the RLIMIT memory lock limit high enough for processes to + be able to create memory regions of the desired size. + It is therefore best to limit the size of memory regions created. + +Kernel Space Memory Regions + + ib_get_dma_mr() This function returns a pointer to a struct ib_mr + which contains the 'lkey' and 'rkey' fields similar to user + memory regions. The memory region represents all of physical + memory so no base address or length is needed when creating it. + The addresses used for the 'addr' field of struct ib_sge need + to be hardware device addresses suitable for access by RDMA devices. + Since this mapping may be device specific, there are a set + of kernel verbs functions corresponding to the DMA mapping + functions described in DMA-API.txt. Another useful reference + is the "Linux Device Drivers" book, 3rd edition, by Rubini and Corbet. + + ib_dma_mapping_error() + ib_dma_map_single() + ib_dma_unmap_single() + ib_dma_map_page() + ib_dma_unmap_page() + ib_dma_map_sg() + ib_dma_unmap_sg() + ib_sg_dma_address() + ib_sg_dma_len() + ib_dma_sync_single_for_cpu() + ib_dma_sync_single_for_device() + + The addresses returned by these mapping functions should be + used for a struct ib_send_wr's 'remote_addr' field with the + appropriate rkey. + + Note that the mapped addresses need to be unmapped after they + are no longer needed. This may require the local and remote + kernels to pass messages at the middle or upper layers to + sychronize. + + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. + It takes an array of device DMA addresses and lengths which are used + to describe the memory region. These addresses are created by + calling the mapping functions listed for ib_get_dma_mr(). + The 'iova' argument can be used by the caller to set the address + of the first byte of the address region. The addresses used + with struct ib_sge 'addr' and struct ib_send_wr 'remote_addr' are + thus 'iova' plus offset within the length of the memory region. + The 'lkey' and 'rkey' fields for the above structs should be set + with the values returned in the struct ib_mr 'lkey' and 'rkey' fields. + + ib_dereg_mr() is used to destroy memory regions created by + either ib_get_dma_mr() or ib_reg_phys_mr(). + + ib_alloc_fmr() This returns a pointer to a struct ib_fmr. + The struct ib_fmr_attr argument specifies the size of each + FMR "page" as a power of two in 'page_shift'. This size + is assumed by ib_map_phys_fmr() described below. + A FMR cannot be used until ib_map_phys_fmr() is called. + The 'lkey' and 'rkey' fields are defined in struct ib_fmr + and used the same way as the other memory regions. + + ib_map_phys_fmr() The function takes an array of u64 and a length + for the number of entries in the array. Each u64 value should be + a DMA address created with the mapping functions listed for + ib_get_dma_mr(). The length of each u64 address region is the + FMR page size set when ib_alloc_fmr() was called. + Note that this now defines the memory region to start at address + 'iova' and is the base address used for 'addr' and 'remote_addr'. + The size of the memory region is the array length times the + FMR page size. + + FMR memory regions should be unmapped by calling ib_unmap_fmr() + and then the ib_fmr destroyed by calling ib_dealloc_fmr(). + + See also ib_create_fmr_pool(), ib_fmr_pool_map_phys(), and + ib_fmr_pool_unmap() which are defined in the ib_core module + to assist in caching FMRs. This can help performance when + the same memory is mapped/unmapped frequently. + + Despite the name FMR, the memory region allocation and deallocation + functions perform very differently depending on device, processor, + and platform differences. From tom at opengridcomputing.com Fri Nov 10 10:50:00 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Nov 2006 12:50:00 -0600 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: <4554B797.5060800@ichips.intel.com> Message-ID: On 11/10/06 11:32 AM, "Sean Hefty" wrote: > Tom Tucker wrote: >> This patch adds a race that causes an event to be processed twice. >> >> Each cm_id has a list of pending work stuck on cm_id_priv->work_list. >> When new work is created, cm_event_handler needs to know whether it >> needs to schedule a call to cm_work_handler by enqueing work on the >> iwcm_wq or whether it can just append it to the list of work already >> scheduled. >> >> The race is that the cm_work_handler removes the last element on the >> list (leaving it empty) and then releases the lock. The cm_event_handler >> checks the list, finds it empty and so schedules another call to >> cm_work_handler (which is still running). cm_work_handler processes the >> work (previous end of list) and then takes the lock again, checks the >> list and voila, finds the work just added by cm_event_handler, so he >> processes it. But the work is still queued on the iwcm_wq. So when the >> cm_work_handler runs again, it processes that same work element a second >> time -- causing all kinds of misadventure. > > There may be a race here, but... Why wouldn't the second call into > cm_work_handler simply find the list empty on entry into the call? Basically, you've got a free work queue element sitting on the iwcm_wq. What typically would happen is you'd end up corrupting the list because the cm_event_handler would enqueue the element, it would get freed in cm_work_handler with put_work, then cm_event_handler would call get_work (getting the one just freed that's also sitting on the iwcm_wq list) and ... bad things happen. > As an > alternative, could you defer the list_del_init() call to the end of the loop, > which would avoid scheduling cm_work_handler while it's running? Yeah, that's a good idea. > > - Sean From tom at opengridcomputing.com Fri Nov 10 11:33:47 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Nov 2006 13:33:47 -0600 Subject: [openib-general] [PATCH 1/1] IWCM cm_work_handler improvements Message-ID: <1163187227.4896.3.camel@trinity.ogc.int> Move the removal of the work queue element to the end of the processing loop. This avoids the race with the cm_event_handler and obviates the need for the local copy of the work structure. I'll be testing this concurrently, but here's the patch for review... Signed-off-by: Tom Tucker --- drivers/infiniband/core/iwcm.c | 12 ++++-------- 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c index c3fb304..3a14355 100644 --- a/drivers/infiniband/core/iwcm.c +++ b/drivers/infiniband/core/iwcm.c @@ -830,21 +830,15 @@ static int process_event(struct iwcm_id_ */ static void cm_work_handler(void *arg) { - struct iwcm_work *work = arg, lwork; + struct iwcm_work *work = arg; struct iwcm_id_private *cm_id_priv = work->cm_id; unsigned long flags; - int empty; int ret = 0; spin_lock_irqsave(&cm_id_priv->lock, flags); - empty = list_empty(&cm_id_priv->work_list); - while (!empty) { + while (!list_empty(&cm_id_priv->work_list)) { work = list_entry(cm_id_priv->work_list.next, struct iwcm_work, list); - list_del_init(&work->list); - empty = list_empty(&cm_id_priv->work_list); - lwork = *work; - put_work(work); spin_unlock_irqrestore(&cm_id_priv->lock, flags); ret = process_event(cm_id_priv, &work->event); @@ -863,6 +857,8 @@ static void cm_work_handler(void *arg) return; } spin_lock_irqsave(&cm_id_priv->lock, flags); + list_del_init(&work->list); + put_work(work); } spin_unlock_irqrestore(&cm_id_priv->lock, flags); } From jlentini at netapp.com Fri Nov 10 11:58:23 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 10 Nov 2006 14:58:23 -0500 (EST) Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband In-Reply-To: <1163182394.2508.42.camel@brick.pathscale.com> References: <1163108454.2508.29.camel@brick.pathscale.com> <1163182394.2508.42.camel@brick.pathscale.com> Message-ID: Replies below: On Fri, 10 Nov 2006, Ralph Campbell wrote: > > > + Remote processes should use the same address for 'remote_addr' > > > + as the local kernel's address as returned by the mapping functions > > > + listed above. The only difference is the local kernel uses the > > > + 'lkey' and the remote kernel uses the 'rkey'. > > > > I found the above paragraph difficult to understand on the first > > read. How about structuring the text similar to the user space > > explanation above: > > > > The addresses returned by these mapping functions should be > > used for a struct ib_send_wr's 'remote_addr' field with the > > appropriate rkey. > > OK. > I guess I was trying to point out that the address used with the > lkey or rkey will always be the same when accessing the same byte > of memory. It is just that the lkey is used when accessing memory > locally and the rkey is used to access the memory remotely. That is a good way of looking at it. > > > + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. > > > + It takes an array of device DMA addresses and lengths which are used > > > + to describe the memory region. These addresses are created by > > > + calling the mapping functions listed for ib_get_dma_mr(). > > > > The iova is an in/out parameter. I recommend including a description > > of how to initialize it: > > > > The 'iova' argument can be used by the caller to request an address > > to associate with the first byte of the address region. Upon return, > > the 'iova' argument is the ... > > Iova was at one point in time an in/out parameter but was changed > to be just an in parameter some time ago. At what point was this changed? In the 1.2 IBTA spec, it is described as an in/out. For mthca, the iova value is an input (see mthca_reg_phys_mr). > I modified the text as follows: > > The 'iova' argument can be used by the caller to set the address > of the first byte of the address region. The addresses used > with struct ib_sge 'addr' and struct ib_send_wr 'remote_addr' are > thus 'iova' plus offset within the length of the memory region. > The 'lkey' and 'rkey' fields for the above structs should be set > with the values returned in the struct ib_mr 'lkey' and 'rkey' fields. From mshefty at ichips.intel.com Fri Nov 10 12:05:58 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Fri, 10 Nov 2006 12:05:58 -0800 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. In-Reply-To: References: Message-ID: <4554DBA6.5020107@ichips.intel.com> > I tried to increase the retry_count but without success. In fact the > maximum value I could set was 7. Retry_count is a 3 bit value, so 7 is fine. > The only thing that worries me is the timeout of 1 for the gen2 stack > which is 12 for the gen1 stack. > Is there a way to increase this? The gen2 stack calculates this value based on the packet lifetime (+1) in the path record. A value of 1 is about 8 microseconds. How did you obtain the path record that you passed into ib_send_cm_req()? It looks like the value being passed in is 0. I have no idea how gen1 gets this value, but it should be pulling it from the CM REQ message. There's a disconnect between how it gets 12, while the other side has 1, that we probably need to understand. Is the timeout still 1 when connecting gen2 to gen2? > Question about psn: > In my gen1 application I have no place where I explicitly the psn's. The > psn's are either set by qp creation or the cm kernel code (I don't know > and care it works!) > In my gen2 code I copied an example where the > ib_cm_req_param.starting_psn is explicitly set to the qp_num and the > qp_attr.rq_psn is set to qp_num in the transition from Rts to Rtr. > Whithout that last setting in Rts to Rtr even gen2 to gen2 does not > work. > Is there something that I'm missing? Based on your output, your psn values look fine. > ah_attr.is_global: 1 You're not actually trying to go between subnets are you? I think is_global should be 0 here. The gen2 stack sets this based on information from the path record. What value is in the path record hop_limit? (It may help if you just print out the path record passed into ib_send_cm_req().) > ah_attr.port_num: 1 > min_rnr_timer: 4 > port_num: 0 What is "port_num: 0" above? ah_attr.port_num looks correct. Also, for the gen1 side, you displayed "port: 1" and "av.port: 0". I would expect the port number to be >= 1. - Sean From tom at opengridcomputing.com Fri Nov 10 12:26:43 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Fri, 10 Nov 2006 14:26:43 -0600 Subject: [openib-general] [PATCH 1/1] IWCM cm_work_handler improvements In-Reply-To: <1163187227.4896.3.camel@trinity.ogc.int> References: <1163187227.4896.3.camel@trinity.ogc.int> Message-ID: <1163190403.4896.13.camel@trinity.ogc.int> I remember now why the work was removed from the list early. If the connection is going away, the loop exits early and you won't delete the work item and you'll end up leaking the work queue structure. I know what to do, but I don't have time to do it properly because I've got to get on a plane for SC'06. Please ignore this for now and sorry for the patch noise. Tom On Fri, 2006-11-10 at 13:33 -0600, Tom Tucker wrote: > Move the removal of the work queue element to the > end of the processing loop. This avoids the race with > the cm_event_handler and obviates the need for the local > copy of the work structure. > > I'll be testing this concurrently, but here's the patch for review... > > Signed-off-by: Tom Tucker > > --- > > drivers/infiniband/core/iwcm.c | 12 ++++-------- > 1 files changed, 4 insertions(+), 8 deletions(-) > > diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c > index c3fb304..3a14355 100644 > --- a/drivers/infiniband/core/iwcm.c > +++ b/drivers/infiniband/core/iwcm.c > @@ -830,21 +830,15 @@ static int process_event(struct iwcm_id_ > */ > static void cm_work_handler(void *arg) > { > - struct iwcm_work *work = arg, lwork; > + struct iwcm_work *work = arg; > struct iwcm_id_private *cm_id_priv = work->cm_id; > unsigned long flags; > - int empty; > int ret = 0; > > spin_lock_irqsave(&cm_id_priv->lock, flags); > - empty = list_empty(&cm_id_priv->work_list); > - while (!empty) { > + while (!list_empty(&cm_id_priv->work_list)) { > work = list_entry(cm_id_priv->work_list.next, > struct iwcm_work, list); > - list_del_init(&work->list); > - empty = list_empty(&cm_id_priv->work_list); > - lwork = *work; > - put_work(work); > spin_unlock_irqrestore(&cm_id_priv->lock, flags); > > ret = process_event(cm_id_priv, &work->event); > @@ -863,6 +857,8 @@ static void cm_work_handler(void *arg) > return; > } > spin_lock_irqsave(&cm_id_priv->lock, flags); > + list_del_init(&work->list); > + put_work(work); > } > spin_unlock_irqrestore(&cm_id_priv->lock, flags); > } > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ralph.campbell at qlogic.com Fri Nov 10 12:33:27 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 10 Nov 2006 12:33:27 -0800 Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband In-Reply-To: References: <1163108454.2508.29.camel@brick.pathscale.com> <1163182394.2508.42.camel@brick.pathscale.com> Message-ID: <1163190808.2508.55.camel@brick.pathscale.com> On Fri, 2006-11-10 at 14:58 -0500, James Lentini wrote: > Replies below: > > On Fri, 10 Nov 2006, Ralph Campbell wrote: > > > > > > + Remote processes should use the same address for 'remote_addr' > > > > + as the local kernel's address as returned by the mapping functions > > > > + listed above. The only difference is the local kernel uses the > > > > + 'lkey' and the remote kernel uses the 'rkey'. > > > > > > I found the above paragraph difficult to understand on the first > > > read. How about structuring the text similar to the user space > > > explanation above: > > > > > > The addresses returned by these mapping functions should be > > > used for a struct ib_send_wr's 'remote_addr' field with the > > > appropriate rkey. > > > > OK. > > I guess I was trying to point out that the address used with the > > lkey or rkey will always be the same when accessing the same byte > > of memory. It is just that the lkey is used when accessing memory > > locally and the rkey is used to access the memory remotely. > > That is a good way of looking at it. > > > > > > + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. > > > > + It takes an array of device DMA addresses and lengths which are used > > > > + to describe the memory region. These addresses are created by > > > > + calling the mapping functions listed for ib_get_dma_mr(). > > > > > > The iova is an in/out parameter. I recommend including a description > > > of how to initialize it: > > > > > > The 'iova' argument can be used by the caller to request an address > > > to associate with the first byte of the address region. Upon return, > > > the 'iova' argument is the ... > > > > Iova was at one point in time an in/out parameter but was changed > > to be just an in parameter some time ago. > > At what point was this changed? Maybe a year ago. I tried to find it in the SVN log but didn't find which revision changed it. > In the 1.2 IBTA spec, it is described as an in/out. > > For mthca, the iova value is an input (see mthca_reg_phys_mr). The IBTA spec. wasn't changed. The open-fabrics code was changed since none of the HCAs returned a modified iova value. Also, there are currently no callers of ib_reg_phys_mr() that I know of. From chris_youb at yahoo.ca Fri Nov 10 12:33:41 2006 From: chris_youb at yahoo.ca (chris_youb at yahoo.ca) Date: Sat, 11 Nov 2006 04:33:41 +0800 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated In-Reply-To: <4512144.1163124085768.JavaMail.websites@opensubscriber> Message-ID: <6923467.1163190820620.JavaMail.websites@opensubscriber> Backtrace related to previous posting of the same title: Core was generated by `opensm -V'. Program terminated with signal 6, Aborted. #0 0xffffe410 in __kernel_vsyscall () (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0xb7e6e770 in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e6fef3 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0xb7ea3d0b in __fsetlocking () from /lib/tls/i686/cmov/libc.so.6 #4 0xb7f27e41 in __stack_chk_fail () from /lib/tls/i686/cmov/libc.so.6 #5 0xb7fb7834 in __stack_chk_fail_local () from /usr/local/lib/libopensm.so.1 #6 0xb7fb4a83 in osm_dump_service_record (p_log=0x80bd810, p_sr=0x8109108, log_level=8 '\b') at osm_helper.c:1360 #7 0x0807d7f0 in osm_sr_rcv_process_set_method (p_rcv=0x80bd260, p_madw=0x80d0980) at osm_sa_service_record.c:863 #8 0x0807dd91 in osm_sr_rcv_process (p_rcv=0x80bd260, p_madw=0x80d0980) at osm_sa_service_record.c:1083 #9 0x0807e0aa in __osm_sr_rcv_ctrl_disp_callback (context=0x80bd334, p_data=0x80d0980) at osm_sa_service_record_ctrl.c:66 #10 0xb7fa193e in __cl_disp_worker (context=0x80bd840) at cl_dispatcher.c:108 #11 0xb7fa8c49 in __cl_thread_pool_routine (context=0x80bd880) at cl_threadpool.c:79 #12 0xb7fa8a55 in __cl_thread_wrapper (arg=0x80be478) at cl_thread.c:61 #13 0xb7f87504 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #14 0xb7f1251e in clone () from /lib/tls/i686/cmov/libc.so.6 (gdb) f 6 #6 0xb7fb4a83 in osm_dump_service_record (p_log=0x80bd810, p_sr=0x8109108, log_level=8 '\b') at osm_helper.c:1360 1360 } (gdb) l 1355 cl_ntoh32(p_sr->service_data32[3]), 1356 cl_ntoh64(p_sr->service_data64[0]), 1357 cl_ntoh64(p_sr->service_data64[1]) 1358 ); 1359 } 1360 } 1361 1362 /********************************************************************** 1363 **********************************************************************/ 1364 void (gdb) p p_sr $1 = (const ib_service_record_t * const) 0x8109108 (gdb) p *$1 $3 = {service_id = 6004495675223179280, service_gid = { raw = "þ\200\000\000\000\000\000\000\000\bñ\004\000A\fä", unicast = { prefix = 33022, interface_id = 16432580608706807808}, multicast = { header = "þ\200", raw_group_id = "\000\000\000\000\000\000\000\bñ\004\000A\fä"}}, service_pkey = 65535, resv = 0, service_lease = 4294967295, service_key = "\000\000\000\000\000\000\bñÿÿ\000\000\000\000\000", service_name = "DAPL Address Translation Service", '\0' , service_data8 = '\0' , "ˬ\002d", service_data16 = {61704, 0, 0, 0, 0, 0, 0, 0}, service_data32 = {961696585, 758395440, 879059760, 1953068800}, service_data64 = {26723, 0}} (gdb) p /x $3 $4 = {service_id = 0x53544100e10c0010, service_gid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}, unicast = {prefix = 0x80fe, interface_id = 0xe40c410004f10800}, multicast = {header = {0xfe, 0x80}, raw_group_id = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}}}, service_pkey = 0xffff, resv = 0x0, service_lease = 0xffffffff, service_key = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_name = {0x44, 0x41, 0x50, 0x4c, 0x20, 0x41, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x20, 0x54, 0x72, 0x61, 0x6e, 0x73, 0x6c, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x20, 0x53, 0x65, 0x72, 0x76, 0x69, 0x63, 0x65, 0x0 }, service_data8 = { 0x0 , 0xc0, 0xa8, 0x2, 0x64}, service_data16 = {0xf108, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_data32 = {0x39525349, 0x2d343230, 0x34656330, 0x74697700}, service_data64 = {0x6863, 0x0}} (gdb) $5 = {service_id = 0x53544100e10c0010, service_gid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}, unicast = {prefix = 0x80fe, interface_id = 0xe40c410004f10800}, multicast = {header = {0xfe, 0x80}, raw_group_id = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}}}, service_pkey = 0xffff, resv = 0x0, service_lease = 0xffffffff, service_key = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_name = {0x44, 0x41, 0x50, 0x4c, 0x20, 0x41, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x20, 0x54, 0x72, 0x61, 0x6e, 0x73, 0x6c, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x20, 0x53, 0x65, 0x72, 0x76, 0x69, 0x63, 0x65, 0x0 }, service_data8 = { 0x0 , 0xc0, 0xa8, 0x2, 0x64}, service_data16 = {0xf108, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_data32 = {0x39525349, 0x2d343230, 0x34656330, 0x74697700}, service_data64 = {0x6863, 0x0}} (gdb) quit -- This message was sent on behalf of chris_youb at yahoo.ca at openSubscriber.com http://www.opensubscriber.com/message/openib-general at openib.org/5325029.html From halr at voltaire.com Fri Nov 10 13:13:49 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 10 Nov 2006 23:13:49 +0200 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated References: <6923467.1163190820620.JavaMail.websites@opensubscriber> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F501894407@taurus.voltaire.com> I think I see the problem. Give me a little time to give you a patch to try. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of chris_youb at yahoo.ca Sent: Fri 11/10/2006 3:33 PM To: openib-general at openib.org Subject: Re: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated Backtrace related to previous posting of the same title: Core was generated by `opensm -V'. Program terminated with signal 6, Aborted. #0 0xffffe410 in __kernel_vsyscall () (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0xb7e6e770 in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e6fef3 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0xb7ea3d0b in __fsetlocking () from /lib/tls/i686/cmov/libc.so.6 #4 0xb7f27e41 in __stack_chk_fail () from /lib/tls/i686/cmov/libc.so.6 #5 0xb7fb7834 in __stack_chk_fail_local () from /usr/local/lib/libopensm.so.1 #6 0xb7fb4a83 in osm_dump_service_record (p_log=0x80bd810, p_sr=0x8109108, log_level=8 '\b') at osm_helper.c:1360 #7 0x0807d7f0 in osm_sr_rcv_process_set_method (p_rcv=0x80bd260, p_madw=0x80d0980) at osm_sa_service_record.c:863 #8 0x0807dd91 in osm_sr_rcv_process (p_rcv=0x80bd260, p_madw=0x80d0980) at osm_sa_service_record.c:1083 #9 0x0807e0aa in __osm_sr_rcv_ctrl_disp_callback (context=0x80bd334, p_data=0x80d0980) at osm_sa_service_record_ctrl.c:66 #10 0xb7fa193e in __cl_disp_worker (context=0x80bd840) at cl_dispatcher.c:108 #11 0xb7fa8c49 in __cl_thread_pool_routine (context=0x80bd880) at cl_threadpool.c:79 #12 0xb7fa8a55 in __cl_thread_wrapper (arg=0x80be478) at cl_thread.c:61 #13 0xb7f87504 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #14 0xb7f1251e in clone () from /lib/tls/i686/cmov/libc.so.6 (gdb) f 6 #6 0xb7fb4a83 in osm_dump_service_record (p_log=0x80bd810, p_sr=0x8109108, log_level=8 '\b') at osm_helper.c:1360 1360 } (gdb) l 1355 cl_ntoh32(p_sr->service_data32[3]), 1356 cl_ntoh64(p_sr->service_data64[0]), 1357 cl_ntoh64(p_sr->service_data64[1]) 1358 ); 1359 } 1360 } 1361 1362 /********************************************************************** 1363 **********************************************************************/ 1364 void (gdb) p p_sr $1 = (const ib_service_record_t * const) 0x8109108 (gdb) p *$1 $3 = {service_id = 6004495675223179280, service_gid = { raw = "þ\200\000\000\000\000\000\000\000\bñ\004\000A\fä", unicast = { prefix = 33022, interface_id = 16432580608706807808}, multicast = { header = "þ\200", raw_group_id = "\000\000\000\000\000\000\000\bñ\004\000A\fä"}}, service_pkey = 65535, resv = 0, service_lease = 4294967295, service_key = "\000\000\000\000\000\000\bñÿÿ\000\000\000\000\000", service_name = "DAPL Address Translation Service", '\0' , service_data8 = '\0' , "ÃEUR¨\002d", service_data16 = {61704, 0, 0, 0, 0, 0, 0, 0}, service_data32 = {961696585, 758395440, 879059760, 1953068800}, service_data64 = {26723, 0}} (gdb) p /x $3 $4 = {service_id = 0x53544100e10c0010, service_gid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}, unicast = {prefix = 0x80fe, interface_id = 0xe40c410004f10800}, multicast = {header = {0xfe, 0x80}, raw_group_id = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}}}, service_pkey = 0xffff, resv = 0x0, service_lease = 0xffffffff, service_key = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_name = {0x44, 0x41, 0x50, 0x4c, 0x20, 0x41, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x20, 0x54, 0x72, 0x61, 0x6e, 0x73, 0x6c, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x20, 0x53, 0x65, 0x72, 0x76, 0x69, 0x63, 0x65, 0x0 }, service_data8 = { 0x0 , 0xc0, 0xa8, 0x2, 0x64}, service_data16 = {0xf108, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_data32 = {0x39525349, 0x2d343230, 0x34656330, 0x74697700}, service_data64 = {0x6863, 0x0}} (gdb) $5 = {service_id = 0x53544100e10c0010, service_gid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}, unicast = {prefix = 0x80fe, interface_id = 0xe40c410004f10800}, multicast = {header = {0xfe, 0x80}, raw_group_id = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}}}, service_pkey = 0xffff, resv = 0x0, service_lease = 0xffffffff, service_key = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_name = {0x44, 0x41, 0x50, 0x4c, 0x20, 0x41, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x20, 0x54, 0x72, 0x61, 0x6e, 0x73, 0x6c, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x20, 0x53, 0x65, 0x72, 0x76, 0x69, 0x63, 0x65, 0x0 }, service_data8 = { 0x0 , 0xc0, 0xa8, 0x2, 0x64}, service_data16 = {0xf108, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_data32 = {0x39525349, 0x2d343230, 0x34656330, 0x74697700}, service_data64 = {0x6863, 0x0}} (gdb) quit -- This message was sent on behalf of chris_youb at yahoo.ca at openSubscriber.com http://www.opensubscriber.com/message/openib-general at openib.org/5325029.html _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From or.gerlitz at gmail.com Fri Nov 10 13:32:57 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 10 Nov 2006 23:32:57 +0200 Subject: [openib-general] [openfabrics-ewg] RHEL5 and OFED ... In-Reply-To: <1162488225.2898.346.camel@fc6.xsintricity.com> References: <1161155330.2917.511.camel@fc6.xsintricity.com> <20061018072904.GA26507@mellanox.co.il> <1161177058.2917.513.camel@fc6.xsintricity.com> <20061019050907.GA1547@mellanox.co.il> <1161268837.2917.544.camel@fc6.xsintricity.com> <453793A1.8000000@voltaire.com> <1162488225.2898.346.camel@fc6.xsintricity.com> Message-ID: <15ddcffd0611101332n32ab0bddye3016eca9d254b11@mail.gmail.com> On 11/2/06, Doug Ledford wrote: > On Thu, 2006-10-19 at 17:02 +0200, Or Gerlitz wrote: > > Doug Ledford wrote: > > > ... and reviewing arpingib > > > (which I'm going to remove from the ipoibtools and fix the native arping > > > in RHEL5 to work properly over IB without needing a new flag, the -A or > > > -U flags should be sufficient assuming those modes worked at all over IB > > > which they don't in either the native arping or the patched arpingib in > > > ipoibtools). I should get to it today though. > > > > Would you mind send the patch to arping for review? > > OK, this patch to arping actually makes it work for me in all modes > (duplicate address detection, arp response, and unsolicited arp > response). You shouldn't need any new flags to arping with this patch, > you should be able to just use the existing modes of operation as they > were intended to make the ipoibha.pl script work. There's still some > debugging printf's in the patch, so don't consider this a final version. > How's it work? The getsockname() function will return the full hw > address if you give it a buffer large enough to do so. So, instead of > allocating a single struct sockaddr_ll for me and he, which caps the > address size at 8 bytes, allocate two and let the extra 12 bytes run > over into the second struct element. Adjust the send_to and recv_from > calls to accomodate this intentional size overrun. Finally, don't > assume the broadcast address is all 1's, use sysfs to get the actual > device broadcast address and convert it from text to binary (which will > accommodate any possible future interface types that similarly don't > have all 1's for broadcast address without requiring any recoding). > That's all I had to do in order to get it to work for me. Hi Doug, sorry for no responding so far, i am little bit benind wrt to my email. I will probably be able to look on the patch not before sc06. Or. From jlentini at netapp.com Fri Nov 10 14:14:00 2006 From: jlentini at netapp.com (James Lentini) Date: Fri, 10 Nov 2006 17:14:00 -0500 (EST) Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband In-Reply-To: <1163190808.2508.55.camel@brick.pathscale.com> References: <1163108454.2508.29.camel@brick.pathscale.com> <1163182394.2508.42.camel@brick.pathscale.com> <1163190808.2508.55.camel@brick.pathscale.com> Message-ID: On Fri, 10 Nov 2006, Ralph Campbell wrote: > > > > > + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. > > > > > + It takes an array of device DMA addresses and lengths which are used > > > > > + to describe the memory region. These addresses are created by > > > > > + calling the mapping functions listed for ib_get_dma_mr(). > > > > > > > > The iova is an in/out parameter. I recommend including a description > > > > of how to initialize it: > > > > > > > > The 'iova' argument can be used by the caller to request an address > > > > to associate with the first byte of the address region. Upon return, > > > > the 'iova' argument is the ... > > > > > > Iova was at one point in time an in/out parameter but was changed > > > to be just an in parameter some time ago. > > > > At what point was this changed? > > Maybe a year ago. I tried to find it in the SVN log but > didn't find which revision changed it. > > > In the 1.2 IBTA spec, it is described as an in/out. > > > > For mthca, the iova value is an input (see mthca_reg_phys_mr). > > The IBTA spec. wasn't changed. > The open-fabrics code was changed since none of the HCAs > returned a modified iova value. Where did this change? All the versions of include/rdma/ib_verbs.h (kernel 2.6.18, svn, Roland's git) define ib_reg_phys_mr() as struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, struct ib_phys_buf *phys_buf_array, int num_phys_buf, int mr_access_flags, u64 *iova_start) indicating to me that iova_start is an in/out parameter. > Also, there are currently no callers of ib_reg_phys_mr() that I know > of. The NFS-RDMA client calls ib_reg_phys_mr() in certain configurations. See the code at http://www.sourceforge.net/projects/nfs-rdma. From ralph.campbell at qlogic.com Fri Nov 10 14:51:32 2006 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Fri, 10 Nov 2006 14:51:32 -0800 Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband In-Reply-To: References: <1163108454.2508.29.camel@brick.pathscale.com> <1163182394.2508.42.camel@brick.pathscale.com> <1163190808.2508.55.camel@brick.pathscale.com> Message-ID: <1163199092.2508.61.camel@brick.pathscale.com> On Fri, 2006-11-10 at 17:14 -0500, James Lentini wrote: > > On Fri, 10 Nov 2006, Ralph Campbell wrote: > > > > > > > + ib_reg_phys_mr() This function returns a pointer to struct ib_mr. > > > > > > + It takes an array of device DMA addresses and lengths which are used > > > > > > + to describe the memory region. These addresses are created by > > > > > > + calling the mapping functions listed for ib_get_dma_mr(). > > > > > > > > > > The iova is an in/out parameter. I recommend including a description > > > > > of how to initialize it: > > > > > > > > > > The 'iova' argument can be used by the caller to request an address > > > > > to associate with the first byte of the address region. Upon return, > > > > > the 'iova' argument is the ... > > > > > > > > Iova was at one point in time an in/out parameter but was changed > > > > to be just an in parameter some time ago. > > > > > > At what point was this changed? > > > > Maybe a year ago. I tried to find it in the SVN log but > > didn't find which revision changed it. > > > > > In the 1.2 IBTA spec, it is described as an in/out. > > > > > > For mthca, the iova value is an input (see mthca_reg_phys_mr). > > > > The IBTA spec. wasn't changed. > > The open-fabrics code was changed since none of the HCAs > > returned a modified iova value. > > Where did this change? All the versions of include/rdma/ib_verbs.h > (kernel 2.6.18, svn, Roland's git) define ib_reg_phys_mr() as > > struct ib_mr *ib_reg_phys_mr(struct ib_pd *pd, > struct ib_phys_buf *phys_buf_array, > int num_phys_buf, > int mr_access_flags, > u64 *iova_start) > > indicating to me that iova_start is an in/out parameter. > > > Also, there are currently no callers of ib_reg_phys_mr() that I know > > of. > > The NFS-RDMA client calls ib_reg_phys_mr() in certain configurations. > See the code at http://www.sourceforge.net/projects/nfs-rdma. I must have woken up on the wrong side of the bed today :-) Perhaps I am remembering it being changed from in to in/out since the code obviously is in/out now. How about: The 'iova' argument is an in/out parameter which can be used by the caller to request an address to associate with the first byte of the memory region. Upon return, the addresses used with struct ib_sge 'addr' and struct ib_send_wr 'remote_addr' are thus 'iova' plus an offset within the length of the memory region. From halr at voltaire.com Fri Nov 10 15:31:17 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Sat, 11 Nov 2006 01:31:17 +0200 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated References: <6923467.1163190820620.JavaMail.websites@opensubscriber> <5CE025EE7D88BA4599A2C8FEFCF226F501894407@taurus.voltaire.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F501894409@taurus.voltaire.com> Can you see if this fixes it ? Thanks. -- Hal Index: opensm/osm_helper.c =================================================================== --- opensm/osm_helper.c (revision 10089) +++ opensm/osm_helper.c (working copy) @@ -1264,7 +1264,7 @@ IN const ib_service_record_t* const p_sr, IN const osm_log_level_t log_level ) { - char buf_service_key[33]; + char buf_service_key[35]; char buf_service_name[65]; if( osm_log_is_active( p_log, log_level ) ) ________________________________ From: Hal Rosenstock Sent: Fri 11/10/2006 4:13 PM To: chris_youb at yahoo.ca Cc: openib-general at openib.org Subject: RE: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated I think I see the problem. Give me a little time to give you a patch to try. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of chris_youb at yahoo.ca Sent: Fri 11/10/2006 3:33 PM To: openib-general at openib.org Subject: Re: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated Backtrace related to previous posting of the same title: Core was generated by `opensm -V'. Program terminated with signal 6, Aborted. #0 0xffffe410 in __kernel_vsyscall () (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0xb7e6e770 in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e6fef3 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0xb7ea3d0b in __fsetlocking () from /lib/tls/i686/cmov/libc.so.6 #4 0xb7f27e41 in __stack_chk_fail () from /lib/tls/i686/cmov/libc.so.6 #5 0xb7fb7834 in __stack_chk_fail_local () from /usr/local/lib/libopensm.so.1 #6 0xb7fb4a83 in osm_dump_service_record (p_log=0x80bd810, p_sr=0x8109108, log_level=8 '\b') at osm_helper.c:1360 #7 0x0807d7f0 in osm_sr_rcv_process_set_method (p_rcv=0x80bd260, p_madw=0x80d0980) at osm_sa_service_record.c:863 #8 0x0807dd91 in osm_sr_rcv_process (p_rcv=0x80bd260, p_madw=0x80d0980) at osm_sa_service_record.c:1083 #9 0x0807e0aa in __osm_sr_rcv_ctrl_disp_callback (context=0x80bd334, p_data=0x80d0980) at osm_sa_service_record_ctrl.c:66 #10 0xb7fa193e in __cl_disp_worker (context=0x80bd840) at cl_dispatcher.c:108 #11 0xb7fa8c49 in __cl_thread_pool_routine (context=0x80bd880) at cl_threadpool.c:79 #12 0xb7fa8a55 in __cl_thread_wrapper (arg=0x80be478) at cl_thread.c:61 #13 0xb7f87504 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 #14 0xb7f1251e in clone () from /lib/tls/i686/cmov/libc.so.6 (gdb) f 6 #6 0xb7fb4a83 in osm_dump_service_record (p_log=0x80bd810, p_sr=0x8109108, log_level=8 '\b') at osm_helper.c:1360 1360 } (gdb) l 1355 cl_ntoh32(p_sr->service_data32[3]), 1356 cl_ntoh64(p_sr->service_data64[0]), 1357 cl_ntoh64(p_sr->service_data64[1]) 1358 ); 1359 } 1360 } 1361 1362 /********************************************************************** 1363 **********************************************************************/ 1364 void (gdb) p p_sr $1 = (const ib_service_record_t * const) 0x8109108 (gdb) p *$1 $3 = {service_id = 6004495675223179280, service_gid = { raw = "þ\200\000\000\000\000\000\000\000\bñ\004\000A\fä", unicast = { prefix = 33022, interface_id = 16432580608706807808}, multicast = { header = "þ\200", raw_group_id = "\000\000\000\000\000\000\000\bñ\004\000A\fä"}}, service_pkey = 65535, resv = 0, service_lease = 4294967295, service_key = "\000\000\000\000\000\000\bñÿÿ\000\000\000\000\000", service_name = "DAPL Address Translation Service", '\0' , service_data8 = '\0' , "ÃEUR¨\002d", service_data16 = {61704, 0, 0, 0, 0, 0, 0, 0}, service_data32 = {961696585, 758395440, 879059760, 1953068800}, service_data64 = {26723, 0}} (gdb) p /x $3 $4 = {service_id = 0x53544100e10c0010, service_gid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}, unicast = {prefix = 0x80fe, interface_id = 0xe40c410004f10800}, multicast = {header = {0xfe, 0x80}, raw_group_id = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}}}, service_pkey = 0xffff, resv = 0x0, service_lease = 0xffffffff, service_key = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_name = {0x44, 0x41, 0x50, 0x4c, 0x20, 0x41, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x20, 0x54, 0x72, 0x61, 0x6e, 0x73, 0x6c, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x20, 0x53, 0x65, 0x72, 0x76, 0x69, 0x63, 0x65, 0x0 }, service_data8 = { 0x0 , 0xc0, 0xa8, 0x2, 0x64}, service_data16 = {0xf108, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_data32 = {0x39525349, 0x2d343230, 0x34656330, 0x74697700}, service_data64 = {0x6863, 0x0}} (gdb) $5 = {service_id = 0x53544100e10c0010, service_gid = {raw = {0xfe, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}, unicast = {prefix = 0x80fe, interface_id = 0xe40c410004f10800}, multicast = {header = {0xfe, 0x80}, raw_group_id = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0x4, 0x0, 0x41, 0xc, 0xe4}}}, service_pkey = 0xffff, resv = 0x0, service_lease = 0xffffffff, service_key = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8, 0xf1, 0xff, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_name = {0x44, 0x41, 0x50, 0x4c, 0x20, 0x41, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x20, 0x54, 0x72, 0x61, 0x6e, 0x73, 0x6c, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x20, 0x53, 0x65, 0x72, 0x76, 0x69, 0x63, 0x65, 0x0 }, service_data8 = { 0x0 , 0xc0, 0xa8, 0x2, 0x64}, service_data16 = {0xf108, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, service_data32 = {0x39525349, 0x2d343230, 0x34656330, 0x74697700}, service_data64 = {0x6863, 0x0}} (gdb) quit -- This message was sent on behalf of chris_youb at yahoo.ca at openSubscriber.com http://www.opensubscriber.com/message/openib-general at openib.org/5325029.html _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Fri Nov 10 16:32:03 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Nov 2006 02:32:03 +0200 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F501894409@taurus.voltaire.com> References: <6923467.1163190820620.JavaMail.websites@opensubscriber> <5CE025EE7D88BA4599A2C8FEFCF226F501894407@taurus.voltaire.com> <5CE025EE7D88BA4599A2C8FEFCF226F501894409@taurus.voltaire.com> Message-ID: <20061111003203.GA12085@sashak.voltaire.com> On 01:31 Sat 11 Nov , Hal Rosenstock wrote: > Can you see if this fixes it ? Thanks. > > -- Hal > > Index: opensm/osm_helper.c > =================================================================== > --- opensm/osm_helper.c (revision 10089) > +++ opensm/osm_helper.c (working copy) > @@ -1264,7 +1264,7 @@ > IN const ib_service_record_t* const p_sr, > IN const osm_log_level_t log_level ) > { > - char buf_service_key[33]; > + char buf_service_key[35]; > char buf_service_name[65]; Good catch! Other thing here buf_service_name is used for copying service_name buffer which is NULL terminated according to spec. So copying is not needed, instead we can put '\0' at end of buffer when MAD is received (for sure). Sasha From xma at us.ibm.com Fri Nov 10 17:21:36 2006 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 10 Nov 2006 17:21:36 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Roland Dreier wrote on 11/10/2006 07:00:46 AM: > I think it has to stay the way I wrote it. Your version: > > + if (empty) > + return (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS) && netif_rx_reschedule(dev, 0); > + Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Fri Nov 10 17:27:11 2006 From: xma at us.ibm.com (Shirley Ma) Date: Fri, 10 Nov 2006 17:27:11 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Roland, Sorry I was not intend to send previous email. Anyway I accidently sent it out. What I thought was there would be a problem, if the missed_event always return to 1. Then this napi poll would keep forever. How about defer the rotting packets process later? like this: > > + if (empty) > + return (ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP|IB_CQ_REPORT_MISSED_EVENTS) && netif_rx_reschedule(dev, 0); > + With this patch, I could get NAPI + non scaling code throughput performance from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still investigating now. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Fri Nov 10 21:09:03 2006 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sat, 11 Nov 2006 00:09:03 -0500 (EST) Subject: [openib-general] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection management and optimized collective support Message-ID: <200611110509.kAB593H1012931@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH2 0.9.8 with the following NEW features: - Checkpoint/Restart support for application transparent systems-level fault tolerance. BLCR-based support using native InfiniBand Gen2 interface is provided. Flexible interface to work with different file systems. Tested with ext3 (local disk), NFS and PVFS2. Performance of sample applications with checkpoint-restart using PVFS2 and Lustre can be found here: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and drivers. - RDMA CM-based Connection management support - Shared memory optimizations for collective communication operations. Efficient algorithms and optimizations for barrier, reduce and all-reduce operations. Exploits the multi-core optimized shared memory point-to-point communication support introduced in MVAPICH2 0.9.6. Performance of sample collective operations with this new feature can be found here: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html - uDAPL support for NetEffect 10GigE adapter. Tested with NetEffect NE010 adapter. More details on all features and supported platforms can be obtained by visiting the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It continues to deliver excellent performance. Sample performance numbers include: - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: Two-sided operations: - 2.81 microsec one-way latency (4 bytes) - 1561 MB/sec unidirectional bandwidth - 2935 MB/sec bidirectional bandwidth One-sided operations: - 4.92 microsec Put latency - 1569 MB/sec unidirectional Put bandwidth - 2935 MB/sec bidirectional Put bandwidth - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): Two-sided operations: - 2.81 microsec one-way latency (4 bytes) - 3127 MB/sec unidirectional bandwidth - 5917 MB/sec bidirectional bandwidth One-sided operations: - 4.37 microsec Put latency - 3137 MB/sec unidirectional Put bandwidth - 5917 MB/sec bidirectional Put bandwidth - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: Two-sided operations: - 3.01 microsec one-way latency (4 bytes) - 1402 MB/sec unidirectional bandwidth - 2238 MB/sec bidirectional bandwidth One-sided operations: - 4.65 microsec Put latency - 1402 MB/sec unidirectional Put bandwidth - 2238 MB/sec bidirectional Put bandwidth Performance numbers for all other platforms, system configurations and operations can be viewed by visiting `Performance' section of the project's web page. With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar performance for two-sided operations compared to MVAPICH 0.9.8. Organizations and users interested in getting the best performance for both two-sided and one-sided operations and also want to exploit advanced features (such as fault tolerance with checkpoint/restart, iWARP, RDMA CM connection management, multi-threading, integrated multi-rail, multi-core optimization, memory hook support and optimized collectives) may migrate from MVAPICH code base to MVAPICH2 code base. For downloading MVAPICH2 0.9.8 package and accessing the anonymous SVN, please visit the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A stripped down version of this release is also available at the OpenFabrics SVN. All feedbacks, including bug reports and hints for performance tuning, are welcome. Please post it to the mvapich-discuss mailing list. Thanks, MVAPICH Team at OSU/NBCL ====================================================================== MVAPICH/MVAPICH2 project is currently supported with funding from U.S. National Science Foundation, U.S. DOE Office of Science, Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; and with equipment support from Advanced Clustering, AMD, Apple, Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm and Sun Microsystems. Other technology partner includes Etnus. ====================================================================== From dotanb at dev.mellanox.co.il Sat Nov 11 01:28:03 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Sat, 11 Nov 2006 11:28:03 +0200 (IST) Subject: [openib-general] [PATCH] 1/3 [core] Added support to IB_EVENT_GID_CHANGE async event In-Reply-To: References: <1162986489.12259.6.camel@mtls05.yok.mtl.com> Message-ID: <1364.85.65.224.192.1163237283.squirrel@dev.mellanox.co.il> > So this GID change event would be an extension to what the verbs spec > defines. What is the motivation for adding this? A GID change > doesn't seem particularly important to consumers. > > - R. > First of all you are absolutely right, this is an implementation extension and not IB spec definition. I believe that the PKEY change event is also not defined in the IB spec .. The motivation for this event is that if one uses GRH he don't put the actual value in the GRH props, he only supply the index in the GID table to be used. I believe that if the openSM will decide to change the GID table, the user will like to know if the index is now containing a different value ... thanks Dotan From ogerlitz at voltaire.com Sat Nov 11 23:25:11 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 12 Nov 2006 09:25:11 +0200 Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1162506626.29948.568.camel@brick.pathscale.com> References: <1162506626.29948.568.camel@brick.pathscale.com> Message-ID: <4556CC57.6020805@voltaire.com> Ralph Campbell wrote: > IB/ipath - Implement new verbs DMA mapping functions > > This patch implements the interposing DMA mapping functions to allow > support for IOMMUs and remove the dependence on phys_to_virt(). > diff -r f37bd0e41fec drivers/infiniband/hw/ipath/ipath_dma.c > --- /dev/null Thu Jan 01 00:00:00 1970 +0000 > +++ b/drivers/infiniband/hw/ipath/ipath_dma.c Fri Oct 27 10:40:03 2006 -0800 > +/** > + * ipath_dma_map_single - Map a kernel virtual address to DMA address > + * @device: The device for which the dma_addr is to be created > + * @cpu_addr: The kernel virtual address > + * @size: The size of the region in bytes > + * @direction: The direction of the DMA > + */ > +static dma_addr_t ipath_dma_map_single(struct ib_device *dev, > + void *cpu_addr, size_t size, > + enum dma_data_direction direction) > +{ > + BUG_ON(direction == DMA_NONE); > + return (dma_addr_t) cpu_addr; > +} This is a bug since there are architectures eg PPC64 where the native address size is u64 but dma_addr_t is u32. You are somehow in a problem here, since returning an unchopped cpu_addr to the consumer might cause a memory corruption as they are expecting 32 bit value. Or. From muli at il.ibm.com Sun Nov 12 01:59:16 2006 From: muli at il.ibm.com (Muli Ben-Yehuda) Date: Sun, 12 Nov 2006 11:59:16 +0200 Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <1162506626.29948.568.camel@brick.pathscale.com> References: <1162506626.29948.568.camel@brick.pathscale.com> Message-ID: <20061112095916.GF4988@rhun.ibm.com> On Thu, Nov 02, 2006 at 02:30:26PM -0800, Ralph Campbell wrote: > +static dma_addr_t ipath_dma_map_single(struct ib_device *dev, > + void *cpu_addr, size_t size, > + enum dma_data_direction direction) > +{ > + BUG_ON(direction == DMA_NONE); Please use BUG_ON(!valid_dma_direction(direction)) here and elsewhere instead. Cheers, Muli From monil at voltaire.com Sun Nov 12 04:44:33 2006 From: monil at voltaire.com (Moni Levy) Date: Sun, 12 Nov 2006 14:44:33 +0200 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails In-Reply-To: <013601c704a1$b86d11f0$05c8a8c0@DIEGO> References: <003201c70317$36911f40$05c8a8c0@DIEGO> <4551E7C5.8020700@dev.mellanox.co.il> <00a301c70347$accee4f0$05c8a8c0@DIEGO> <45534DD0.3010803@dev.mellanox.co.il> <013601c704a1$b86d11f0$05c8a8c0@DIEGO> Message-ID: <6a122cc00611120444h5ec7e401l17241077c30632fc@mail.gmail.com> On 11/10/06, Diego Guella wrote: > Hi Vladimir, > Thanks for your answer. > > I have installed: > > compat-libstdc++ (version 5.0.7-35) > libstdc++-32bit (version 4.1.2_20060705-2) > libstdc++41 (version 4.1.2_20061024-3) > libstdc++41-devel (version 4.1.2_20061024-3) > libstdc++-devel (version 4.1.3-22) > > > but remember that in the log file, first it says (line 6393): > ----- > checking for C compiler default output file name... a.out > ----- > > and about 5000 lines below, it says my compiler can't create executables > (of course this isn't true, because this is the machine on wich I compile > all the programs I make) > Have you got any other suggestion? Please try to install a 32 bit glibc-devel package. -- Moni > > > Thanks, > Diego > > > ----- Original Message ----- > From: "Vladimir Sokolovsky" > To: "Diego Guella" > Cc: "Tziporet Koren" ; > > Sent: Thursday, November 09, 2006 4:48 PM > Subject: Re: [openib-general] Installation on openSUSE 10.2 Beta1 fails > > > > Hello Diego, > > Check that you have libstdc++, libstdc++-devel and compat-libstdc++ RPMs > > installed. > > > > Regards, > > Vladimir > > > > Diego Guella wrote: > >> > >> From: "Tziporet Koren" > >>> The failing is utility is used for IPoIB high availability. If you don't > >>> need to use them you can just change this line in ofed.conf: > >>> ipoibtools=n > >>> > >>> Tziporet > >>> > >> Thanks Tziporet for your answer. > >> > >> > >> Tried just right now, i disabled ipoibtools. I get another, more strange > >> error: > >> (attached OFED.3816.log) > >> ----- > >> /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples > >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs > >> Running: > >> ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > >> --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib > >> CPPFLAGS="-I../libibverbs/include" > >> configure: creating cache > >> /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > >> checking for a BSD-compatible install... /usr/bin/install -c > >> checking whether build environment is sane... yes > >> checking for gawk... gawk > >> checking whether make sets $(MAKE)... yes > >> checking build system type... x86_64-unknown-linux-gnu > >> checking host system type... x86_64-unknown-linux-gnu > >> checking for style of include used by make... GNU > >> checking for gcc... gcc > >> checking for C compiler default output file name... configure: error: C > >> compiler cannot create executables > >> See `config.log' for more details. > >> Failed to execute: > >> ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache > >> --disable-libcheck --prefix /usr/local/ofed --libdir /usr/local/ofed/lib > >> CPPFLAGS="-I../libibverbs/include" > >> error: Bad exit status from /var/tmp/rpm-tmp.46102 (%install) > >> ----- > >> > >> Am I right? It says my C compiler cannot create executables???? Is it > >> joking me???? > >> In the log file, line 6393, it says: > >> ----- > >> checking for C compiler default output file name... a.out > >> ----- > >> > >> I don't understand....! > >> Is there something I can do to fix this? > >> > >> > >> Thanks, > >> Diego > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> openib-general mailing list > >> openib-general at openib.org > >> http://openib.org/mailman/listinfo/openib-general > >> > >> To unsubscribe, please visit > >> http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From kliteyn at dev.mellanox.co.il Sun Nov 12 05:31:04 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 12 Nov 2006 15:31:04 +0200 Subject: [openib-general] [PATCH] osm: comparing InformInfo records In-Reply-To: <1162907368.25771.37532.camel@hal.voltaire.com> References: <454EDD97.5060000@dev.mellanox.co.il> <1162907368.25771.37532.camel@hal.voltaire.com> Message-ID: <45572218.4040801@dev.mellanox.co.il> Hal Rosenstock wrote: > Hi Yevgeny, > > On Mon, 2006-11-06 at 02:00, Yevgeny Kliteynik wrote: >> Hi Hal >> >> [From Vu Pham] >> 1. sending InformInfo set subscribe for trap 64,65,144 - this works; >> however, osm.log outputs wrong value for "subscribe" field > > What code issues these subscriptions ? How was this patch tested ? Unfortunately, we don't have this kind of test in the testbase, so this patch hasn't been tested. > >> 2. sending InformInfo set *unsubscribe* for >> trap 64,65,144 - I'm using/formating the same mad as (1) except the >> "subscribe" field is zero; however, opensm response with status 0x200 >> [/From Vu Pham] >> >> 1. The received InformInfo struct was modified before dumping it. >> This was fixed as part of the second issue. >> 2. The function that compares InformInfo structures was just comparing >> the whole memory allocated for it, including reserved fields. >> Fixed to compare more selectively. >> >> Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> >> Index: opensm/osm_sa_informinfo.c >> =================================================================== >> --- opensm/osm_sa_informinfo.c (revision 10064) >> +++ opensm/osm_sa_informinfo.c (working copy) >> @@ -345,7 +345,6 @@ osm_infr_rcv_process_set_method( >> ib_inform_info_t *p_recvd_inform_info; >> osm_infr_t inform_info_rec; /* actual inform record to be stored for reports */ >> osm_infr_t *p_infr; >> - uint8_t subscribe; >> ib_net32_t qpn; >> uint8_t resp_time_val; >> ib_api_status_t res; >> @@ -403,19 +402,11 @@ osm_infr_rcv_process_set_method( >> * >> * QPN: >> * internally we keep the QPN field of the InformInfo updated >> - * so we can simply compare the entire record - when finding such. >> - * IBA spec only requires the QPN field to be filled when an unsubscribe >> - * Set(InformInfo) is done. See table 119 p 740 QPN field >> - * >> - * SUBSCRIBE: >> - * For similar reasons we change the subscribe to 0 on the >> - * inserted/searched data >> + * so we can simply compare it in the record - when finding such. >> */ >> >> - subscribe = p_recvd_inform_info->subscribe; >> - if (subscribe) >> + if (p_recvd_inform_info->subscribe) >> { >> - inform_info_rec.inform_record.inform_info.subscribe = 0; >> ib_inform_info_set_qpn( >> &inform_info_rec.inform_record.inform_info, >> inform_info_rec.report_addr.addr_type.gsi.remote_qp ); >> @@ -443,7 +434,7 @@ osm_infr_rcv_process_set_method( >> p_infr = osm_infr_get_by_rec( p_rcv->p_subn, p_rcv->p_log, &inform_info_rec ); >> >> /* check to see if the request was for subscribe = 1 */ >> - if (subscribe) >> + if (p_recvd_inform_info->subscribe) >> { >> /* validate the request for a new or update InformInfo */ >> if (__validate_infr( p_rcv, &inform_info_rec ) != TRUE) >> @@ -480,6 +471,8 @@ osm_infr_rcv_process_set_method( >> goto Exit; >> } >> >> + /* set the subscribe bit to 0 before adding the record */ >> + p_infr->inform_record.inform_info.subscribe = 0; > > It seems odd to me to set subscribe to 0 for a subscription (rather than > when it is an unsibscription). Aren't only subscriptions kept in the > database ? Is this an artifact of the matching code ? If so, why not > change that ? You're right. Previously the zero value was used for comparing the whole record to the unsubscribe request (which carries 0 in the 'subscribe' field). Now that the comparing function has been changed, no need to keep zeroing this field. >> /* Add this new osm_infr_t object to subnet object */ >> osm_infr_insert_to_db( p_rcv->p_subn, p_rcv->p_log, p_infr ); >> >> @@ -488,6 +481,8 @@ osm_infr_rcv_process_set_method( >> { >> /* Update the old instance of the osm_infr_t object */ >> p_infr->inform_record = inform_info_rec.inform_record; >> + /* set the subscribe bit to 0 after updating the record */ >> + p_infr->inform_record.inform_info.subscribe = 0; > > Same as previous comment. > >> } >> } >> else >> Index: opensm/osm_inform.c >> =================================================================== >> --- opensm/osm_inform.c (revision 10064) >> +++ opensm/osm_inform.c (working copy) >> @@ -206,30 +206,133 @@ __match_inf_rec( >> osm_infr_t* p_infr_rec = (osm_infr_t *)context; >> osm_infr_t* p_infr = (osm_infr_t*)p_list_item; >> osm_log_t *p_log = p_infr_rec->p_infr_rcv->p_log; >> - cl_status_t status; >> - int32_t count1, count2; >> + cl_status_t status = CL_NOT_FOUND; >> + ib_gid_t all_zero_gid; >> + >> >> OSM_LOG_ENTER( p_log, __match_inf_rec); >> >> - count1 = memcmp(&p_infr->report_addr, &p_infr_rec->report_addr, >> - sizeof(p_infr_rec->report_addr)); >> - if (count1) >> - osm_log( p_log, OSM_LOG_DEBUG, >> - "__match_inf_rec: " >> - "Differ by Address\n" ); >> - count2 = memcmp( >> - &p_infr->inform_record.inform_info, >> - &p_infr_rec->inform_record.inform_info, >> - sizeof(p_infr->inform_record.inform_info) ); >> - if (count2) >> - osm_log( p_log, OSM_LOG_DEBUG, >> - "__match_inf_rec: " >> - "Differ by InformInfo\n" ); >> - if ((count1 == 0) && (count2 == 0)) >> - status = CL_SUCCESS; >> + if ( !memcmp(&p_infr->report_addr, >> + &p_infr_rec->report_addr, >> + sizeof(p_infr_rec->report_addr)) ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by Address\n" ); >> + goto Exit; >> + } >> + >> + memset(&all_zero_gid, 0, sizeof(ib_gid_t)); >> + >> + /* if inform_info.gid is not zero, ignoring lid range */ >> + if ( !memcmp(&p_infr_rec->inform_record.inform_info.gid, >> + &all_zero_gid, >> + sizeof(p_infr_rec->inform_record.inform_info.gid)) ) >> + { >> + if ( !memcmp(&p_infr->inform_record.inform_info.gid, >> + &p_infr_rec->inform_record.inform_info.gid, >> + sizeof(p_infr->inform_record.inform_info.gid)) ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.gid\n" ); >> + goto Exit; >> + } >> + } >> else >> - status = CL_NOT_FOUND; >> + { >> + if ( (p_infr->inform_record.inform_info.lid_range_begin != >> + p_infr_rec->inform_record.inform_info.lid_range_begin) || >> + (p_infr->inform_record.inform_info.lid_range_end != >> + p_infr_rec->inform_record.inform_info.lid_range_end) ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.LIDRange\n" ); >> + goto Exit; >> + } >> + } >> + >> + if ( p_infr->inform_record.inform_info.is_generic != >> + p_infr_rec->inform_record.inform_info.is_generic ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.IsGeneric\n" ); >> + goto Exit; >> + } >> >> + if ( p_infr->inform_record.inform_info.trap_type != >> + p_infr_rec->inform_record.inform_info.trap_type ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.TrapType\n" ); >> + goto Exit; >> + } >> + >> + if ( p_infr->inform_record.inform_info.is_generic != >> + p_infr_rec->inform_record.inform_info.is_generic ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.IsGeneric\n" ); >> + } > > This appears to be a duplicate of what was added shortly earlier in this > patch. Right, good catch. I'll send a V2 of this patch shortly. -- Yevgeny > -- Hal > >> + else if (p_infr->inform_record.inform_info.is_generic) >> + { >> + if ( p_infr->inform_record.inform_info.g_or_v.generic.trap_num != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.trap_num ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.TrapNumber\n" ); >> + goto Exit; >> + } >> + else if ( p_infr->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.QPNRespTimeVal\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_msb != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_msb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.NodeTypeMSB\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_lsb != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_lsb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.NodeTypeLSB\n" ); >> + else >> + status = CL_SUCCESS; >> + } >> + else >> + { >> + if ( p_infr->inform_record.inform_info.g_or_v.vend.dev_id != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.dev_id ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.DeviceID\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.QPNRespTimeVal\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_msb != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_msb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.VendorIdMSB\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_lsb != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_lsb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.VendorIdLSB\n" ); >> + else >> + status = CL_SUCCESS; >> + } >> + >> + Exit: >> OSM_LOG_EXIT( p_log ); >> return status; >> } >> > From kliteyn at dev.mellanox.co.il Sun Nov 12 05:42:28 2006 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 12 Nov 2006 15:42:28 +0200 Subject: [openib-general] [PATCH v2] osm: comparing InformInfo records Message-ID: <455724C4.9080806@dev.mellanox.co.il> Hi Hal Here's the fixed InformInfo patch Yevgeny Signed-off-by: Yevgeny Kliteynik Index: opensm/osm_inform.c =================================================================== --- opensm/osm_inform.c (revision 10100) +++ opensm/osm_inform.c (working copy) @@ -206,30 +206,123 @@ __match_inf_rec( osm_infr_t* p_infr_rec = (osm_infr_t *)context; osm_infr_t* p_infr = (osm_infr_t*)p_list_item; osm_log_t *p_log = p_infr_rec->p_infr_rcv->p_log; - cl_status_t status; - int32_t count1, count2; + cl_status_t status = CL_NOT_FOUND; + ib_gid_t all_zero_gid; OSM_LOG_ENTER( p_log, __match_inf_rec); - count1 = memcmp(&p_infr->report_addr, &p_infr_rec->report_addr, - sizeof(p_infr_rec->report_addr)); - if (count1) - osm_log( p_log, OSM_LOG_DEBUG, - "__match_inf_rec: " - "Differ by Address\n" ); - count2 = memcmp( - &p_infr->inform_record.inform_info, - &p_infr_rec->inform_record.inform_info, - sizeof(p_infr->inform_record.inform_info) ); - if (count2) - osm_log( p_log, OSM_LOG_DEBUG, - "__match_inf_rec: " - "Differ by InformInfo\n" ); - if ((count1 == 0) && (count2 == 0)) - status = CL_SUCCESS; + if ( !memcmp(&p_infr->report_addr, + &p_infr_rec->report_addr, + sizeof(p_infr_rec->report_addr)) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by Address\n" ); + goto Exit; + } + + memset(&all_zero_gid, 0, sizeof(ib_gid_t)); + + /* if inform_info.gid is not zero, ignoring lid range */ + if ( !memcmp(&p_infr_rec->inform_record.inform_info.gid, + &all_zero_gid, + sizeof(p_infr_rec->inform_record.inform_info.gid)) ) + { + if ( !memcmp(&p_infr->inform_record.inform_info.gid, + &p_infr_rec->inform_record.inform_info.gid, + sizeof(p_infr->inform_record.inform_info.gid)) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.gid\n" ); + goto Exit; + } + } + else + { + if ( (p_infr->inform_record.inform_info.lid_range_begin != + p_infr_rec->inform_record.inform_info.lid_range_begin) || + (p_infr->inform_record.inform_info.lid_range_end != + p_infr_rec->inform_record.inform_info.lid_range_end) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.LIDRange\n" ); + goto Exit; + } + } + + if ( p_infr->inform_record.inform_info.trap_type != + p_infr_rec->inform_record.inform_info.trap_type ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.TrapType\n" ); + goto Exit; + } + + if ( p_infr->inform_record.inform_info.is_generic != + p_infr_rec->inform_record.inform_info.is_generic ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.IsGeneric\n" ); + } + else if (p_infr->inform_record.inform_info.is_generic) + { + if ( p_infr->inform_record.inform_info.g_or_v.generic.trap_num != + p_infr_rec->inform_record.inform_info.g_or_v.generic.trap_num ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.TrapNumber\n" ); + goto Exit; + } + else if ( p_infr->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val != + p_infr_rec->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.QPNRespTimeVal\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_msb != + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_msb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.NodeTypeMSB\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_lsb != + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_lsb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.NodeTypeLSB\n" ); + else + status = CL_SUCCESS; + } else - status = CL_NOT_FOUND; + { + if ( p_infr->inform_record.inform_info.g_or_v.vend.dev_id != + p_infr_rec->inform_record.inform_info.g_or_v.vend.dev_id ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.DeviceID\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val != + p_infr_rec->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.QPNRespTimeVal\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_msb != + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_msb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.VendorIdMSB\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_lsb != + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_lsb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.VendorIdLSB\n" ); + else + status = CL_SUCCESS; + } + Exit: OSM_LOG_EXIT( p_log ); return status; } Index: opensm/osm_sa_informinfo.c =================================================================== --- opensm/osm_sa_informinfo.c (revision 10100) +++ opensm/osm_sa_informinfo.c (working copy) @@ -345,7 +345,6 @@ osm_infr_rcv_process_set_method( ib_inform_info_t *p_recvd_inform_info; osm_infr_t inform_info_rec; /* actual inform record to be stored for reports */ osm_infr_t *p_infr; - uint8_t subscribe; ib_net32_t qpn; uint8_t resp_time_val; ib_api_status_t res; @@ -403,19 +402,11 @@ osm_infr_rcv_process_set_method( * * QPN: * internally we keep the QPN field of the InformInfo updated - * so we can simply compare the entire record - when finding such. - * IBA spec only requires the QPN field to be filled when an unsubscribe - * Set(InformInfo) is done. See table 119 p 740 QPN field - * - * SUBSCRIBE: - * For similar reasons we change the subscribe to 0 on the - * inserted/searched data + * so we can simply compare it in the record - when finding such. */ - subscribe = p_recvd_inform_info->subscribe; - if (subscribe) + if (p_recvd_inform_info->subscribe) { - inform_info_rec.inform_record.inform_info.subscribe = 0; ib_inform_info_set_qpn( &inform_info_rec.inform_record.inform_info, inform_info_rec.report_addr.addr_type.gsi.remote_qp ); @@ -443,7 +434,7 @@ osm_infr_rcv_process_set_method( p_infr = osm_infr_get_by_rec( p_rcv->p_subn, p_rcv->p_log, &inform_info_rec ); /* check to see if the request was for subscribe = 1 */ - if (subscribe) + if (p_recvd_inform_info->subscribe) { /* validate the request for a new or update InformInfo */ if (__validate_infr( p_rcv, &inform_info_rec ) != TRUE) From mst at mellanox.co.il Sun Nov 12 05:57:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 12 Nov 2006 15:57:52 +0200 Subject: [openib-general] Fwd: [ANNOUNCE] GIT 1.4.3.5 Message-ID: <20061112135752.GC31290@mellanox.co.il> FYI I don't see any updates openfabrics server needs. -- MST -------------- next part -------------- An embedded message was scrubbed... From: "Junio C Hamano" Subject: [ANNOUNCE] GIT 1.4.3.5 Date: Sun, 12 Nov 2006 07:23:20 +0200 Size: 4003 URL: From troy at scl.ameslab.gov Sun Nov 12 14:57:37 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Sun, 12 Nov 2006 16:57:37 -0600 Subject: [openib-general] ehca build on 2.6.18.2?? Message-ID: what is up with subversion? it does not build with errors like this: CC [M] drivers/infiniband/core/uverbs_main.o drivers/infiniband/core/uverbs_main.c: In function 'uverbs_event_get_sb': drivers/infiniband/core/uverbs_main.c:811: error: too few arguments to function 'get_sb_pseudo' drivers/infiniband/core/uverbs_main.c: At top level: drivers/infiniband/core/uverbs_main.c:817: warning: initialization from incompatible pointer type From troy at scl.ameslab.gov Sun Nov 12 16:15:50 2006 From: troy at scl.ameslab.gov (Troy Benjegerdes) Date: Sun, 12 Nov 2006 18:15:50 -0600 Subject: [openib-general] ehca build on 2.6.18.2?? In-Reply-To: References: Message-ID: <75966154-70FB-470D-98DF-1ED735430893@scl.ameslab.gov> Um. So I built openib-1.1 from the OFED-1.1 tarball, and now I get: * p5l9:/usr/src/openib-1.1/src/userspace/libehca# ibv_devinfo libibverbs: Warning: no userspace device-specific driver found for uverbs0 driver search path: /usr/local/lib/infiniband No IB devices found On Nov 12, 2006, at 4:57 PM, Troy Benjegerdes wrote: > what is up with subversion? it does not build with errors like this: > > CC [M] drivers/infiniband/core/uverbs_main.o > drivers/infiniband/core/uverbs_main.c: In function > 'uverbs_event_get_sb': > drivers/infiniband/core/uverbs_main.c:811: error: too few arguments > to function 'get_sb_pseudo' > drivers/infiniband/core/uverbs_main.c: At top level: > drivers/infiniband/core/uverbs_main.c:817: warning: initialization > from incompatible pointer type > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/ > openib-general > From tziporet at dev.mellanox.co.il Sun Nov 12 18:51:47 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 13 Nov 2006 04:51:47 +0200 Subject: [openib-general] ehca build on 2.6.18.2?? In-Reply-To: <75966154-70FB-470D-98DF-1ED735430893@scl.ameslab.gov> References: <75966154-70FB-470D-98DF-1ED735430893@scl.ameslab.gov> Message-ID: <4557DDC3.9060703@dev.mellanox.co.il> Troy Benjegerdes wrote: > Um. So I built openib-1.1 from the OFED-1.1 tarball, and now I get: > > * > p5l9:/usr/src/openib-1.1/src/userspace/libehca# ibv_devinfo > libibverbs: Warning: no userspace device-specific driver found for > uverbs0 > driver search path: /usr/local/lib/infiniband > No IB devices found > > For ehca you need to take OFED 1.1.1 (see https://openib.org/svn/gen2/branches/1.1/ofed/releases/) Tziporet From zhushisongzhu at yahoo.com Sun Nov 12 20:10:34 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Sun, 12 Nov 2006 20:10:34 -0800 (PST) Subject: [openib-general] compile error Message-ID: <20061113041034.17303.qmail@web36906.mail.mud.yahoo.com> openib src: svn the latest from openfabrics kernel : 2.6.18 FC5 error: ipoib_multicast.c: in struct net_device there is no xmit_lock member name how to handle it? tks zhu ____________________________________________________________________________________ Cheap talk? Check out Yahoo! Messenger's low PC-to-Phone call rates. http://voice.yahoo.com From krkumar2 at in.ibm.com Sun Nov 12 21:07:49 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 13 Nov 2006 10:37:49 +0530 Subject: [openib-general] [Fwd: [PATCH] RDMA/iwcm: Fix memory leak] In-Reply-To: Message-ID: Hi Tom, > if (len) > kfree(ptr) > > or > > if (ptr) > kfree(ptr) > > is correct is contingent upon how you couple the two variables. But I don't' > think this has anything to do with the Roland's point. That is correct. I stated somewhat differently by saying that the two variables may not be initialized together, eg, if there is no data, the driver can initialize the private_data_len to zero instead of setting that as well as private_data to NULL (redundant). How it is done at the driver is not known at this time, unless we specify how it should be done at the access layer, so that future drivers have to be written to this standard (amso seems to do both and no check is necessary). Eg a patch like : out: + BUG_ON((private_data_len == 0 && private_data != NULL) || + (private_data_len && private_data == NULL) kfree(private_data); can catch wrong driver implentations early on. Maybe that is something that can be added ? Thanks, - KK > However, Roland's point is that in the kernel, it's contingent upon us all > to know and leverage the error checking done by the services we use. If > kfree checks for nul, we don't have to....and shouldn't check it. > > Kittens are cute... really ... who can argue with that? What 'len' allows us > to assume about 'ptr' is a little more ... well... fuzzy. > > > On 11/9/06 11:11 PM, "Krishna Kumar2" wrote: > > > That is valid only if the drivers also comply. Eg if driver has two > > stack variables private_data and private_data_len, and it sets > > only private_data_len to zero. Then when calling the upper layer, > > it sets the event->private_data to its local private_data (uninitialized) > > and event->private_data_len to its local private_data_len (zero). > > Here we have to check the private_data_len before touching > > private_data or risk bug/panic. > > > > thanks, > > > > - KK > > > > Tom Tucker wrote on 11/10/2006 10:20:18 AM: > > > >> > >> If it's truly nul or a ptr, we don't need to (and shouldn't) check, just > >> call kfree. If it's unitialized, we can't tell anyway and it's a bug -- > >> right? > >> > >> Am I missing something? > >> > >> On 11/9/06 10:41 PM, "Krishna Kumar2" wrote: > >> > >>> Though the amso driver (c2_ae_event) is setting the private_data and > >>> private_data_len together for connect request and connect result, so > >>> the check may not be necessary. But if the semantics prefer checking > >>> to make sure, we should follow that (esp if other future drivers may > >>> also simply set private_data_len to zero without modifying > >>> private_data). > >>> > >>> I did it this way since cm_conn_rep_handler() had the same check :) > >>> > >>> thanks, > >>> > >>> - KK > >>> > >>>> I think the semantics are that the pointer is only used if > >>>> private_data_len > 0. Otherwise, it is undefined. So I think we > > should > >>>> keep the check. Plus I don't like calling kfree() with a NULL > > pointer. > >>>> It just seems wrong... > >>>> > >>>> ;-) > >>>> > >>>> > >>>> On Thu, 2006-11-09 at 14:59 -0800, Roland Dreier wrote: > >>>>>>> if (iw_event->private_data_len) > >>>>>>> kfree(iw_event->private_data); > >>>>>> > >>>>>> Kfree checks for a null value, so is the private_data_len check > >>> necessary? > >>>>> > >>>>> Could private_data be a junk pointer if private_data_len == 0 ? > >>>>> > >>>>> - R. > >>>> > >>>> > >>>> _______________________________________________ > >>>> openib-general mailing list > >>>> openib-general at openib.org > >>>> http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> To unsubscribe, please visit > >>> http://openib.org/mailman/listinfo/openib-general > >>>> > >>> > >>> > >>> _______________________________________________ > >>> openib-general mailing list > >>> openib-general at openib.org > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > >>> > >> > >> > > > > From krkumar2 at in.ibm.com Sun Nov 12 21:30:46 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 13 Nov 2006 11:00:46 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: Message-ID: Hi Tom & Sean, > > > > There may be a race here, but... Why wouldn't the second call into > > cm_work_handler simply find the list empty on entry into the call? > > Basically, you've got a free work queue element sitting on the iwcm_wq. What > typically would happen is you'd end up corrupting the list because the > cm_event_handler would enqueue the element, it would get freed in > cm_work_handler with put_work, then cm_event_handler would call get_work > (getting the one just freed that's also sitting on the iwcm_wq list) and ... > bad things happen. Actually I had anticipated this possible problem when I submitted the patch, and I have explained (given below) why it is not a problem : "Doing the redundant queue_work() (if cm_work_handler is already running processing the last entry) will not result in another call to cm_work_handler (run_workqueue) where no entry is found, since cm_work_handler will remove all entries from the list, even ones that are added late". Isn't that correct ? So if cm_work_handler() is already running and processing the LAST entry (anything but the last entry will not have an issue as a new queue_work would not be done by iwcm), it will next find this new entry in it's current run iteration, and process it. Meanwhile iwcm had done a redundant "queue_work()" on this queue, which, besides adding the new entry to the workqueue, also does a wakeup of "worker_thread" (which is still running the previous iteration of run_workqueue -> cm_work_handler). When cm_work_handler finishes removing this new entry, it returns to worker_thread, which will do a schedule() which gets woken up again immediately due to the redundant "queue_work" done earlier, but then it checks whether the list is empty and since it is empty, it does another "schedule()" call. So that is what I meant by saying that another call to cm_work_handler() will NOT result (and where that redundant call would find no entry to process). So I feel this patch is correct in it's original form. Comments or did I misunderstand the kernel code completely ? Thanks, - KK > > > As an > > alternative, could you defer the list_del_init() call to the end of the loop, > > which would avoid scheduling cm_work_handler while it's running? > > Yeah, that's a good idea. > > > > > - Sean > > From krkumar2 at in.ibm.com Sun Nov 12 21:40:55 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 13 Nov 2006 11:10:55 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: Message-ID: Sorry, the last mail came disfigured :). Here it is in a (hopefully) more readable form : Hi Tom & Sean, > > There may be a race here, but... Why wouldn't the second call into > > cm_work_handler simply find the list empty on entry into the call? > > Basically, you've got a free work queue element sitting on the iwcm_wq. What > typically would happen is you'd end up corrupting the list because the > cm_event_handler would enqueue the element, it would get freed in > cm_work_handler with put_work, then cm_event_handler would call get_work > (getting the one just freed that's also sitting on the iwcm_wq list) and ... > bad things happen. Actually I had anticipated this possible problem when I submitted the patch, and I have explained (given below) why it is not a problem : "Doing the redundant queue_work() (if cm_work_handler is already running processing the last entry) will not result in another call to cm_work_handler (run_workqueue) where no entry is found, since cm_work_handler will remove all entries from the list, even ones that are added late". Isn't that correct ? So if cm_work_handler() is already running and processing the LAST entry (anything but the last entry will not have an issue as a new queue_work would not be done by iwcm), it will next find this new entry in it's current run iteration, and process it. Meanwhile iwcm had done a redundant "queue_work()" on this queue, which, besides adding the new entry to the workqueue, also does a wakeup of "worker_thread" (which is still running the previous iteration of run_workqueue -> cm_work_handler). When cm_work_handler finishes removing this new entry, it returns to worker_thread, which will do a schedule() which gets woken up again immediately due to the redundant "queue_work" done earlier, but then it checks whether the list is empty and since it is empty, it does another "schedule()" call. So that is what I meant by saying that another call to cm_work_handler() will NOT result (and where that redundant call would find no entry to process). So I feel this patch is correct in it's original form. Comments or did I misunderstand the kernel code completely ? Thanks, - KK > > > As an > > alternative, could you defer the list_del_init() call to the end of the loop, > > which would avoid scheduling cm_work_handler while it's running? > > Yeah, that's a good idea. > > > > > - Sean > > From michael at ellerman.id.au Sun Nov 12 22:05:05 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 13 Nov 2006 17:05:05 +1100 Subject: [openib-general] [PATCH 1/6] Add pci_find_ht_capability() for finding Hypertransport capabilities In-Reply-To: <363CCF5E-2CB8-4AFB-A620-5BB2F57458AB@kernel.crashing.org> References: <20061109064046.5726767C76@ozlabs.org> <363CCF5E-2CB8-4AFB-A620-5BB2F57458AB@kernel.crashing.org> Message-ID: <1163397905.7410.84.camel@localhost.localdomain> On Thu, 2006-11-09 at 09:01 +0100, Segher Boessenkool wrote: > > +int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int > > ht_cap) > > +{ > > + int rc; > > + u8 cap, mask; > > + > > + if (ht_cap == HT_CAPTYPE_SLAVE || ht_cap == HT_CAPTYPE_HOST) > > + mask = HT_3BIT_CAP_MASK; > > + else > > + mask = HT_5BIT_CAP_MASK; > > + > > + pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT); > > or the caller will loop forever if a second same type HT cap is found. Er .. duh. Memo, don't send code just before leaving for the weekend :) Putting that back in is going to break pci_find_ht_capability(), so I'll have to rethink it. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From michael at ellerman.id.au Sun Nov 12 22:45:37 2006 From: michael at ellerman.id.au (Michael Ellerman) Date: Mon, 13 Nov 2006 17:45:37 +1100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> <20061109141733.GA11499@kroah.com> Message-ID: <1163400337.7410.111.camel@localhost.localdomain> On Thu, 2006-11-09 at 15:43 +0100, Segher Boessenkool wrote: > > While yes, we should not in general add new workarounds before we need > > them, for this quirk, you should keep the original functionality, > > unless > > you wrote the quirk, or unless you have the hardware that needs it and > > you can verify that the change works properly. > > > > Are any of these last two options true for you? > > This new code only runs on HyperTransport devices and > none of those _existed_ when the quirk was first written. > I cannot claim I know for sure it is never needed there > of course, but it's quite improbable at least. > > > If not, I suggest that you put the TTL logic back in just to be safe. > > I'm fine with that -- but I'm not writing the code here, > Michael is, and I just hope he has more spine than I do ;-) Nah no spine here. I will quote Rusty though, who has plenty of spine ... http://www.ozlabs.com/~rusty/ols-2003-keynote/img53.html pci_find_next_capability() scores a 14 IMHO. Of three callers, we currently have two that don't use a TTL and so are vulnerable to bodgy PCI cap lists. What if we had: extern int pci_find_next_capability(struct pci_dev *dev, u8 pos, int cap, int *ttl) ... int pos, ttl = 0; pos = pci_find_capability(dev, PCI_CAP_FOO); while (pos) { ... stuff(); ... pos = pci_find_next_capability(dev, pos, PCI_CAP_FOO, *ttl); } It's not pretty I admit. pci_find_next_ht_capability() would also take a ttl, and pass it through to pci_find_next_capability(). AFAICT that would avoid any infinite hang scenarios? cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From benh at kernel.crashing.org Sun Nov 12 22:55:39 2006 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 13 Nov 2006 17:55:39 +1100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <1163400337.7410.111.camel@localhost.localdomain> References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> <20061109141733.GA11499@kroah.com> <1163400337.7410.111.camel@localhost.localdomain> Message-ID: <1163400939.4982.284.camel@localhost.localdomain> > It's not pretty I admit. pci_find_next_ht_capability() would also take a > ttl, and pass it through to pci_find_next_capability(). AFAICT that > would avoid any infinite hang scenarios? What if we didn't try to solve a problem we don't have ? Have we yet encountered an HT device with that sort of bogus capability list ? Ben. From ogerlitz at voltaire.com Sun Nov 12 23:46:11 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 13 Nov 2006 09:46:11 +0200 Subject: [openib-general] ehca build on 2.6.18.2?? In-Reply-To: References: Message-ID: <455822C3.2060607@voltaire.com> Troy Benjegerdes wrote: > what is up with subversion? it does not build with errors like this: > > CC [M] drivers/infiniband/core/uverbs_main.o > drivers/infiniband/core/uverbs_main.c: In function > 'uverbs_event_get_sb': > drivers/infiniband/core/uverbs_main.c:811: error: too few arguments > to function 'get_sb_pseudo' Practically, the IB kernel code under the subversion is not maintained (ie some parts are and others not), you may not assume this or that on this code, but rather get a fresh clone/pull of Roland's kernel.org git tree or of OFED 1.x Or. From ogerlitz at voltaire.com Sun Nov 12 23:48:29 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 13 Nov 2006 09:48:29 +0200 Subject: [openib-general] compile error In-Reply-To: <20061113041034.17303.qmail@web36906.mail.mud.yahoo.com> References: <20061113041034.17303.qmail@web36906.mail.mud.yahoo.com> Message-ID: <4558234D.4000005@voltaire.com> zhu shi song wrote: > openib src: svn the latest from openfabrics > kernel : 2.6.18 FC5 > error: ipoib_multicast.c: in struct net_device there > is no xmit_lock member name > > how to handle it? Did you have any problems using the IB code that comes with your OS/kernel ? ===================== snip starts Practically, the IB kernel code under the subversion is not maintained (ie some parts are and others not), you may not assume this or that on this code, but rather get a fresh clone/pull of Roland's kernel.org git tree or of OFED 1.x Or. ===================== snip ends From HNGUYEN at de.ibm.com Mon Nov 13 00:21:36 2006 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Mon, 13 Nov 2006 09:21:36 +0100 Subject: [openib-general] ehca build on 2.6.18.2?? In-Reply-To: <4557DDC3.9060703@dev.mellanox.co.il> Message-ID: > Troy Benjegerdes wrote: > > Um. So I built openib-1.1 from the OFED-1.1 tarball, and now I get: > > * > > p5l9:/usr/src/openib-1.1/src/userspace/libehca# ibv_devinfo > > libibverbs: Warning: no userspace device-specific driver found for > > uverbs0 > > driver search path: /usr/local/lib/infiniband > > No IB devices found > For ehca you need to take OFED 1.1.1 (see > https://openib.org/svn/gen2/branches/1.1/ofed/releases/) Tziporet, thanks for helping Troy! Nam From thomas.bub at thomson.net Mon Nov 13 00:37:16 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Mon, 13 Nov 2006 09:37:16 +0100 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. Message-ID: Sean, you got it! Setting the hop_limit from 64 down to 0 or 1 solved the problem. :-) Don't ask me where I got that hop_limit from, it must have been an example I found somewhere. Can you explain why that hop_limit/is_global makes a difference in communication between gen1 and gen2? Does the counterpart need to have the same hop_limit? The path record values I use are queried from the OSM using a SERVICE_RECORD query followed by a path record query. I'm not using any alternate path record values, is this critical? In addition I enclose the values I path into the ib_cm_sned_req call. Can you pls have a look if you find something alse looking abnormal. Thanks Thomas Bub req_param.qp_type = IBV_QPT_RC; req_param.qp_num = _dataQpNum; req_param.starting_psn = _dataQpNum;; req_param.service_id = htonll(SERVICE_ID); req_param.primary_path = &path_record; req_param.alternate_path = NULL; req_param.private_data = NULL; req_param.private_data_len = 0; req_param.responder_resources = 4; req_param.initiator_depth = 4; req_param.remote_cm_response_timeout = 20; req_param.local_cm_response_timeout = 20; req_param.retry_count = 7; req_param.rnr_retry_count = 7; req_param.max_cm_retries = 5; path_record.sgid = _localGid; path_record.dgid = _remoteGid; path_record.slid = htons(_localLID); path_record.dlid = htons(_remoteLID); path_record.flow_label = 0; path_record.hop_limit = 0; path_record.traffic_class = 0; path_record.pkey = 0xffff; path_record.sl = 0; path_record.rate = IBV_RATE_10_GBPS; path_record.packet_life_time = 0; path_record.mtu = IBV_MTU_2048; From diego.guella at sircomtech.com Mon Nov 13 00:44:00 2006 From: diego.guella at sircomtech.com (Diego Guella) Date: Mon, 13 Nov 2006 09:44:00 +0100 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails References: <003201c70317$36911f40$05c8a8c0@DIEGO> <4551E7C5.8020700@dev.mellanox.co.il> <00a301c70347$accee4f0$05c8a8c0@DIEGO> <45534DD0.3010803@dev.mellanox.co.il> <013601c704a1$b86d11f0$05c8a8c0@DIEGO> <6a122cc00611120444h5ec7e401l17241077c30632fc@mail.gmail.com> Message-ID: <01dc01c706ff$df4c3890$05c8a8c0@DIEGO> Hello Moni, Thanks for your answer. I had installed: glibc-devel (2.5-17) glibc-devel-32bit (2.5-17) when i tried to install the OFED package. I saw in the mailing list that is available OFED 1.1.1, should I try and install it? Thanks, Diego ----- Original Message ----- From: "Moni Levy" To: "Diego Guella" Cc: "Vladimir Sokolovsky" ; "Dotan Barak" ; Sent: Sunday, November 12, 2006 1:44 PM Subject: Re: [openib-general] Installation on openSUSE 10.2 Beta1 fails > On 11/10/06, Diego Guella wrote: >> Hi Vladimir, >> Thanks for your answer. >> >> I have installed: >> >> compat-libstdc++ (version 5.0.7-35) >> libstdc++-32bit (version 4.1.2_20060705-2) >> libstdc++41 (version 4.1.2_20061024-3) >> libstdc++41-devel (version 4.1.2_20061024-3) >> libstdc++-devel (version 4.1.3-22) >> >> >> but remember that in the log file, first it says (line 6393): >> ----- >> checking for C compiler default output file name... a.out >> ----- >> >> and about 5000 lines below, it says my compiler can't create executables >> (of course this isn't true, because this is the machine on wich I compile >> all the programs I make) >> Have you got any other suggestion? > > Please try to install a 32 bit glibc-devel package. > > -- Moni > >> >> >> Thanks, >> Diego >> >> >> ----- Original Message ----- >> From: "Vladimir Sokolovsky" >> To: "Diego Guella" >> Cc: "Tziporet Koren" ; >> >> Sent: Thursday, November 09, 2006 4:48 PM >> Subject: Re: [openib-general] Installation on openSUSE 10.2 Beta1 fails >> >> >> > Hello Diego, >> > Check that you have libstdc++, libstdc++-devel and compat-libstdc++ >> > RPMs >> > installed. >> > >> > Regards, >> > Vladimir >> > >> > Diego Guella wrote: >> >> >> >> From: "Tziporet Koren" >> >>> The failing is utility is used for IPoIB high availability. If you >> >>> don't >> >>> need to use them you can just change this line in ofed.conf: >> >>> ipoibtools=n >> >>> >> >>> Tziporet >> >>> >> >> Thanks Tziporet for your answer. >> >> >> >> >> >> Tried just right now, i disabled ipoibtools. I get another, more >> >> strange >> >> error: >> >> (attached OFED.3816.log) >> >> ----- >> >> /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples >> >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs >> >> Running: >> >> ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> >> --disable-libcheck --prefix /usr/local/ofed --libdir >> >> /usr/local/ofed/lib >> >> CPPFLAGS="-I../libibverbs/include" >> >> configure: creating cache >> >> /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> >> checking for a BSD-compatible install... /usr/bin/install -c >> >> checking whether build environment is sane... yes >> >> checking for gawk... gawk >> >> checking whether make sets $(MAKE)... yes >> >> checking build system type... x86_64-unknown-linux-gnu >> >> checking host system type... x86_64-unknown-linux-gnu >> >> checking for style of include used by make... GNU >> >> checking for gcc... gcc >> >> checking for C compiler default output file name... configure: error: >> >> C >> >> compiler cannot create executables >> >> See `config.log' for more details. >> >> Failed to execute: >> >> ./configure --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >> >> --disable-libcheck --prefix /usr/local/ofed --libdir >> >> /usr/local/ofed/lib >> >> CPPFLAGS="-I../libibverbs/include" >> >> error: Bad exit status from /var/tmp/rpm-tmp.46102 (%install) >> >> ----- >> >> >> >> Am I right? It says my C compiler cannot create executables???? Is it >> >> joking me???? >> >> In the log file, line 6393, it says: >> >> ----- >> >> checking for C compiler default output file name... a.out >> >> ----- >> >> >> >> I don't understand....! >> >> Is there something I can do to fix this? >> >> >> >> >> >> Thanks, >> >> Diego >> >> ------------------------------------------------------------------------ >> >> >> >> _______________________________________________ >> >> openib-general mailing list >> >> openib-general at openib.org >> >> http://openib.org/mailman/listinfo/openib-general >> >> >> >> To unsubscribe, please visit >> >> http://openib.org/mailman/listinfo/openib-general >> > >> >> >> _______________________________________________ >> openib-general mailing list >> openib-general at openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> From segher at kernel.crashing.org Mon Nov 13 00:55:13 2006 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Mon, 13 Nov 2006 09:55:13 +0100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <1163400939.4982.284.camel@localhost.localdomain> References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> <20061109141733.GA11499@kroah.com> <1163400337.7410.111.camel@localhost.localdomain> <1163400939.4982.284.camel@localhost.localdomain> Message-ID: <774BA04D-395B-4C48-BFA1-1CBE87EE2603@kernel.crashing.org> > What if we didn't try to solve a problem we don't have ? Yes exactly. > Have we yet encountered an HT device with that sort of bogus > capability > list ? Nope. So whatever fancy time-to-live scheme we come up with, we cannot even test it, and it performance no useful function. Nuke it :-) Segher From mst at mellanox.co.il Mon Nov 13 01:55:49 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Nov 2006 11:55:49 +0200 Subject: [openib-general] [PATCH] IB/ipoib: DMA alignment on ppc64 Message-ID: <20061113095549.GA22387@mellanox.co.il> Unaligned DMA is slow ppc64 systems - that's why this architecture overrides NET_IP_ALIGN. IPoIB should take this into account and align DMA on this platform. Signed-off-by: Michael S. Tsirkin --- Roland, what do you think? This comes from reading linux/skbuff.h and asm-powerpc/system.h - I don't actually have a ppc64 to test the performance impact. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 8bf5e9e..cec1cb4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -125,7 +125,7 @@ static int ipoib_alloc_rx_skb(struct net struct sk_buff *skb; dma_addr_t addr; - skb = dev_alloc_skb(IPOIB_BUF_SIZE + 4); + skb = dev_alloc_skb(IPOIB_BUF_SIZE + (NET_IP_ALIGN ? 4 : 0)); if (!skb) return -ENOMEM; @@ -133,8 +133,12 @@ static int ipoib_alloc_rx_skb(struct net * IB will leave a 40 byte gap for a GRH and IPoIB adds a 4 byte * header. So we need 4 more bytes to get to 48 and align the * IP header to a multiple of 16. + * + * But don't do this on architectures that override NET_IP_ALIGN - + * see comment in linux/skbuff.h for reasons why. */ - skb_reserve(skb, 4); + if (NET_IP_ALIGN) + skb_reserve(skb, 4); addr = dma_map_single(priv->ca->dma_device, skb->data, IPOIB_BUF_SIZE, -- MST From ogerlitz at voltaire.com Mon Nov 13 02:48:55 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 13 Nov 2006 12:48:55 +0200 Subject: [openib-general] compile error In-Reply-To: <20061113100905.21019.qmail@web36915.mail.mud.yahoo.com> References: <20061113100905.21019.qmail@web36915.mail.mud.yahoo.com> Message-ID: <45584D97.5050807@voltaire.com> zhu shi song wrote: > how can I git latest tree? git clone git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git Or. From bhartner at us.ibm.com Mon Nov 13 04:57:57 2006 From: bhartner at us.ibm.com (Bill Hartner) Date: Mon, 13 Nov 2006 06:57:57 -0600 Subject: [openib-general] uDAPL - using both ports of HCA Message-ID: I have both ports of the HCA connected to a Voltaire switch for both client and server.  I have no luck trying to use the 2nd port with uDAPL.  Is it possible to use the 2nd port ? I get this on the server when trying to accept the connection; dapl_cr_accept (0x350f610, 0x350f7e0, 0, (nil))  accept(cr 0x350f610 conn 0x350f4b0, id 0x350f350, p_data (nil), p_sz=0)  accept: ERR dev(0x350c270!=0x350c270) or port mismatch(2!=1)  destroy_conn: conn 0x350f4b0 id 55636816 dat_cr_accept[262144] failed 0 derr 40000 DAT_INTERNAL_ERROR DAPL: Stopped (dapl_fini) -Bill From ogerlitz at voltaire.com Mon Nov 13 05:07:19 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 13 Nov 2006 15:07:19 +0200 Subject: [openib-general] [PATCH] IB/ipoib: DMA alignment on ppc64 In-Reply-To: <20061113095549.GA22387@mellanox.co.il> References: <20061113095549.GA22387@mellanox.co.il> Message-ID: <45586E07.4020108@voltaire.com> Michael S. Tsirkin wrote: > Unaligned DMA is slow ppc64 systems - that's why this architecture > overrides NET_IP_ALIGN. IPoIB should take this into account and align > DMA on this platform. Is this relevant only to IPoIB/SKB's or also to any other IB ULP (eg SDP, iSER, SRP) which allocates its rx/tx buffers directly by calling kmalloc ? Or. From mst at mellanox.co.il Mon Nov 13 05:30:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Mon, 13 Nov 2006 15:30:52 +0200 Subject: [openib-general] [PATCH] IB/ipoib: DMA alignment on ppc64 In-Reply-To: <45586E07.4020108@voltaire.com> References: <45586E07.4020108@voltaire.com> Message-ID: <20061113133052.GC29113@mellanox.co.il> Quoting r. Or Gerlitz : > Subject: Re: [openib-general] [PATCH] IB/ipoib: DMA alignment on ppc64 > > Michael S. Tsirkin wrote: > > Unaligned DMA is slow ppc64 systems - that's why this architecture > > overrides NET_IP_ALIGN. IPoIB should take this into account and align > > DMA on this platform. > > Is this relevant only to IPoIB/SKB's or also to any other IB ULP (eg > SDP, iSER, SRP) which allocates its rx/tx buffers directly by calling > kmalloc ? I would expect the unsaligned DMA penalty to apply to all ULPs. Look up NET_IP_ALIGN and see for yourself. -- MST From halr at voltaire.com Mon Nov 13 05:05:50 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Nov 2006 15:05:50 +0200 Subject: [openib-general] [PATCH] osm: comparing InformInfo records References: <454EDD97.5060000@dev.mellanox.co.il> <1162907368.25771.37532.camel@hal.voltaire.com> <45572218.4040801@dev.mellanox.co.il> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F50189442E@taurus.voltaire.com> Hi Yevgeny, See embedded comments below. -- Hal ________________________________ From: Yevgeny Kliteynik [mailto:kliteyn at dev.mellanox.co.il] Sent: Sun 11/12/2006 8:31 AM To: Hal Rosenstock Cc: OPENIB Subject: Re: [PATCH] osm: comparing InformInfo records Hal Rosenstock wrote: > Hi Yevgeny, > > On Mon, 2006-11-06 at 02:00, Yevgeny Kliteynik wrote: >> Hi Hal >> >> [From Vu Pham] >> 1. sending InformInfo set subscribe for trap 64,65,144 - this works; >> however, osm.log outputs wrong value for "subscribe" field > > What code issues these subscriptions ? How was this patch tested ? Unfortunately, we don't have this kind of test in the testbase, so this patch hasn't been tested. OK but it has been tested with some ULP, right ? (It's not just code inspection. >> 2. sending InformInfo set *unsubscribe* for >> trap 64,65,144 - I'm using/formating the same mad as (1) except the >> "subscribe" field is zero; however, opensm response with status 0x200 >> [/From Vu Pham] >> >> 1. The received InformInfo struct was modified before dumping it. >> This was fixed as part of the second issue. >> 2. The function that compares InformInfo structures was just comparing >> the whole memory allocated for it, including reserved fields. >> Fixed to compare more selectively. >> >> Yevgeny >> >> Signed-off-by: Yevgeny Kliteynik >> >> Index: opensm/osm_sa_informinfo.c >> =================================================================== >> --- opensm/osm_sa_informinfo.c (revision 10064) >> +++ opensm/osm_sa_informinfo.c (working copy) >> @@ -345,7 +345,6 @@ osm_infr_rcv_process_set_method( >> ib_inform_info_t *p_recvd_inform_info; >> osm_infr_t inform_info_rec; /* actual inform record to be stored for reports */ >> osm_infr_t *p_infr; >> - uint8_t subscribe; >> ib_net32_t qpn; >> uint8_t resp_time_val; >> ib_api_status_t res; >> @@ -403,19 +402,11 @@ osm_infr_rcv_process_set_method( >> * >> * QPN: >> * internally we keep the QPN field of the InformInfo updated >> - * so we can simply compare the entire record - when finding such. >> - * IBA spec only requires the QPN field to be filled when an unsubscribe >> - * Set(InformInfo) is done. See table 119 p 740 QPN field >> - * >> - * SUBSCRIBE: >> - * For similar reasons we change the subscribe to 0 on the >> - * inserted/searched data >> + * so we can simply compare it in the record - when finding such. >> */ >> >> - subscribe = p_recvd_inform_info->subscribe; >> - if (subscribe) >> + if (p_recvd_inform_info->subscribe) >> { >> - inform_info_rec.inform_record.inform_info.subscribe = 0; >> ib_inform_info_set_qpn( >> &inform_info_rec.inform_record.inform_info, >> inform_info_rec.report_addr.addr_type.gsi.remote_qp ); >> @@ -443,7 +434,7 @@ osm_infr_rcv_process_set_method( >> p_infr = osm_infr_get_by_rec( p_rcv->p_subn, p_rcv->p_log, &inform_info_rec ); >> >> /* check to see if the request was for subscribe = 1 */ >> - if (subscribe) >> + if (p_recvd_inform_info->subscribe) >> { >> /* validate the request for a new or update InformInfo */ >> if (__validate_infr( p_rcv, &inform_info_rec ) != TRUE) >> @@ -480,6 +471,8 @@ osm_infr_rcv_process_set_method( >> goto Exit; >> } >> >> + /* set the subscribe bit to 0 before adding the record */ >> + p_infr->inform_record.inform_info.subscribe = 0; > > It seems odd to me to set subscribe to 0 for a subscription (rather than > when it is an unsibscription). Aren't only subscriptions kept in the > database ? Is this an artifact of the matching code ? If so, why not > change that ? You're right. Previously the zero value was used for comparing the whole record to the unsubscribe request (which carries 0 in the 'subscribe' field). Now that the comparing function has been changed, no need to keep zeroing this field. >> /* Add this new osm_infr_t object to subnet object */ >> osm_infr_insert_to_db( p_rcv->p_subn, p_rcv->p_log, p_infr ); >> >> @@ -488,6 +481,8 @@ osm_infr_rcv_process_set_method( >> { >> /* Update the old instance of the osm_infr_t object */ >> p_infr->inform_record = inform_info_rec.inform_record; >> + /* set the subscribe bit to 0 after updating the record */ >> + p_infr->inform_record.inform_info.subscribe = 0; > > Same as previous comment. > >> } >> } >> else >> Index: opensm/osm_inform.c >> =================================================================== >> --- opensm/osm_inform.c (revision 10064) >> +++ opensm/osm_inform.c (working copy) >> @@ -206,30 +206,133 @@ __match_inf_rec( >> osm_infr_t* p_infr_rec = (osm_infr_t *)context; >> osm_infr_t* p_infr = (osm_infr_t*)p_list_item; >> osm_log_t *p_log = p_infr_rec->p_infr_rcv->p_log; >> - cl_status_t status; >> - int32_t count1, count2; >> + cl_status_t status = CL_NOT_FOUND; >> + ib_gid_t all_zero_gid; >> + >> >> OSM_LOG_ENTER( p_log, __match_inf_rec); >> >> - count1 = memcmp(&p_infr->report_addr, &p_infr_rec->report_addr, >> - sizeof(p_infr_rec->report_addr)); >> - if (count1) >> - osm_log( p_log, OSM_LOG_DEBUG, >> - "__match_inf_rec: " >> - "Differ by Address\n" ); >> - count2 = memcmp( >> - &p_infr->inform_record.inform_info, >> - &p_infr_rec->inform_record.inform_info, >> - sizeof(p_infr->inform_record.inform_info) ); >> - if (count2) >> - osm_log( p_log, OSM_LOG_DEBUG, >> - "__match_inf_rec: " >> - "Differ by InformInfo\n" ); >> - if ((count1 == 0) && (count2 == 0)) >> - status = CL_SUCCESS; >> + if ( !memcmp(&p_infr->report_addr, >> + &p_infr_rec->report_addr, >> + sizeof(p_infr_rec->report_addr)) ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by Address\n" ); >> + goto Exit; >> + } >> + >> + memset(&all_zero_gid, 0, sizeof(ib_gid_t)); >> + >> + /* if inform_info.gid is not zero, ignoring lid range */ >> + if ( !memcmp(&p_infr_rec->inform_record.inform_info.gid, >> + &all_zero_gid, >> + sizeof(p_infr_rec->inform_record.inform_info.gid)) ) >> + { >> + if ( !memcmp(&p_infr->inform_record.inform_info.gid, >> + &p_infr_rec->inform_record.inform_info.gid, >> + sizeof(p_infr->inform_record.inform_info.gid)) ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.gid\n" ); >> + goto Exit; >> + } >> + } >> else >> - status = CL_NOT_FOUND; >> + { >> + if ( (p_infr->inform_record.inform_info.lid_range_begin != >> + p_infr_rec->inform_record.inform_info.lid_range_begin) || >> + (p_infr->inform_record.inform_info.lid_range_end != >> + p_infr_rec->inform_record.inform_info.lid_range_end) ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.LIDRange\n" ); >> + goto Exit; >> + } >> + } >> + >> + if ( p_infr->inform_record.inform_info.is_generic != >> + p_infr_rec->inform_record.inform_info.is_generic ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.IsGeneric\n" ); >> + goto Exit; >> + } >> >> + if ( p_infr->inform_record.inform_info.trap_type != >> + p_infr_rec->inform_record.inform_info.trap_type ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.TrapType\n" ); >> + goto Exit; >> + } >> + >> + if ( p_infr->inform_record.inform_info.is_generic != >> + p_infr_rec->inform_record.inform_info.is_generic ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.IsGeneric\n" ); >> + } > > This appears to be a duplicate of what was added shortly earlier in this > patch. Right, good catch. I'll send a V2 of this patch shortly. I'll try to review it this week. I won't get to integrating it until the weekend or next week. Thanks. -- Yevgeny > -- Hal > >> + else if (p_infr->inform_record.inform_info.is_generic) >> + { >> + if ( p_infr->inform_record.inform_info.g_or_v.generic.trap_num != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.trap_num ) >> + { >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.TrapNumber\n" ); >> + goto Exit; >> + } >> + else if ( p_infr->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.QPNRespTimeVal\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_msb != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_msb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.NodeTypeMSB\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_lsb != >> + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_lsb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Generic.NodeTypeLSB\n" ); >> + else >> + status = CL_SUCCESS; >> + } >> + else >> + { >> + if ( p_infr->inform_record.inform_info.g_or_v.vend.dev_id != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.dev_id ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.DeviceID\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.QPNRespTimeVal\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_msb != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_msb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.VendorIdMSB\n" ); >> + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_lsb != >> + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_lsb ) >> + osm_log( p_log, OSM_LOG_DEBUG, >> + "__match_inf_rec: " >> + "Differ by InformInfo.Vendor.VendorIdLSB\n" ); >> + else >> + status = CL_SUCCESS; >> + } >> + >> + Exit: >> OSM_LOG_EXIT( p_log ); >> return status; >> } >> > From vlad at dev.mellanox.co.il Mon Nov 13 06:03:49 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 13 Nov 2006 16:03:49 +0200 Subject: [openib-general] Installation on openSUSE 10.2 Beta1 fails In-Reply-To: <01dc01c706ff$df4c3890$05c8a8c0@DIEGO> References: <003201c70317$36911f40$05c8a8c0@DIEGO> <4551E7C5.8020700@dev.mellanox.co.il> <00a301c70347$accee4f0$05c8a8c0@DIEGO> <45534DD0.3010803@dev.mellanox.co.il> <013601c704a1$b86d11f0$05c8a8c0@DIEGO> <6a122cc00611120444h5ec7e401l17241077c30632fc@mail.gmail.com> <01dc01c706ff$df4c3890$05c8a8c0@DIEGO> Message-ID: <45587B45.2060709@dev.mellanox.co.il> Hi Diego, OFED-1.1.1 includes fixed libehca package. So, if you are not going to use eHCA driver then you can stay with OFED-1.1. Regards, Vladimir Diego Guella wrote: > Hello Moni, > Thanks for your answer. > > I had installed: > glibc-devel (2.5-17) > glibc-devel-32bit (2.5-17) > > when i tried to install the OFED package. > > I saw in the mailing list that is available OFED 1.1.1, should I try > and install it? > > > Thanks, > Diego > > > ----- Original Message ----- From: "Moni Levy" > To: "Diego Guella" > Cc: "Vladimir Sokolovsky" ; "Dotan Barak" > ; > Sent: Sunday, November 12, 2006 1:44 PM > Subject: Re: [openib-general] Installation on openSUSE 10.2 Beta1 fails > > >> On 11/10/06, Diego Guella wrote: >>> Hi Vladimir, >>> Thanks for your answer. >>> >>> I have installed: >>> >>> compat-libstdc++ (version 5.0.7-35) >>> libstdc++-32bit (version 4.1.2_20060705-2) >>> libstdc++41 (version 4.1.2_20061024-3) >>> libstdc++41-devel (version 4.1.2_20061024-3) >>> libstdc++-devel (version 4.1.3-22) >>> >>> >>> but remember that in the log file, first it says (line 6393): >>> ----- >>> checking for C compiler default output file name... a.out >>> ----- >>> >>> and about 5000 lines below, it says my compiler can't create >>> executables >>> (of course this isn't true, because this is the machine on wich I >>> compile >>> all the programs I make) >>> Have you got any other suggestion? >> >> Please try to install a 32 bit glibc-devel package. >> >> -- Moni >> >>> >>> >>> Thanks, >>> Diego >>> >>> >>> ----- Original Message ----- >>> From: "Vladimir Sokolovsky" >>> To: "Diego Guella" >>> Cc: "Tziporet Koren" ; >>> >>> Sent: Thursday, November 09, 2006 4:48 PM >>> Subject: Re: [openib-general] Installation on openSUSE 10.2 Beta1 fails >>> >>> >>> > Hello Diego, >>> > Check that you have libstdc++, libstdc++-devel and >>> compat-libstdc++ > RPMs >>> > installed. >>> > >>> > Regards, >>> > Vladimir >>> > >>> > Diego Guella wrote: >>> >> >>> >> From: "Tziporet Koren" >>> >>> The failing is utility is used for IPoIB high availability. If >>> you >>> don't >>> >>> need to use them you can just change this line in ofed.conf: >>> >>> ipoibtools=n >>> >>> >>> >>> Tziporet >>> >>> >>> >> Thanks Tziporet for your answer. >>> >> >>> >> >>> >> Tried just right now, i disabled ipoibtools. I get another, more >>> >> strange >>> >> error: >>> >> (attached OFED.3816.log) >>> >> ----- >>> >> /bin/rm -f /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >>> >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/examples >>> >> cd /var/tmp/OFEDRPM/BUILD/openib-1.1/src/userspace/libibverbs >>> >> Running: >>> >> ./configure >>> --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >>> >> --disable-libcheck --prefix /usr/local/ofed --libdir >> >>> /usr/local/ofed/lib >>> >> CPPFLAGS="-I../libibverbs/include" >>> >> configure: creating cache >>> >> /var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >>> >> checking for a BSD-compatible install... /usr/bin/install -c >>> >> checking whether build environment is sane... yes >>> >> checking for gawk... gawk >>> >> checking whether make sets $(MAKE)... yes >>> >> checking build system type... x86_64-unknown-linux-gnu >>> >> checking host system type... x86_64-unknown-linux-gnu >>> >> checking for style of include used by make... GNU >>> >> checking for gcc... gcc >>> >> checking for C compiler default output file name... configure: >>> error: >> C >>> >> compiler cannot create executables >>> >> See `config.log' for more details. >>> >> Failed to execute: >>> >> ./configure >>> --cache-file=/var/tmp/OFEDRPM/BUILD/openib-1.1/configure.cache >>> >> --disable-libcheck --prefix /usr/local/ofed --libdir >> >>> /usr/local/ofed/lib >>> >> CPPFLAGS="-I../libibverbs/include" >>> >> error: Bad exit status from /var/tmp/rpm-tmp.46102 (%install) >>> >> ----- >>> >> >>> >> Am I right? It says my C compiler cannot create executables???? >>> Is it >>> >> joking me???? >>> >> In the log file, line 6393, it says: >>> >> ----- >>> >> checking for C compiler default output file name... a.out >>> >> ----- >>> >> >>> >> I don't understand....! >>> >> Is there something I can do to fix this? >>> >> >>> >> >>> >> Thanks, >>> >> Diego >>> >> >>> ------------------------------------------------------------------------ >>> >>> >> >>> >> _______________________________________________ >>> >> openib-general mailing list >>> >> openib-general at openib.org >>> >> http://openib.org/mailman/listinfo/openib-general >>> >> >>> >> To unsubscribe, please visit >>> >> http://openib.org/mailman/listinfo/openib-general >>> > >>> >>> >>> _______________________________________________ >>> openib-general mailing list >>> openib-general at openib.org >>> http://openib.org/mailman/listinfo/openib-general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> From halr at voltaire.com Mon Nov 13 06:09:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Nov 2006 16:09:51 +0200 Subject: [openib-general] [PATCH v2] osm: comparing InformInfo records References: <455724C4.9080806@dev.mellanox.co.il> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F50189442F@taurus.voltaire.com> See embedded comment below. ________________________________ From: Yevgeny Kliteynik [mailto:kliteyn at dev.mellanox.co.il] Sent: Sun 11/12/2006 8:42 AM To: Hal Rosenstock; OPENIB Subject: [PATCH v2] osm: comparing InformInfo records Hi Hal Here's the fixed InformInfo patch Yevgeny Signed-off-by: Yevgeny Kliteynik Index: opensm/osm_inform.c =================================================================== --- opensm/osm_inform.c (revision 10100) +++ opensm/osm_inform.c (working copy) @@ -206,30 +206,123 @@ __match_inf_rec( osm_infr_t* p_infr_rec = (osm_infr_t *)context; osm_infr_t* p_infr = (osm_infr_t*)p_list_item; osm_log_t *p_log = p_infr_rec->p_infr_rcv->p_log; - cl_status_t status; - int32_t count1, count2; + cl_status_t status = CL_NOT_FOUND; + ib_gid_t all_zero_gid; OSM_LOG_ENTER( p_log, __match_inf_rec); - count1 = memcmp(&p_infr->report_addr, &p_infr_rec->report_addr, - sizeof(p_infr_rec->report_addr)); - if (count1) - osm_log( p_log, OSM_LOG_DEBUG, - "__match_inf_rec: " - "Differ by Address\n" ); - count2 = memcmp( - &p_infr->inform_record.inform_info, - &p_infr_rec->inform_record.inform_info, - sizeof(p_infr->inform_record.inform_info) ); - if (count2) - osm_log( p_log, OSM_LOG_DEBUG, - "__match_inf_rec: " - "Differ by InformInfo\n" ); - if ((count1 == 0) && (count2 == 0)) - status = CL_SUCCESS; + if ( !memcmp(&p_infr->report_addr, + &p_infr_rec->report_addr, + sizeof(p_infr_rec->report_addr)) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by Address\n" ); + goto Exit; + } + + memset(&all_zero_gid, 0, sizeof(ib_gid_t)); + + /* if inform_info.gid is not zero, ignoring lid range */ + if ( !memcmp(&p_infr_rec->inform_record.inform_info.gid, + &all_zero_gid, + sizeof(p_infr_rec->inform_record.inform_info.gid)) ) + { + if ( !memcmp(&p_infr->inform_record.inform_info.gid, + &p_infr_rec->inform_record.inform_info.gid, + sizeof(p_infr->inform_record.inform_info.gid)) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.gid\n" ); + goto Exit; + } + } + else + { + if ( (p_infr->inform_record.inform_info.lid_range_begin != + p_infr_rec->inform_record.inform_info.lid_range_begin) || + (p_infr->inform_record.inform_info.lid_range_end != + p_infr_rec->inform_record.inform_info.lid_range_end) ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.LIDRange\n" ); + goto Exit; + } + } + + if ( p_infr->inform_record.inform_info.trap_type != + p_infr_rec->inform_record.inform_info.trap_type ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.TrapType\n" ); + goto Exit; + } + + if ( p_infr->inform_record.inform_info.is_generic != + p_infr_rec->inform_record.inform_info.is_generic ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.IsGeneric\n" ); + } + else if (p_infr->inform_record.inform_info.is_generic) + { + if ( p_infr->inform_record.inform_info.g_or_v.generic.trap_num != + p_infr_rec->inform_record.inform_info.g_or_v.generic.trap_num ) + { + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.TrapNumber\n" ); + goto Exit; + } + else if ( p_infr->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val != + p_infr_rec->inform_record.inform_info.g_or_v.generic.qpn_resp_time_val ) Isn't QPN supposed to be ignored on an unsubscribe request ? + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.QPNRespTimeVal\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_msb != + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_msb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.NodeTypeMSB\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.generic.node_type_lsb != + p_infr_rec->inform_record.inform_info.g_or_v.generic.node_type_lsb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Generic.NodeTypeLSB\n" ); + else + status = CL_SUCCESS; + } else - status = CL_NOT_FOUND; + { + if ( p_infr->inform_record.inform_info.g_or_v.vend.dev_id != + p_infr_rec->inform_record.inform_info.g_or_v.vend.dev_id ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.DeviceID\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val != + p_infr_rec->inform_record.inform_info.g_or_v.vend.qpn_resp_time_val ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.QPNRespTimeVal\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_msb != + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_msb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.VendorIdMSB\n" ); + else if ( p_infr->inform_record.inform_info.g_or_v.vend.vendor_id_lsb != + p_infr_rec->inform_record.inform_info.g_or_v.vend.vendor_id_lsb ) + osm_log( p_log, OSM_LOG_DEBUG, + "__match_inf_rec: " + "Differ by InformInfo.Vendor.VendorIdLSB\n" ); + else + status = CL_SUCCESS; + } + Exit: OSM_LOG_EXIT( p_log ); return status; } Index: opensm/osm_sa_informinfo.c =================================================================== --- opensm/osm_sa_informinfo.c (revision 10100) +++ opensm/osm_sa_informinfo.c (working copy) @@ -345,7 +345,6 @@ osm_infr_rcv_process_set_method( ib_inform_info_t *p_recvd_inform_info; osm_infr_t inform_info_rec; /* actual inform record to be stored for reports */ osm_infr_t *p_infr; - uint8_t subscribe; ib_net32_t qpn; uint8_t resp_time_val; ib_api_status_t res; @@ -403,19 +402,11 @@ osm_infr_rcv_process_set_method( * * QPN: * internally we keep the QPN field of the InformInfo updated - * so we can simply compare the entire record - when finding such. - * IBA spec only requires the QPN field to be filled when an unsubscribe - * Set(InformInfo) is done. See table 119 p 740 QPN field - * - * SUBSCRIBE: - * For similar reasons we change the subscribe to 0 on the - * inserted/searched data + * so we can simply compare it in the record - when finding such. */ - subscribe = p_recvd_inform_info->subscribe; - if (subscribe) + if (p_recvd_inform_info->subscribe) { - inform_info_rec.inform_record.inform_info.subscribe = 0; ib_inform_info_set_qpn( &inform_info_rec.inform_record.inform_info, inform_info_rec.report_addr.addr_type.gsi.remote_qp ); @@ -443,7 +434,7 @@ osm_infr_rcv_process_set_method( p_infr = osm_infr_get_by_rec( p_rcv->p_subn, p_rcv->p_log, &inform_info_rec ); /* check to see if the request was for subscribe = 1 */ - if (subscribe) + if (p_recvd_inform_info->subscribe) { /* validate the request for a new or update InformInfo */ if (__validate_infr( p_rcv, &inform_info_rec ) != TRUE) From tom at opengridcomputing.com Mon Nov 13 06:12:19 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 13 Nov 2006 08:12:19 -0600 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: Message-ID: On 11/12/06 11:30 PM, "Krishna Kumar2" wrote: > Hi Tom & Sean, > >>> >>> There may be a race here, but... Why wouldn't the second call into >>> cm_work_handler simply find the list empty on entry into the call? >> >> Basically, you've got a free work queue element sitting on the iwcm_wq. > What >> typically would happen is you'd end up corrupting the list because the >> cm_event_handler would enqueue the element, it would get freed in >> cm_work_handler with put_work, then cm_event_handler would call get_work >> (getting the one just freed that's also sitting on the iwcm_wq list) and > ... >> bad things happen. > > Actually I had anticipated this possible problem when I submitted the > patch, and I > have explained (given below) why it is not a problem : > > "Doing the redundant queue_work() (if cm_work_handler is already running > processing the last entry) will not result in another call to > cm_work_handler > (run_workqueue) where no entry is found, since cm_work_handler will remove > all entries from the list, even ones that are added late". > > Isn't that correct ? No, to understand why go look at the implementation of queue_work. BTW, this is not a hypothetical race, it's a bug that I fixed. Sean's approach is the best, but I need to handle the cleanup of the work_list when the connection is going away. > So if cm_work_handler() is already running and > processing > the LAST entry (anything but the last entry will not have an issue as a > new > queue_work would not be done by iwcm), it will next find this new entry in > it's > current run iteration, and process it. Meanwhile iwcm had done a redundant > "queue_work()" on this queue, which, besides adding the new entry to the > workqueue, also does a wakeup of "worker_thread" (which is still running > the > previous iteration of run_workqueue -> cm_work_handler). When > cm_work_handler finishes removing this new entry, it returns to > worker_thread, > which will do a schedule() which gets woken up again immediately due to > the > redundant "queue_work" done earlier, but then it checks whether the list > is empty > and since it is empty, it does another "schedule()" call. So that is what > I meant by > saying that another call to cm_work_handler() will NOT result (and where > that > redundant call would find no entry to process). > > So I feel this patch is correct in it's original form. Comments or did I > misunderstand > the kernel code completely ? > > Thanks, > > - KK > >> >>> As an >>> alternative, could you defer the list_del_init() call to the end of > the loop, >>> which would avoid scheduling cm_work_handler while it's running? >> >> Yeah, that's a good idea. >> >>> >>> - Sean >> >> > From jlentini at netapp.com Mon Nov 13 06:55:01 2006 From: jlentini at netapp.com (James Lentini) Date: Mon, 13 Nov 2006 09:55:01 -0500 (EST) Subject: [openib-general] [PATCH] IB/documentation - add new file to Documentation/infiniband In-Reply-To: <1163199092.2508.61.camel@brick.pathscale.com> References: <1163108454.2508.29.camel@brick.pathscale.com> <1163182394.2508.42.camel@brick.pathscale.com> <1163190808.2508.55.camel@brick.pathscale.com> <1163199092.2508.61.camel@brick.pathscale.com> Message-ID: On Fri, 10 Nov 2006, Ralph Campbell wrote: > I must have woken up on the wrong side of the bed today :-) > Perhaps I am remembering it being changed from in to in/out since > the code obviously is in/out now. > How about: > > The 'iova' argument is an in/out parameter which can be used by the > caller to request an address to associate with the first byte of the > memory region. Upon return, the addresses used with struct ib_sge > 'addr' and struct ib_send_wr 'remote_addr' are thus 'iova' plus an > offset within the length of the memory region. That looks great. From ebiederm at xmission.com Mon Nov 13 07:46:54 2006 From: ebiederm at xmission.com (ebiederm at xmission.com) Date: Mon, 13 Nov 2006 08:46:54 -0700 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <774BA04D-395B-4C48-BFA1-1CBE87EE2603@kernel.crashing.org> (Segher Boessenkool's message of "Mon, 13 Nov 2006 09:55:13 +0100") References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> <20061109141733.GA11499@kroah.com> <1163400337.7410.111.camel@localhost.localdomain> <1163400939.4982.284.camel@localhost.localdomain> <774BA04D-395B-4C48-BFA1-1CBE87EE2603@kernel.crashing.org> Message-ID: Segher Boessenkool writes: >> What if we didn't try to solve a problem we don't have ? > > Yes exactly. > >> Have we yet encountered an HT device with that sort of bogus capability >> list ? > > Nope. So whatever fancy time-to-live scheme we come > up with, we cannot even test it, and it performance > no useful function. Nuke it :-) If we really need to we can put a loop detector in pci_find_capability by looking ahead a little. So we really should not even need to change the API to do this. So nuke it and if it is important have pci_find_capability bail if there are loops in the capability chain. Eric From mshefty at ichips.intel.com Mon Nov 13 08:26:25 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Mon, 13 Nov 2006 08:26:25 -0800 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. In-Reply-To: References: Message-ID: <45589CB1.7020307@ichips.intel.com> Bub Thomas wrote: > Setting the hop_limit from 64 down to 0 or 1 solved the problem. :-) > Don't ask me where I got that hop_limit from, it must have been an > example I found somewhere. > Can you explain why that hop_limit/is_global makes a difference in > communication between gen1 and gen2? Does the counterpart need to have > the same hop_limit? The gen2 stack uses a hop_limit > 0 to indicate that global routing is being used. If the hop_limit is > 0, then the global routing information must be valid. > The path record values I use are queried from the OSM using a > SERVICE_RECORD query followed by a path record query. > I'm not using any alternate path record values, is this critical? Everything is supposed to work with the path records returned from the SM. I was wondering if you were querying for the path record, modifying the returned value, or creating a path record yourself. > path_record.packet_life_time = 0; I would set this higher (maybe between 12-20, with: 18 = 1 second, 19 = 2 seconds, 20 = 4 seconds, etc.) From brice at myri.com Mon Nov 13 09:25:55 2006 From: brice at myri.com (Brice Goglin) Date: Mon, 13 Nov 2006 18:25:55 +0100 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> <20061109141733.GA11499@kroah.com> <1163400337.7410.111.camel@localhost.localdomain> <1163400939.4982.284.camel@localhost.localdomain> <774BA04D-395B-4C48-BFA1-1CBE87EE2603@kernel.crashing.org> Message-ID: <4558AAA3.6060606@myri.com> Eric W. Biederman wrote: > Segher Boessenkool writes: > > >>> What if we didn't try to solve a problem we don't have ? >>> >> Yes exactly. >> >> >>> Have we yet encountered an HT device with that sort of bogus capability >>> list ? >>> >> Nope. So whatever fancy time-to-live scheme we come >> up with, we cannot even test it, and it performance >> no useful function. Nuke it :-) >> > > If we really need to we can put a loop detector in pci_find_capability > by looking ahead a little. So we really should not even need to change > the API to do this. > > So nuke it and if it is important have pci_find_capability bail if > there are loops in the capability chain. > > Eric > How do you want to detect the following loop in pci_find_capability() without changing the API? any cap -> any cap -> one HT cap -> any cap -> back to first HT cap When looking for a HT cap, pci_find_capability() will always succeed, it will never loop forever. But, pci_find_ht_capability() will loop forever if it is looking for a different HT cap than the one we have in the chain. When we add a new function looping on top of the existing functions, we can always find a brain-damaged chain which makes the new function loop forever while the existing ones do not... That's why I had a new TTL in msi_ht_cap_enabled() in drivers/pci/quirks.c even if I was using pci_find_next_capability() which has its own TTL. Anyway, I agree that protecting against bugs that we've never seen before is not that important. I was just thinking that protecting is very easy, while debugging might be boring once we find such a broken PCI chain :) Brice From RAISCH at de.ibm.com Mon Nov 13 08:40:52 2006 From: RAISCH at de.ibm.com (Christoph Raisch) Date: Mon, 13 Nov 2006 17:40:52 +0100 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: <17748.15442.906060.210242@cargo.ozlabs.ibm.com> Message-ID: > Christoph Raisch writes: > > > The patch is needed. We've seen it on the real system. We did fix it on the > > real system. > > I disagree that the ioremap change is needed. > > > ...and it conforms to theory... although theory is a bit confusing here. > > > > let me try to summarize: > > ioremap checks for 64k boundary (actually page boundary) > > Actually, ioremap itself already does the calculations that your patch > adds - that is, it generates the offset within the page and the > physical address of the start of the page, does the mapping using the > latter, then adds on the offset to the virtual address of the page and > returns that. Paul, you are right. The calculation is done in your code already. We can't reproduce the problem anymore on latest kernel. Was this calculation there in ioremap right from the start with 64k on POWER or added later on? So Roland, feel free to ignore that line where we do the calculation. > > Paul. From rdreier at cisco.com Mon Nov 13 08:39:58 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Nov 2006 08:39:58 -0800 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: <17748.15442.906060.210242@cargo.ozlabs.ibm.com> (Paul Mackerras's message of "Fri, 10 Nov 2006 19:46:10 +1100") References: <17748.15442.906060.210242@cargo.ozlabs.ibm.com> Message-ID: > > The patch is needed. We've seen it on the real system. We did fix it on the > > real system. > > I disagree that the ioremap change is needed. Hmm... Paul, what you say makes sense and is what I would have thought, but Christoph says that the unpatched code really fails on a real system. So I'm still confused. I think I'll merge this with a fat comment, with the hope that we can drop it ASAP once everyone agrees on what's going on. - R. From rdreier at cisco.com Mon Nov 13 08:40:48 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Nov 2006 08:40:48 -0800 Subject: [openib-general] [PATCH 2.6.19 2/4] ehca: hcp_phyp.c: correct page mapping in 64k page mode In-Reply-To: (Christoph Raisch's message of "Mon, 13 Nov 2006 17:40:52 +0100") References: Message-ID: > So Roland, feel free to ignore that line where we do the calculation. OK, ignore the email I just sent. I'll drop the patch. thanks From rdreier at cisco.com Mon Nov 13 08:42:53 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Nov 2006 08:42:53 -0800 Subject: [openib-general] [PATCH] IB/ipoib: DMA alignment on ppc64 In-Reply-To: <20061113095549.GA22387@mellanox.co.il> (Michael S. Tsirkin's message of "Mon, 13 Nov 2006 11:55:49 +0200") References: <20061113095549.GA22387@mellanox.co.il> Message-ID: I don't think this is what's needed. The GRH leaves a gap of 40, so getting rid of the skb_reserve() just means that DMA will start at an offset of 40 rather than 44. I think you need to reserve enough to get to a full cacheline boundary, but I can't remember if that's 64 or 128 bytes. - R. From rdreier at cisco.com Mon Nov 13 08:45:52 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Nov 2006 08:45:52 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: (Shirley Ma's message of "Fri, 10 Nov 2006 17:27:11 -0800") References: Message-ID: > Sorry I was not intend to send previous email. Anyway I accidently sent it > out. What I thought was there would be a problem, if the missed_event > always return to 1. Then this napi poll would keep forever. Well, it's limited by the quota that the net stack gives it, so there's no possibility of looping forever. However.... > How about defer the rotting packets process later? like this: that seems like it is still correct. > With this patch, I could get NAPI + non scaling code throughput performance > from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still > investigating now. But I wonder why it gives you a factor of 4 in performance?? Why does it make a difference? I would have thought that the rotting packet situation would be rare enough that it doesn't really matter for performance exactly how we handle it. What are the other problems you're investigating? - R. From elsen_david at yahoo.com Mon Nov 13 09:30:09 2006 From: elsen_david at yahoo.com (david elsen) Date: Mon, 13 Nov 2006 09:30:09 -0800 (PST) Subject: [openib-general] [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme In-Reply-To: Message-ID: <20061113173010.30191.qmail@web58006.mail.re3.yahoo.com> Sundeep, I am sorry. I did not make myself very clear in my questions. I have got the iWARP running on my set-up for the Ammasso and Chelsio cards. To do this, I downloaded the iWARP code from the gen2 branch. Now I was trying to run the OSC MPI tools MVAPICH2 there to run the iWARP traffic in my set-up. When I am trying to build the MVAPICH2 0.9.8 tool, it has some mandatory variable called, OPEN_IB_HOME = /usr/local/ofed in the makefile make.mvapich2.iwarp, the build file which is to be run for the iWARP code. My question is why should I have this reference to /usr/local/ofed there if I do not need to download the OFED distribution code to run the iWARP. Is it possible to add more information in your MVAPICH2 0.9.8 User Guide describing how to build this and what are the dependencies? David Sundeep Narravula wrote: Hi David, > iWARP is actually a part of the Open Fabrics SVN. It is available from > a different branch. > > I am cc'ing this note to my group. One of the students (Sundeep) will > send you the detailed instructions on which branch of OF to download > and use. The instructions for setting up iwarp on OpenFrabics is avalable on the openIB wiki at https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Ammasso1100 Further, the branch you can download from the svn is https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable Please let us know if you have any further questions. Regards, --Sundeep. > > Is there any document describing the build process of the MVAPICH2 > > tool? I am going through the MVAPICH2 0.9.8 users guide and that > > does not seem to be giving me the detailed information. > > We will add this additional information on our user guide. > > > Can you please provide README files for the iWARP which describes > > TODO steps? > > Sundeep's information will help. If you have any additional questions, > please feel free to ask us. > > Thanks, > > DK > > > Thanks, > > David > > > > Dhabaleswar Panda wrote: > > The MVAPICH team is pleased to announce the availability of MVAPICH2 > > 0.9.8 with the following NEW features: > > > > - Checkpoint/Restart support for application transparent systems-level > > fault tolerance. BLCR-based support using native InfiniBand Gen2 > > interface is provided. Flexible interface to work with different > > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > > > Performance of sample applications with checkpoint-restart using > > PVFS2 and Lustre can be found here: > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > > drivers. > > > > - RDMA CM-based Connection management support > > > > - Shared memory optimizations for collective communication operations. > > Efficient algorithms and optimizations for barrier, reduce and > > all-reduce operations. Exploits the multi-core optimized shared > > memory point-to-point communication support introduced in MVAPICH2 > > 0.9.6. > > > > Performance of sample collective operations with this new feature > > can be found here: > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > > > - uDAPL support for NetEffect 10GigE adapter. Tested with > > NetEffect NE010 adapter. > > > > More details on all features and supported platforms can be obtained > > by visiting the following URL: > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > > continues to deliver excellent performance. Sample performance > > numbers include: > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > > Two-sided operations: > > - 2.81 microsec one-way latency (4 bytes) > > - 1561 MB/sec unidirectional bandwidth > > - 2935 MB/sec bidirectional bandwidth > > > > One-sided operations: > > - 4.92 microsec Put latency > > - 1569 MB/sec unidirectional Put bandwidth > > - 2935 MB/sec bidirectional Put bandwidth > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > > Two-sided operations: > > - 2.81 microsec one-way latency (4 bytes) > > - 3127 MB/sec unidirectional bandwidth > > - 5917 MB/sec bidirectional bandwidth > > > > One-sided operations: > > - 4.37 microsec Put latency > > - 3137 MB/sec unidirectional Put bandwidth > > - 5917 MB/sec bidirectional Put bandwidth > > > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > > Two-sided operations: > > - 3.01 microsec one-way latency (4 bytes) > > - 1402 MB/sec unidirectional bandwidth > > - 2238 MB/sec bidirectional bandwidth > > > > One-sided operations: > > - 4.65 microsec Put latency > > - 1402 MB/sec unidirectional Put bandwidth > > - 2238 MB/sec bidirectional Put bandwidth > > > > Performance numbers for all other platforms, system configurations and > > operations can be viewed by visiting `Performance' section of the > > project's web page. > > > > With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar > > performance for two-sided operations compared to MVAPICH 0.9.8. > > Organizations and users interested in getting the best performance for > > both two-sided and one-sided operations and also want to exploit > > advanced features (such as fault tolerance with checkpoint/restart, > > iWARP, RDMA CM connection management, multi-threading, integrated > > multi-rail, multi-core optimization, memory hook support and optimized > > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > > SVN, please visit the following URL: > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > > > A stripped down version of this release is also available at the > > OpenFabrics SVN. > > > > All feedbacks, including bug reports and hints for performance tuning, > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > Thanks, > > > > MVAPICH Team at OSU/NBCL > > > > ====================================================================== > > MVAPICH/MVAPICH2 project is currently supported with funding from > > U.S. National Science Foundation, U.S. DOE Office of Science, > > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > > and with equipment support from Advanced Clustering, AMD, Apple, > > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > > and Sun Microsystems. Other technology partner includes Etnus. > > ====================================================================== > > > > _______________________________________________ > > openfabrics-ewg mailing list > > openfabrics-ewg at openib.org > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > --------------------------------- > > Sponsored Link > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > --0-119921422-1163376356=:19890 > > Content-Type: text/html; charset=iso-8859-1 > > Content-Transfer-Encoding: 8bit > > > > I am trying to use the OSC MPI tool for the iWARP and quite new to the open fabrics tools. If I know it correctly, iWARP is not yet part of the OFED release. But the iWARP makefile has reference to ofed code. Is it really required. Is there any document describing the build process of the MVAPICH2 tool? I am going through the MVAPICH2 0.9.8 users guide and that does not seem to be giving me the detailed information. Can you please provide README files for the iWARP which describes TODO steps? Thanks, David Dhabaleswar Panda wrote: The MVAPICH team is pleased to announce the availability of MVAPICH2 0.9.8 with the following NEW features: - > > Checkpoint/Restart support for application transparent systems-level fault tolerance. BLCR-based support using native InfiniBand Gen2 interface is provided. Flexible interface to work with different file systems. Tested with ext3 (local disk), NFS and PVFS2. Performance of sample applications with checkpoint-restart using PVFS2 and Lustre can be found here: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and drivers. - RDMA CM-based Connection management support - Shared memory optimizations for collective communication operations. Efficient algorithms and optimizations for barrier, reduce and all-reduce operations. Exploits the multi-core optimized shared memory point-to-point communication support introduced in > > MVAPICH2 0.9.6. Performance of sample collective operations with this new feature can be found here: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html - uDAPL support for NetEffect 10GigE adapter. Tested with NetEffect NE010 adapter. More details on all features and supported platforms can be obtained by visiting the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It continues to deliver excellent performance. Sample performance numbers include: - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: Two-sided operations: - 2.81 microsec one-way latency (4 bytes) - 1561 MB/sec unidirectional bandwidth - 2935 MB/sec bidirectional bandwidth One-sided operations: - 4.92 microsec Put latency - 1569 MB/sec unidirectional Put bandwidth - 2935 MB/sec > > bidirectional Put bandwidth - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): Two-sided operations: - 2.81 microsec one-way latency (4 bytes) - 3127 MB/sec unidirectional bandwidth - 5917 MB/sec bidirectional bandwidth One-sided operations: - 4.37 microsec Put latency - 3137 MB/sec unidirectional Put bandwidth - 5917 MB/sec bidirectional Put bandwidth - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: Two-sided operations: - 3.01 microsec one-way latency (4 bytes) - 1402 MB/sec unidirectional bandwidth - 2238 MB/sec bidirectional bandwidth One-sided operations: - 4.65 microsec Put latency - 1402 MB/sec unidirectional Put bandwidth - 2238 MB/sec bidirectional Put bandwidth Performance numbers for all other platforms, system configurations and operations can be viewed by visiting `Performance' section of the project's web page. With the > > ADI-3-level design, MVAPICH2 0.9.8 delivers similar performance for two-sided operations compared to MVAPICH 0.9.8. Organizations and users interested in getting the best performance for both two-sided and one-sided operations and also want to exploit advanced features (such as fault tolerance with checkpoint/restart, iWARP, RDMA CM connection management, multi-threading, integrated multi-rail, multi-core optimization, memory hook support and optimized collectives) may migrate from MVAPICH code base to MVAPICH2 code base. For downloading MVAPICH2 0.9.8 package and accessing the anonymous SVN, please visit the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ A stripped down version of this release is also available at the OpenFabrics SVN. All feedbacks, including bug reports and hints for performance tuning, are welcome. Please post it to the mvapich-discuss mailing list. Thanks, > > MVAPICH Team at OSU/NBCL ====================================================================== MVAPICH/MVAPICH2 project is currently supported with funding from U.S. National Science Foundation, U.S. DOE Office of Science, Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; and with equipment support from Advanced Clustering, AMD, Apple, Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm and Sun Microsystems. Other technology partner includes Etnus. ====================================================================== _______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > --------------------------------- Sponsored Link > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > --0-119921422-1163376356=:19890-- > > > --------------------------------- Sponsored Link $200,000 mortgage for $660/mo - 30/15 yr fixed, reduce debt, home equity - Click now for info -------------- next part -------------- An HTML attachment was scrubbed... URL: From ebiederm at xmission.com Mon Nov 13 09:31:24 2006 From: ebiederm at xmission.com (ebiederm at xmission.com) Date: Mon, 13 Nov 2006 10:31:24 -0700 Subject: [openib-general] [PATCH 5/6] Use pci_find_ht_capability() in drivers/pci/quirks.c In-Reply-To: <4558AAA3.6060606@myri.com> (Brice Goglin's message of "Mon, 13 Nov 2006 18:25:55 +0100") References: <20061109064048.32E5E67C7C@ozlabs.org> <4552EF4E.4030408@myri.com> <39132BF2-24E2-4BBF-8D92-A201DBF83D5B@kernel.crashing.org> <20061109141733.GA11499@kroah.com> <1163400337.7410.111.camel@localhost.localdomain> <1163400939.4982.284.camel@localhost.localdomain> <774BA04D-395B-4C48-BFA1-1CBE87EE2603@kernel.crashing.org> <4558AAA3.6060606@myri.com> Message-ID: Brice Goglin writes: > > How do you want to detect the following loop in pci_find_capability() > without changing the API? > any cap -> any cap -> one HT cap -> any cap -> back to first HT cap > When looking for a HT cap, pci_find_capability() will always succeed, it > will never loop forever. But, pci_find_ht_capability() will loop forever > if it is looking for a different HT cap than the one we have in the chain. Ok. Reading the code now.. pci_find_capability will never loop forever precisely we have the ttl check in __pci_find_next_cap, which underlies both pci_find_capability, and pci_find_next_capability. While it is a little silly we can use the exact same logic on the hypertransport side. What I was actually thinking and this may almost be better is to perform a preemptive walk pci_find_capability and see if there is a loop and issue a warning and refuse to deal with the device if that is the case. Or being more generous have pci_find_capability or some similar function do a walk and find the end of the chain, as defined be either the point the chain terminates, or the point where the chain loops back in on itself. At which point you only need the ttl check once. One of the above two solutions is probably preferable so that higher level interfaces don't have to think about infinite loops. > When we add a new function looping on top of the existing functions, we > can always find a brain-damaged chain which makes the new function loop > forever while the existing ones do not... That's why I had a new TTL in > msi_ht_cap_enabled() in drivers/pci/quirks.c even if I was using > pci_find_next_capability() which has its own TTL. > > Anyway, I agree that protecting against bugs that we've never seen > before is not that important. I was just thinking that protecting is > very easy, while debugging might be boring once we find such a broken > PCI chain :) Right, and I agree that it makes sense to be a little paranoid especially in generic code. I just don't think it makes sense to make the user of the interface care that we are being paranoid. Eric From rdreier at cisco.com Mon Nov 13 09:34:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Nov 2006 09:34:36 -0800 Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: <4556CC57.6020805@voltaire.com> (Or Gerlitz's message of "Sun, 12 Nov 2006 09:25:11 +0200") References: <1162506626.29948.568.camel@brick.pathscale.com> <4556CC57.6020805@voltaire.com> Message-ID: > This is a bug since there are architectures eg PPC64 where the native > address size is u64 but dma_addr_t is u32. You are somehow in a > problem here, since returning an unchopped cpu_addr to the consumer > might cause a memory corruption as they are expecting 32 bit value. Yes (although ppc64 is now u64 -- sparc64 is still u32 though). I think this means we need to make these ib_dma_xxx functions return u64 instead of dma_addr_t. - R. From rdreier at cisco.com Mon Nov 13 09:36:25 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Nov 2006 09:36:25 -0800 Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: (Roland Dreier's message of "Mon, 13 Nov 2006 09:34:36 -0800") References: <1162506626.29948.568.camel@brick.pathscale.com> <4556CC57.6020805@voltaire.com> Message-ID: ...although one has to think through the implications for pci_unmap_addr_set() I guess... From rdreier at cisco.com Mon Nov 13 09:42:24 2006 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 13 Nov 2006 09:42:24 -0800 Subject: [openib-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This includes various small fixes for 2.6.19-rc6: Hoang-Nam Nguyen (3): IB/ehca: Assure 4K alignment for firmware control blocks IB/ehca: Use named constant for max mtu IB/ehca: Activate scaling code by default Jean Delvare (1): RDMA/amso1100: Fix && typo Roland Dreier (1): IB/mad: Fix race between cancel and receive completion Tom Tucker (1): RDMA/amso1100: Fix unitialized pseudo_netdev accessed in c2_register_device drivers/infiniband/core/mad.c | 2 +- drivers/infiniband/hw/amso1100/c2.c | 3 +- drivers/infiniband/hw/amso1100/c2_provider.c | 39 +++++++++--------- drivers/infiniband/hw/amso1100/c2_rnic.c | 4 +- drivers/infiniband/hw/ehca/Kconfig | 1 + drivers/infiniband/hw/ehca/ehca_av.c | 5 +- drivers/infiniband/hw/ehca/ehca_hca.c | 17 ++++---- drivers/infiniband/hw/ehca/ehca_irq.c | 17 ++++---- drivers/infiniband/hw/ehca/ehca_iverbs.h | 8 ++++ drivers/infiniband/hw/ehca/ehca_main.c | 56 +++++++++++++++++++++---- drivers/infiniband/hw/ehca/ehca_mrmw.c | 8 ++-- drivers/infiniband/hw/ehca/ehca_qp.c | 10 ++-- drivers/infiniband/hw/ehca/hipz_hw.h | 2 + 13 files changed, 111 insertions(+), 61 deletions(-) The full log and patch is below: commit 39798695b4bcc7b145f8910ca56195808d3a7637 Author: Roland Dreier Date: Mon Nov 13 09:38:07 2006 -0800 IB/mad: Fix race between cancel and receive completion When ib_cancel_mad() is called, it puts the canceled send on a list and schedules a "flushed" callback from process context. However, this leaves a window where a receive completion could be processed before the send is fully flushed. This is fine, except that ib_find_send_mad() will find the MAD and return it to the receive processing, which results in the sender getting both a successful receive and a "flushed" send completion for the same request. Understandably, this confuses the sender, which is expecting only one of these two callbacks, and leads to grief such as a use-after-free in IPoIB. Fix this by changing ib_find_send_mad() to return a send struct only if the status is still successful (and not "flushed"). The search of the send_list already had this check, so this patch just adds the same check to the search of the wait_list. Signed-off-by: Roland Dreier commit b26c791e9ca3365616d40836000285931ca033d0 Author: Jean Delvare Date: Thu Nov 9 21:02:26 2006 +0100 RDMA/amso1100: Fix && typo Fix the AMSO1100 firmware version computation, which was broken due to "&&" being used where "&" should have. Signed-off-by: Jean Delvare Signed-off-by: Roland Dreier commit 2ffcab6ae44b02679229ca1852526d0a6e062dd2 Author: Tom Tucker Date: Wed Nov 8 14:23:22 2006 -0600 RDMA/amso1100: Fix unitialized pseudo_netdev accessed in c2_register_device Rework some load-time error handling: c2_register_device() leaked when it failed, and the function that called it didn't check the return code. Signed-off-by: Tom Tucker Signed-off-by: Roland Dreier commit f2c238a0c5e155acd49752c5fb93fb8d8534232b Author: Hoang-Nam Nguyen Date: Sun Nov 5 21:42:20 2006 +0100 IB/ehca: Activate scaling code by default Change ehca's Kconfig to activates scaling code as default. After several measurements we saw that this feature prevents dropped packets (UD) in stress situation. Thus, enabling it helps to improve ehca's bandwidth through IPoIB. Signed-off-by: Hoang-Nam Nguyen Signed-off-by: Roland Dreier commit c58121143f87930621c1a6fa9683b6862f2b42c9 Author: Hoang-Nam Nguyen Date: Sun Nov 5 21:42:56 2006 +0100 IB/ehca: Use named constant for max mtu Define and use a constant EHCA_MAX_MTU instead hardcoded value. Signed-off-by: Hoang-Nam Nguyen Signed-off-by: Roland Dreier commit 7e28db5d8ff63b1cabc221c5cb84a5f45752f1c2 Author: Hoang-Nam Nguyen Date: Tue Nov 7 00:56:39 2006 +0100 IB/ehca: Assure 4K alignment for firmware control blocks Assure 4K alignment for firmware control blocks in 64K page mode, because kzalloc()'s result address might not be 4K aligned if 64K pages are enabled. Thus, we introduce wrappers called ehca_{alloc,free}_fw_ctrlblock(), which use a slab cache for objects with 4K length and 4K alignment in order to alloc/free firmware control blocks in 64K page mode. In 4K page mode those wrappers just are defines of get_zeroed_page() and free_page(). Signed-off-by: Hoang-Nam Nguyen Signed-off-by: Roland Dreier commit 39798695b4bcc7b145f8910ca56195808d3a7637 Author: Roland Dreier Date: Mon Nov 13 09:38:07 2006 -0800 IB/mad: Fix race between cancel and receive completion When ib_cancel_mad() is called, it puts the canceled send on a list and schedules a "flushed" callback from process context. However, this leaves a window where a receive completion could be processed before the send is fully flushed. This is fine, except that ib_find_send_mad() will find the MAD and return it to the receive processing, which results in the sender getting both a successful receive and a "flushed" send completion for the same request. Understandably, this confuses the sender, which is expecting only one of these two callbacks, and leads to grief such as a use-after-free in IPoIB. Fix this by changing ib_find_send_mad() to return a send struct only if the status is still successful (and not "flushed"). The search of the send_list already had this check, so this patch just adds the same check to the search of the wait_list. Signed-off-by: Roland Dreier commit b26c791e9ca3365616d40836000285931ca033d0 Author: Jean Delvare Date: Thu Nov 9 21:02:26 2006 +0100 RDMA/amso1100: Fix && typo Fix the AMSO1100 firmware version computation, which was broken due to "&&" being used where "&" should have. Signed-off-by: Jean Delvare Signed-off-by: Roland Dreier commit 2ffcab6ae44b02679229ca1852526d0a6e062dd2 Author: Tom Tucker Date: Wed Nov 8 14:23:22 2006 -0600 RDMA/amso1100: Fix unitialized pseudo_netdev accessed in c2_register_device Rework some load-time error handling: c2_register_device() leaked when it failed, and the function that called it didn't check the return code. Signed-off-by: Tom Tucker Signed-off-by: Roland Dreier commit f2c238a0c5e155acd49752c5fb93fb8d8534232b Author: Hoang-Nam Nguyen Date: Sun Nov 5 21:42:20 2006 +0100 IB/ehca: Activate scaling code by default Change ehca's Kconfig to activates scaling code as default. After several measurements we saw that this feature prevents dropped packets (UD) in stress situation. Thus, enabling it helps to improve ehca's bandwidth through IPoIB. Signed-off-by: Hoang-Nam Nguyen Signed-off-by: Roland Dreier commit c58121143f87930621c1a6fa9683b6862f2b42c9 Author: Hoang-Nam Nguyen Date: Sun Nov 5 21:42:56 2006 +0100 IB/ehca: Use named constant for max mtu Define and use a constant EHCA_MAX_MTU instead hardcoded value. Signed-off-by: Hoang-Nam Nguyen Signed-off-by: Roland Dreier commit 7e28db5d8ff63b1cabc221c5cb84a5f45752f1c2 Author: Hoang-Nam Nguyen Date: Tue Nov 7 00:56:39 2006 +0100 IB/ehca: Assure 4K alignment for firmware control blocks Assure 4K alignment for firmware control blocks in 64K page mode, because kzalloc()'s result address might not be 4K aligned if 64K pages are enabled. Thus, we introduce wrappers called ehca_{alloc,free}_fw_ctrlblock(), which use a slab cache for objects with 4K length and 4K alignment in order to alloc/free firmware control blocks in 64K page mode. In 4K page mode those wrappers just are defines of get_zeroed_page() and free_page(). Signed-off-by: Hoang-Nam Nguyen Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 493f4c6..a72bcea 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -1750,7 +1750,7 @@ ib_find_send_mad(struct ib_mad_agent_pri */ (is_direct(wc->recv_buf.mad->mad_hdr.mgmt_class) || rcv_has_same_gid(mad_agent_priv, wr, wc))) - return wr; + return (wr->status == IB_WC_SUCCESS) ? wr : NULL; } /* diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c index 9e7bd94..27fe242 100644 --- a/drivers/infiniband/hw/amso1100/c2.c +++ b/drivers/infiniband/hw/amso1100/c2.c @@ -1155,7 +1155,8 @@ static int __devinit c2_probe(struct pci goto bail10; } - c2_register_device(c2dev); + if (c2_register_device(c2dev)) + goto bail10; return 0; diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index da98d9f..fef9727 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -757,20 +757,17 @@ #endif int c2_register_device(struct c2_dev *dev) { - int ret; + int ret = -ENOMEM; int i; /* Register pseudo network device */ dev->pseudo_netdev = c2_pseudo_netdev_init(dev); - if (dev->pseudo_netdev) { - ret = register_netdev(dev->pseudo_netdev); - if (ret) { - printk(KERN_ERR PFX - "Unable to register netdev, ret = %d\n", ret); - free_netdev(dev->pseudo_netdev); - return ret; - } - } + if (!dev->pseudo_netdev) + goto out3; + + ret = register_netdev(dev->pseudo_netdev); + if (ret) + goto out2; pr_debug("%s:%u\n", __FUNCTION__, __LINE__); strlcpy(dev->ibdev.name, "amso%d", IB_DEVICE_NAME_MAX); @@ -848,21 +845,25 @@ int c2_register_device(struct c2_dev *de ret = ib_register_device(&dev->ibdev); if (ret) - return ret; + goto out1; for (i = 0; i < ARRAY_SIZE(c2_class_attributes); ++i) { ret = class_device_create_file(&dev->ibdev.class_dev, c2_class_attributes[i]); - if (ret) { - unregister_netdev(dev->pseudo_netdev); - free_netdev(dev->pseudo_netdev); - ib_unregister_device(&dev->ibdev); - return ret; - } + if (ret) + goto out0; } + goto out3; - pr_debug("%s:%u\n", __FUNCTION__, __LINE__); - return 0; +out0: + ib_unregister_device(&dev->ibdev); +out1: + unregister_netdev(dev->pseudo_netdev); +out2: + free_netdev(dev->pseudo_netdev); +out3: + pr_debug("%s:%u ret=%d\n", __FUNCTION__, __LINE__, ret); + return ret; } void c2_unregister_device(struct c2_dev *dev) diff --git a/drivers/infiniband/hw/amso1100/c2_rnic.c b/drivers/infiniband/hw/amso1100/c2_rnic.c index 21d9612..623dc95 100644 --- a/drivers/infiniband/hw/amso1100/c2_rnic.c +++ b/drivers/infiniband/hw/amso1100/c2_rnic.c @@ -157,8 +157,8 @@ static int c2_rnic_query(struct c2_dev * props->fw_ver = ((u64)be32_to_cpu(reply->fw_ver_major) << 32) | - ((be32_to_cpu(reply->fw_ver_minor) && 0xFFFF) << 16) | - (be32_to_cpu(reply->fw_ver_patch) && 0xFFFF); + ((be32_to_cpu(reply->fw_ver_minor) & 0xFFFF) << 16) | + (be32_to_cpu(reply->fw_ver_patch) & 0xFFFF); memcpy(&props->sys_image_guid, c2dev->netdev->dev_addr, 6); props->max_mr_size = 0xFFFFFFFF; props->page_size_cap = ~(C2_MIN_PAGESIZE-1); diff --git a/drivers/infiniband/hw/ehca/Kconfig b/drivers/infiniband/hw/ehca/Kconfig index 922389b..727b10d 100644 --- a/drivers/infiniband/hw/ehca/Kconfig +++ b/drivers/infiniband/hw/ehca/Kconfig @@ -10,6 +10,7 @@ config INFINIBAND_EHCA config INFINIBAND_EHCA_SCALING bool "Scaling support (EXPERIMENTAL)" depends on IBMEBUS && INFINIBAND_EHCA && HOTPLUG_CPU && EXPERIMENTAL + default y ---help--- eHCA scaling support schedules the CQ callbacks to different CPUs. diff --git a/drivers/infiniband/hw/ehca/ehca_av.c b/drivers/infiniband/hw/ehca/ehca_av.c index 3bac197..214e2fd 100644 --- a/drivers/infiniband/hw/ehca/ehca_av.c +++ b/drivers/infiniband/hw/ehca/ehca_av.c @@ -118,8 +118,7 @@ struct ib_ah *ehca_create_ah(struct ib_p } memcpy(&av->av.grh.word_1, &gid, sizeof(gid)); } - /* for the time being we use a hard coded PMTU of 2048 Bytes */ - av->av.pmtu = 4; + av->av.pmtu = EHCA_MAX_MTU; /* dgid comes in grh.word_3 */ memcpy(&av->av.grh.word_3, &ah_attr->grh.dgid, @@ -193,7 +192,7 @@ int ehca_modify_ah(struct ib_ah *ah, str memcpy(&new_ehca_av.grh.word_1, &gid, sizeof(gid)); } - new_ehca_av.pmtu = 4; /* see also comment in create_ah() */ + new_ehca_av.pmtu = EHCA_MAX_MTU; memcpy(&new_ehca_av.grh.word_3, &ah_attr->grh.dgid, sizeof(ah_attr->grh.dgid)); diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 5eae6ac..e1b618c 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -40,6 +40,7 @@ */ #include "ehca_tools.h" +#include "ehca_iverbs.h" #include "hcp_if.h" int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props) @@ -49,7 +50,7 @@ int ehca_query_device(struct ib_device * ib_device); struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -96,7 +97,7 @@ int ehca_query_device(struct ib_device * = min_t(int, rblock->max_total_mcast_qp_attach, INT_MAX); query_device1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -109,7 +110,7 @@ int ehca_query_port(struct ib_device *ib ib_device); struct hipz_query_port *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -162,7 +163,7 @@ int ehca_query_port(struct ib_device *ib props->active_speed = 0x1; query_port1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -178,7 +179,7 @@ int ehca_query_pkey(struct ib_device *ib return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -193,7 +194,7 @@ int ehca_query_pkey(struct ib_device *ib memcpy(pkey, &rblock->pkey_entries + index, sizeof(u16)); query_pkey1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -211,7 +212,7 @@ int ehca_query_gid(struct ib_device *ibd return -EINVAL; } - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -227,7 +228,7 @@ int ehca_query_gid(struct ib_device *ibd memcpy(&gid->raw[8], &rblock->guid_entries[index], sizeof(u64)); query_gid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 048cc44..c3ea746 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -45,6 +45,7 @@ #include "ehca_iverbs.h" #include "ehca_tools.h" #include "hcp_if.h" #include "hipz_fns.h" +#include "ipz_pt_fn.h" #define EQE_COMPLETION_EVENT EHCA_BMASK_IBM(1,1) #define EQE_CQ_QP_NUMBER EHCA_BMASK_IBM(8,31) @@ -137,38 +138,36 @@ int ehca_error_data(struct ehca_shca *sh u64 *rblock; unsigned long block_count; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Cannot allocate rblock memory."); ret = -ENOMEM; goto error_data1; } + /* rblock must be 4K aligned and should be 4K large */ ret = hipz_h_error_data(shca->ipz_hca_handle, resource, rblock, &block_count); - if (ret == H_R_STATE) { + if (ret == H_R_STATE) ehca_err(&shca->ib_device, "No error data is available: %lx.", resource); - } else if (ret == H_SUCCESS) { int length; length = EHCA_BMASK_GET(ERROR_DATA_LENGTH, rblock[0]); - if (length > PAGE_SIZE) - length = PAGE_SIZE; + if (length > EHCA_PAGESIZE) + length = EHCA_PAGESIZE; print_error_data(shca, data, rblock, length); - } - else { + } else ehca_err(&shca->ib_device, "Error data could not be fetched: %lx", resource); - } - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); error_data1: return ret; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 319c39d..3720e30 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -179,4 +179,12 @@ int ehca_mmap_register(u64 physical,void int ehca_munmap(unsigned long addr, size_t len); +#ifdef CONFIG_PPC_64K_PAGES +void *ehca_alloc_fw_ctrlblock(void); +void ehca_free_fw_ctrlblock(void *ptr); +#else +#define ehca_alloc_fw_ctrlblock() ((void *) get_zeroed_page(GFP_KERNEL)) +#define ehca_free_fw_ctrlblock(ptr) free_page((unsigned long)(ptr)) +#endif + #endif diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 024d511..01f5aa9 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -40,6 +40,9 @@ * POSSIBILITY OF SUCH DAMAGE. */ +#ifdef CONFIG_PPC_64K_PAGES +#include +#endif #include "ehca_classes.h" #include "ehca_iverbs.h" #include "ehca_mrmw.h" @@ -49,7 +52,7 @@ #include "hcp_if.h" MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); MODULE_DESCRIPTION("IBM eServer HCA InfiniBand Device Driver"); -MODULE_VERSION("SVNEHCA_0017"); +MODULE_VERSION("SVNEHCA_0018"); int ehca_open_aqp1 = 0; int ehca_debug_level = 0; @@ -94,11 +97,31 @@ spinlock_t ehca_cq_idr_lock; DEFINE_IDR(ehca_qp_idr); DEFINE_IDR(ehca_cq_idr); + static struct list_head shca_list; /* list of all registered ehcas */ static spinlock_t shca_list_lock; static struct timer_list poll_eqs_timer; +#ifdef CONFIG_PPC_64K_PAGES +static struct kmem_cache *ctblk_cache = NULL; + +void *ehca_alloc_fw_ctrlblock(void) +{ + void *ret = kmem_cache_zalloc(ctblk_cache, SLAB_KERNEL); + if (!ret) + ehca_gen_err("Out of memory for ctblk"); + return ret; +} + +void ehca_free_fw_ctrlblock(void *ptr) +{ + if (ptr) + kmem_cache_free(ctblk_cache, ptr); + +} +#endif + static int ehca_create_slab_caches(void) { int ret; @@ -133,6 +156,17 @@ static int ehca_create_slab_caches(void) goto create_slab_caches5; } +#ifdef CONFIG_PPC_64K_PAGES + ctblk_cache = kmem_cache_create("ehca_cache_ctblk", + EHCA_PAGESIZE, H_CB_ALIGNMENT, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (!ctblk_cache) { + ehca_gen_err("Cannot create ctblk SLAB cache."); + ehca_cleanup_mrmw_cache(); + goto create_slab_caches5; + } +#endif return 0; create_slab_caches5: @@ -157,6 +191,10 @@ static void ehca_destroy_slab_caches(voi ehca_cleanup_qp_cache(); ehca_cleanup_cq_cache(); ehca_cleanup_pd_cache(); +#ifdef CONFIG_PPC_64K_PAGES + if (ctblk_cache) + kmem_cache_destroy(ctblk_cache); +#endif } #define EHCA_HCAAVER EHCA_BMASK_IBM(32,39) @@ -168,7 +206,7 @@ int ehca_sense_attributes(struct ehca_sh u64 h_ret; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_gen_err("Cannot allocate rblock memory."); return -ENOMEM; @@ -211,7 +249,7 @@ int ehca_sense_attributes(struct ehca_sh shca->sport[1].rate = IB_RATE_30_GBPS; num_ports1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -220,7 +258,7 @@ static int init_node_guid(struct ehca_sh int ret = 0; struct hipz_query_hca *rblock; - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + rblock = ehca_alloc_fw_ctrlblock(); if (!rblock) { ehca_err(&shca->ib_device, "Can't allocate rblock memory."); return -ENOMEM; @@ -235,7 +273,7 @@ static int init_node_guid(struct ehca_sh memcpy(&shca->ib_device.node_guid, &rblock->node_guid, sizeof(u64)); init_node_guid1: - kfree(rblock); + ehca_free_fw_ctrlblock(rblock); return ret; } @@ -431,7 +469,7 @@ static ssize_t ehca_show_##name(struct \ shca = dev->driver_data; \ \ - rblock = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); \ + rblock = ehca_alloc_fw_ctrlblock(); \ if (!rblock) { \ dev_err(dev, "Can't allocate rblock memory."); \ return 0; \ @@ -439,12 +477,12 @@ static ssize_t ehca_show_##name(struct \ if (hipz_h_query_hca(shca->ipz_hca_handle, rblock) != H_SUCCESS) { \ dev_err(dev, "Can't query device properties"); \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ return 0; \ } \ \ data = rblock->name; \ - kfree(rblock); \ + ehca_free_fw_ctrlblock(rblock); \ \ if ((strcmp(#name, "num_ports") == 0) && (ehca_nr_ports == 1)) \ return snprintf(buf, 256, "1\n"); \ @@ -752,7 +790,7 @@ int __init ehca_module_init(void) int ret; printk(KERN_INFO "eHCA Infiniband Device Driver " - "(Rel.: SVNEHCA_0017)\n"); + "(Rel.: SVNEHCA_0018)\n"); idr_init(&ehca_qp_idr); idr_init(&ehca_cq_idr); spin_lock_init(&ehca_qp_idr_lock); diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 5ca6544..abce676 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -1013,7 +1013,7 @@ int ehca_reg_mr_rpages(struct ehca_shca u32 i; u64 *kpage; - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1092,7 +1092,7 @@ int ehca_reg_mr_rpages(struct ehca_shca ehca_reg_mr_rpages_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%x shca=%p e_mr=%p pginfo=%p " @@ -1124,7 +1124,7 @@ inline int ehca_rereg_mr_rereg1(struct e ehca_mrmw_map_acl(acl, &hipz_acl); ehca_mrmw_set_pgsize_hipz_acl(&hipz_acl); - kpage = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); + kpage = ehca_alloc_fw_ctrlblock(); if (!kpage) { ehca_err(&shca->ib_device, "kpage alloc failed"); ret = -ENOMEM; @@ -1181,7 +1181,7 @@ inline int ehca_rereg_mr_rereg1(struct e } ehca_rereg_mr_rereg1_exit1: - kfree(kpage); + ehca_free_fw_ctrlblock(kpage); ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%x lkey=%x rkey=%x " diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 4394123..cf3e50e 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -811,8 +811,8 @@ static int internal_modify_qp(struct ib_ unsigned long spl_flags = 0; /* do query_qp to obtain current attr values */ - mqpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL); - if (mqpcb == NULL) { + mqpcb = ehca_alloc_fw_ctrlblock(); + if (!mqpcb) { ehca_err(ibqp->device, "Could not get zeroed page for mqpcb " "ehca_qp=%p qp_num=%x ", my_qp, ibqp->qp_num); return -ENOMEM; @@ -1225,7 +1225,7 @@ modify_qp_exit2: } modify_qp_exit1: - kfree(mqpcb); + ehca_free_fw_ctrlblock(mqpcb); return ret; } @@ -1277,7 +1277,7 @@ int ehca_query_qp(struct ib_qp *qp, return -EINVAL; } - qpcb = kzalloc(H_CB_ALIGNMENT, GFP_KERNEL ); + qpcb = ehca_alloc_fw_ctrlblock(); if (!qpcb) { ehca_err(qp->device,"Out of memory for qpcb " "ehca_qp=%p qp_num=%x", my_qp, qp->qp_num); @@ -1401,7 +1401,7 @@ int ehca_query_qp(struct ib_qp *qp, ehca_dmp(qpcb, 4*70, "qp_num=%x", qp->qp_num); query_qp_exit1: - kfree(qpcb); + ehca_free_fw_ctrlblock(qpcb); return ret; } diff --git a/drivers/infiniband/hw/ehca/hipz_hw.h b/drivers/infiniband/hw/ehca/hipz_hw.h index 3fc92b0..fad9136 100644 --- a/drivers/infiniband/hw/ehca/hipz_hw.h +++ b/drivers/infiniband/hw/ehca/hipz_hw.h @@ -45,6 +45,8 @@ #define __HIPZ_HW_H__ #include "ehca_tools.h" +#define EHCA_MAX_MTU 4 + /* QP Table Entry Memory Map */ struct hipz_qptemm { u64 qpx_hcr; From halr at voltaire.com Mon Nov 13 09:49:08 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Nov 2006 19:49:08 +0200 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. References: <45589CB1.7020307@ichips.intel.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F501894434@taurus.voltaire.com> There may be an OpenSM bug with setting hop limit in the path record response. I'm looking at it now. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Sean Hefty Sent: Mon 11/13/2006 11:26 AM To: Bub Thomas Cc: Erez Cohen; openib-general at openib.org Subject: Re: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. Bub Thomas wrote: > Setting the hop_limit from 64 down to 0 or 1 solved the problem. :-) > Don't ask me where I got that hop_limit from, it must have been an > example I found somewhere. > Can you explain why that hop_limit/is_global makes a difference in > communication between gen1 and gen2? Does the counterpart need to have > the same hop_limit? The gen2 stack uses a hop_limit > 0 to indicate that global routing is being used. If the hop_limit is > 0, then the global routing information must be valid. > The path record values I use are queried from the OSM using a > SERVICE_RECORD query followed by a path record query. > I'm not using any alternate path record values, is this critical? Everything is supposed to work with the path records returned from the SM. I was wondering if you were querying for the path record, modifying the returned value, or creating a path record yourself. > path_record.packet_life_time = 0; I would set this higher (maybe between 12-20, with: 18 = 1 second, 19 = 2 seconds, 20 = 4 seconds, etc.) _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From narravul at cse.ohio-state.edu Mon Nov 13 10:56:44 2006 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Mon, 13 Nov 2006 13:56:44 -0500 (EST) Subject: [openib-general] [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme In-Reply-To: <20061113173010.30191.qmail@web58006.mail.re3.yahoo.com> Message-ID: Hi David, > My question is why should I have this reference to /usr/local/ofed > there if I do not need to download the OFED distribution code to run > the iWARP. The variable OPEN_IB_HOME in make.mvapich2.iwarp sets the path to the Gen2 installation that you intend to use for iwarp. The default in the script is /usr/local/ofed. Based on your installation, please set this variable appropriately. export OPEN_IB_HOME=/usr/local > Is it possible to add more information in your MVAPICH2 0.9.8 User Guide describing how to build this and what are the dependencies? We have updated our userguide with the installation information. The dependencies that we have are the installation of OF iwarp branch, the setup of rdma-cm module and the setup of the underlying network. Regards, --Sundeep. > > David > Sundeep Narravula wrote: > Hi David, > > > iWARP is actually a part of the Open Fabrics SVN. It is available from > > a different branch. > > > > I am cc'ing this note to my group. One of the students (Sundeep) will > > send you the detailed instructions on which branch of OF to download > > and use. > > The instructions for setting up iwarp on OpenFrabics is avalable on the > openIB wiki at > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Ammasso1100 > > Further, the branch you can download from the svn is > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > > Please let us know if you have any further questions. > > Regards, > --Sundeep. > > > > > Is there any document describing the build process of the MVAPICH2 > > > tool? I am going through the MVAPICH2 0.9.8 users guide and that > > > does not seem to be giving me the detailed information. > > > > We will add this additional information on our user guide. > > > > > Can you please provide README files for the iWARP which describes > > > TODO steps? > > > > Sundeep's information will help. If you have any additional questions, > > please feel free to ask us. > > > > Thanks, > > > > DK > > > > > Thanks, > > > David > > > > > > Dhabaleswar Panda > wrote: > > > The MVAPICH team is pleased to announce the availability of MVAPICH2 > > > 0.9.8 with the following NEW features: > > > > > > - Checkpoint/Restart support for application transparent systems-level > > > fault tolerance. BLCR-based support using native InfiniBand Gen2 > > > interface is provided. Flexible interface to work with different > > > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > > > > > Performance of sample applications with checkpoint-restart using > > > PVFS2 and Lustre can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > > > > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > > > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > > > drivers. > > > > > > - RDMA CM-based Connection management support > > > > > > - Shared memory optimizations for collective communication operations. > > > Efficient algorithms and optimizations for barrier, reduce and > > > all-reduce operations. Exploits the multi-core optimized shared > > > memory point-to-point communication support introduced in MVAPICH2 > > > 0.9.6. > > > > > > Performance of sample collective operations with this new feature > > > can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > > > > > - uDAPL support for NetEffect 10GigE adapter. Tested with > > > NetEffect NE010 adapter. > > > > > > More details on all features and supported platforms can be obtained > > > by visiting the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > > > > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > > > continues to deliver excellent performance. Sample performance > > > numbers include: > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 1561 MB/sec unidirectional bandwidth > > > - 2935 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.92 microsec Put latency > > > - 1569 MB/sec unidirectional Put bandwidth > > > - 2935 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 3127 MB/sec unidirectional bandwidth > > > - 5917 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.37 microsec Put latency > > > - 3137 MB/sec unidirectional Put bandwidth > > > - 5917 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 3.01 microsec one-way latency (4 bytes) > > > - 1402 MB/sec unidirectional bandwidth > > > - 2238 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.65 microsec Put latency > > > - 1402 MB/sec unidirectional Put bandwidth > > > - 2238 MB/sec bidirectional Put bandwidth > > > > > > Performance numbers for all other platforms, system configurations and > > > operations can be viewed by visiting `Performance' section of the > > > project's web page. > > > > > > With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar > > > performance for two-sided operations compared to MVAPICH 0.9.8. > > > Organizations and users interested in getting the best performance for > > > both two-sided and one-sided operations and also want to exploit > > > advanced features (such as fault tolerance with checkpoint/restart, > > > iWARP, RDMA CM connection management, multi-threading, integrated > > > multi-rail, multi-core optimization, memory hook support and optimized > > > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > > > > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > > > SVN, please visit the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > > > > > A stripped down version of this release is also available at the > > > OpenFabrics SVN. > > > > > > All feedbacks, including bug reports and hints for performance tuning, > > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > > > Thanks, > > > > > > MVAPICH Team at OSU/NBCL > > > > > > ====================================================================== > > > MVAPICH/MVAPICH2 project is currently supported with funding from > > > U.S. National Science Foundation, U.S. DOE Office of Science, > > > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > > > and with equipment support from Advanced Clustering, AMD, Apple, > > > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > > > and Sun Microsystems. Other technology partner includes Etnus. > > > ====================================================================== > > > > > > _______________________________________________ > > > openfabrics-ewg mailing list > > > openfabrics-ewg at openib.org > > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > > > --------------------------------- > > > Sponsored Link > > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890 > > > Content-Type: text/html; charset=iso-8859-1 > > > Content-Transfer-Encoding: 8bit > > > > > > I am trying to use the OSC MPI tool for the iWARP and quite new to the open fabrics tools. > > If I know it correctly, iWARP is not yet part of the OFED release. > But the iWARP makefile has reference to ofed code. Is it really required. > > Is there any document describing the build process of the MVAPICH2 tool? I am going through the MVAPICH2 0.9.8 users guide and that does not seem to be giving me the detailed information. > > Can you please provide README files for the iWARP which describes TODO steps? > > Thanks, > David > > Dhabaleswar Panda wrote: > The MVAPICH team is pleased to announce the availability of MVAPICH2 > 0.9.8 with the following NEW features: > > - > > > Checkpoint/Restart support for application transparent systems-level > fault tolerance. BLCR-based support using native InfiniBand Gen2 > interface is provided. Flexible interface to work with different > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > Performance of sample applications with checkpoint-restart using > PVFS2 and Lustre can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > drivers. > > - RDMA CM-based Connection management support > > - Shared memory optimizations for collective communication operations. > Efficient algorithms and optimizations for barrier, reduce and > all-reduce operations. Exploits the multi-core optimized shared > memory point-to-point communication support introduced in > > > MVAPICH2 > 0.9.6. > > Performance of sample collective operations with this new feature > can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > - uDAPL support for NetEffect 10GigE adapter. Tested with > NetEffect NE010 adapter. > > More details on all features and supported platforms can be obtained > by visiting the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > continues to deliver excellent performance. Sample performance > numbers include: > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 1561 MB/sec unidirectional bandwidth > - 2935 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.92 microsec Put latency > - 1569 MB/sec unidirectional Put bandwidth > - 2935 MB/sec > > > bidirectional Put bandwidth > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 3127 MB/sec unidirectional bandwidth > - 5917 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.37 microsec Put latency > - 3137 MB/sec unidirectional Put bandwidth > - 5917 MB/sec bidirectional Put bandwidth > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 3.01 microsec one-way latency (4 bytes) > - 1402 MB/sec unidirectional bandwidth > - 2238 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.65 microsec Put latency > - 1402 MB/sec unidirectional Put bandwidth > - 2238 MB/sec bidirectional Put bandwidth > > Performance numbers for all other platforms, system configurations and > operations can be viewed by visiting `Performance' section of the > project's web page. > > With the > > > ADI-3-level design, MVAPICH2 0.9.8 delivers similar > performance for two-sided operations compared to MVAPICH 0.9.8. > Organizations and users interested in getting the best performance for > both two-sided and one-sided operations and also want to exploit > advanced features (such as fault tolerance with checkpoint/restart, > iWARP, RDMA CM connection management, multi-threading, integrated > multi-rail, multi-core optimization, memory hook support and optimized > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > SVN, please visit the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > A stripped down version of this release is also available at the > OpenFabrics SVN. > > All feedbacks, including bug reports and hints for performance tuning, > are welcome. Please post it to the mvapich-discuss mailing list. > > Thanks, > > > > > MVAPICH Team at OSU/NBCL > > ====================================================================== > MVAPICH/MVAPICH2 project is currently supported with funding from > U.S. National Science Foundation, U.S. DOE Office of Science, > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > and with equipment support from Advanced Clustering, AMD, Apple, > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > and Sun Microsystems. Other technology partner includes Etnus. > ====================================================================== > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > --------------------------------- > Sponsored Link > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890-- > > > > > > > > > > --------------------------------- > Sponsored Link > > $200,000 mortgage for $660/mo - 30/15 yr fixed, reduce debt, home equity - Click now for info From sean.hefty at intel.com Mon Nov 13 11:05:47 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 13 Nov 2006 11:05:47 -0800 Subject: [openib-general] IPoIB new multicast API patches oops In-Reply-To: <20061109100229.GF14960@mellanox.co.il> Message-ID: <000101c70756$ba41e110$1bcd180a@amr.corp.intel.com> I have not been able to reproduce this crash on my systems, and even instrumenting the code isn't helping me to locate the issue. Can you apply the following patch on top of the previous patches, and let me know if you get any additional output? - Sean --- diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c index 88a9edf..b3bc4c6 100644 --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -81,6 +81,12 @@ enum mcast_state { MCAST_ERROR }; +enum mcast_debug { + MCAST_DEBUG_IDLE, + MCAST_DEBUG_JOINING, + MCAST_DEBUG_LEAVING, +}; + struct mcast_member; struct mcast_group { @@ -97,6 +103,7 @@ struct mcast_group { enum mcast_state state; struct ib_sa_query *query; int query_id; + enum mcast_debug debug_state; }; struct mcast_member { @@ -179,6 +186,7 @@ static void release_group(struct mcast_g if (atomic_dec_and_test(&group->refcount)) { rb_erase(&group->node, &port->table); spin_unlock_irqrestore(&port->lock, flags); + BUG_ON(group->debug_state != MCAST_DEBUG_IDLE); kfree(group); deref_port(port); } else @@ -319,6 +327,8 @@ static int send_join(struct mcast_group struct mcast_port *port = group->port; int ret; + BUG_ON(group->debug_state != MCAST_DEBUG_IDLE); + group->debug_state = MCAST_DEBUG_JOINING; ret = ib_sa_mcmember_rec_query(&sa_client, port->dev->device, port->port_num, IB_MGMT_METHOD_SET, &member->multicast.rec, @@ -341,6 +351,8 @@ static int send_leave(struct mcast_group rec = group->rec; rec.join_state = leave_state; + BUG_ON(group->debug_state != MCAST_DEBUG_IDLE); + group->debug_state = MCAST_DEBUG_LEAVING; ret = ib_sa_mcmember_rec_query(&sa_client, port->dev->device, port->port_num, IB_SA_METHOD_DELETE, &rec, IB_SA_MCMEMBER_REC_MGID | @@ -493,6 +505,8 @@ static void join_handler(int status, str { struct mcast_group *group = context; + BUG_ON(group->debug_state != MCAST_DEBUG_JOINING); + group->debug_state = MCAST_DEBUG_IDLE; if (status) process_join_error(group, status); else { @@ -510,6 +524,10 @@ static void join_handler(int status, str static void leave_handler(int status, struct ib_sa_mcmember_rec *rec, void *context) { + struct mcast_group *group = context; + + BUG_ON(group->debug_state != MCAST_DEBUG_LEAVING); + group->debug_state = MCAST_DEBUG_IDLE; mcast_work_handler(context); } From tziporet at dev.mellanox.co.il Mon Nov 13 11:32:53 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 13 Nov 2006 14:32:53 -0500 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F501894409@taurus.voltaire.com> References: <6923467.1163190820620.JavaMail.websites@opensubscriber> <5CE025EE7D88BA4599A2C8FEFCF226F501894407@taurus.voltaire.com> <5CE025EE7D88BA4599A2C8FEFCF226F501894409@taurus.voltaire.com> Message-ID: <4558C865.6050604@dev.mellanox.co.il> Hal Rosenstock wrote: > Can you see if this fixes it ? Thanks. > > -- Hal > > Index: opensm/osm_helper.c > =================================================================== > --- opensm/osm_helper.c (revision 10089) > +++ opensm/osm_helper.c (working copy) > @@ -1264,7 +1264,7 @@ > IN const ib_service_record_t* const p_sr, > IN const osm_log_level_t log_level ) > { > - char buf_service_key[33]; > + char buf_service_key[35]; > char buf_service_name[65]; > if( osm_log_is_active( p_log, log_level ) ) > > Hal, If this patch does solve the problem please add it to the support page of OFED 1.1 on the Wiki (https://openib.org/tiki/tiki-index.php?page=OFED+Support) Thanks, Tziporet From elsen_david at yahoo.com Mon Nov 13 11:35:29 2006 From: elsen_david at yahoo.com (david elsen) Date: Mon, 13 Nov 2006 11:35:29 -0800 (PST) Subject: [openib-general] [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme In-Reply-To: Message-ID: <20061113193529.45124.qmail@web58011.mail.re3.yahoo.com> I modified the OPEN_IB_HOME to the iWARP directroy. Now it is looking OPEN_IB_HOME/lib64 and OPEN_IB_HOME/lib directories. Sundeep Narravula wrote: Hi David, > My question is why should I have this reference to /usr/local/ofed > there if I do not need to download the OFED distribution code to run > the iWARP. The variable OPEN_IB_HOME in make.mvapich2.iwarp sets the path to the Gen2 installation that you intend to use for iwarp. The default in the script is /usr/local/ofed. Based on your installation, please set this variable appropriately. export OPEN_IB_HOME=/usr/local > Is it possible to add more information in your MVAPICH2 0.9.8 User Guide describing how to build this and what are the dependencies? We have updated our userguide with the installation information. The dependencies that we have are the installation of OF iwarp branch, the setup of rdma-cm module and the setup of the underlying network. Regards, --Sundeep. > > David > Sundeep Narravula wrote: > Hi David, > > > iWARP is actually a part of the Open Fabrics SVN. It is available from > > a different branch. > > > > I am cc'ing this note to my group. One of the students (Sundeep) will > > send you the detailed instructions on which branch of OF to download > > and use. > > The instructions for setting up iwarp on OpenFrabics is avalable on the > openIB wiki at > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Ammasso1100 > > Further, the branch you can download from the svn is > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > > Please let us know if you have any further questions. > > Regards, > --Sundeep. > > > > > Is there any document describing the build process of the MVAPICH2 > > > tool? I am going through the MVAPICH2 0.9.8 users guide and that > > > does not seem to be giving me the detailed information. > > > > We will add this additional information on our user guide. > > > > > Can you please provide README files for the iWARP which describes > > > TODO steps? > > > > Sundeep's information will help. If you have any additional questions, > > please feel free to ask us. > > > > Thanks, > > > > DK > > > > > Thanks, > > > David > > > > > > Dhabaleswar Panda > wrote: > > > The MVAPICH team is pleased to announce the availability of MVAPICH2 > > > 0.9.8 with the following NEW features: > > > > > > - Checkpoint/Restart support for application transparent systems-level > > > fault tolerance. BLCR-based support using native InfiniBand Gen2 > > > interface is provided. Flexible interface to work with different > > > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > > > > > Performance of sample applications with checkpoint-restart using > > > PVFS2 and Lustre can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > > > > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > > > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > > > drivers. > > > > > > - RDMA CM-based Connection management support > > > > > > - Shared memory optimizations for collective communication operations. > > > Efficient algorithms and optimizations for barrier, reduce and > > > all-reduce operations. Exploits the multi-core optimized shared > > > memory point-to-point communication support introduced in MVAPICH2 > > > 0.9.6. > > > > > > Performance of sample collective operations with this new feature > > > can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > > > > > - uDAPL support for NetEffect 10GigE adapter. Tested with > > > NetEffect NE010 adapter. > > > > > > More details on all features and supported platforms can be obtained > > > by visiting the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > > > > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > > > continues to deliver excellent performance. Sample performance > > > numbers include: > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 1561 MB/sec unidirectional bandwidth > > > - 2935 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.92 microsec Put latency > > > - 1569 MB/sec unidirectional Put bandwidth > > > - 2935 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 3127 MB/sec unidirectional bandwidth > > > - 5917 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.37 microsec Put latency > > > - 3137 MB/sec unidirectional Put bandwidth > > > - 5917 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 3.01 microsec one-way latency (4 bytes) > > > - 1402 MB/sec unidirectional bandwidth > > > - 2238 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.65 microsec Put latency > > > - 1402 MB/sec unidirectional Put bandwidth > > > - 2238 MB/sec bidirectional Put bandwidth > > > > > > Performance numbers for all other platforms, system configurations and > > > operations can be viewed by visiting `Performance' section of the > > > project's web page. > > > > > > With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar > > > performance for two-sided operations compared to MVAPICH 0.9.8. > > > Organizations and users interested in getting the best performance for > > > both two-sided and one-sided operations and also want to exploit > > > advanced features (such as fault tolerance with checkpoint/restart, > > > iWARP, RDMA CM connection management, multi-threading, integrated > > > multi-rail, multi-core optimization, memory hook support and optimized > > > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > > > > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > > > SVN, please visit the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > > > > > A stripped down version of this release is also available at the > > > OpenFabrics SVN. > > > > > > All feedbacks, including bug reports and hints for performance tuning, > > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > > > Thanks, > > > > > > MVAPICH Team at OSU/NBCL > > > > > > ====================================================================== > > > MVAPICH/MVAPICH2 project is currently supported with funding from > > > U.S. National Science Foundation, U.S. DOE Office of Science, > > > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > > > and with equipment support from Advanced Clustering, AMD, Apple, > > > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > > > and Sun Microsystems. Other technology partner includes Etnus. > > > ====================================================================== > > > > > > _______________________________________________ > > > openfabrics-ewg mailing list > > > openfabrics-ewg at openib.org > > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > > > --------------------------------- > > > Sponsored Link > > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890 > > > Content-Type: text/html; charset=iso-8859-1 > > > Content-Transfer-Encoding: 8bit > > > > > > I am trying to use the OSC MPI tool for the iWARP and quite new to the open fabrics tools. > > If I know it correctly, iWARP is not yet part of the OFED release. > But the iWARP makefile has reference to ofed code. Is it really required. > > Is there any document describing the build process of the MVAPICH2 tool? I am going through the MVAPICH2 0.9.8 users guide and that does not seem to be giving me the detailed information. > > Can you please provide README files for the iWARP which describes TODO steps? > > Thanks, > David > > Dhabaleswar Panda wrote: > The MVAPICH team is pleased to announce the availability of MVAPICH2 > 0.9.8 with the following NEW features: > > - > > > Checkpoint/Restart support for application transparent systems-level > fault tolerance. BLCR-based support using native InfiniBand Gen2 > interface is provided. Flexible interface to work with different > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > Performance of sample applications with checkpoint-restart using > PVFS2 and Lustre can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > drivers. > > - RDMA CM-based Connection management support > > - Shared memory optimizations for collective communication operations. > Efficient algorithms and optimizations for barrier, reduce and > all-reduce operations. Exploits the multi-core optimized shared > memory point-to-point communication support introduced in > > > MVAPICH2 > 0.9.6. > > Performance of sample collective operations with this new feature > can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > - uDAPL support for NetEffect 10GigE adapter. Tested with > NetEffect NE010 adapter. > > More details on all features and supported platforms can be obtained > by visiting the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > continues to deliver excellent performance. Sample performance > numbers include: > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 1561 MB/sec unidirectional bandwidth > - 2935 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.92 microsec Put latency > - 1569 MB/sec unidirectional Put bandwidth > - 2935 MB/sec > > > bidirectional Put bandwidth > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 3127 MB/sec unidirectional bandwidth > - 5917 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.37 microsec Put latency > - 3137 MB/sec unidirectional Put bandwidth > - 5917 MB/sec bidirectional Put bandwidth > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 3.01 microsec one-way latency (4 bytes) > - 1402 MB/sec unidirectional bandwidth > - 2238 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.65 microsec Put latency > - 1402 MB/sec unidirectional Put bandwidth > - 2238 MB/sec bidirectional Put bandwidth > > Performance numbers for all other platforms, system configurations and > operations can be viewed by visiting `Performance' section of the > project's web page. > > With the > > > ADI-3-level design, MVAPICH2 0.9.8 delivers similar > performance for two-sided operations compared to MVAPICH 0.9.8. > Organizations and users interested in getting the best performance for > both two-sided and one-sided operations and also want to exploit > advanced features (such as fault tolerance with checkpoint/restart, > iWARP, RDMA CM connection management, multi-threading, integrated > multi-rail, multi-core optimization, memory hook support and optimized > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > SVN, please visit the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > A stripped down version of this release is also available at the > OpenFabrics SVN. > > All feedbacks, including bug reports and hints for performance tuning, > are welcome. Please post it to the mvapich-discuss mailing list. > > Thanks, > > > > > MVAPICH Team at OSU/NBCL > > ====================================================================== > MVAPICH/MVAPICH2 project is currently supported with funding from > U.S. National Science Foundation, U.S. DOE Office of Science, > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > and with equipment support from Advanced Clustering, AMD, Apple, > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > and Sun Microsystems. Other technology partner includes Etnus. > ====================================================================== > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > --------------------------------- > Sponsored Link > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890-- > > > > > > > > > > --------------------------------- > Sponsored Link > > $200,000 mortgage for $660/mo - 30/15 yr fixed, reduce debt, home equity - Click now for info --------------------------------- Sponsored Link Get an Online or Campus degree - Associate's, Bachelor's, or Master's - in less than one year. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Mon Nov 13 11:39:50 2006 From: elsen_david at yahoo.com (david elsen) Date: Mon, 13 Nov 2006 11:39:50 -0800 (PST) Subject: [openib-general] [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme In-Reply-To: <20061113193529.45124.qmail@web58011.mail.re3.yahoo.com> Message-ID: <20061113193950.76369.qmail@web58005.mail.re3.yahoo.com> The iWARP code does not have these directories. Will it be better if the iWARP makefile is not dependent on the OFED code? david elsen wrote: I modified the OPEN_IB_HOME to the iWARP directroy. Now it is looking OPEN_IB_HOME/lib64 and OPEN_IB_HOME/lib directories. Sundeep Narravula wrote: Hi David, > My question is why should I have this reference to /usr/local/ofed > there if I do not need to download the OFED distribution code to run > the iWARP. The variable OPEN_IB_HOME in make.mvapich2.iwarp sets the path to the Gen2 installation that you intend to use for iwarp. The default in the script is /usr/local/ofed. Based on your installation, please set this variable appropriately. export OPEN_IB_HOME=/usr/local > Is it possible to add more information in your MVAPICH2 0.9.8 User Guide describing how to build this and what are the dependencies? We have updated our userguide with the installation information. The dependencies that we have are the installation of OF iwarp branch, the setup of rdma-cm module and the setup of the underlying network. Regards, --Sundeep. > > David > Sundeep Narravula wrote: > Hi David, > > > iWARP is actually a part of the Open Fabrics SVN. It is available from > > a different branch. > > > > I am cc'ing this note to my group. One of the students (Sundeep) will > > send you the detailed instructions on which branch of OF to download > > and use. > > The instructions for setting up iwarp on OpenFrabics is avalable on the > openIB wiki at > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Ammasso1100 > > Further, the branch you can download from the svn is > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > > Please let us know if you have any further questions. > > Regards, > --Sundeep. > > > > > Is there any document describing the build process of the MVAPICH2 > > > tool? I am going through the MVAPICH2 0.9.8 users guide and that > > > does not seem to be giving me the detailed information. > > > > We will add this additional information on our user guide. > > > > > Can you please provide README files for the iWARP which describes > > > TODO steps? > > > > Sundeep's information will help. If you have any additional questions, > > please feel free to ask us. > > > > Thanks, > > > > DK > > > > > Thanks, > > > David > > > > > > Dhabaleswar Panda > wrote: > > > The MVAPICH team is pleased to announce the availability of MVAPICH2 > > > 0.9.8 with the following NEW features: > > > > > > - Checkpoint/Restart support for application transparent systems-level > > > fault tolerance. BLCR-based support using native InfiniBand Gen2 > > > interface is provided. Flexible interface to work with different > > > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > > > > > Performance of sample applications with checkpoint-restart using > > > PVFS2 and Lustre can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > > > > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > > > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > > > drivers. > > > > > > - RDMA CM-based Connection management support > > > > > > - Shared memory optimizations for collective communication operations. > > > Efficient algorithms and optimizations for barrier, reduce and > > > all-reduce operations. Exploits the multi-core optimized shared > > > memory point-to-point communication support introduced in MVAPICH2 > > > 0.9.6. > > > > > > Performance of sample collective operations with this new feature > > > can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > > > > > - uDAPL support for NetEffect 10GigE adapter. Tested with > > > NetEffect NE010 adapter. > > > > > > More details on all features and supported platforms can be obtained > > > by visiting the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > > > > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > > > continues to deliver excellent performance. Sample performance > > > numbers include: > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 1561 MB/sec unidirectional bandwidth > > > - 2935 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.92 microsec Put latency > > > - 1569 MB/sec unidirectional Put bandwidth > > > - 2935 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 3127 MB/sec unidirectional bandwidth > > > - 5917 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.37 microsec Put latency > > > - 3137 MB/sec unidirectional Put bandwidth > > > - 5917 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 3.01 microsec one-way latency (4 bytes) > > > - 1402 MB/sec unidirectional bandwidth > > > - 2238 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.65 microsec Put latency > > > - 1402 MB/sec unidirectional Put bandwidth > > > - 2238 MB/sec bidirectional Put bandwidth > > > > > > Performance numbers for all other platforms, system configurations and > > > operations can be viewed by visiting `Performance' section of the > > > project's web page. > > > > > > With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar > > > performance for two-sided operations compared to MVAPICH 0.9.8. > > > Organizations and users interested in getting the best performance for > > > both two-sided and one-sided operations and also want to exploit > > > advanced features (such as fault tolerance with checkpoint/restart, > > > iWARP, RDMA CM connection management, multi-threading, integrated > > > multi-rail, multi-core optimization, memory hook support and optimized > > > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > > > > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > > > SVN, please visit the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > > > > > A stripped down version of this release is also available at the > > > OpenFabrics SVN. > > > > > > All feedbacks, including bug reports and hints for performance tuning, > > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > > > Thanks, > > > > > > MVAPICH Team at OSU/NBCL > > > > > > ====================================================================== > > > MVAPICH/MVAPICH2 project is currently supported with funding from > > > U.S. National Science Foundation, U.S. DOE Office of Science, > > > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > > > and with equipment support from Advanced Clustering, AMD, Apple, > > > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > > > and Sun Microsystems. Other technology partner includes Etnus. > > > ====================================================================== > > > > > > _______________________________________________ > > > openfabrics-ewg mailing list > > > openfabrics-ewg at openib.org > > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > > > --------------------------------- > > > Sponsored Link > > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890 > > > Content-Type: text/html; charset=iso-8859-1 > > > Content-Transfer-Encoding: 8bit > > > > > > I am trying to use the OSC MPI tool for the iWARP and quite new to the open fabrics tools. > > If I know it correctly, iWARP is not yet part of the OFED release. > But the iWARP makefile has reference to ofed code. Is it really required. > > Is there any document describing the build process of the MVAPICH2 tool? I am going through the MVAPICH2 0.9.8 users guide and that does not seem to be giving me the detailed information. > > Can you please provide README files for the iWARP which describes TODO steps? > > Thanks, > David > > Dhabaleswar Panda wrote: > The MVAPICH team is pleased to announce the availability of MVAPICH2 > 0.9.8 with the following NEW features: > > - > > > Checkpoint/Restart support for application transparent systems-level > fault tolerance. BLCR-based support using native InfiniBand Gen2 > interface is provided. Flexible interface to work with different > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > Performance of sample applications with checkpoint-restart using > PVFS2 and Lustre can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > drivers. > > - RDMA CM-based Connection management support > > - Shared memory optimizations for collective communication operations. > Efficient algorithms and optimizations for barrier, reduce and > all-reduce operations. Exploits the multi-core optimized shared > memory point-to-point communication support introduced in > > > MVAPICH2 > 0.9.6. > > Performance of sample collective operations with this new feature > can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > - uDAPL support for NetEffect 10GigE adapter. Tested with > NetEffect NE010 adapter. > > More details on all features and supported platforms can be obtained > by visiting the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > continues to deliver excellent performance. Sample performance > numbers include: > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 1561 MB/sec unidirectional bandwidth > - 2935 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.92 microsec Put latency > - 1569 MB/sec unidirectional Put bandwidth > - 2935 MB/sec > > > bidirectional Put bandwidth > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 3127 MB/sec unidirectional bandwidth > - 5917 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.37 microsec Put latency > - 3137 MB/sec unidirectional Put bandwidth > - 5917 MB/sec bidirectional Put bandwidth > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 3.01 microsec one-way latency (4 bytes) > - 1402 MB/sec unidirectional bandwidth > - 2238 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.65 microsec Put latency > - 1402 MB/sec unidirectional Put bandwidth > - 2238 MB/sec bidirectional Put bandwidth > > Performance numbers for all other platforms, system configurations and > operations can be viewed by visiting `Performance' section of the > project's web page. > > With the > > > ADI-3-level design, MVAPICH2 0.9.8 delivers similar > performance for two-sided operations compared to MVAPICH 0.9.8. > Organizations and users interested in getting the best performance for > both two-sided and one-sided operations and also want to exploit > advanced features (such as fault tolerance with checkpoint/restart, > iWARP, RDMA CM connection management, multi-threading, integrated > multi-rail, multi-core optimization, memory hook support and optimized > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > SVN, please visit the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > A stripped down version of this release is also available at the > OpenFabrics SVN. > > All feedbacks, including bug reports and hints for performance tuning, > are welcome. Please post it to the mvapich-discuss mailing list. > > Thanks, > > > > > MVAPICH Team at OSU/NBCL > > ====================================================================== > MVAPICH/MVAPICH2 project is currently supported with funding from > U.S. National Science Foundation, U.S. DOE Office of Science, > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > and with equipment support from Advanced Clustering, AMD, Apple, > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > and Sun Microsystems. Other technology partner includes Etnus. > ====================================================================== > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > --------------------------------- > Sponsored Link > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890-- > > > > > > > > > > --------------------------------- > Sponsored Link > > $200,000 mortgage for $660/mo - 30/15 yr fixed, reduce debt, home equity - Click now for info --------------------------------- Sponsored Link Get an Online or Campus degree - Associate's, Bachelor's, or Master's - in less than one year._______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg --------------------------------- Sponsored Link Talk more and pay less. Vonage can save you up to $300 a year on your phone bill. Sign up now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Mon Nov 13 11:56:43 2006 From: elsen_david at yahoo.com (david elsen) Date: Mon, 13 Nov 2006 11:56:43 -0800 (PST) Subject: [openib-general] [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme In-Reply-To: <20061113193950.76369.qmail@web58005.mail.re3.yahoo.com> Message-ID: <20061113195643.17110.qmail@web58012.mail.re3.yahoo.com> Sundeep, I do not see any subdirectory for lib64 or lib in either the iWARP branch or in the OFED branch. Can you please tell me the reference for the OPEN_IB_LIB in the iWARP MVAPICH2 makefile? Regards, David david elsen wrote: The iWARP code does not have these directories. Will it be better if the iWARP makefile is not dependent on the OFED code? david elsen wrote: I modified the OPEN_IB_HOME to the iWARP directroy. Now it is looking OPEN_IB_HOME/lib64 and OPEN_IB_HOME/lib directories. Sundeep Narravula wrote: Hi David, > My question is why should I have this reference to /usr/local/ofed > there if I do not need to download the OFED distribution code to run > the iWARP. The variable OPEN_IB_HOME in make.mvapich2.iwarp sets the path to the Gen2 installation that you intend to use for iwarp. The default in the script is /usr/local/ofed. Based on your installation, please set this variable appropriately. export OPEN_IB_HOME=/usr/local > Is it possible to add more information in your MVAPICH2 0.9.8 User Guide describing how to build this and what are the dependencies? We have updated our userguide with the installation information. The dependencies that we have are the installation of OF iwarp branch, the setup of rdma-cm module and the setup of the underlying network. Regards, --Sundeep. > > David > Sundeep Narravula wrote: > Hi David, > > > iWARP is actually a part of the Open Fabrics SVN. It is available from > > a different branch. > > > > I am cc'ing this note to my group. One of the students (Sundeep) will > > send you the detailed instructions on which branch of OF to download > > and use. > > The instructions for setting up iwarp on OpenFrabics is avalable on the > openIB wiki at > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Ammasso1100 > > Further, the branch you can download from the svn is > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > > Please let us know if you have any further questions. > > Regards, > --Sundeep. > > > > > Is there any document describing the build process of the MVAPICH2 > > > tool? I am going through the MVAPICH2 0.9.8 users guide and that > > > does not seem to be giving me the detailed information. > > > > We will add this additional information on our user guide. > > > > > Can you please provide README files for the iWARP which describes > > > TODO steps? > > > > Sundeep's information will help. If you have any additional questions, > > please feel free to ask us. > > > > Thanks, > > > > DK > > > > > Thanks, > > > David > > > > > > Dhabaleswar Panda > wrote: > > > The MVAPICH team is pleased to announce the availability of MVAPICH2 > > > 0.9.8 with the following NEW features: > > > > > > - Checkpoint/Restart support for application transparent systems-level > > > fault tolerance. BLCR-based support using native InfiniBand Gen2 > > > interface is provided. Flexible interface to work with different > > > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > > > > > Performance of sample applications with checkpoint-restart using > > > PVFS2 and Lustre can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > > > > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > > > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > > > drivers. > > > > > > - RDMA CM-based Connection management support > > > > > > - Shared memory optimizations for collective communication operations. > > > Efficient algorithms and optimizations for barrier, reduce and > > > all-reduce operations. Exploits the multi-core optimized shared > > > memory point-to-point communication support introduced in MVAPICH2 > > > 0.9.6. > > > > > > Performance of sample collective operations with this new feature > > > can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > > > > > - uDAPL support for NetEffect 10GigE adapter. Tested with > > > NetEffect NE010 adapter. > > > > > > More details on all features and supported platforms can be obtained > > > by visiting the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > > > > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > > > continues to deliver excellent performance. Sample performance > > > numbers include: > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 1561 MB/sec unidirectional bandwidth > > > - 2935 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.92 microsec Put latency > > > - 1569 MB/sec unidirectional Put bandwidth > > > - 2935 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 3127 MB/sec unidirectional bandwidth > > > - 5917 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.37 microsec Put latency > > > - 3137 MB/sec unidirectional Put bandwidth > > > - 5917 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 3.01 microsec one-way latency (4 bytes) > > > - 1402 MB/sec unidirectional bandwidth > > > - 2238 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.65 microsec Put latency > > > - 1402 MB/sec unidirectional Put bandwidth > > > - 2238 MB/sec bidirectional Put bandwidth > > > > > > Performance numbers for all other platforms, system configurations and > > > operations can be viewed by visiting `Performance' section of the > > > project's web page. > > > > > > With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar > > > performance for two-sided operations compared to MVAPICH 0.9.8. > > > Organizations and users interested in getting the best performance for > > > both two-sided and one-sided operations and also want to exploit > > > advanced features (such as fault tolerance with checkpoint/restart, > > > iWARP, RDMA CM connection management, multi-threading, integrated > > > multi-rail, multi-core optimization, memory hook support and optimized > > > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > > > > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > > > SVN, please visit the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > > > > > A stripped down version of this release is also available at the > > > OpenFabrics SVN. > > > > > > All feedbacks, including bug reports and hints for performance tuning, > > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > > > Thanks, > > > > > > MVAPICH Team at OSU/NBCL > > > > > > ====================================================================== > > > MVAPICH/MVAPICH2 project is currently supported with funding from > > > U.S. National Science Foundation, U.S. DOE Office of Science, > > > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > > > and with equipment support from Advanced Clustering, AMD, Apple, > > > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > > > and Sun Microsystems. Other technology partner includes Etnus. > > > ====================================================================== > > > > > > _______________________________________________ > > > openfabrics-ewg mailing list > > > openfabrics-ewg at openib.org > > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > > > --------------------------------- > > > Sponsored Link > > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890 > > > Content-Type: text/html; charset=iso-8859-1 > > > Content-Transfer-Encoding: 8bit > > > > > > I am trying to use the OSC MPI tool for the iWARP and quite new to the open fabrics tools. > > If I know it correctly, iWARP is not yet part of the OFED release. > But the iWARP makefile has reference to ofed code. Is it really required. > > Is there any document describing the build process of the MVAPICH2 tool? I am going through the MVAPICH2 0.9.8 users guide and that does not seem to be giving me the detailed information. > > Can you please provide README files for the iWARP which describes TODO steps? > > Thanks, > David > > Dhabaleswar Panda wrote: > The MVAPICH team is pleased to announce the availability of MVAPICH2 > 0.9.8 with the following NEW features: > > - > > > Checkpoint/Restart support for application transparent systems-level > fault tolerance. BLCR-based support using native InfiniBand Gen2 > interface is provided. Flexible interface to work with different > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > Performance of sample applications with checkpoint-restart using > PVFS2 and Lustre can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > drivers. > > - RDMA CM-based Connection management support > > - Shared memory optimizations for collective communication operations. > Efficient algorithms and optimizations for barrier, reduce and > all-reduce operations. Exploits the multi-core optimized shared > memory point-to-point communication support introduced in > > > MVAPICH2 > 0.9.6. > > Performance of sample collective operations with this new feature > can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > - uDAPL support for NetEffect 10GigE adapter. Tested with > NetEffect NE010 adapter. > > More details on all features and supported platforms can be obtained > by visiting the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > continues to deliver excellent performance. Sample performance > numbers include: > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 1561 MB/sec unidirectional bandwidth > - 2935 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.92 microsec Put latency > - 1569 MB/sec unidirectional Put bandwidth > - 2935 MB/sec > > > bidirectional Put bandwidth > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 3127 MB/sec unidirectional bandwidth > - 5917 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.37 microsec Put latency > - 3137 MB/sec unidirectional Put bandwidth > - 5917 MB/sec bidirectional Put bandwidth > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 3.01 microsec one-way latency (4 bytes) > - 1402 MB/sec unidirectional bandwidth > - 2238 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.65 microsec Put latency > - 1402 MB/sec unidirectional Put bandwidth > - 2238 MB/sec bidirectional Put bandwidth > > Performance numbers for all other platforms, system configurations and > operations can be viewed by visiting `Performance' section of the > project's web page. > > With the > > > ADI-3-level design, MVAPICH2 0.9.8 delivers similar > performance for two-sided operations compared to MVAPICH 0.9.8. > Organizations and users interested in getting the best performance for > both two-sided and one-sided operations and also want to exploit > advanced features (such as fault tolerance with checkpoint/restart, > iWARP, RDMA CM connection management, multi-threading, integrated > multi-rail, multi-core optimization, memory hook support and optimized > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > SVN, please visit the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > A stripped down version of this release is also available at the > OpenFabrics SVN. > > All feedbacks, including bug reports and hints for performance tuning, > are welcome. Please post it to the mvapich-discuss mailing list. > > Thanks, > > > > > MVAPICH Team at OSU/NBCL > > ====================================================================== > MVAPICH/MVAPICH2 project is currently supported with funding from > U.S. National Science Foundation, U.S. DOE Office of Science, > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > and with equipment support from Advanced Clustering, AMD, Apple, > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > and Sun Microsystems. Other technology partner includes Etnus. > ====================================================================== > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > --------------------------------- > Sponsored Link > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890-- > > > > > > > > > > --------------------------------- > Sponsored Link > > $200,000 mortgage for $660/mo - 30/15 yr fixed, reduce debt, home equity - Click now for info --------------------------------- Sponsored Link Get an Online or Campus degree - Associate's, Bachelor's, or Master's - in less than one year._______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg --------------------------------- Sponsored Link Talk more and pay less. Vonage can save you up to $300 a year on your phone bill. Sign up now. --------------------------------- Everyone is raving about the all-new Yahoo! Mail beta. -------------- next part -------------- An HTML attachment was scrubbed... URL: From elsen_david at yahoo.com Mon Nov 13 11:56:52 2006 From: elsen_david at yahoo.com (david elsen) Date: Mon, 13 Nov 2006 11:56:52 -0800 (PST) Subject: [openib-general] [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme In-Reply-To: <20061113193950.76369.qmail@web58005.mail.re3.yahoo.com> Message-ID: <20061113195652.82999.qmail@web58005.mail.re3.yahoo.com> Sundeep, I do not see any subdirectory for lib64 or lib in either the iWARP branch or in the OFED branch. Can you please tell me the reference for the OPEN_IB_LIB in the iWARP MVAPICH2 makefile? Regards, David david elsen wrote: The iWARP code does not have these directories. Will it be better if the iWARP makefile is not dependent on the OFED code? david elsen wrote: I modified the OPEN_IB_HOME to the iWARP directroy. Now it is looking OPEN_IB_HOME/lib64 and OPEN_IB_HOME/lib directories. Sundeep Narravula wrote: Hi David, > My question is why should I have this reference to /usr/local/ofed > there if I do not need to download the OFED distribution code to run > the iWARP. The variable OPEN_IB_HOME in make.mvapich2.iwarp sets the path to the Gen2 installation that you intend to use for iwarp. The default in the script is /usr/local/ofed. Based on your installation, please set this variable appropriately. export OPEN_IB_HOME=/usr/local > Is it possible to add more information in your MVAPICH2 0.9.8 User Guide describing how to build this and what are the dependencies? We have updated our userguide with the installation information. The dependencies that we have are the installation of OF iwarp branch, the setup of rdma-cm module and the setup of the underlying network. Regards, --Sundeep. > > David > Sundeep Narravula wrote: > Hi David, > > > iWARP is actually a part of the Open Fabrics SVN. It is available from > > a different branch. > > > > I am cc'ing this note to my group. One of the students (Sundeep) will > > send you the detailed instructions on which branch of OF to download > > and use. > > The instructions for setting up iwarp on OpenFrabics is avalable on the > openIB wiki at > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Ammasso1100 > > Further, the branch you can download from the svn is > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > > Please let us know if you have any further questions. > > Regards, > --Sundeep. > > > > > Is there any document describing the build process of the MVAPICH2 > > > tool? I am going through the MVAPICH2 0.9.8 users guide and that > > > does not seem to be giving me the detailed information. > > > > We will add this additional information on our user guide. > > > > > Can you please provide README files for the iWARP which describes > > > TODO steps? > > > > Sundeep's information will help. If you have any additional questions, > > please feel free to ask us. > > > > Thanks, > > > > DK > > > > > Thanks, > > > David > > > > > > Dhabaleswar Panda > wrote: > > > The MVAPICH team is pleased to announce the availability of MVAPICH2 > > > 0.9.8 with the following NEW features: > > > > > > - Checkpoint/Restart support for application transparent systems-level > > > fault tolerance. BLCR-based support using native InfiniBand Gen2 > > > interface is provided. Flexible interface to work with different > > > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > > > > > Performance of sample applications with checkpoint-restart using > > > PVFS2 and Lustre can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > > > > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > > > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > > > drivers. > > > > > > - RDMA CM-based Connection management support > > > > > > - Shared memory optimizations for collective communication operations. > > > Efficient algorithms and optimizations for barrier, reduce and > > > all-reduce operations. Exploits the multi-core optimized shared > > > memory point-to-point communication support introduced in MVAPICH2 > > > 0.9.6. > > > > > > Performance of sample collective operations with this new feature > > > can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > > > > > - uDAPL support for NetEffect 10GigE adapter. Tested with > > > NetEffect NE010 adapter. > > > > > > More details on all features and supported platforms can be obtained > > > by visiting the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > > > > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > > > continues to deliver excellent performance. Sample performance > > > numbers include: > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 1561 MB/sec unidirectional bandwidth > > > - 2935 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.92 microsec Put latency > > > - 1569 MB/sec unidirectional Put bandwidth > > > - 2935 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 3127 MB/sec unidirectional bandwidth > > > - 5917 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.37 microsec Put latency > > > - 3137 MB/sec unidirectional Put bandwidth > > > - 5917 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 3.01 microsec one-way latency (4 bytes) > > > - 1402 MB/sec unidirectional bandwidth > > > - 2238 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.65 microsec Put latency > > > - 1402 MB/sec unidirectional Put bandwidth > > > - 2238 MB/sec bidirectional Put bandwidth > > > > > > Performance numbers for all other platforms, system configurations and > > > operations can be viewed by visiting `Performance' section of the > > > project's web page. > > > > > > With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar > > > performance for two-sided operations compared to MVAPICH 0.9.8. > > > Organizations and users interested in getting the best performance for > > > both two-sided and one-sided operations and also want to exploit > > > advanced features (such as fault tolerance with checkpoint/restart, > > > iWARP, RDMA CM connection management, multi-threading, integrated > > > multi-rail, multi-core optimization, memory hook support and optimized > > > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > > > > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > > > SVN, please visit the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > > > > > A stripped down version of this release is also available at the > > > OpenFabrics SVN. > > > > > > All feedbacks, including bug reports and hints for performance tuning, > > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > > > Thanks, > > > > > > MVAPICH Team at OSU/NBCL > > > > > > ====================================================================== > > > MVAPICH/MVAPICH2 project is currently supported with funding from > > > U.S. National Science Foundation, U.S. DOE Office of Science, > > > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > > > and with equipment support from Advanced Clustering, AMD, Apple, > > > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > > > and Sun Microsystems. Other technology partner includes Etnus. > > > ====================================================================== > > > > > > _______________________________________________ > > > openfabrics-ewg mailing list > > > openfabrics-ewg at openib.org > > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > > > --------------------------------- > > > Sponsored Link > > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890 > > > Content-Type: text/html; charset=iso-8859-1 > > > Content-Transfer-Encoding: 8bit > > > > > > I am trying to use the OSC MPI tool for the iWARP and quite new to the open fabrics tools. > > If I know it correctly, iWARP is not yet part of the OFED release. > But the iWARP makefile has reference to ofed code. Is it really required. > > Is there any document describing the build process of the MVAPICH2 tool? I am going through the MVAPICH2 0.9.8 users guide and that does not seem to be giving me the detailed information. > > Can you please provide README files for the iWARP which describes TODO steps? > > Thanks, > David > > Dhabaleswar Panda wrote: > The MVAPICH team is pleased to announce the availability of MVAPICH2 > 0.9.8 with the following NEW features: > > - > > > Checkpoint/Restart support for application transparent systems-level > fault tolerance. BLCR-based support using native InfiniBand Gen2 > interface is provided. Flexible interface to work with different > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > Performance of sample applications with checkpoint-restart using > PVFS2 and Lustre can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > drivers. > > - RDMA CM-based Connection management support > > - Shared memory optimizations for collective communication operations. > Efficient algorithms and optimizations for barrier, reduce and > all-reduce operations. Exploits the multi-core optimized shared > memory point-to-point communication support introduced in > > > MVAPICH2 > 0.9.6. > > Performance of sample collective operations with this new feature > can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > - uDAPL support for NetEffect 10GigE adapter. Tested with > NetEffect NE010 adapter. > > More details on all features and supported platforms can be obtained > by visiting the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > continues to deliver excellent performance. Sample performance > numbers include: > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 1561 MB/sec unidirectional bandwidth > - 2935 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.92 microsec Put latency > - 1569 MB/sec unidirectional Put bandwidth > - 2935 MB/sec > > > bidirectional Put bandwidth > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 3127 MB/sec unidirectional bandwidth > - 5917 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.37 microsec Put latency > - 3137 MB/sec unidirectional Put bandwidth > - 5917 MB/sec bidirectional Put bandwidth > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 3.01 microsec one-way latency (4 bytes) > - 1402 MB/sec unidirectional bandwidth > - 2238 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.65 microsec Put latency > - 1402 MB/sec unidirectional Put bandwidth > - 2238 MB/sec bidirectional Put bandwidth > > Performance numbers for all other platforms, system configurations and > operations can be viewed by visiting `Performance' section of the > project's web page. > > With the > > > ADI-3-level design, MVAPICH2 0.9.8 delivers similar > performance for two-sided operations compared to MVAPICH 0.9.8. > Organizations and users interested in getting the best performance for > both two-sided and one-sided operations and also want to exploit > advanced features (such as fault tolerance with checkpoint/restart, > iWARP, RDMA CM connection management, multi-threading, integrated > multi-rail, multi-core optimization, memory hook support and optimized > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > SVN, please visit the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > A stripped down version of this release is also available at the > OpenFabrics SVN. > > All feedbacks, including bug reports and hints for performance tuning, > are welcome. Please post it to the mvapich-discuss mailing list. > > Thanks, > > > > > MVAPICH Team at OSU/NBCL > > ====================================================================== > MVAPICH/MVAPICH2 project is currently supported with funding from > U.S. National Science Foundation, U.S. DOE Office of Science, > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > and with equipment support from Advanced Clustering, AMD, Apple, > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > and Sun Microsystems. Other technology partner includes Etnus. > ====================================================================== > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > --------------------------------- > Sponsored Link > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890-- > > > > > > > > > > --------------------------------- > Sponsored Link > > $200,000 mortgage for $660/mo - 30/15 yr fixed, reduce debt, home equity - Click now for info --------------------------------- Sponsored Link Get an Online or Campus degree - Associate's, Bachelor's, or Master's - in less than one year._______________________________________________ openfabrics-ewg mailing list openfabrics-ewg at openib.org http://openib.org/mailman/listinfo/openfabrics-ewg --------------------------------- Sponsored Link Talk more and pay less. Vonage can save you up to $300 a year on your phone bill. Sign up now. --------------------------------- Sponsored Link Mortgage rates near 39yr lows. $420,000 Mortgage for $1,399/mo - Calculate new house payment -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Mon Nov 13 12:02:45 2006 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Mon, 13 Nov 2006 22:02:45 +0200 Subject: [openib-general] IB/ipath - Implement new verbs DMA mapping functions In-Reply-To: References: <1162506626.29948.568.camel@brick.pathscale.com> <4556CC57.6020805@voltaire.com> Message-ID: <15ddcffd0611131202r6ab49d22r96c8c31cfaacb411@mail.gmail.com> On 11/13/06, Roland Dreier wrote: > > This is a bug since there are architectures eg PPC64 where the native > > address size is u64 but dma_addr_t is u32. You are somehow in a > > problem here, since returning an unchopped cpu_addr to the consumer > > might cause a memory corruption as they are expecting 32 bit value. > > Yes (although ppc64 is now u64 -- sparc64 is still u32 though). I > think this means we need to make these ib_dma_xxx functions return u64 > instead of dma_addr_t. This would require to change all the places where an ib drivers uses dma_addr_t to hold the result of dma_map_xxx (single, page), also note that dma_map_sg writes on the sg elements, to a field whose type is dma_addr_t, so Ralph would to keep a ghost of each sg element since the dma_addr_t field (eg under sparc) can not hold a u64 address... Ralph/Roland - my hope is that there might be some other way to get out of the ipath driver problem, how about presenting the problem to LKML and see if someone comes out with a solution we did not think about? Or. From sashak at voltaire.com Mon Nov 13 12:11:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 13 Nov 2006 22:11:32 +0200 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F501894434@taurus.voltaire.com> References: <45589CB1.7020307@ichips.intel.com> <5CE025EE7D88BA4599A2C8FEFCF226F501894434@taurus.voltaire.com> Message-ID: <20061113201132.GF21026@sashak.voltaire.com> On 19:49 Mon 13 Nov , Hal Rosenstock wrote: > There may be an OpenSM bug with setting hop limit in the path record response. I'm looking at it now. Looks like it is - OpenSM returns the same hop_limit value as was in the request. Bub, could you try the patch below? Thanks, Sasha diff --git a/osm/opensm/osm_sa_path_record.c b/osm/opensm/osm_sa_path_record.c index 560e385..72a89d0 100644 --- a/osm/opensm/osm_sa_path_record.c +++ b/osm/opensm/osm_sa_path_record.c @@ -709,6 +710,8 @@ __osm_pr_rcv_build_pr( p_pr->dlid = cl_hton16( dest_lid_ho ); p_pr->slid = cl_hton16( src_lid_ho ); + p_pr->hop_flow_raw &= cl_hton32(1<<31); + p_pr->pkey = p_parms->pkey; p_pr->sl = cl_hton16(p_parms->sl); p_pr->mtu = (uint8_t)(p_parms->mtu | 0x80); > > -- Hal > > ________________________________ > > From: openib-general-bounces at openib.org on behalf of Sean Hefty > Sent: Mon 11/13/2006 11:26 AM > To: Bub Thomas > Cc: Erez Cohen; openib-general at openib.org > Subject: Re: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. > > > > Bub Thomas wrote: > > Setting the hop_limit from 64 down to 0 or 1 solved the problem. :-) > > Don't ask me where I got that hop_limit from, it must have been an > > example I found somewhere. > > Can you explain why that hop_limit/is_global makes a difference in > > communication between gen1 and gen2? Does the counterpart need to have > > the same hop_limit? > > The gen2 stack uses a hop_limit > 0 to indicate that global routing is being > used. If the hop_limit is > 0, then the global routing information must be valid. > > > The path record values I use are queried from the OSM using a > > SERVICE_RECORD query followed by a path record query. > > I'm not using any alternate path record values, is this critical? > > Everything is supposed to work with the path records returned from the SM. I > was wondering if you were querying for the path record, modifying the returned > value, or creating a path record yourself. > > > path_record.packet_life_time = 0; > > I would set this higher (maybe between 12-20, with: 18 = 1 second, 19 = 2 > seconds, 20 = 4 seconds, etc.) > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Mon Nov 13 12:18:54 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Mon, 13 Nov 2006 22:18:54 +0200 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated References: <6923467.1163190820620.JavaMail.websites@opensubscriber> <5CE025EE7D88BA4599A2C8FEFCF226F501894407@taurus.voltaire.com> <5CE025EE7D88BA4599A2C8FEFCF226F501894409@taurus.voltaire.com> <4558C865.6050604@dev.mellanox.co.il> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F50189443C@taurus.voltaire.com> OK but I won't do this until I get back from SC. Is that soon enough ? There's 1 other patch that should be there as well. -- Hal ________________________________ From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Mon 11/13/2006 2:32 PM To: Hal Rosenstock Cc: chris_youb at yahoo.ca; openib-general at openib.org Subject: Re: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated Hal Rosenstock wrote: > Can you see if this fixes it ? Thanks. > > -- Hal > > Index: opensm/osm_helper.c > =================================================================== > --- opensm/osm_helper.c (revision 10089) > +++ opensm/osm_helper.c (working copy) > @@ -1264,7 +1264,7 @@ > IN const ib_service_record_t* const p_sr, > IN const osm_log_level_t log_level ) > { > - char buf_service_key[33]; > + char buf_service_key[35]; > char buf_service_name[65]; > if( osm_log_is_active( p_log, log_level ) ) > > Hal, If this patch does solve the problem please add it to the support page of OFED 1.1 on the Wiki (https://openib.org/tiki/tiki-index.php?page=OFED+Support) Thanks, Tziporet From tziporet at dev.mellanox.co.il Mon Nov 13 13:29:21 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 13 Nov 2006 16:29:21 -0500 Subject: [openib-general] OFED-1.1: *** stack smashing detected ***: opensm terminated In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F50189443C@taurus.voltaire.com> References: <6923467.1163190820620.JavaMail.websites@opensubscriber> <5CE025EE7D88BA4599A2C8FEFCF226F501894407@taurus.voltaire.com> <5CE025EE7D88BA4599A2C8FEFCF226F501894409@taurus.voltaire.com> <4558C865.6050604@dev.mellanox.co.il> <5CE025EE7D88BA4599A2C8FEFCF226F50189443C@taurus.voltaire.com> Message-ID: <4558E3B1.8070001@dev.mellanox.co.il> Hal Rosenstock wrote: > OK but I won't do this until I get back from SC. Is that soon enough ? > sure > > There's 1 other patch that should be there as well. > > please add it too Tziporet From elsen_david at yahoo.com Mon Nov 13 15:32:04 2006 From: elsen_david at yahoo.com (david elsen) Date: Mon, 13 Nov 2006 15:32:04 -0800 (PST) Subject: [openib-general] [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme In-Reply-To: <20061113193529.45124.qmail@web58011.mail.re3.yahoo.com> Message-ID: <20061113233205.66318.qmail@web58005.mail.re3.yahoo.com> I am trying to build the OSC MPI and see the error message with the congratulations message. Please see the attached file for details. Can someone help to find out the problem what I am having with this? Thanks in advance, David david elsen wrote: I modified the OPEN_IB_HOME to the iWARP directroy. Now it is looking OPEN_IB_HOME/lib64 and OPEN_IB_HOME/lib directories. Sundeep Narravula wrote: Hi David, > My question is why should I have this reference to /usr/local/ofed > there if I do not need to download the OFED distribution code to run > the iWARP. The variable OPEN_IB_HOME in make.mvapich2.iwarp sets the path to the Gen2 installation that you intend to use for iwarp. The default in the script is /usr/local/ofed. Based on your installation, please set this variable appropriately. export OPEN_IB_HOME=/usr/local > Is it possible to add more information in your MVAPICH2 0.9.8 User Guide describing how to build this and what are the dependencies? We have updated our userguide with the installation information. The dependencies that we have are the installation of OF iwarp branch, the setup of rdma-cm module and the setup of the underlying network. Regards, --Sundeep. > > David > Sundeep Narravula wrote: > Hi David, > > > iWARP is actually a part of the Open Fabrics SVN. It is available from > > a different branch. > > > > I am cc'ing this note to my group. One of the students (Sundeep) will > > send you the detailed instructions on which branch of OF to download > > and use. > > The instructions for setting up iwarp on OpenFrabics is avalable on the > openIB wiki at > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3 > https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Ammasso1100 > > Further, the branch you can download from the svn is > https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable > > Please let us know if you have any further questions. > > Regards, > --Sundeep. > > > > > Is there any document describing the build process of the MVAPICH2 > > > tool? I am going through the MVAPICH2 0.9.8 users guide and that > > > does not seem to be giving me the detailed information. > > > > We will add this additional information on our user guide. > > > > > Can you please provide README files for the iWARP which describes > > > TODO steps? > > > > Sundeep's information will help. If you have any additional questions, > > please feel free to ask us. > > > > Thanks, > > > > DK > > > > > Thanks, > > > David > > > > > > Dhabaleswar Panda > wrote: > > > The MVAPICH team is pleased to announce the availability of MVAPICH2 > > > 0.9.8 with the following NEW features: > > > > > > - Checkpoint/Restart support for application transparent systems-level > > > fault tolerance. BLCR-based support using native InfiniBand Gen2 > > > interface is provided. Flexible interface to work with different > > > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > > > > > Performance of sample applications with checkpoint-restart using > > > PVFS2 and Lustre can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > > > > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > > > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > > > drivers. > > > > > > - RDMA CM-based Connection management support > > > > > > - Shared memory optimizations for collective communication operations. > > > Efficient algorithms and optimizations for barrier, reduce and > > > all-reduce operations. Exploits the multi-core optimized shared > > > memory point-to-point communication support introduced in MVAPICH2 > > > 0.9.6. > > > > > > Performance of sample collective operations with this new feature > > > can be found here: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > > > > > - uDAPL support for NetEffect 10GigE adapter. Tested with > > > NetEffect NE010 adapter. > > > > > > More details on all features and supported platforms can be obtained > > > by visiting the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > > > > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > > > continues to deliver excellent performance. Sample performance > > > numbers include: > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 1561 MB/sec unidirectional bandwidth > > > - 2935 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.92 microsec Put latency > > > - 1569 MB/sec unidirectional Put bandwidth > > > - 2935 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > > > Two-sided operations: > > > - 2.81 microsec one-way latency (4 bytes) > > > - 3127 MB/sec unidirectional bandwidth > > > - 5917 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.37 microsec Put latency > > > - 3137 MB/sec unidirectional Put bandwidth > > > - 5917 MB/sec bidirectional Put bandwidth > > > > > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > > > Two-sided operations: > > > - 3.01 microsec one-way latency (4 bytes) > > > - 1402 MB/sec unidirectional bandwidth > > > - 2238 MB/sec bidirectional bandwidth > > > > > > One-sided operations: > > > - 4.65 microsec Put latency > > > - 1402 MB/sec unidirectional Put bandwidth > > > - 2238 MB/sec bidirectional Put bandwidth > > > > > > Performance numbers for all other platforms, system configurations and > > > operations can be viewed by visiting `Performance' section of the > > > project's web page. > > > > > > With the ADI-3-level design, MVAPICH2 0.9.8 delivers similar > > > performance for two-sided operations compared to MVAPICH 0.9.8. > > > Organizations and users interested in getting the best performance for > > > both two-sided and one-sided operations and also want to exploit > > > advanced features (such as fault tolerance with checkpoint/restart, > > > iWARP, RDMA CM connection management, multi-threading, integrated > > > multi-rail, multi-core optimization, memory hook support and optimized > > > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > > > > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > > > SVN, please visit the following URL: > > > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > > > > > A stripped down version of this release is also available at the > > > OpenFabrics SVN. > > > > > > All feedbacks, including bug reports and hints for performance tuning, > > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > > > Thanks, > > > > > > MVAPICH Team at OSU/NBCL > > > > > > ====================================================================== > > > MVAPICH/MVAPICH2 project is currently supported with funding from > > > U.S. National Science Foundation, U.S. DOE Office of Science, > > > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > > > and with equipment support from Advanced Clustering, AMD, Apple, > > > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > > > and Sun Microsystems. Other technology partner includes Etnus. > > > ====================================================================== > > > > > > _______________________________________________ > > > openfabrics-ewg mailing list > > > openfabrics-ewg at openib.org > > > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > > > --------------------------------- > > > Sponsored Link > > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890 > > > Content-Type: text/html; charset=iso-8859-1 > > > Content-Transfer-Encoding: 8bit > > > > > > I am trying to use the OSC MPI tool for the iWARP and quite new to the open fabrics tools. > > If I know it correctly, iWARP is not yet part of the OFED release. > But the iWARP makefile has reference to ofed code. Is it really required. > > Is there any document describing the build process of the MVAPICH2 tool? I am going through the MVAPICH2 0.9.8 users guide and that does not seem to be giving me the detailed information. > > Can you please provide README files for the iWARP which describes TODO steps? > > Thanks, > David > > Dhabaleswar Panda wrote: > The MVAPICH team is pleased to announce the availability of MVAPICH2 > 0.9.8 with the following NEW features: > > - > > > Checkpoint/Restart support for application transparent systems-level > fault tolerance. BLCR-based support using native InfiniBand Gen2 > interface is provided. Flexible interface to work with different > file systems. Tested with ext3 (local disk), NFS and PVFS2. > > Performance of sample applications with checkpoint-restart using > PVFS2 and Lustre can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/performance/mvapich2/application/MVAPICH2-ckpt.html > > - iWARP support: Incorporates the support for OpenFabrics/Gen2-iWARP. > Tested with Chelsio T3 (10GigE) and Ammasso iWARP adapters and > drivers. > > - RDMA CM-based Connection management support > > - Shared memory optimizations for collective communication operations. > Efficient algorithms and optimizations for barrier, reduce and > all-reduce operations. Exploits the multi-core optimized shared > memory point-to-point communication support introduced in > > > MVAPICH2 > 0.9.6. > > Performance of sample collective operations with this new feature > can be found here: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/perf-coll.html > > - uDAPL support for NetEffect 10GigE adapter. Tested with > NetEffect NE010 adapter. > > More details on all features and supported platforms can be obtained > by visiting the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/mvapich2_features.html > > MVAPICH2 0.9.8 release is tested with the latest OFED 1.1 stack. It > continues to deliver excellent performance. Sample performance > numbers include: > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 1561 MB/sec unidirectional bandwidth > - 2935 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.92 microsec Put latency > - 1569 MB/sec unidirectional Put bandwidth > - 2935 MB/sec > > > bidirectional Put bandwidth > > - OpenFabrics/Gen2 on EM64T dual-core with PCI-Ex and IBA-DDR (Dual-rail): > Two-sided operations: > - 2.81 microsec one-way latency (4 bytes) > - 3127 MB/sec unidirectional bandwidth > - 5917 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.37 microsec Put latency > - 3137 MB/sec unidirectional Put bandwidth > - 5917 MB/sec bidirectional Put bandwidth > > - OpenFabrics/Gen2 on Opteron single-core with PCI-Ex and IBA-DDR: > Two-sided operations: > - 3.01 microsec one-way latency (4 bytes) > - 1402 MB/sec unidirectional bandwidth > - 2238 MB/sec bidirectional bandwidth > > One-sided operations: > - 4.65 microsec Put latency > - 1402 MB/sec unidirectional Put bandwidth > - 2238 MB/sec bidirectional Put bandwidth > > Performance numbers for all other platforms, system configurations and > operations can be viewed by visiting `Performance' section of the > project's web page. > > With the > > > ADI-3-level design, MVAPICH2 0.9.8 delivers similar > performance for two-sided operations compared to MVAPICH 0.9.8. > Organizations and users interested in getting the best performance for > both two-sided and one-sided operations and also want to exploit > advanced features (such as fault tolerance with checkpoint/restart, > iWARP, RDMA CM connection management, multi-threading, integrated > multi-rail, multi-core optimization, memory hook support and optimized > collectives) may migrate from MVAPICH code base to MVAPICH2 code base. > > For downloading MVAPICH2 0.9.8 package and accessing the anonymous > SVN, please visit the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > A stripped down version of this release is also available at the > OpenFabrics SVN. > > All feedbacks, including bug reports and hints for performance tuning, > are welcome. Please post it to the mvapich-discuss mailing list. > > Thanks, > > > > > MVAPICH Team at OSU/NBCL > > ====================================================================== > MVAPICH/MVAPICH2 project is currently supported with funding from > U.S. National Science Foundation, U.S. DOE Office of Science, > Mellanox, Intel, Cisco Systems, Sun Microsystems and Linux Networx; > and with equipment support from Advanced Clustering, AMD, Apple, > Appro, Dell, IBM, Intel, Mellanox, Microway, PathScale, SilverStorm > and Sun Microsystems. Other technology partner includes Etnus. > ====================================================================== > > _______________________________________________ > openfabrics-ewg mailing list > openfabrics-ewg at openib.org > http://openib.org/mailman/listinfo/openfabrics-ewg > > > > > > > > > > > > > --------------------------------- > Sponsored Link > > > > > Try Netflix today! With plans starting at only $5.99 a month what are you waiting for? > > > --0-119921422-1163376356=:19890-- > > > > > > > > > > --------------------------------- > Sponsored Link > > $200,000 mortgage for $660/mo - 30/15 yr fixed, reduce debt, home equity - Click now for info --------------------------------- Sponsored Link Get an Online or Campus degree - Associate's, Bachelor's, or Master's - in less than one year._______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: feedback Type: application/octet-stream Size: 3128 bytes Desc: 1027603996-feedback URL: From krkumar2 at in.ibm.com Mon Nov 13 20:44:37 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Tue, 14 Nov 2006 10:14:37 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: Message-ID: Hi Tom, > No, to understand why go look at the implementation of queue_work. BTW, this I was describing the implementation of queue_work() in my previous mail. So sorry to be dense, but I do not understand why this patch introduces a race. Can you explain the race that you had found ? What I understood of queue_work() is : If cm_work_handler() is already running and processing the last entry at the same time this new entry was added, it is guaranteed to find this new entry in it's current run iteration, and process it. The only issue is with the extra queue_work by iwcm parallely on a different cpu for the same case. So if iwcm had done a redundant "queue_work" on this queue, which, besides adding the new entry to the workqueue, also does a wakeup of "worker_thread" (which is still running the previous iteration of run_workqueue -> cm_work_handler). I am assuming that the wake up function is default_wake_function(), since I couldn't locate in wait* code where this is initialized. When cm_work_handler finishes removing this new entry, it returns to worker_thread, which will do a schedule() and sleep till it is woken up again (since default_wake_function found that the thread is already running and had done nothing). Are you referring to a race where the queue_work is done between the time cm_work_handler finished running and before it gets back to schedule ? I feel that should not matter as the run_workqueue() will find this entry in it's cwq->worklist and continue processing instead of exiting to worker_thread() and schedule(). Still confused about the race :) Thanks, - KK From bugzilla-daemon at openib.org Mon Nov 13 21:59:03 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Mon, 13 Nov 2006 21:59:03 -0800 (PST) Subject: [openib-general] [Bug 263] OFED 1.1 rc6: IPoIB Oops during IPoIB failover loop Message-ID: <20061114055903.E4FBD2283D8@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=263 rolandd at cisco.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #13 from rolandd at cisco.com 2006-11-13 21:59 ------- Fix is merged into Linus's tree as commit 39798695 (ie sometime after 2.6.19-rc5) ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From ogerlitz at voltaire.com Mon Nov 13 23:49:42 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 14 Nov 2006 09:49:42 +0200 Subject: [openib-general] [PATCH 1/7] IB/core - Add DMA mapping functions to allow device drivers to interpose In-Reply-To: <1162506570.29948.567.camel@brick.pathscale.com> References: <1162506570.29948.567.camel@brick.pathscale.com> Message-ID: <45597516.5010000@voltaire.com> Ralph Campbell wrote: > IB/core - Add DMA mapping functions to allow device drivers to interpose > > The QLogic InfiniPath HCAs use programmed I/O instead of HW DMA. > This patch allows a verbs device driver to interpose on DMA mapping > function calls in order to avoid relying on bus_to_virt() and > phys_to_virt() to undo the mappings created by dma_map_single(), > dma_map_sg(), etc. Yet another issue i see here: i understand that the core you are using at your HCA is capable of doing DMA but is able to do so from one RX and one TX cyclic descriptor buffer, correct? (*) Hence on the RX flow, you first get the IB packet into a buffer allocated by the driver and then copy it to the buffer posted by the user to the QP. But what is going on the TX flow, per my understanding here you should be able to do it zero-copy wise but you do need to provide a **dma** (and not kernel virtual nor physical) address to the HCA. This means that for DMA_TO_DEVICE mappings you do want to call the original dma_map_xxx calls and Roland's suggestion to have ib_dma_map_xxx calls return u64 is problematic from another aspect then the one i have pointed on. Going to the SC06 and then vacationing, I will probably not be able to further jump on this thread before Nov 26th, anyway, my personal feeling is that this need to be baked a little more, specifically pushing it for 2.6.20-rc1 might be moving too fast... but if you guys feel the open issues are resolved feel free to go ahead. cheers, Or. (*) kernel bypass for rx/tx postings/completions by libipath seems to me as black magic... you have to copy the rx data from the driver cyclic buffer to the buffer pointed by tge user space qp wqe without passing through the kernel. Do you do it on the poll_cq time? how do you post tx? From mst at mellanox.co.il Tue Nov 14 00:00:52 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Nov 2006 10:00:52 +0200 Subject: [openib-general] IPoIB new multicast API patches oops In-Reply-To: <000101c70756$ba41e110$1bcd180a@amr.corp.intel.com> References: <000101c70756$ba41e110$1bcd180a@amr.corp.intel.com> Message-ID: <20061114080052.GA26806@mellanox.co.il> Quoting Sean Hefty : > Subject: RE: IPoIB new multicast API patches oops > > I have not been able to reproduce this crash on my systems, and even > instrumenting the code isn't helping me to locate the issue. Can you > apply the following patch on top of the previous patches, and let me > know if you get any additional output? OK, I hope to get back to testing this next-week-ish. -- MST From zhushisongzhu at yahoo.com Tue Nov 14 00:57:23 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Tue, 14 Nov 2006 00:57:23 -0800 (PST) Subject: [openib-general] compile error In-Reply-To: <45584D97.5050807@voltaire.com> Message-ID: <20061114085723.18161.qmail@web36905.mail.mud.yahoo.com> I've got the whole kernel 2.6.19-rc5 with infiniband. But there is no sdp module there. Can OFED1.x run under FC5? zhu --- Or Gerlitz wrote: > zhu shi song wrote: > > how can I git latest tree? > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git > > Or. > > ____________________________________________________________________________________ Yahoo! Music Unlimited Access over 1 million songs. http://music.yahoo.com/unlimited From mst at mellanox.co.il Tue Nov 14 01:58:45 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Nov 2006 11:58:45 +0200 Subject: [openib-general] [PATCH] IB/ipoib: DMA alignment on ppc64 In-Reply-To: References: Message-ID: <20061114095845.GA27446@mellanox.co.il> Quoting r. Roland Dreier : > Subject: Re: [PATCH] IB/ipoib: DMA alignment on ppc64 > > I don't think this is what's needed. The GRH leaves a gap of 40, so > getting rid of the skb_reserve() just means that DMA will start at an > offset of 40 rather than 44. Ugh. Correct - unless the GRH is present in the incoming packet - which we probably need not optimize for for now. > I think you need to reserve enough to get to a full cacheline > boundary, but I can't remember if that's 64 or 128 bytes. L1_CACHE_BYTES? Hmm. How about we add a S/G entry to put both the GRH *and* the IPoIB encap in a separate chunk (which we can then ignore)? -- MST From mst at mellanox.co.il Tue Nov 14 02:24:02 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Nov 2006 12:24:02 +0200 Subject: [openib-general] make ipoib_ib_dev_stop void? Message-ID: <20061114102402.GB27446@mellanox.co.il> Shouldn't ipoib_ib_dev_stop be void? -- MST From ogerlitz at voltaire.com Tue Nov 14 03:05:07 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 14 Nov 2006 13:05:07 +0200 Subject: [openib-general] compile error In-Reply-To: <20061114085723.18161.qmail@web36905.mail.mud.yahoo.com> References: <20061114085723.18161.qmail@web36905.mail.mud.yahoo.com> Message-ID: <4559A2E3.7020309@voltaire.com> zhu shi song wrote: > I've got the whole kernel 2.6.19-rc5 with infiniband. > But there is no sdp module there. Can OFED1.x run > under FC5? Basically, SDP is not merged into the kernel, it is provided by OFED 1.1, however as far as i know you would not be able to build it within the kernel IB drivers framework as it uses some API call(s) (eg rdma_establish) which are not provided by the kernel IB api. Or. From zhushisongzhu at yahoo.com Tue Nov 14 03:29:11 2006 From: zhushisongzhu at yahoo.com (zhu shi song) Date: Tue, 14 Nov 2006 03:29:11 -0800 (PST) Subject: [openib-general] compile error In-Reply-To: <45584D97.5050807@voltaire.com> Message-ID: <20061114112911.89088.qmail@web36912.mail.mud.yahoo.com> I download OFED-1.1.1.tar.gz and try to compile it under FC5 with kernel 2.6.18. But there are errors when compile ipoib_fs.c. see below: /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_fs.c: In function 'ipoib_mcg_open': /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_fs.c:144: error: 'struct inode' has no member named 'u' /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_fs.c: In function 'ipoib_path_open': /var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_fs.c:250: error: 'struct inode' has no member named 'u' make[3]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib/ipoib_fs.o] Error 1 make[2]: *** [/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband/ulp/ipoib] Error 2 make[1]: *** [_module_/var/tmp/OFEDRPM/BUILD/openib-1.1/drivers/infiniband] Error 2 zhu --- Or Gerlitz wrote: > zhu shi song wrote: > > how can I git latest tree? > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git > > Or. > > ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From thomas.bub at thomson.net Tue Nov 14 03:48:00 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Tue, 14 Nov 2006 12:48:00 +0100 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. Message-ID: I don't have a svn repository. I'm living on OFED release will this work on the version in OFED-1.1? Thomas > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Monday, November 13, 2006 9:12 PM > To: Hal Rosenstock; Bub Thomas > Cc: Sean Hefty; Erez Cohen; openib-general at openib.org > Subject: Re: [openib-general] Unusable QP's on CM established connections > from gen2 client to gen1 server. > > On 19:49 Mon 13 Nov , Hal Rosenstock wrote: > > There may be an OpenSM bug with setting hop limit in the path record > response. I'm looking at it now. > > Looks like it is - OpenSM returns the same hop_limit value as was in the > request. > > Bub, could you try the patch below? > > Thanks, > Sasha > > > diff --git a/osm/opensm/osm_sa_path_record.c > b/osm/opensm/osm_sa_path_record.c > index 560e385..72a89d0 100644 > --- a/osm/opensm/osm_sa_path_record.c > +++ b/osm/opensm/osm_sa_path_record.c > @@ -709,6 +710,8 @@ __osm_pr_rcv_build_pr( > p_pr->dlid = cl_hton16( dest_lid_ho ); > p_pr->slid = cl_hton16( src_lid_ho ); > > + p_pr->hop_flow_raw &= cl_hton32(1<<31); > + > p_pr->pkey = p_parms->pkey; > p_pr->sl = cl_hton16(p_parms->sl); > p_pr->mtu = (uint8_t)(p_parms->mtu | 0x80); > > > > > > -- Hal > > > > ________________________________ > > > > From: openib-general-bounces at openib.org on behalf of Sean Hefty > > Sent: Mon 11/13/2006 11:26 AM > > To: Bub Thomas > > Cc: Erez Cohen; openib-general at openib.org > > Subject: Re: [openib-general] Unusable QP's on CM established > connections from gen2 client to gen1 server. > > > > > > > > Bub Thomas wrote: > > > Setting the hop_limit from 64 down to 0 or 1 solved the problem. :-) > > > Don't ask me where I got that hop_limit from, it must have been an > > > example I found somewhere. > > > Can you explain why that hop_limit/is_global makes a difference in > > > communication between gen1 and gen2? Does the counterpart need to have > > > the same hop_limit? > > > > The gen2 stack uses a hop_limit > 0 to indicate that global routing is > being > > used. If the hop_limit is > 0, then the global routing information must > be valid. > > > > > The path record values I use are queried from the OSM using a > > > SERVICE_RECORD query followed by a path record query. > > > I'm not using any alternate path record values, is this critical? > > > > Everything is supposed to work with the path records returned from the > SM. I > > was wondering if you were querying for the path record, modifying the > returned > > value, or creating a path record yourself. > > > > > path_record.packet_life_time = 0; > > > > I would set this higher (maybe between 12-20, with: 18 = 1 second, 19 = > 2 > > seconds, 20 = 4 seconds, etc.) > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general > > From ogerlitz at voltaire.com Tue Nov 14 03:50:47 2006 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Tue, 14 Nov 2006 13:50:47 +0200 Subject: [openib-general] compile error In-Reply-To: <20061114112911.89088.qmail@web36912.mail.mud.yahoo.com> References: <20061114112911.89088.qmail@web36912.mail.mud.yahoo.com> Message-ID: <4559AD97.2040607@voltaire.com> zhu shi song wrote: > I download OFED-1.1.1.tar.gz and try to compile it > under FC5 with kernel 2.6.18. But there are errors > when compile ipoib_fs.c. see below: Is it kernel.org 2.6.18 or some -stable version of it or the distro kernel? I am not sure OFED 1.1 supports the FC5 kernel. Or. From mst at mellanox.co.il Tue Nov 14 04:25:26 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Nov 2006 14:25:26 +0200 Subject: [openib-general] compile error In-Reply-To: <20061114112911.89088.qmail@web36912.mail.mud.yahoo.com> References: <45584D97.5050807@voltaire.com> <20061114112911.89088.qmail@web36912.mail.mud.yahoo.com> Message-ID: <20061114122526.GA11383@mellanox.co.il> > Quoting r. zhu shi song : > Subject: Re: compile error > > I download OFED-1.1.1.tar.gz and try to compile it > under FC5 with kernel 2.6.18. But there are errors > when compile ipoib_fs.c. see below: Right. FC5 has a patched kernel, not a mainline 2.6.18 that OFED 1.1 supports. The following might help though. -- MST -------------- next part -------------- Index: gen2_devel/drivers/infiniband/ulp/ipoib/ipoib_fs.c =================================================================== --- gen2_devel.orig/drivers/infiniband/ulp/ipoib/ipoib_fs.c +++ gen2_devel/drivers/infiniband/ulp/ipoib/ipoib_fs.c @@ -141,7 +141,7 @@ static int ipoib_mcg_open(struct inode * return ret; seq = file->private_data; - seq->private = inode->i_private; + seq->private = inode->u.generic_ip; return 0; } @@ -247,7 +247,7 @@ static int ipoib_path_open(struct inode return ret; seq = file->private_data; - seq->private = inode->i_private; + seq->private = inode->u.generic_ip; return 0; } From halr at voltaire.com Tue Nov 14 04:40:33 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Tue, 14 Nov 2006 14:40:33 +0200 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. References: Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F501894448@taurus.voltaire.com> Yes, if you build it from source. The same change should apply. If it doesn't automatically apply, it's simple to hand edit this one line change and rebuild. -- Hal ________________________________ From: Bub Thomas [mailto:thomas.bub at thomson.net] Sent: Tue 11/14/2006 6:48 AM To: Sasha Khapyorsky; Hal Rosenstock; Bub Thomas Cc: Sean Hefty; Erez Cohen; openib-general at openib.org Subject: RE: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. I don't have a svn repository. I'm living on OFED release will this work on the version in OFED-1.1? Thomas > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Monday, November 13, 2006 9:12 PM > To: Hal Rosenstock; Bub Thomas > Cc: Sean Hefty; Erez Cohen; openib-general at openib.org > Subject: Re: [openib-general] Unusable QP's on CM established connections > from gen2 client to gen1 server. > > On 19:49 Mon 13 Nov , Hal Rosenstock wrote: > > There may be an OpenSM bug with setting hop limit in the path record > response. I'm looking at it now. > > Looks like it is - OpenSM returns the same hop_limit value as was in the > request. > > Bub, could you try the patch below? > > Thanks, > Sasha > > > diff --git a/osm/opensm/osm_sa_path_record.c > b/osm/opensm/osm_sa_path_record.c > index 560e385..72a89d0 100644 > --- a/osm/opensm/osm_sa_path_record.c > +++ b/osm/opensm/osm_sa_path_record.c > @@ -709,6 +710,8 @@ __osm_pr_rcv_build_pr( > p_pr->dlid = cl_hton16( dest_lid_ho ); > p_pr->slid = cl_hton16( src_lid_ho ); > > + p_pr->hop_flow_raw &= cl_hton32(1<<31); > + > p_pr->pkey = p_parms->pkey; > p_pr->sl = cl_hton16(p_parms->sl); > p_pr->mtu = (uint8_t)(p_parms->mtu | 0x80); > > > > > > -- Hal > > > > ________________________________ > > > > From: openib-general-bounces at openib.org on behalf of Sean Hefty > > Sent: Mon 11/13/2006 11:26 AM > > To: Bub Thomas > > Cc: Erez Cohen; openib-general at openib.org > > Subject: Re: [openib-general] Unusable QP's on CM established > connections from gen2 client to gen1 server. > > > > > > > > Bub Thomas wrote: > > > Setting the hop_limit from 64 down to 0 or 1 solved the problem. :-) > > > Don't ask me where I got that hop_limit from, it must have been an > > > example I found somewhere. > > > Can you explain why that hop_limit/is_global makes a difference in > > > communication between gen1 and gen2? Does the counterpart need to have > > > the same hop_limit? > > > > The gen2 stack uses a hop_limit > 0 to indicate that global routing is > being > > used. If the hop_limit is > 0, then the global routing information must > be valid. > > > > > The path record values I use are queried from the OSM using a > > > SERVICE_RECORD query followed by a path record query. > > > I'm not using any alternate path record values, is this critical? > > > > Everything is supposed to work with the path records returned from the > SM. I > > was wondering if you were querying for the path record, modifying the > returned > > value, or creating a path record yourself. > > > > > path_record.packet_life_time = 0; > > > > I would set this higher (maybe between 12-20, with: 18 = 1 second, 19 = > 2 > > seconds, 20 = 4 seconds, etc.) > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general > > > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general > > From todd.rimmer at qlogic.com Tue Nov 14 04:58:40 2006 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Tue, 14 Nov 2006 06:58:40 -0600 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061190BDC03@EPEXCH2.qlogic.org> > From: Bub Thomas > Sent: Monday, November 13, 2006 3:37 AM > To: Sean Hefty; Bub Thomas > Cc: Erez Cohen; openib-general at openib.org > Subject: Re: [openib-general] Unusable QP's on CM established connections > from gen2 client to gen1 server. > > Sean, > you got it! > Setting the hop_limit from 64 down to 0 or 1 solved the problem. :-) > Don't ask me where I got that hop_limit from, it must have been an > example I found somewhere. > Can you explain why that hop_limit/is_global makes a difference in > communication between gen1 and gen2? Does the counterpart need to have > the same hop_limit? Hop limit is often used to identify local vs global routes. A hop limit can result is a global route being assumed and hence the unexpected use of GRH network layer headers. > The path record values I use are queried from the OSM using a > SERVICE_RECORD query followed by a path record query. > I'm not using any alternate path record values, is this critical? > In addition I enclose the values I path into the ib_cm_sned_req call. > Can you pls have a look if you find something alse looking abnormal. > Thanks > Thomas Bub > > req_param.qp_type = IBV_QPT_RC; > req_param.qp_num = _dataQpNum; > req_param.starting_psn = _dataQpNum;; > req_param.service_id = htonll(SERVICE_ID); > > req_param.primary_path = &path_record; > req_param.alternate_path = NULL; > req_param.private_data = NULL; > req_param.private_data_len = 0; > > req_param.responder_resources = 4; > req_param.initiator_depth = 4; These should not be hardcoded, but should come from a query of the CA capabilities. > req_param.remote_cm_response_timeout = 20; > req_param.local_cm_response_timeout = 20; These should be computed based on path record pkt lifetime and local CA Ack turnaround time. Check the archives, about a month ago I posted some computations for these values. > req_param.retry_count = 7; > req_param.rnr_retry_count = 7; FYI for RNR retry, 7=infinite. > req_param.max_cm_retries = 5; > > path_record.sgid = _localGid; > path_record.dgid = _remoteGid; > path_record.slid = htons(_localLID); > path_record.dlid = htons(_remoteLID); > path_record.flow_label = 0; > path_record.hop_limit = 0; > path_record.traffic_class = 0; > path_record.pkey = 0xffff; > path_record.sl = 0; > path_record.rate = IBV_RATE_10_GBPS; > path_record.packet_life_time = 0; > path_record.mtu = IBV_MTU_2048; All the path record values should come from the SA. While hardcoding might work in some cases, it will not work on all fabrics. For example, in a DDR fabric setting the rate to 10 GBPS will run at 1/2 the potential bandwidth. Todd Rimmer From sven-hendrik.voss at hhi.fraunhofer.de Tue Nov 14 05:07:13 2006 From: sven-hendrik.voss at hhi.fraunhofer.de (Voss) Date: Tue, 14 Nov 2006 14:07:13 +0100 Subject: [openib-general] VHDL / C co-implementation Message-ID: <20061114130714.46D8D1D88F48@mail.hhi.fraunhofer.de> Hi all, we're currently trying to implement an Infiniband IBx4 interface on a Xilinx Virtex-II Pro FPGA. So first of all we checked out the SVN repository openfabrics.org/svn but unfortunately it now seems unclear to us which pieces of the stuff there is suitable for our goal: What we'd like to do is run the transport layer (in C) on the FPGA's integrated PowerPC while the link layer is already implemented in VHDL on the FPGA. To what extend can the C-source be used for an implementation on a PowerPC? Is it necessary to run Linux on the PowerPC or is it possible to run the link layer directly on the PowerPC without any operating system? How can the data be accessed from the link layer, is it done via memory or a dedicated bus? Thanks in advance and many regards, Sven From mst at mellanox.co.il Tue Nov 14 05:19:15 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Nov 2006 15:19:15 +0200 Subject: [openib-general] Unusable QP's on CM established connections from gen2 client to gen1 server. In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE061190BDC03@EPEXCH2.qlogic.org> References: <4FB1BCCAE6CAED44A1DC005B1DE061190BDC03@EPEXCH2.qlogic.org> Message-ID: <20061114131914.GC27446@mellanox.co.il> Quoting Todd Rimmer : > > req_param.responder_resources = 4; > > req_param.initiator_depth = 4; > > These should not be hardcoded, but should come from a query of the CA > capabilities. Whether is makes sense to query adapter or hard-code these fields depends on the protocol - the HCA should validate these fields and report an error if they are too high. -- MST From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:50:33 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:20:33 +0530 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) Message-ID: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> This patch set adds support for the QLogic Virtual Ethernet I/O controller (VEx), which presents a true Ethernet NIC to the host. This driver provides a standard Ethernet NIC interface to the system and treats IB as an I/O bus to allow a host CPU to use the VEx card as its NIC. This patch set has been tested with the for-2.6.20 branch and incorporates suggestions from the last round of review (details of changes at the end of the mail). The one item still pending is the style of debug tracing and replacing printks with dev_info etc. I will be working on this soon. Roland, If you think these patches are good enough, could you please create a branch in your git tree based on for-2.6.20 for this code ? I thought it would be a good idea if these patches went through another round of review here before posting to lkml and netdev. Changes from last round of review: * Removed trivial pass through functions * Moved all #ifdef code to .h files * Introduced sparse annotations. The driver now passes make C=2 CF=-D__CHECK_ENDIAN__ cleanly * Fixed return types to use the standard convention of 0 for success and < 0 for failure * Split large functions into smaller functions * Replaced non-standard macros with standard kernel macros * Using sysfs_create_group() for creating multiple sysfs files * Lots of cleanups to enhance readability of the code Signed-off-by: Regards, Ram From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:51:30 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:21:30 +0530 Subject: [openib-general] [PATCH v2 1/11] Driver Main files - netdev functions and corresponding state maintenance Message-ID: <455A254A.32264.60F0E7E@ramachandra.kuchimanchi.qlogic.com> Adds the driver main files. These files implement netdev registration, netdev functions and state maintenance of the virtual NIC corresponding to the various events associated with the Virtual Ethernet I/O Controller (VEx) connection. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_main.c | 1029 +++++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_main.h | 130 ++++ 2 files changed, 1159 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_main.c b/drivers/infiniband/ulp/vnic/vnic_main.c new file mode 100644 index 0000000..61f9394 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_main.c @@ -0,0 +1,1029 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_netpath.h" +#include "vnic_viport.h" +#include "vnic_ib.h" +#include "vnic_stats.h" + +#define MODULEVERSION "0.1" +#define MODULEDETAILS "Virtual NIC driver version " MODULEVERSION + +MODULE_AUTHOR("Ramachandra K"); +MODULE_DESCRIPTION(MODULEDETAILS); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_SUPPORTED_DEVICE("QLogic Ethernet Virtual I/O Controller"); + +u32 vnic_debug = 0; + +module_param(vnic_debug, uint, 0444); +MODULE_PARM_DESC(vnic_debug, "Enable debug tracing if > 0"); + +LIST_HEAD(vnic_list); + +const char driver[] = "vnic"; + +DECLARE_WAIT_QUEUE_HEAD(vnic_npevent_queue); +LIST_HEAD(vnic_npevent_list); +DECLARE_COMPLETION(vnic_npevent_thread_exit); +spinlock_t vnic_npevent_list_lock = SPIN_LOCK_UNLOCKED; +int vnic_npevent_thread = -1; +int vnic_npevent_thread_end = 0; + + +void vnic_connected(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_connected()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED); + + vnic_connected_stats(vnic); +} + +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_disconnected()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_DISCONNECTED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_DISCONNECTED); +} + +void vnic_link_up(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_link_up()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKUP); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKUP); +} + +void vnic_link_down(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_link_down()\n"); + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_LINKDOWN); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_LINKDOWN); +} + +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_stop_xmit()\n"); + if (netpath == vnic->current_path) { + if (vnic->xmit_started) { + netif_stop_queue(&vnic->netdevice); + vnic->xmit_started = 0; + } + + vnic_stop_xmit_stats(vnic); + } +} + +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath) +{ + VNIC_FUNCTION("vnic_restart_xmit()\n"); + if (netpath == vnic->current_path) { + if (!vnic->xmit_started) { + netif_wake_queue(&vnic->netdevice); + vnic->xmit_started = 1; + } + + vnic_restart_xmit_stats(vnic); + } +} + +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, + struct sk_buff *skb) +{ + VNIC_FUNCTION("vnic_recv_packet()\n"); + if ((netpath != vnic->current_path) || !vnic->open) { + VNIC_INFO("tossing packet\n"); + dev_kfree_skb(skb); + return; + } + + vnic->netdevice.last_rx = jiffies; + skb->dev = &vnic->netdevice; + skb->protocol = eth_type_trans(skb, skb->dev); + if (!vnic->config->use_rx_csum) + skb->ip_summed = CHECKSUM_NONE; + netif_rx(skb); + vnic_recv_pkt_stats(vnic); +} + +static struct net_device_stats *vnic_get_stats(struct net_device *device) +{ + struct vnic *vnic; + struct netpath *np; + + VNIC_FUNCTION("vnic_get_stats()\n"); + vnic = (struct vnic *)device->priv; + + np = vnic->current_path; + if (np && np->viport) + viport_get_stats(np->viport, &vnic->stats); + return &vnic->stats; +} + +static int vnic_open(struct net_device *device) +{ + struct vnic *vnic; + + VNIC_FUNCTION("vnic_open()\n"); + vnic = (struct vnic *)device->priv; + + vnic->open++; + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_SETLINK); + vnic->xmit_started = 1; + netif_start_queue(&vnic->netdevice); + + return 0; +} + +static int vnic_stop(struct net_device *device) +{ + struct vnic *vnic; + int ret = 0; + + VNIC_FUNCTION("vnic_stop()\n"); + vnic = (struct vnic *)device->priv; + netif_stop_queue(device); + vnic->xmit_started = 0; + vnic->open--; + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_SETLINK); + + return ret; +} + +static int vnic_hard_start_xmit(struct sk_buff *skb, + struct net_device *device) +{ + struct vnic *vnic; + struct netpath *np; + cycles_t xmit_time; + int ret = -1; + + VNIC_FUNCTION("vnic_hard_start_xmit()\n"); + vnic = (struct vnic *)device->priv; + np = vnic->current_path; + + vnic_pre_pkt_xmit_stats(&xmit_time); + + if (np && np->viport) + ret = viport_xmit_packet(np->viport, skb); + + if (ret) { + vnic_xmit_fail_stats(vnic); + dev_kfree_skb_any(skb); + vnic->stats.tx_dropped++; + goto out; + } + + device->trans_start = jiffies; + vnic_post_pkt_xmit_stats(vnic, xmit_time); +out: + return 0; +} + +static void vnic_tx_timeout(struct net_device *device) +{ + struct vnic *vnic; + + VNIC_FUNCTION("vnic_tx_timeout()\n"); + vnic = (struct vnic *)device->priv; + device->trans_start = jiffies; + + if (vnic->current_path->viport) + viport_failure(vnic->current_path->viport); + + VNIC_ERROR("vnic_tx_timeout\n"); +} + +static void vnic_set_multicast_list(struct net_device *device) +{ + struct vnic *vnic; + unsigned long flags; + + VNIC_FUNCTION("vnic_set_multicast_list()\n"); + vnic = (struct vnic *)device->priv; + + spin_lock_irqsave(&vnic->lock, flags); + /* the vnic_link_evt thread also needs to be able to access + * mc_list. it is only safe to access the mc_list + * in the netdevice from this call, so make a local + * copy of it in the vnic. the mc_list is a linked + * list, but my copy is an array where each element's + * next pointer points to the next element. when I + * reallocate the list, I always size it with 10 + * extra elements so I don't have to resize it as + * often. I only downsize the list when it goes empty. + */ + if (device->mc_count == 0) { + if (vnic->mc_list_len) { + vnic->mc_list_len = vnic->mc_count = 0; + kfree(vnic->mc_list); + } + } else { + struct dev_mc_list *mc_list = device->mc_list; + int i; + + if (device->mc_count > vnic->mc_list_len) { + if (vnic->mc_list_len) + kfree(vnic->mc_list); + vnic->mc_list_len = device->mc_count + 10; + vnic->mc_list = kmalloc(vnic->mc_list_len * + sizeof *mc_list, GFP_ATOMIC); + if (!vnic->mc_list) { + vnic->mc_list_len = vnic->mc_count = 0; + VNIC_ERROR("failed allocating mc_list\n"); + goto failure; + } + } + vnic->mc_count = device->mc_count; + for (i = 0; i < device->mc_count; i++) { + vnic->mc_list[i] = *mc_list; + vnic->mc_list[i].next = &vnic->mc_list[i + 1]; + mc_list = mc_list->next; + } + } + spin_unlock_irqrestore(&vnic->lock, flags); + + if (vnic->primary_path.viport) + viport_set_multicast(vnic->primary_path.viport, + vnic->mc_list, vnic->mc_count); + + if (vnic->secondary_path.viport) + viport_set_multicast(vnic->secondary_path.viport, + vnic->mc_list, vnic->mc_count); + + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_SETLINK); + return; +failure: + spin_unlock_irqrestore(&vnic->lock, flags); +} + +static int vnic_set_mac_address(struct net_device *device, void *addr) +{ + struct vnic *vnic; + struct sockaddr *sockaddr = addr; + u8 *address; + int ret = -1; + + VNIC_FUNCTION("vnic_set_mac_address()\n"); + vnic = (struct vnic *)device->priv; + + if (!is_valid_ether_addr(sockaddr->sa_data)) + return -EADDRNOTAVAIL; + + if (netif_running(device)) + return -EBUSY; + + memcpy(device->dev_addr, sockaddr->sa_data, ETH_ALEN); + address = sockaddr->sa_data; + + if (vnic->primary_path.viport) + ret = viport_set_unicast(vnic->primary_path.viport, + address); + + if (ret) + return ret; + + /* Ignore result of set unicast for secondary path viport. + * Consider the operation a success if we are able to atleast + * set the primary path viport address + */ + if (vnic->secondary_path.viport) + viport_set_unicast(vnic->secondary_path.viport, address); + + vnic->mac_set = 1; + /* I'm assuming that this should work even if nothing is connected + * at the moment. note that this might return before the address has + * actually been changed. + */ + return 0; +} + +static int vnic_change_mtu(struct net_device *device, int mtu) +{ + struct vnic *vnic; + int ret = 0; + int pri_max_mtu; + int sec_max_mtu; + + VNIC_FUNCTION("vnic_change_mtu()\n"); + vnic = (struct vnic *)device->priv; + + if (vnic->primary_path.viport) + pri_max_mtu = viport_max_mtu(vnic->primary_path.viport); + else + pri_max_mtu = MAX_PARAM_VALUE; + + if (vnic->secondary_path.viport) + sec_max_mtu = viport_max_mtu(vnic->secondary_path.viport); + else + sec_max_mtu = MAX_PARAM_VALUE; + + if ((mtu < pri_max_mtu) && (mtu < sec_max_mtu)) { + device->mtu = mtu; + vnic_npevent_queue_evt(&vnic->primary_path, + VNIC_NP_SETLINK); + } + + return ret; +} + +static int vnic_npevent_register(struct vnic *vnic, struct netpath *netpath) +{ + u8 *address; + int ret; + + if (!vnic->mac_set) { + /* if netpath == secondary_path, then the primary path isn't + * connected. MAC address will be set when the primary + * connects. + */ + netpath_get_hw_addr(netpath, vnic->netdevice.dev_addr); + address = vnic->netdevice.dev_addr; + + if (vnic->secondary_path.viport) + viport_set_unicast(vnic->secondary_path.viport, + address); + + vnic->mac_set = 1; + } + + ret = register_netdev(&vnic->netdevice); + if (ret) { + printk(KERN_WARNING PFX "failed registering netdev " + "error %d\n", ret); + return ret; + } + + vnic->state = VNIC_REGISTERED; + vnic->carrier = 2; /*special value to force netif_carrier_(on|off)*/ + return 0; +} + +static void vnic_npevent_dequeue_all(struct vnic *vnic) +{ + unsigned long flags; + struct vnic_npevent *npevt, *tmp; + + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) + goto out; + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, + list_ptrs) { + if ((npevt->vnic == vnic)) { + list_del(&npevt->list_ptrs); + kfree(npevt); + } + } +out: + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); +} + + +static const char *const vnic_npevent_str[] = { + "PRIMARY CONNECTED", + "PRIMARY DISCONNECTED", + "PRIMARY CARRIER", + "PRIMARY NO CARRIER", + "PRIMARY TIMER EXPIRED", + "SECONDARY CONNECTED", + "SECONDARY DISCONNECTED", + "SECONDARY CARRIER", + "SECONDARY NO CARRIER", + "SECONDARY TIMER EXPIRED", + "SETLINK", + "FREE VNIC", +}; + +static void update_path_and_reconnect(struct netpath *netpath, + struct vnic *vnic) +{ + struct viport_config *config = netpath->viport->config; + int delay = 1; + + if (vnic_ib_get_path(netpath, vnic)) + return; + /* + * tell viport_connect to wait for default_no_path_timeout + * before connecting if we are retrying the same path index + * within default_no_path_timeout. + * This prevents flooding connect requests to a path (or set + * of paths) that aren't successfully connecting for some reason. + */ + if (jiffies > netpath->connect_time + + vnic->config->no_path_timeout) { + netpath->path_idx = config->path_idx; + netpath->connect_time = jiffies; + delay = 0; + } else if (config->path_idx != netpath->path_idx) + delay = 0; + + viport_connect(netpath->viport, delay); +} + +static void vnic_set_uni_multicast(struct vnic * vnic, + struct netpath * netpath) +{ + unsigned long flags; + u8 *address; + + if (vnic->mac_set) { + address = vnic->netdevice.dev_addr; + + if (netpath->viport) + viport_set_unicast(netpath->viport, address); + } + spin_lock_irqsave(&vnic->lock, flags); + + if (vnic->mc_list && netpath->viport) + viport_set_multicast(netpath->viport, vnic->mc_list, + vnic->mc_count); + + spin_unlock_irqrestore(&vnic->lock, flags); + if (vnic->state == VNIC_REGISTERED) { + if (!netpath->viport) + return; + viport_set_link(netpath->viport, + vnic->netdevice.flags & ~IFF_UP, + vnic->netdevice.mtu); + } +} + +static void vnic_set_netpath_timers(struct vnic *vnic, + struct netpath *netpath) +{ + switch (netpath->timer_state) { + case NETPATH_TS_IDLE: + netpath->timer_state = NETPATH_TS_ACTIVE; + if (vnic->state == VNIC_UNINITIALIZED) + netpath_timer(netpath, + vnic->config-> + primary_connect_timeout); + else + netpath_timer(netpath, + vnic->config-> + primary_reconnect_timeout); + break; + case NETPATH_TS_ACTIVE: + /*nothing to do*/ + break; + case NETPATH_TS_EXPIRED: + if (vnic->state == VNIC_UNINITIALIZED) { + vnic_npevent_register(vnic, netpath); + } + break; + } +} + +static void vnic_check_primary_path_timer(struct vnic * vnic) +{ + switch (vnic->primary_path.timer_state) { + case NETPATH_TS_ACTIVE: + /* nothing to do. just wait */ + break; + case NETPATH_TS_IDLE: + netpath_timer(&vnic->primary_path, + vnic->config-> + primary_switch_timeout); + break; + case NETPATH_TS_EXPIRED: + printk(KERN_INFO PFX + "%s: switching to primary path\n", + vnic->config->name); + + vnic->current_path = &vnic->primary_path; + if (vnic->config->use_tx_csum + && netpath_can_tx_csum(vnic-> + current_path)) { + vnic->netdevice.features |= + NETIF_F_IP_CSUM; + } + break; + } +} + +static void vnic_carrier_loss(struct vnic * vnic, + struct netpath *last_path) +{ + if (vnic->primary_path.carrier) { + vnic->carrier = 1; + vnic->current_path = &vnic->primary_path; + + if (last_path && last_path != vnic->current_path) + printk(KERN_INFO PFX + "%s: failing over to primary path\n", + vnic->config->name); + else if (!last_path) + printk(KERN_INFO PFX "%s: using primary path\n", + vnic->config->name); + + if (vnic->config->use_tx_csum && + netpath_can_tx_csum(vnic->current_path)) + vnic->netdevice.features |= NETIF_F_IP_CSUM; + + } else if ((vnic->secondary_path.carrier) && + (vnic->secondary_path.timer_state != NETPATH_TS_ACTIVE)) { + vnic->carrier = 1; + vnic->current_path = &vnic->secondary_path; + + if (last_path && last_path != vnic->current_path) + printk(KERN_INFO PFX + "%s: failing over to secondary path\n", + vnic->config->name); + else if (!last_path) + printk(KERN_INFO PFX "%s: using secondary path\n", + vnic->config->name); + + if (vnic->config->use_tx_csum && + netpath_can_tx_csum(vnic->current_path)) + vnic->netdevice.features |= NETIF_F_IP_CSUM; + + } + +} + +static void vnic_handle_path_change(struct vnic * vnic, + struct netpath **path) +{ + struct netpath * last_path = *path; + + if (!last_path) { + if (vnic->current_path == &vnic->primary_path) + last_path = &vnic->secondary_path; + else + last_path = &vnic->primary_path; + + } + + if (vnic->current_path && vnic->current_path->viport) + viport_set_link(vnic->current_path->viport, + vnic->netdevice.flags, + vnic->netdevice.mtu); + + if (last_path->viport) + viport_set_link(last_path->viport, + vnic->netdevice.flags & + ~IFF_UP, vnic->netdevice.mtu); + + vnic_restart_xmit(vnic, vnic->current_path); +} + +static void vnic_report_path_change(struct vnic * vnic, + struct netpath *last_path, + int other_path_ok) +{ + if (!vnic->current_path) { + if (last_path == &vnic->primary_path) + printk(KERN_INFO PFX "%s: primary path lost, " + "no failover path available\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: secondary path lost, " + "no failover path available\n", + vnic->config->name); + return; + } + + if (last_path != vnic->current_path) + return; + + if (vnic->current_path == &vnic->secondary_path) { + if (other_path_ok != vnic->primary_path.carrier) { + if (other_path_ok) + printk(KERN_INFO PFX "%s: primary path no" + " longer available for failover\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: primary path now" + " available for failover\n", + vnic->config->name); + } + } else { + if (other_path_ok != vnic->secondary_path.carrier) { + if (other_path_ok) + printk(KERN_INFO PFX "%s: secondary path no" + " longer available for failover\n", + vnic->config->name); + else + printk(KERN_INFO PFX "%s: secondary path now" + " available for failover\n", + vnic->config->name); + } + } +} + +static void vnic_handle_free_vnic_evt(struct vnic * vnic) +{ + netpath_timer_stop(&vnic->primary_path); + netpath_timer_stop(&vnic->secondary_path); + vnic->current_path = NULL; + netpath_free(&vnic->primary_path); + netpath_free(&vnic->secondary_path); + if (vnic->state == VNIC_REGISTERED) + unregister_netdev(&vnic->netdevice); + vnic_npevent_dequeue_all(vnic); + kfree(vnic->config); + if (vnic->mc_list_len) { + vnic->mc_list_len = vnic->mc_count = 0; + kfree(vnic->mc_list); + } + + sysfs_remove_group(&vnic->class_dev_info.class_dev.kobj, + &vnic_dev_attr_group); + vnic_cleanup_stats_files(vnic); + class_device_unregister(&vnic->class_dev_info.class_dev); + wait_for_completion(&vnic->class_dev_info.released); +} + +static struct vnic * vnic_handle_npevent(struct vnic *vnic, + enum vnic_npevent_type npevt_type) +{ + struct netpath *netpath; + + VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n", + vnic->config->name, vnic_npevent_str[npevt_type], + netpath_to_string(vnic, vnic->current_path), + vnic->carrier); + + switch (npevt_type) { + case VNIC_PRINP_CONNECTED: + netpath = &vnic->primary_path; + if (vnic->state == VNIC_UNINITIALIZED) { + if (vnic_npevent_register(vnic, netpath)) + break; + } + vnic_set_uni_multicast(vnic, netpath); + break; + case VNIC_SECNP_CONNECTED: + vnic_set_uni_multicast(vnic, &vnic->secondary_path); + break; + case VNIC_PRINP_TIMEREXPIRED: + netpath = &vnic->primary_path; + netpath->timer_state = NETPATH_TS_EXPIRED; + if (netpath->carrier) + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_SECNP_TIMEREXPIRED: + netpath = &vnic->secondary_path; + netpath->timer_state = NETPATH_TS_EXPIRED; + if (netpath->carrier) { + if (vnic->state == VNIC_UNINITIALIZED) + vnic_npevent_register(vnic, netpath); + } else + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_PRINP_LINKUP: + vnic->primary_path.carrier = 1; + break; + case VNIC_SECNP_LINKUP: + netpath = &vnic->secondary_path; + netpath->carrier = 1; + if (!vnic->carrier) + vnic_set_netpath_timers(vnic, netpath); + break; + case VNIC_PRINP_LINKDOWN: + vnic->primary_path.carrier = 0; + break; + case VNIC_SECNP_LINKDOWN: + if (vnic->state == VNIC_UNINITIALIZED) + netpath_timer_stop(&vnic->secondary_path); + vnic->secondary_path.carrier = 0; + break; + case VNIC_PRINP_DISCONNECTED: + netpath = &vnic->primary_path; + netpath_timer_stop(netpath); + netpath->carrier = 0; + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_SECNP_DISCONNECTED: + netpath = &vnic->secondary_path; + netpath_timer_stop(netpath); + netpath->carrier = 0; + update_path_and_reconnect(netpath, vnic); + break; + case VNIC_NP_FREEVNIC: + vnic_handle_free_vnic_evt(vnic); + kfree(vnic); + vnic = NULL; + break; + case VNIC_NP_SETLINK: + netpath = vnic->current_path; + if (!netpath || !netpath->viport) + break; + viport_set_link(netpath->viport, + vnic->netdevice.flags, + vnic->netdevice.mtu); + break; + } + return vnic; +} + +static int vnic_npevent_statemachine(void *context) +{ + struct vnic_npevent *vnic_link_evt; + enum vnic_npevent_type npevt_type; + struct vnic *vnic; + int last_carrier; + int other_path_ok = 0; + struct netpath *last_path; + + daemonize("vnic_link_evt"); + + while (!vnic_npevent_thread_end || + !list_empty(&vnic_npevent_list)) { + unsigned long flags; + + wait_event_interruptible(vnic_npevent_queue, + !list_empty(&vnic_npevent_list) + || vnic_npevent_thread_end); + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) { + spin_unlock_irqrestore(&vnic_npevent_list_lock, + flags); + VNIC_INFO("netpath statemachine wake" + " on empty list\n"); + continue; + } + + vnic_link_evt = list_entry(vnic_npevent_list.next, + struct vnic_npevent, + list_ptrs); + list_del(&vnic_link_evt->list_ptrs); + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); + vnic = vnic_link_evt->vnic; + npevt_type = vnic_link_evt->event_type; + kfree(vnic_link_evt); + + if (vnic->current_path == &vnic->secondary_path) + other_path_ok = vnic->primary_path.carrier; + else if (vnic->current_path == &vnic->primary_path) + other_path_ok = vnic->secondary_path.carrier; + + vnic = vnic_handle_npevent(vnic, npevt_type); + + if (!vnic) + continue; + + last_carrier = vnic->carrier; + last_path = vnic->current_path; + + if (!vnic->current_path || + !vnic->current_path->carrier) { + vnic->carrier = 0; + vnic->current_path = NULL; + vnic->netdevice.features &= ~NETIF_F_IP_CSUM; + } + + if (!vnic->carrier) + vnic_carrier_loss(vnic, last_path); + else if ((vnic->current_path != &vnic->primary_path) && + (vnic->config->prefer_primary) && + (vnic->primary_path.carrier)) + vnic_check_primary_path_timer(vnic); + + if (last_path) + vnic_report_path_change(vnic, last_path, + other_path_ok); + + VNIC_INFO("new netpath=%s, carrier=%d\n", + netpath_to_string(vnic, vnic->current_path), + vnic->carrier); + + if (vnic->current_path != last_path) + vnic_handle_path_change(vnic, &last_path); + + if (vnic->carrier != last_carrier) { + if (vnic->carrier) { + VNIC_INFO("netif_carrier_on\n"); + netif_carrier_on(&vnic->netdevice); + vnic_carrier_loss_stats(vnic); + } else { + VNIC_INFO("netif_carrier_off\n"); + netif_carrier_off(&vnic->netdevice); + vnic_disconn_stats(vnic); + } + + } + } + complete_and_exit(&vnic_npevent_thread_exit, 0); + return 0; +} + +void vnic_npevent_queue_evt(struct netpath *netpath, + enum vnic_npevent_type evt) +{ + struct vnic_npevent *npevent; + unsigned long flags; + + npevent = kmalloc(sizeof *npevent, GFP_ATOMIC); + if (!npevent) { + VNIC_ERROR("Could not allocate memory for vnic event\n"); + return; + } + npevent->vnic = netpath->parent; + npevent->event_type = evt; + INIT_LIST_HEAD(&npevent->list_ptrs); + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + list_add_tail(&npevent->list_ptrs, &vnic_npevent_list); + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); + wake_up(&vnic_npevent_queue); +} + +void vnic_npevent_dequeue_evt(struct netpath *netpath, + enum vnic_npevent_type evt) +{ + unsigned long flags; + struct vnic_npevent *npevt, *tmp; + struct vnic * vnic = netpath->parent; + + spin_lock_irqsave(&vnic_npevent_list_lock, flags); + if (list_empty(&vnic_npevent_list)) + goto out; + list_for_each_entry_safe(npevt, tmp, &vnic_npevent_list, + list_ptrs) { + if ((npevt->vnic == vnic) && + (npevt->event_type == evt)) { + list_del(&npevt->list_ptrs); + kfree(npevt); + break; + } + } +out: + spin_unlock_irqrestore(&vnic_npevent_list_lock, flags); +} + +static int vnic_npevent_start(void) +{ + VNIC_FUNCTION("vnic_npevent_start()\n"); + + if ((vnic_npevent_thread = + kernel_thread(vnic_npevent_statemachine, NULL, 0)) < 0) { + printk(KERN_WARNING PFX "failed to create vnic npevent" + " thread; error %d\n", vnic_npevent_thread); + return vnic_npevent_thread; + } + + return 0; +} + +static void vnic_npevent_cleanup(void) +{ + if (vnic_npevent_thread >= 0) { + vnic_npevent_thread_end = 1; + wake_up(&vnic_npevent_queue); + wait_for_completion(&vnic_npevent_thread_exit); + } +} + +struct vnic *vnic_allocate(struct vnic_config *config) +{ + struct vnic *vnic = NULL; + struct net_device *device; + + VNIC_FUNCTION("vnic_allocate()\n"); + vnic = kzalloc(sizeof *vnic, GFP_KERNEL); + if (!vnic) { + VNIC_ERROR("failed allocating vnic structure\n"); + return NULL; + } + + vnic->lock = SPIN_LOCK_UNLOCKED; + vnic_alloc_stats(vnic); + vnic->state = VNIC_UNINITIALIZED; + vnic->config = config; + device = &vnic->netdevice; + + strcpy(device->name, config->name); + + ether_setup(device); + + device->priv = (void *)vnic; + device->get_stats = vnic_get_stats; + device->open = vnic_open; + device->stop = vnic_stop; + device->hard_start_xmit = vnic_hard_start_xmit; + device->tx_timeout = vnic_tx_timeout; + device->set_multicast_list = vnic_set_multicast_list; + device->set_mac_address = vnic_set_mac_address; + device->change_mtu = vnic_change_mtu; + device->watchdog_timeo = HZ; + device->features = 0; + + netpath_init(&vnic->primary_path, vnic, 0); + netpath_init(&vnic->secondary_path, vnic, 1); + + vnic->current_path = NULL; + + list_add_tail(&vnic->list_ptrs, &vnic_list); + + return vnic; +} + +void vnic_free(struct vnic *vnic) +{ + VNIC_FUNCTION("vnic_free()\n"); + list_del(&vnic->list_ptrs); + vnic_npevent_queue_evt(&vnic->primary_path, VNIC_NP_FREEVNIC); +} + +static void __exit vnic_cleanup(void) +{ + VNIC_FUNCTION("vnic_cleanup()\n"); + + VNIC_INIT("unloading %s\n", MODULEDETAILS); + + while (!list_empty(&vnic_list)) { + struct vnic *vnic = + list_entry(vnic_list.next, struct vnic, list_ptrs); + vnic_free(vnic); + } + + vnic_npevent_cleanup(); + viport_cleanup(); + vnic_ib_cleanup(); +} + +static int __init vnic_init(void) +{ + int ret; + VNIC_FUNCTION("vnic_init()\n"); + VNIC_INIT("Initializing %s\n", MODULEDETAILS); + + if ((ret=config_start())) { + VNIC_ERROR("config_start failed\n"); + goto failure; + } + + if ((ret=vnic_ib_init())) { + VNIC_ERROR("ib_start failed\n"); + goto failure; + } + + if ((ret=viport_start())) { + VNIC_ERROR("viport_start failed\n"); + goto failure; + } + + if ((ret=vnic_npevent_start())) { + VNIC_ERROR("vnic_npevent_start failed\n"); + goto failure; + } + + return 0; +failure: + vnic_cleanup(); + return ret; +} + +module_init(vnic_init); +module_exit(vnic_cleanup); diff --git a/drivers/infiniband/ulp/vnic/vnic_main.h b/drivers/infiniband/ulp/vnic/vnic_main.h new file mode 100644 index 0000000..2f4ecb3 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_main.h @@ -0,0 +1,130 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_MAIN_H_INCLUDED +#define VNIC_MAIN_H_INCLUDED + +#include +#include + +#include "vnic_config.h" +#include "vnic_netpath.h" + +enum vnic_npevent_type { + VNIC_PRINP_CONNECTED = 0, + VNIC_PRINP_DISCONNECTED = 1, + VNIC_PRINP_LINKUP = 2, + VNIC_PRINP_LINKDOWN = 3, + VNIC_PRINP_TIMEREXPIRED = 4, + VNIC_SECNP_CONNECTED = 5, + VNIC_SECNP_DISCONNECTED = 6, + VNIC_SECNP_LINKUP = 7, + VNIC_SECNP_LINKDOWN = 8, + VNIC_SECNP_TIMEREXPIRED = 9, + VNIC_NP_SETLINK = 10, + VNIC_NP_FREEVNIC = 11 +}; + +struct vnic_npevent { + struct list_head list_ptrs; + struct vnic *vnic; + enum vnic_npevent_type event_type; +}; + +void vnic_npevent_queue_evt(struct netpath *netpath, + enum vnic_npevent_type evt); +void vnic_npevent_dequeue_evt(struct netpath *netpath, + enum vnic_npevent_type evt); + +enum vnic_state { + VNIC_UNINITIALIZED = 0, + VNIC_REGISTERED = 1 +}; + +struct vnic { + struct list_head list_ptrs; + enum vnic_state state; + struct vnic_config *config; + struct netpath *current_path; + struct netpath primary_path; + struct netpath secondary_path; + int open; + int carrier; + int xmit_started; + int mac_set; + struct net_device_stats stats; + struct net_device netdevice; + struct class_dev_info class_dev_info; + struct dev_mc_list *mc_list; + int mc_list_len; + int mc_count; + spinlock_t lock; +#ifdef CONFIG_INFINIBAND_VNIC_STATS + struct { + cycles_t start_time; + cycles_t conn_time; + cycles_t disconn_ref; /* intermediate time */ + cycles_t disconn_time; + u32 disconn_num; + cycles_t xmit_time; + u32 xmit_num; + u32 xmit_fail; + cycles_t recv_time; + u32 recv_num; + cycles_t xmit_ref; /* intermediate time */ + cycles_t xmit_off_time; + u32 xmit_off_num; + cycles_t carrier_ref; /* intermediate time */ + cycles_t carrier_off_time; + u32 carrier_off_num; + } statistics; + struct class_dev_info stat_info; +#endif /* CONFIG_INFINIBAND_VNIC_STATS */ +}; + +struct vnic *vnic_allocate(struct vnic_config *config); + +void vnic_free(struct vnic *vnic); + +void vnic_connected(struct vnic *vnic, struct netpath *netpath); +void vnic_disconnected(struct vnic *vnic, struct netpath *netpath); + +void vnic_link_up(struct vnic *vnic, struct netpath *netpath); +void vnic_link_down(struct vnic *vnic, struct netpath *netpath); + +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath); +void vnic_restart_xmit(struct vnic *vnic, struct netpath *netpath); + +void vnic_recv_packet(struct vnic *vnic, struct netpath *netpath, + struct sk_buff *skb); + +#endif /* VNIC_MAIN_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:53:51 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:23:51 +0530 Subject: [openib-general] [PATCH v2 2/11] Netpath files - abstraction of connection to VEx Message-ID: <455A25D7.11711.61133EF@ramachandra.kuchimanchi.qlogic.com> Adds the driver netpath files. These files implement the netpath layer. Netpath is an an abstraction of a connection to the VEx. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_netpath.c | 112 ++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_netpath.h | 77 +++++++++++++++++++ 2 files changed, 189 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_netpath.c b/drivers/infiniband/ulp/vnic/vnic_netpath.c new file mode 100644 index 0000000..8b1bc90 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_netpath.c @@ -0,0 +1,112 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_netpath.h" + +void vnic_npevent_timeout(unsigned long data) +{ + struct netpath *netpath = (struct netpath *)data; + + if (netpath->second_bias) + vnic_npevent_queue_evt(netpath, VNIC_SECNP_TIMEREXPIRED); + else + vnic_npevent_queue_evt(netpath, VNIC_PRINP_TIMEREXPIRED); +} + +void netpath_timer(struct netpath *netpath, int timeout) +{ + if (netpath->timer_state == NETPATH_TS_ACTIVE) + del_timer_sync(&netpath->timer); + if (timeout) { + init_timer(&netpath->timer); + netpath->timer_state = NETPATH_TS_ACTIVE; + netpath->timer.expires = jiffies + timeout; + netpath->timer.data = (unsigned long)netpath; + netpath->timer.function = vnic_npevent_timeout; + add_timer(&netpath->timer); + } else + vnic_npevent_timeout((unsigned long)netpath); +} + +void netpath_timer_stop(struct netpath *netpath) +{ + if (netpath->timer_state != NETPATH_TS_ACTIVE) + return; + del_timer_sync(&netpath->timer); + if (netpath->second_bias) + vnic_npevent_dequeue_evt(netpath, VNIC_SECNP_TIMEREXPIRED); + else + vnic_npevent_dequeue_evt(netpath, VNIC_PRINP_TIMEREXPIRED); + + netpath->timer_state = NETPATH_TS_IDLE; +} + +void netpath_free(struct netpath *netpath) +{ + if (!netpath->viport) + return; + viport_free(netpath->viport); + netpath->viport = NULL; + sysfs_remove_group(&netpath->class_dev_info.class_dev.kobj, + &vnic_path_attr_group); + class_device_unregister(&netpath->class_dev_info.class_dev); + wait_for_completion(&netpath->class_dev_info.released); +} + +void netpath_init(struct netpath *netpath, struct vnic *vnic, + int second_bias) +{ + netpath->parent = vnic; + netpath->carrier = 0; + netpath->viport = NULL; + netpath->second_bias = second_bias; + netpath->timer_state = NETPATH_TS_IDLE; + init_timer(&netpath->timer); +} + +const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath) +{ + if (!netpath) + return "NULL"; + else if (netpath == &vnic->primary_path) + return "PRIMARY"; + else if (netpath == &vnic->secondary_path) + return "SECONDARY"; + else + return "UNKNOWN"; +} diff --git a/drivers/infiniband/ulp/vnic/vnic_netpath.h b/drivers/infiniband/ulp/vnic/vnic_netpath.h new file mode 100644 index 0000000..a5ee45c --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_netpath.h @@ -0,0 +1,77 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_NETPATH_H_INCLUDED +#define VNIC_NETPATH_H_INCLUDED + +#include + +#include "vnic_sys.h" + +struct viport; +struct vnic; + +enum netpath_ts { + NETPATH_TS_IDLE = 0, + NETPATH_TS_ACTIVE = 1, + NETPATH_TS_EXPIRED = 2 +}; + +struct netpath { + int carrier; + struct vnic *parent; + struct viport *viport; + size_t path_idx; + u32 connect_time; + int second_bias; + struct timer_list timer; + enum netpath_ts timer_state; + struct class_dev_info class_dev_info; +}; + +void netpath_init(struct netpath *netpath, struct vnic *vnic, + int second_bias); +void netpath_free(struct netpath *netpath); + +void netpath_timer(struct netpath *netpath, int timeout); +void netpath_timer_stop(struct netpath *netpath); + +const char *netpath_to_string(struct vnic *vnic, struct netpath *netpath); + +#define netpath_get_hw_addr(netpath, address) \ + viport_get_hw_addr((netpath)->viport, address) +#define netpath_is_connected(netpath) \ + (netpath->state == NETPATH_CONNECTED) +#define netpath_can_tx_csum(netpath) \ + viport_can_tx_csum(netpath->viport) + +#endif /* VNIC_NETPATH_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:54:21 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:24:21 +0530 Subject: [openib-general] [PATCH v2 3/11] Implementation of communication protocol with VEx Message-ID: <455A25F5.19723.611A8E0@ramachandra.kuchimanchi.qlogic.com> Adds the driver viport files. These files implement the state machine for the communication protocol with the VEx. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_viport.c | 1023 +++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_viport.h | 165 +++++ 2 files changed, 1188 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_viport.c b/drivers/infiniband/ulp/vnic/vnic_viport.c new file mode 100644 index 0000000..722a18e --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_viport.c @@ -0,0 +1,1023 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_netpath.h" +#include "vnic_control.h" +#include "vnic_data.h" +#include "vnic_config.h" +#include "vnic_control_pkt.h" + +#define VIPORT_DISCONN_TIMER 10000 /*in ms*/ + +DECLARE_WAIT_QUEUE_HEAD(viport_queue); +LIST_HEAD(viport_list); +DECLARE_COMPLETION(viport_thread_exit); +spinlock_t viport_list_lock = SPIN_LOCK_UNLOCKED; + +int viport_thread = -1; +int viport_thread_end = 0; + +struct viport *viport_allocate(struct viport_config *config) +{ + struct viport *viport; + + VIPORT_FUNCTION("viport_allocate()\n"); + viport = kzalloc(sizeof *viport, GFP_KERNEL); + if (!viport) { + VIPORT_ERROR("failed allocating viport structure\n"); + return NULL; + } + + viport->state = VIPORT_DISCONNECTED; + viport->link_state = LINK_RETRYWAIT; + viport->connect = WAIT; + viport->new_mtu = 1500; + viport->new_flags = 0; + viport->config = config; + + spin_lock_init(&viport->lock); + init_waitqueue_head(&viport->stats_queue); + init_waitqueue_head(&viport->disconnect_queue); + INIT_LIST_HEAD(&viport->list_ptrs); + + viport_kick(viport); + + return viport; +} + +void viport_connect(struct viport * viport, int delay) +{ + VIPORT_FUNCTION("viport_connect()\n"); + + if (delay) + viport->connect = DELAY; + else + viport->connect = NOW; + + viport_kick(viport); +} + +void viport_disconnect(struct viport *viport) +{ + VIPORT_FUNCTION("viport_disconnect()\n"); + viport->disconnect = 1; + viport_failure(viport); + wait_event(viport->disconnect_queue, viport->disconnect == 0); +} + +void viport_free(struct viport *viport) +{ + VIPORT_FUNCTION("viport_free()\n"); + viport_disconnect(viport); /* NOTE: this can sleep */ + kfree(viport->config); + kfree(viport); +} + +void viport_set_link(struct viport * viport, u16 flags, u16 mtu) +{ + unsigned long localflags; + + VIPORT_FUNCTION("viport_set_link()\n"); + if (mtu > data_max_mtu(&viport->data)) { + VIPORT_ERROR("configuration error." + " mtu of %d unsupported by %s\n", mtu, + config_viport_name(viport->config)); + goto failure; + } + + spin_lock_irqsave(&viport->lock, localflags); + flags &= IFF_UP | IFF_ALLMULTI | IFF_PROMISC; + if ((viport->new_flags != flags) + || (viport->new_mtu != mtu)) { + viport->new_flags = flags; + viport->new_mtu = mtu; + viport->updates |= NEED_LINK_CONFIG; + viport_kick(viport); + } + + spin_unlock_irqrestore(&viport->lock, localflags); + return; +failure: + viport_failure(viport); +} + +int viport_set_unicast(struct viport * viport, u8 * address) +{ + unsigned long flags; + int ret = -1; + VIPORT_FUNCTION("viport_set_unicast()\n"); + spin_lock_irqsave(&viport->lock, flags); + + if (!viport->mac_addresses) + goto out; + + if (memcmp(viport->mac_addresses[UNICAST_ADDR].address, + address, ETH_ALEN)) { + memcpy(viport->mac_addresses[UNICAST_ADDR].address, + address, ETH_ALEN); + viport->mac_addresses[UNICAST_ADDR].operation + = VNIC_OP_SET_ENTRY; + viport->updates |= NEED_ADDRESS_CONFIG; + viport_kick(viport); + } + ret = 0; +out: + spin_unlock_irqrestore(&viport->lock, flags); + return ret; +} + +int viport_set_multicast(struct viport * viport, + struct dev_mc_list * mc_list, int mc_count) +{ + u32 old_update_list; + int i; + int ret = -1; + unsigned long flags; + + VIPORT_FUNCTION("viport_set_multicast()\n"); + spin_lock_irqsave(&viport->lock, flags); + + if (!viport->mac_addresses) + goto out; + + old_update_list = viport->updates; + if (mc_count > viport->num_mac_addresses - MCAST_ADDR_START) + viport->updates |= NEED_LINK_CONFIG | MCAST_OVERFLOW; + else { + if (viport->updates & MCAST_OVERFLOW) { + viport->updates &= ~MCAST_OVERFLOW; + viport->updates |= NEED_LINK_CONFIG; + } + /* brute force algorithm */ + for (i = MCAST_ADDR_START; + i < mc_count + MCAST_ADDR_START; + i++, mc_list = mc_list->next) { + if (viport->mac_addresses[i].valid && + !memcmp(viport->mac_addresses[i].address, + mc_list->dmi_addr, ETH_ALEN)) + continue; + memcpy(viport->mac_addresses[i].address, + mc_list->dmi_addr, ETH_ALEN); + viport->mac_addresses[i].valid = 1; + viport->mac_addresses[i].operation = + VNIC_OP_SET_ENTRY; + } + for (; i < viport->num_mac_addresses; i++) { + if (!viport->mac_addresses[i].valid) + continue; + viport->mac_addresses[i].valid = 0; + viport->mac_addresses[i].operation = + VNIC_OP_SET_ENTRY; + } + if (mc_count) + viport->updates |= NEED_ADDRESS_CONFIG; + } + + if (viport->updates != old_update_list) + viport_kick(viport); + ret = 0; +out: + spin_unlock_irqrestore(&viport->lock, flags); + return ret; +} + +void viport_get_stats(struct viport * viport, + struct net_device_stats * stats) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_get_stats()\n"); + if (jiffies > viport->last_stats_time + + viport->config->stats_interval) { + spin_lock_irqsave(&viport->lock, flags); + viport->updates |= NEED_STATS; + spin_unlock_irqrestore(&viport->lock, flags); + viport_kick(viport); + wait_event(viport->stats_queue, + !(viport->updates & NEED_STATS)); + + if (viport->stats.ethernet_status) + vnic_link_up(viport->vnic, viport->parent); + else + vnic_link_down(viport->vnic, viport->parent); + } + + stats->rx_packets = be64_to_cpu(viport->stats.if_in_ok); + stats->tx_packets = be64_to_cpu(viport->stats.if_out_ok); + stats->rx_bytes = be64_to_cpu(viport->stats.if_in_octets); + stats->tx_bytes = be64_to_cpu(viport->stats.if_out_octets); + stats->rx_errors = be64_to_cpu(viport->stats.if_in_errors); + stats->tx_errors = be64_to_cpu(viport->stats.if_out_errors); + stats->rx_dropped = 0; /* EIOC doesn't track */ + stats->tx_dropped = 0; /* EIOC doesn't track */ + stats->multicast = be64_to_cpu(viport->stats.if_in_nucast_pkts); + stats->collisions = 0; /* EIOC doesn't track */ +} + +int viport_xmit_packet(struct viport * viport, struct sk_buff * skb) +{ + int status = -1; + unsigned long flags; + + VIPORT_FUNCTION("viport_xmit_packet()\n"); + spin_lock_irqsave(&viport->lock, flags); + if (viport->state == VIPORT_CONNECTED) + status = data_xmit_packet(&viport->data, skb); + spin_unlock_irqrestore(&viport->lock, flags); + + return status; +} + +void viport_kick(struct viport *viport) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_kick()\n"); + spin_lock_irqsave(&viport_list_lock, flags); + if (list_empty(&viport->list_ptrs)) { + list_add_tail(&viport->list_ptrs, &viport_list); + wake_up(&viport_queue); + } + spin_unlock_irqrestore(&viport_list_lock, flags); +} + +void viport_failure(struct viport *viport) +{ + unsigned long flags; + + VIPORT_FUNCTION("viport_failure()\n"); + spin_lock_irqsave(&viport_list_lock, flags); + viport->errored = 1; + if (list_empty(&viport->list_ptrs)) { + list_add_tail(&viport->list_ptrs, &viport_list); + wake_up(&viport_queue); + } + spin_unlock_irqrestore(&viport_list_lock, flags); +} + +static void viport_timeout(unsigned long data) +{ + struct viport *viport; + + VIPORT_FUNCTION("viport_timeout()\n"); + viport = (struct viport *)data; + viport->timer_active = 0; + viport_kick(viport); +} + +static void viport_timer(struct viport *viport, int timeout) +{ + VIPORT_FUNCTION("viport_timer()\n"); + if (viport->timer_active) + del_timer(&viport->timer); + init_timer(&viport->timer); + viport->timer.expires = jiffies + timeout; + viport->timer.data = (unsigned long)viport; + viport->timer.function = viport_timeout; + viport->timer_active = 1; + add_timer(&viport->timer); +} + +static void viport_timer_stop(struct viport *viport) +{ + VIPORT_FUNCTION("viport_timer_stop()\n"); + if (viport->timer_active) + del_timer(&viport->timer); + viport->timer_active = 0; +} + +static int viport_init_mac_addresses(struct viport *viport) +{ + struct vnic_address_op *temp; + unsigned long flags; + int i; + + VIPORT_FUNCTION("viport_init_mac_addresses()\n"); + i = viport->num_mac_addresses * sizeof *temp; + temp = kzalloc(viport->num_mac_addresses * sizeof *temp, + GFP_KERNEL); + if (!temp) { + VIPORT_ERROR("failed allocating MAC address table\n"); + return -ENOMEM; + } + + spin_lock_irqsave(&viport->lock, flags); + viport->mac_addresses = temp; + for (i = 0; i < viport->num_mac_addresses; i++) { + viport->mac_addresses[i].index = cpu_to_be16(i); + viport->mac_addresses[i].vlan = + cpu_to_be16(viport->default_vlan); + } + memset(viport->mac_addresses[BROADCAST_ADDR].address, + 0xFF, ETH_ALEN); + viport->mac_addresses[BROADCAST_ADDR].valid = 1; + memcpy(viport->mac_addresses[UNICAST_ADDR].address, + viport->hw_mac_address, ETH_ALEN); + viport->mac_addresses[UNICAST_ADDR].valid = 1; + + spin_unlock_irqrestore(&viport->lock, flags); + + return 0; +} + +static int viport_handle_init_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_UNINITIALIZED: + LINK_STATE("state LINK_UNINITIALIZED\n"); + viport->updates = 0; + wake_up(&viport->stats_queue); + /* in case of going to + * uninitialized put this viport + * back on the serviceQ, delete + * it off again. + */ + spin_lock_irq(&viport_list_lock); + list_del_init(&viport->list_ptrs); + spin_unlock_irq(&viport_list_lock); + viport->disconnect = 0; + wake_up(&viport->disconnect_queue); + break; + case LINK_INITIALIZE: + LINK_STATE("state LINK_INITIALIZE\n"); + viport->errored = 0; + viport->connect = WAIT; + viport->last_stats_time = 0; + if (viport->disconnect) + viport->link_state = LINK_UNINITIALIZED; + else + viport->link_state = LINK_INITIALIZECONTROL; + break; + case LINK_INITIALIZECONTROL: + LINK_STATE("state LINK_INITIALIZECONTROL\n"); + viport->pd = ib_alloc_pd(viport->config->ibdev); + if (IS_ERR(viport->pd)) + viport->link_state = LINK_DISCONNECTED; + else if (control_init(&viport->control, viport, + &viport->config->control_config, + viport->pd)) { + ib_dealloc_pd(viport->pd); + viport->link_state = LINK_DISCONNECTED; + + } + else + viport->link_state = LINK_INITIALIZEDATA; + break; + case LINK_INITIALIZEDATA: + LINK_STATE("state LINK_INITIALIZEDATA\n"); + if (data_init(&viport->data, viport, + &viport->config->data_config, + viport->pd)) + viport->link_state = LINK_CLEANUPCONTROL; + else + viport->link_state = LINK_CONTROLCONNECT; + break; + default: + return -1; + } + } while (viport->link_state != old_state); + + return 0; +} + +static int viport_handle_control_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_CONTROLCONNECT: + init_completion(&(viport->control.ib_conn.done)); + if (vnic_ib_cm_connect(&viport->control.ib_conn)) + viport->link_state = LINK_CLEANUPDATA; + else + viport->link_state = LINK_CONTROLCONNECTWAIT; + break; + case LINK_CONTROLCONNECTWAIT: + LINK_STATE("state LINK_CONTROLCONNECTWAIT\n"); + wait_for_completion(&(viport->control.ib_conn.done)); + if (control_is_connected(&viport->control)) + viport->link_state = LINK_INITVNICREQ; + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_CONTROLDISCONNECT; + } + break; + case LINK_INITVNICREQ: + LINK_STATE("state LINK_INITVNICREQ\n"); + if (control_init_vnic_req(&viport->control)) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_INITVNICRSP; + break; + case LINK_INITVNICRSP: + LINK_STATE("state LINK_INITVNICRSP\n"); + control_process_async(&viport->control); + + if (!control_init_vnic_rsp(&viport->control, + &viport->features_supported, + viport->hw_mac_address, + &viport->num_mac_addresses, + &viport->default_vlan)) { + if (viport_init_mac_addresses(viport)) + viport->link_state = + LINK_RESETCONTROL; + else + viport->link_state = + LINK_BEGINDATAPATH; + } + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESETCONTROL; + } + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_data_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_BEGINDATAPATH: + LINK_STATE("state LINK_BEGINDATAPATH\n"); + viport->link_state = LINK_CONFIGDATAPATHREQ; + break; + case LINK_CONFIGDATAPATHREQ: + LINK_STATE("state LINK_CONFIGDATAPATHREQ\n"); + if (control_config_data_path_req(&viport->control, + data_path_id(&viport-> + data), + data_host_pool_max + (&viport->data), + data_eioc_pool_max + (&viport->data))) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_CONFIGDATAPATHRSP; + break; + case LINK_CONFIGDATAPATHRSP: + LINK_STATE("state LINK_CONFIGDATAPATHRSP\n"); + control_process_async(&viport->control); + + if (!control_config_data_path_rsp(&viport->control, + data_host_pool + (&viport->data), + data_eioc_pool + (&viport->data), + data_host_pool_max + (&viport->data), + data_eioc_pool_max + (&viport->data), + data_host_pool_min + (&viport->data), + data_eioc_pool_min + (&viport->data))) + viport->link_state = LINK_DATACONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESETCONTROL; + } + break; + case LINK_DATACONNECT: + LINK_STATE("state LINK_DATACONNECT\n"); + init_completion(&viport->data.ib_conn.done); + if (data_connect(&viport->data)) + viport->link_state = LINK_RESETCONTROL; + else + viport->link_state = LINK_DATACONNECTWAIT; + break; + case LINK_DATACONNECTWAIT: + LINK_STATE("state LINK_DATACONNECTWAIT\n"); + wait_for_completion(&viport->data.ib_conn.done); + control_process_async(&viport->control); + if (data_is_connected(&viport->data)) + viport->link_state = LINK_XCHGPOOLREQ; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_xchgpool_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_XCHGPOOLREQ: + LINK_STATE("state LINK_XCHGPOOLREQ\n"); + if (control_exchange_pools_req(&viport->control, + data_local_pool_addr + (&viport->data), + data_local_pool_rkey + (&viport->data))) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_XCHGPOOLRSP; + break; + case LINK_XCHGPOOLRSP: + LINK_STATE("state LINK_XCHGPOOLRSP\n"); + control_process_async(&viport->control); + + if (!control_exchange_pools_rsp(&viport->control, + data_remote_pool_addr + (&viport->data), + data_remote_pool_rkey + (&viport->data))) + viport->link_state = LINK_INITIALIZED; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + case LINK_INITIALIZED: + LINK_STATE("state LINK_INITIALIZED\n"); + viport->state = VIPORT_CONNECTED; + printk(KERN_INFO PFX + "%s: connection established\n", + config_viport_name(viport->config)); + data_connected(&viport->data); + vnic_connected(viport->parent->parent, + viport->parent); + spin_lock_irq(&viport->lock); + viport->mtu = 1500; + viport->flags = 0; + if ((viport->mtu != viport->new_mtu) || + (viport->flags != viport->new_flags)) + viport->updates |= NEED_LINK_CONFIG; + spin_unlock_irq(&viport->lock); + viport->link_state = LINK_IDLE; + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_idle_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_IDLE: + LINK_STATE("state LINK_IDLE\n"); + if (viport->config->hb_interval) + viport_timer(viport, + viport->config->hb_interval); + viport->link_state = LINK_IDLING; + break; + case LINK_IDLING: + LINK_STATE("state LINK_IDLING\n"); + control_process_async(&viport->control); + if (viport->errored) { + viport_timer_stop(viport); + viport->errored = 0; + viport->link_state = LINK_RESET; + break; + } + + spin_lock_irq(&viport->lock); + if (viport->updates & NEED_LINK_CONFIG) { + viport_timer_stop(viport); + viport->link_state = LINK_CONFIGLINKREQ; + } else if (viport->updates & NEED_ADDRESS_CONFIG) { + viport_timer_stop(viport); + viport->link_state = LINK_CONFIGADDRSREQ; + } else if (viport->updates & NEED_STATS) { + viport_timer_stop(viport); + viport->link_state = LINK_REPORTSTATREQ; + } else if (viport->config->hb_interval) { + if (!viport->timer_active) + viport->link_state = + LINK_HEARTBEATREQ; + } + spin_unlock_irq(&viport->lock); + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_config_states(struct viport *viport) +{ + enum link_state old_state; + int res; + + do { + switch(old_state = viport->link_state) { + case LINK_CONFIGLINKREQ: + LINK_STATE("state LINK_CONFIGLINKREQ\n"); + spin_lock_irq(&viport->lock); + viport->updates &= ~NEED_LINK_CONFIG; + viport->flags = viport->new_flags; + if (viport->updates & MCAST_OVERFLOW) + viport->flags |= IFF_ALLMULTI; + viport->mtu = viport->new_mtu; + spin_unlock_irq(&viport->lock); + if (control_config_link_req(&viport->control, + viport->flags, + viport->mtu)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_CONFIGLINKRSP; + break; + case LINK_CONFIGLINKRSP: + LINK_STATE("state LINK_CONFIGLINKRSP\n"); + control_process_async(&viport->control); + + if (!control_config_link_rsp(&viport->control, + &viport->flags, + &viport->mtu)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + case LINK_CONFIGADDRSREQ: + LINK_STATE("state LINK_CONFIGADDRSREQ\n"); + + spin_lock_irq(&viport->lock); + res = control_config_addrs_req(&viport->control, + viport->mac_addresses, + viport-> + num_mac_addresses); + + if (res > 0) { + viport->updates &= ~NEED_ADDRESS_CONFIG; + viport->link_state = LINK_CONFIGADDRSRSP; + } else if (res == 0) + viport->link_state = LINK_CONFIGADDRSRSP; + else + viport->link_state = LINK_RESET; + spin_unlock_irq(&viport->lock); + break; + case LINK_CONFIGADDRSRSP: + LINK_STATE("state LINK_CONFIGADDRSRSP\n"); + control_process_async(&viport->control); + + if (!control_config_addrs_rsp(&viport->control)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_stat_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_REPORTSTATREQ: + LINK_STATE("state LINK_REPORTSTATREQ\n"); + if (control_report_statistics_req(&viport->control)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_REPORTSTATRSP; + break; + case LINK_REPORTSTATRSP: + LINK_STATE("state LINK_REPORTSTATRSP\n"); + control_process_async(&viport->control); + + spin_lock_irq(&viport->lock); + if (control_report_statistics_rsp(&viport->control, + &viport->stats) == 0) { + viport->updates &= ~NEED_STATS; + viport->last_stats_time = jiffies; + wake_up(&viport->stats_queue); + viport->link_state = LINK_IDLE; + } + + spin_unlock_irq(&viport->lock); + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_heartbeat_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_HEARTBEATREQ: + LINK_STATE("state LINK_HEARTBEATREQ\n"); + if (control_heartbeat_req(&viport->control, + viport->config->hb_timeout)) + viport->link_state = LINK_RESET; + else + viport->link_state = LINK_HEARTBEATRSP; + break; + case LINK_HEARTBEATRSP: + LINK_STATE("state LINK_HEARTBEATRSP\n"); + control_process_async(&viport->control); + + if (!control_heartbeat_rsp(&viport->control)) + viport->link_state = LINK_IDLE; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_RESET; + } + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_reset_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_RESET: + LINK_STATE("state LINK_RESET\n"); + viport->errored = 0; + spin_lock_irq(&viport->lock); + viport->state = VIPORT_DISCONNECTED; + spin_unlock_irq(&viport->lock); + vnic_link_down(viport->vnic, viport->parent); + printk(KERN_INFO PFX + "%s: connection lost\n", + config_viport_name(viport->config)); + if (control_reset_req(&viport->control)) + viport->link_state = LINK_DATADISCONNECT; + else + viport->link_state = LINK_RESETRSP; + break; + case LINK_RESETRSP: + LINK_STATE("state LINK_RESETRSP\n"); + control_process_async(&viport->control); + + if (!control_reset_rsp(&viport->control)) + viport->link_state = LINK_DATADISCONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_DATADISCONNECT; + } + break; + case LINK_RESETCONTROL: + LINK_STATE("state LINK_RESETCONTROL\n"); + if (control_reset_req(&viport->control)) + viport->link_state = LINK_CONTROLDISCONNECT; + else + viport->link_state = LINK_RESETCONTROLRSP; + break; + case LINK_RESETCONTROLRSP: + LINK_STATE("state LINK_RESETCONTROLRSP\n"); + control_process_async(&viport->control); + + if (!control_reset_rsp(&viport->control)) + viport->link_state = LINK_CONTROLDISCONNECT; + + if (viport->errored) { + viport->errored = 0; + viport->link_state = LINK_CONTROLDISCONNECT; + } + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_handle_disconn_states(struct viport *viport) +{ + enum link_state old_state; + + do { + switch(old_state = viport->link_state) { + case LINK_DATADISCONNECT: + LINK_STATE("state LINK_DATADISCONNECT\n"); + data_disconnect(&viport->data); + viport->link_state = LINK_CONTROLDISCONNECT; + break; + case LINK_CONTROLDISCONNECT: + LINK_STATE("state LINK_CONTROLDISCONNECT\n"); + viport->link_state = LINK_CLEANUPDATA; + break; + case LINK_CLEANUPDATA: + LINK_STATE("state LINK_CLEANUPDATA\n"); + data_cleanup(&viport->data); + viport->link_state = LINK_CLEANUPCONTROL; + break; + case LINK_CLEANUPCONTROL: + LINK_STATE("state LINK_CLEANUPCONTROL\n"); + spin_lock_irq(&viport->lock); + if (viport->mac_addresses) { + kfree(viport->mac_addresses); + viport->mac_addresses = NULL; + } + spin_unlock_irq(&viport->lock); + control_cleanup(&viport->control); + ib_dealloc_pd(viport->pd); + viport->link_state = LINK_DISCONNECTED; + break; + case LINK_DISCONNECTED: + LINK_STATE("state LINK_DISCONNECTED\n"); + vnic_disconnected(viport->parent->parent, + viport->parent); + if (viport->disconnect != 0) + viport->link_state = LINK_UNINITIALIZED; + else { + viport_timer(viport, + msecs_to_jiffies(VIPORT_DISCONN_TIMER)); + viport->link_state = LINK_RETRYWAIT; + } + break; + case LINK_RETRYWAIT: + LINK_STATE("state LINK_RETRYWAIT\n"); + viport->stats.ethernet_status = 0; + viport->updates = 0; + wake_up(&viport->stats_queue); + if (viport->disconnect != 0) { + viport_timer_stop(viport); + viport->link_state = LINK_UNINITIALIZED; + } else if (viport->connect == DELAY) { + if (!viport->timer_active) { + viport->link_state = LINK_INITIALIZE; + } + } else if (viport->connect == NOW) { + viport_timer_stop(viport); + viport->link_state = LINK_INITIALIZE; + } + break; + default: + return -1; + } + } while(viport->link_state != old_state); + + return 0; +} + +static int viport_statemachine(void *context) +{ + struct viport *viport; + enum link_state old_link_state; + + VIPORT_FUNCTION("viport_statemachine()\n"); + daemonize("vnic_viport"); + while (!viport_thread_end || !list_empty(&viport_list)) { + wait_event_interruptible(viport_queue, + !list_empty(&viport_list) + || viport_thread_end); + spin_lock_irq(&viport_list_lock); + if (list_empty(&viport_list)) { + spin_unlock_irq(&viport_list_lock); + continue; + } + viport = list_entry(viport_list.next, struct viport, + list_ptrs); + list_del_init(&viport->list_ptrs); + spin_unlock_irq(&viport_list_lock); + + do { + old_link_state = viport->link_state; + + /* + * Optimize for the state machine steady state + * by checking for the most common states first. + * + */ + if (viport_handle_idle_states(viport) == 0) + break; + if (viport_handle_heartbeat_states(viport) == 0) + break; + if (viport_handle_stat_states(viport) == 0) + break; + if (viport_handle_config_states(viport) == 0) + break; + + if (viport_handle_init_states(viport) == 0) + break; + if (viport_handle_control_states(viport) == 0) + break; + if (viport_handle_data_states(viport) == 0) + break; + if (viport_handle_xchgpool_states(viport) == 0) + break; + if (viport_handle_reset_states(viport) == 0) + break; + if (viport_handle_disconn_states(viport) == 0) + break; + } while (viport->link_state != old_link_state); + } + + complete_and_exit(&viport_thread_exit, 0); +} + +int viport_start(void) +{ + VIPORT_FUNCTION("viport_start()\n"); + + viport_thread = kernel_thread(viport_statemachine, NULL, 0); + if (viport_thread < 0) { + printk(KERN_WARNING PFX "Could not create viport_thread;" + " error %d\n", viport_thread); + return viport_thread; + } + + return 0; +} + +void viport_cleanup(void) +{ + VIPORT_FUNCTION("viport_cleanup()\n"); + if (viport_thread > 0) { + viport_thread_end = 1; + wake_up(&viport_queue); + wait_for_completion(&viport_thread_exit); + } +} diff --git a/drivers/infiniband/ulp/vnic/vnic_viport.h b/drivers/infiniband/ulp/vnic/vnic_viport.h new file mode 100644 index 0000000..4a3060a --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_viport.h @@ -0,0 +1,165 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_VIPORT_H_INCLUDED +#define VNIC_VIPORT_H_INCLUDED + +#include "vnic_control.h" +#include "vnic_data.h" + +enum viport_state { + VIPORT_DISCONNECTED = 0, + VIPORT_CONNECTED = 1 +}; + +enum link_state { + LINK_UNINITIALIZED = 0, + LINK_INITIALIZE = 1, + LINK_INITIALIZECONTROL = 2, + LINK_INITIALIZEDATA = 3, + LINK_CONTROLCONNECT = 4, + LINK_CONTROLCONNECTWAIT = 5, + LINK_INITVNICREQ = 6, + LINK_INITVNICRSP = 7, + LINK_BEGINDATAPATH = 8, + LINK_CONFIGDATAPATHREQ = 9, + LINK_CONFIGDATAPATHRSP = 10, + LINK_DATACONNECT = 11, + LINK_DATACONNECTWAIT = 12, + LINK_XCHGPOOLREQ = 13, + LINK_XCHGPOOLRSP = 14, + LINK_INITIALIZED = 15, + LINK_IDLE = 16, + LINK_IDLING = 17, + LINK_CONFIGLINKREQ = 18, + LINK_CONFIGLINKRSP = 19, + LINK_CONFIGADDRSREQ = 20, + LINK_CONFIGADDRSRSP = 21, + LINK_REPORTSTATREQ = 22, + LINK_REPORTSTATRSP = 23, + LINK_HEARTBEATREQ = 24, + LINK_HEARTBEATRSP = 25, + LINK_RESET = 26, + LINK_RESETRSP = 27, + LINK_RESETCONTROL = 28, + LINK_RESETCONTROLRSP = 29, + LINK_DATADISCONNECT = 30, + LINK_CONTROLDISCONNECT = 31, + LINK_CLEANUPDATA = 32, + LINK_CLEANUPCONTROL = 33, + LINK_DISCONNECTED = 34, + LINK_RETRYWAIT = 35 +}; + +enum { + BROADCAST_ADDR = 0, + UNICAST_ADDR = 1, + MCAST_ADDR_START = 2 +}; + +#define current_mac_address mac_addresses[UNICAST_ADDR].address + +enum { + NEED_STATS = 0x00000001, + NEED_ADDRESS_CONFIG = 0x00000002, + NEED_LINK_CONFIG = 0x00000004, + MCAST_OVERFLOW = 0x00000008 +}; + +struct viport { + struct list_head list_ptrs; + struct netpath *parent; + struct vnic *vnic; + struct viport_config *config; + struct control control; + struct data data; + spinlock_t lock; + struct ib_pd *pd; + enum viport_state state; + enum link_state link_state; + struct vnic_cmd_report_stats_rsp stats; + wait_queue_head_t stats_queue; + u32 last_stats_time; + u32 features_supported; + u8 hw_mac_address[ETH_ALEN]; + u16 default_vlan; + u16 num_mac_addresses; + struct vnic_address_op *mac_addresses; + u32 updates; + u16 flags; + u16 new_flags; + u16 mtu; + u16 new_mtu; + u32 errored; + enum { WAIT, DELAY, NOW } connect; + u32 disconnect; + wait_queue_head_t disconnect_queue; + int timer_active; + struct timer_list timer; +}; + +int viport_start(void); +void viport_cleanup(void); + +struct viport *viport_allocate(struct viport_config *config); +void viport_free(struct viport *viport); + +void viport_connect(struct viport *viport, int delay); +void viport_disconnect(struct viport *viport); + +void viport_set_link(struct viport *viport, u16 flags, u16 mtu); +void viport_get_stats(struct viport *viport, + struct net_device_stats *stats); +int viport_xmit_packet(struct viport *viport, struct sk_buff *skb); +void viport_kick(struct viport *viport); + +void viport_failure(struct viport *viport); + +int viport_set_unicast(struct viport *viport, u8 * address); +int viport_set_multicast(struct viport *viport, + struct dev_mc_list *mc_list, + int mc_count); + +#define viport_max_mtu(viport) data_max_mtu(&(viport)->data) + +#define viport_get_hw_addr(viport, address) \ + memcpy(address, (viport)->hw_mac_address, ETH_ALEN) + +#define viport_features(viport) ((viport)->features_supported) + +#define viport_can_tx_csum(viport) \ + (((viport)->features_supported & \ + (VNIC_FEAT_IPV4_CSUM_TX | VNIC_FEAT_TCP_CSUM_TX | \ + VNIC_FEAT_UDP_CSUM_TX)) == (VNIC_FEAT_IPV4_CSUM_TX | \ + VNIC_FEAT_TCP_CSUM_TX | VNIC_FEAT_UDP_CSUM_TX)) + +#endif /* VNIC_VIPORT_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:54:48 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:24:48 +0530 Subject: [openib-general] [PATCH v2 4/11] Implementation of Control path of the communication protocol Message-ID: <455A2610.4157.6121120@ramachandra.kuchimanchi.qlogic.com> Adds the files that define the control packet formats and implement the various control messages that are exchanged as part of the communication protocol with the VEx. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_control.c | 1953 ++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_control.h | 146 ++ drivers/infiniband/ulp/vnic/vnic_control_pkt.h | 292 ++++ 3 files changed, 2391 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_control.c b/drivers/infiniband/ulp/vnic/vnic_control.c new file mode 100644 index 0000000..920ab90 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_control.c @@ -0,0 +1,1953 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include + +#include "vnic_util.h" +#include "vnic_main.h" +#include "vnic_viport.h" +#include "vnic_control.h" +#include "vnic_config.h" +#include "vnic_control_pkt.h" +#include "vnic_stats.h" + +static void control_log_control_packet(struct vnic_control_packet *pkt); + +static inline char *control_ifcfg_name(struct control *control) +{ + if (!control) + return "nctl"; + if (!control->parent) + return "np"; + if (!control->parent->parent) + return "npp"; + if (!control->parent->parent->parent) + return "nppp"; + if (!control->parent->parent->parent->config) + return "npppc"; + return (control->parent->parent->parent->config->name); +} + +static void control_recv(struct control *control, struct recv_io *recv_io) +{ + if (vnic_ib_post_recv(&control->ib_conn, &recv_io->io)) + viport_failure(control->parent); +} + +static void control_recv_complete(struct io *io) +{ + struct recv_io *recv_io = (struct recv_io *)io; + struct recv_io *last_recv_io; + struct control *control = &io->viport->control; + struct vnic_control_packet *pkt = control_packet(recv_io); + struct vnic_control_header *c_hdr = &pkt->hdr; + unsigned long flags; + cycles_t response_time; + + CONTROL_FUNCTION("%s: control_recv_complete()\n", + control_ifcfg_name(control)); + + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + control_note_rsptime_stats(&response_time); + CONTROL_PACKET(pkt); + spin_lock_irqsave(&control->io_lock, flags); + if (c_hdr->pkt_type == TYPE_INFO) { + last_recv_io = control->info; + control->info = recv_io; + spin_unlock_irqrestore(&control->io_lock, flags); + viport_kick(control->parent); + if (last_recv_io) + control_recv(control, last_recv_io); + } else if (c_hdr->pkt_type == TYPE_RSP) { + if (control->rsp_expected + && (c_hdr->pkt_seq_num == control->seq_num)) { + control->response = recv_io; + control->rsp_expected = 0; + spin_unlock_irqrestore(&control->io_lock, flags); + control_update_rsptime_stats(control, + response_time); + viport_kick(control->parent); + } else { + spin_unlock_irqrestore(&control->io_lock, flags); + control_recv(control, recv_io); + } + } else { + list_add_tail(&recv_io->io.list_ptrs, + &control->failure_list); + spin_unlock_irqrestore(&control->io_lock, flags); + viport_kick(control->parent); + } + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); +} + +static void control_timeout(unsigned long data) +{ + struct control *control; + + control = (struct control *)data; + CONTROL_FUNCTION("%s: control_timeout()\n", + control_ifcfg_name(control)); + control->timer_state = TIMER_EXPIRED; + control->rsp_expected = 0; + viport_kick(control->parent); +} + +static void control_timer(struct control *control, int timeout) +{ + CONTROL_FUNCTION("%s: control_timer()\n", + control_ifcfg_name(control)); + if (control->timer_state == TIMER_ACTIVE) + mod_timer(&control->timer, jiffies + timeout); + else { + init_timer(&control->timer); + control->timer.expires = jiffies + timeout; + control->timer.data = (unsigned long)control; + control->timer.function = control_timeout; + control->timer_state = TIMER_ACTIVE; + add_timer(&control->timer); + } +} + +static void control_timer_stop(struct control *control) +{ + CONTROL_FUNCTION("%s: control_timer_stop()\n", + control_ifcfg_name(control)); + if (control->timer_state == TIMER_ACTIVE) + del_timer_sync(&control->timer); + + control->timer_state = TIMER_IDLE; +} + +static int control_send(struct control *control, struct send_io *send_io) +{ + CONTROL_FUNCTION("%s: control_send()\n", + control_ifcfg_name(control)); + if (control->req_outstanding) { + CONTROL_ERROR("%s: IB send never completed\n", + control_ifcfg_name(control)); + goto out; + } + + control->req_outstanding = 1; + control_timer(control, control->config->rsp_timeout); + control_note_reqtime_stats(control); + if (vnic_ib_post_send(&control->ib_conn, &control->send_io.io)) { + CONTROL_ERROR("failed to post send\n"); + control->req_outstanding = 0; + goto out; + } + + return 0; +out: + viport_failure(control->parent); + return -1; + +} + +static void control_send_complete(struct io *io) +{ + struct control *control = &io->viport->control; + + CONTROL_FUNCTION("%s: control_send_complete()\n", + control_ifcfg_name(control)); + control->req_outstanding = 0; +} + +void control_process_async(struct control *control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + unsigned long flags; + + CONTROL_FUNCTION("%s: control_process_async()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + spin_lock_irqsave(&control->io_lock, flags); + recv_io = control->info; + if (recv_io) { + CONTROL_INFO("%s: processing info packet\n", + control_ifcfg_name(control)); + control->info = NULL; + spin_unlock_irqrestore(&control->io_lock, flags); + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd == CMD_REPORT_STATUS) { + u32 status; + status = + be32_to_cpu(pkt->cmd.report_status.status_number); + switch (status) { + case VNIC_STATUS_LINK_UP: + CONTROL_INFO("%s: link up\n", + control_ifcfg_name(control)); + vnic_link_up(control->parent->vnic, + control->parent->parent); + break; + case VNIC_STATUS_LINK_DOWN: + CONTROL_INFO("%s: link down\n", + control_ifcfg_name(control)); + vnic_link_down(control->parent->vnic, + control->parent->parent); + break; + default: + CONTROL_ERROR("%s: asynchronous status" + " received from EIOC\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + break; + } + } + if ((pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) || + pkt->cmd.report_status.is_fatal) { + viport_failure(control->parent); + } + control_recv(control, recv_io); + spin_lock_irqsave(&control->io_lock, flags); + } + + while (!list_empty(&control->failure_list)) { + CONTROL_INFO("%s: processing error packet\n", + control_ifcfg_name(control)); + recv_io = (struct recv_io *) + list_entry(control->failure_list.next, struct io, + list_ptrs); + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&control->io_lock, flags); + pkt = control_packet(recv_io); + CONTROL_ERROR("%s: asynchronous error received from EIOC\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + if ((pkt->hdr.pkt_type != TYPE_ERR) + || (pkt->hdr.pkt_cmd != CMD_REPORT_STATUS) + || pkt->cmd.report_status.is_fatal) { + viport_failure(control->parent); + } + control_recv(control, recv_io); + spin_lock_irqsave(&control->io_lock, flags); + } + spin_unlock_irqrestore(&control->io_lock, flags); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + CONTROL_INFO("%s: done control_process_async\n", + control_ifcfg_name(control)); +} + +static struct send_io *control_init_hdr(struct control *control, u8 cmd) +{ + struct control_config *config; + struct vnic_control_packet *pkt; + struct vnic_control_header *hdr; + + CONTROL_FUNCTION("control_init_hdr()\n"); + config = control->config; + + pkt = control_packet(&control->send_io); + hdr = &pkt->hdr; + + hdr->pkt_type = TYPE_REQ; + hdr->pkt_cmd = cmd; + control->seq_num++; + hdr->pkt_seq_num = control->seq_num; + control->req_retry_counter = 0; + hdr->pkt_retry_count = control->req_retry_counter; + + return &control->send_io; +} + +static struct recv_io *control_get_rsp(struct control *control) +{ + struct recv_io *recv_io; + unsigned long flags; + + CONTROL_FUNCTION("%s: control_get_rsp()\n", + control_ifcfg_name(control)); + spin_lock_irqsave(&control->io_lock, flags); + recv_io = control->response; + if (recv_io) { + control_timer_stop(control); + control->response = NULL; + spin_unlock_irqrestore(&control->io_lock, flags); + return recv_io; + } + spin_unlock_irqrestore(&control->io_lock, flags); + if (control->timer_state == TIMER_EXPIRED) { + struct vnic_control_packet *pkt = + control_packet(&control->send_io); + struct vnic_control_header *hdr = &pkt->hdr; + + control->timer_state = TIMER_IDLE; + CONTROL_ERROR("%s: no response received from EIOC\n", + control_ifcfg_name(control)); + control_timeout_stats(control); + control->req_retry_counter++; + if (control->req_retry_counter >= + control->config->req_retry_count) { + CONTROL_ERROR("%s: control packet retry exceeded\n", + control_ifcfg_name(control)); + viport_failure(control->parent); + } else { + hdr->pkt_retry_count = + control->req_retry_counter; + control_send(control, &control->send_io); + } + } + + return NULL; +} + +int control_init_vnic_req(struct control *control) +{ + struct send_io *send_io; + struct control_config *config = control->config; + struct vnic_control_packet *pkt; + struct vnic_cmd_init_vnic_req *init_vnic_req; + + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_INIT_VNIC); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + init_vnic_req = &pkt->cmd.init_vnic_req; + init_vnic_req->vnic_major_version = + __constant_cpu_to_be16(VNIC_MAJORVERSION); + init_vnic_req->vnic_minor_version = + __constant_cpu_to_be16(VNIC_MINORVERSION); + init_vnic_req->vnic_instance = config->vnic_instance; + init_vnic_req->num_data_paths = 1; + init_vnic_req->num_address_entries = + cpu_to_be16(config->max_address_entries); + + CONTROL_PACKET(pkt); + + control->rsp_expected = pkt->hdr.pkt_cmd; + + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + return control_send(control, send_io); +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +static int control_chk_vnic_rsp_values(struct control *control, + u16 *num_addrs, + u8 num_data_paths, + u8 num_lan_switches) +{ + + struct control_config *config = control->config; + + if ((control->maj_ver > VNIC_MAJORVERSION) + || ((control->maj_ver == VNIC_MAJORVERSION) + && (control->min_ver > VNIC_MINORVERSION))) { + CONTROL_ERROR("%s: unsupported version\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_data_paths != 1) { + CONTROL_ERROR("%s: EIOC returned too many datapaths\n", + control_ifcfg_name(control)); + goto failure; + } + if (*num_addrs > config->max_address_entries) { + CONTROL_ERROR("%s: EIOC returned more address" + " entries than requested\n", + control_ifcfg_name(control)); + goto failure; + } + if (*num_addrs < config->min_address_entries) { + CONTROL_ERROR("%s: not enough address entries\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_lan_switches < 1) { + CONTROL_ERROR("%s: EIOC returned no lan switches\n", + control_ifcfg_name(control)); + goto failure; + } + if (num_lan_switches > 1) { + CONTROL_ERROR("%s: EIOC returned multiple lan switches\n", + control_ifcfg_name(control)); + goto failure; + } + + return 0; +failure: + return -1; +} + +int control_init_vnic_rsp(struct control *control, u32 *features, + u8 *mac_address, u16 *num_addrs, u16 *vlan) +{ + u8 num_data_paths; + u8 num_lan_switches; + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_init_vnic_rsp *init_vnic_rsp; + + + CONTROL_FUNCTION("%s: control_init_vnic_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_INIT_VNIC) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + + init_vnic_rsp = &pkt->cmd.init_vnic_rsp; + control->maj_ver = be16_to_cpu(init_vnic_rsp->vnic_major_version); + control->min_ver = be16_to_cpu(init_vnic_rsp->vnic_minor_version); + num_data_paths = init_vnic_rsp->num_data_paths; + num_lan_switches = init_vnic_rsp->num_lan_switches; + *features = be32_to_cpu(init_vnic_rsp->features_supported); + *num_addrs = be16_to_cpu(init_vnic_rsp->num_address_entries); + + if (control_chk_vnic_rsp_values(control, num_addrs, + num_data_paths, + num_lan_switches)) + goto failure; + + control->lan_switch.lan_switch_num = + init_vnic_rsp->lan_switch[0].lan_switch_num; + control->lan_switch.num_enet_ports = + init_vnic_rsp->lan_switch[0].num_enet_ports; + control->lan_switch.default_vlan = + init_vnic_rsp->lan_switch[0].default_vlan; + *vlan = be16_to_cpu(control->lan_switch.default_vlan); + memcpy(control->lan_switch.hw_mac_address, + init_vnic_rsp->lan_switch[0].hw_mac_address, ETH_ALEN); + memcpy(mac_address, init_vnic_rsp->lan_switch[0].hw_mac_address, + ETH_ALEN); + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +static void copy_recv_pool_config(struct vnic_recv_pool_config *src, + struct vnic_recv_pool_config *dst) +{ + dst->size_recv_pool_entry = src->size_recv_pool_entry; + dst->num_recv_pool_entries = src->num_recv_pool_entries; + dst->timeout_before_kick = src->timeout_before_kick; + dst->num_recv_pool_entries_before_kick = + src->num_recv_pool_entries_before_kick; + dst->num_recv_pool_bytes_before_kick = + src->num_recv_pool_bytes_before_kick; + dst->free_recv_pool_entries_per_update = + src->free_recv_pool_entries_per_update; +} + +static int check_recv_pool_config_value(__be32 *src, __be32 *dst, + __be32 *max, __be32 *min, + char *name) +{ + u32 value; + + value = be32_to_cpu(*src); + if (value > be32_to_cpu(*max)) { + CONTROL_ERROR("value %s too large\n", name); + return -1; + } else if (value < be32_to_cpu(*min)) { + CONTROL_ERROR("value %s too small\n", name); + return -1; + } + + *dst = cpu_to_be32(value); + return 0; +} + +static int check_recv_pool_config(struct vnic_recv_pool_config *src, + struct vnic_recv_pool_config *dst, + struct vnic_recv_pool_config *max, + struct vnic_recv_pool_config *min) +{ + if (check_recv_pool_config_value(&src->size_recv_pool_entry, + &dst->size_recv_pool_entry, + &max->size_recv_pool_entry, + &min->size_recv_pool_entry, + "size_recv_pool_entry") + || check_recv_pool_config_value(&src->num_recv_pool_entries, + &dst->num_recv_pool_entries, + &max->num_recv_pool_entries, + &min->num_recv_pool_entries, + "num_recv_pool_entries") + || check_recv_pool_config_value(&src->timeout_before_kick, + &dst->timeout_before_kick, + &max->timeout_before_kick, + &min->timeout_before_kick, + "timeout_before_kick") + || check_recv_pool_config_value(&src-> + num_recv_pool_entries_before_kick, + &dst-> + num_recv_pool_entries_before_kick, + &max-> + num_recv_pool_entries_before_kick, + &min-> + num_recv_pool_entries_before_kick, + "num_recv_pool_entries_before_kick") + || check_recv_pool_config_value(&src-> + num_recv_pool_bytes_before_kick, + &dst-> + num_recv_pool_bytes_before_kick, + &max-> + num_recv_pool_bytes_before_kick, + &min-> + num_recv_pool_bytes_before_kick, + "num_recv_pool_bytes_before_kick") + || check_recv_pool_config_value(&src-> + free_recv_pool_entries_per_update, + &dst-> + free_recv_pool_entries_per_update, + &max-> + free_recv_pool_entries_per_update, + &min-> + free_recv_pool_entries_per_update, + "free_recv_pool_entries_per_update")) + goto failure; + + if (!is_power_of2(be32_to_cpu(dst->num_recv_pool_entries))) { + CONTROL_ERROR("num_recv_pool_entries (%d)" + " must be power of 2\n", + dst->num_recv_pool_entries); + goto failure; + } + + if (!is_power_of2(be32_to_cpu(dst-> + free_recv_pool_entries_per_update))) { + CONTROL_ERROR("free_recv_pool_entries_per_update (%d)" + " must be power of 2\n", + dst->free_recv_pool_entries_per_update); + goto failure; + } + + if (be32_to_cpu(dst->free_recv_pool_entries_per_update) >= + be32_to_cpu(dst->num_recv_pool_entries)) { + CONTROL_ERROR("free_recv_pool_entries_per_update (%d) must" + " be less than num_recv_pool_entries (%d)\n", + dst->free_recv_pool_entries_per_update, + dst->num_recv_pool_entries); + goto failure; + } + + if (be32_to_cpu(dst->num_recv_pool_entries_before_kick) >= + be32_to_cpu(dst->num_recv_pool_entries)) { + CONTROL_ERROR("num_recv_pool_entries_before_kick (%d) must" + " be less than num_recv_pool_entries (%d)\n", + dst->num_recv_pool_entries_before_kick, + dst->num_recv_pool_entries); + goto failure; + } + + return 0; +failure: + return -1; +} + +int control_config_data_path_req(struct control * control, u64 path_id, + struct vnic_recv_pool_config * host, + struct vnic_recv_pool_config * eioc) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_data_path *config_data_path; + + CONTROL_FUNCTION("%s: control_config_data_path_req()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_CONFIG_DATA_PATH); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_data_path = &pkt->cmd.config_data_path_req; + config_data_path->data_path = 0; + config_data_path->path_identifier = path_id; + copy_recv_pool_config(host, + &config_data_path->host_recv_pool_config); + copy_recv_pool_config(eioc, + &config_data_path->eioc_recv_pool_config); + CONTROL_PACKET(pkt); + + control->rsp_expected = pkt->hdr.pkt_cmd; + + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + return control_send(control, send_io); +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_config_data_path_rsp(struct control * control, + struct vnic_recv_pool_config * host, + struct vnic_recv_pool_config * eioc, + struct vnic_recv_pool_config * max_host, + struct vnic_recv_pool_config * max_eioc, + struct vnic_recv_pool_config * min_host, + struct vnic_recv_pool_config * min_eioc) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_data_path *config_data_path; + + CONTROL_FUNCTION("%s: control_config_data_path_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_CONFIG_DATA_PATH) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + + config_data_path = &pkt->cmd.config_data_path_rsp; + if (config_data_path->data_path != 0) { + CONTROL_ERROR("%s: received CMD_CONFIG_DATA_PATH response" + " for wrong data path: %u\n", + control_ifcfg_name(control), + config_data_path->data_path); + goto failure; + } + + if (check_recv_pool_config(&config_data_path-> + host_recv_pool_config, + host, max_host, min_host) + || check_recv_pool_config(&config_data_path-> + eioc_recv_pool_config, + eioc, max_eioc, min_eioc)) { + goto failure; + } + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_exchange_pools_req(struct control * control, u64 addr, u32 rkey) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_exchange_pools *exchange_pools; + + CONTROL_FUNCTION("%s: control_exchange_pools_req()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_EXCHANGE_POOLS); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + exchange_pools = &pkt->cmd.exchange_pools_req; + exchange_pools->data_path = 0; + exchange_pools->pool_rkey = cpu_to_be32(rkey); + exchange_pools->pool_addr = cpu_to_be64(addr); + + control->rsp_expected = pkt->hdr.pkt_cmd; + + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_exchange_pools_rsp(struct control * control, u64 * addr, + u32 * rkey) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_exchange_pools *exchange_pools; + + CONTROL_FUNCTION("%s: control_exchange_pools_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_EXCHANGE_POOLS) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + + exchange_pools = &pkt->cmd.exchange_pools_rsp; + *rkey = be32_to_cpu(exchange_pools->pool_rkey); + *addr = be64_to_cpu(exchange_pools->pool_addr); + + if (exchange_pools->data_path != 0) { + CONTROL_ERROR("%s: received CMD_EXCHANGE_POOLS response" + " for wrong data path: %u\n", + control_ifcfg_name(control), + exchange_pools->data_path); + goto failure; + } + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_config_link_req(struct control * control, u16 flags, u16 mtu) +{ + struct send_io *send_io; + struct vnic_cmd_config_link *config_link_req; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_config_link_req()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_CONFIG_LINK); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_link_req = &pkt->cmd.config_link_req; + config_link_req->lan_switch_num = + control->lan_switch.lan_switch_num; + config_link_req->cmd_flags = VNIC_FLAG_SET_MTU; + if (flags & IFF_UP) + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_NIC; + else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_NIC; + if (flags & IFF_ALLMULTI) + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL; + else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_MCAST_ALL; + if (flags & IFF_PROMISC) { + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_PROMISC; + /* the EIOU doesn't really do PROMISC mode. + * if PROMISC is set, it only receives unicast packets + * I also have to set MCAST_ALL if I want real + * PROMISC mode. + */ + config_link_req->cmd_flags &= ~VNIC_FLAG_DISABLE_MCAST_ALL; + config_link_req->cmd_flags |= VNIC_FLAG_ENABLE_MCAST_ALL; + } else + config_link_req->cmd_flags |= VNIC_FLAG_DISABLE_PROMISC; + + config_link_req->mtu_size = cpu_to_be16(mtu); + + control->rsp_expected = pkt->hdr.pkt_cmd; + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_config_link_rsp(struct control * control, u16 * flags, + u16 * mtu) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_link *config_link_rsp; + + CONTROL_FUNCTION("%s: control_config_link_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_CONFIG_LINK) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + config_link_rsp = &pkt->cmd.config_link_rsp; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_NIC) + *flags |= IFF_UP; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) + *flags |= IFF_ALLMULTI; + if (config_link_rsp->cmd_flags & VNIC_FLAG_ENABLE_PROMISC) + *flags |= IFF_PROMISC; + + *mtu = be16_to_cpu(config_link_rsp->mtu_size); + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +/* control_config_addrs_req: + * return values: + * -1: failure + * 0: incomplete (successful operation, but more address + * table entries to be updated) + * 1: complete + */ +int control_config_addrs_req(struct control *control, + struct vnic_address_op *addrs, u16 num) +{ + u16 i; + u8 j; + int ret = 1; + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_addresses *config_addrs_req; + + CONTROL_FUNCTION("%s: control_config_addrs_req()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_CONFIG_ADDRESSES); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + config_addrs_req = &pkt->cmd.config_addresses_req; + config_addrs_req->lan_switch_num = + control->lan_switch.lan_switch_num; + for (i = 0, j = 0; (i < num) && (j < 16); i++) { + if (!addrs[i].operation) + continue; + config_addrs_req->list_address_ops[j].index = cpu_to_be16(i); + config_addrs_req->list_address_ops[j].operation = + VNIC_OP_SET_ENTRY; + config_addrs_req->list_address_ops[j].valid = addrs[i].valid; + memcpy(config_addrs_req->list_address_ops[j].address, + addrs[i].address, ETH_ALEN); + config_addrs_req->list_address_ops[j].vlan = addrs[i].vlan; + addrs[i].operation = 0; + j++; + } + for (; i < num; i++) { + if (addrs[i].operation) { + ret = 0; + break; + } + } + config_addrs_req->num_address_ops = j; + + control->rsp_expected = pkt->hdr.pkt_cmd; + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + if (control_send(control, send_io)) + return -1; + return ret; +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_config_addrs_rsp(struct control * control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_config_addresses *config_addrs_rsp; + + CONTROL_FUNCTION("%s: control_config_addrs_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_CONFIG_ADDRESSES) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + config_addrs_rsp = &pkt->cmd.config_addresses_rsp; + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_report_statistics_req(struct control * control) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_report_stats_req *report_statistics_req; + + CONTROL_FUNCTION("%s: control_report_statistics_req()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_REPORT_STATISTICS); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + report_statistics_req = &pkt->cmd.report_statistics_req; + report_statistics_req->lan_switch_num = + control->lan_switch.lan_switch_num; + + control->rsp_expected = pkt->hdr.pkt_cmd; + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_report_statistics_rsp(struct control * control, + struct vnic_cmd_report_stats_rsp * stats) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_report_stats_rsp *rep_stat_rsp; + + CONTROL_FUNCTION("%s: control_report_statistics_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_REPORT_STATISTICS) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + + rep_stat_rsp = &pkt->cmd.report_statistics_rsp; + + stats->if_in_broadcast_pkts = rep_stat_rsp->if_in_broadcast_pkts; + stats->if_in_multicast_pkts = rep_stat_rsp->if_in_multicast_pkts; + stats->if_in_octets = rep_stat_rsp->if_in_octets; + stats->if_in_ucast_pkts = rep_stat_rsp->if_in_ucast_pkts; + stats->if_in_nucast_pkts = rep_stat_rsp->if_in_nucast_pkts; + stats->if_in_underrun = rep_stat_rsp->if_in_underrun; + stats->if_in_errors = rep_stat_rsp->if_in_errors; + stats->if_out_errors = rep_stat_rsp->if_out_errors; + stats->if_out_octets = rep_stat_rsp->if_out_octets; + stats->if_out_ucast_pkts = rep_stat_rsp->if_out_ucast_pkts; + stats->if_out_multicast_pkts = rep_stat_rsp->if_out_multicast_pkts; + stats->if_out_broadcast_pkts = rep_stat_rsp->if_out_broadcast_pkts; + stats->if_out_nucast_pkts = rep_stat_rsp->if_out_nucast_pkts; + stats->if_out_ok = rep_stat_rsp->if_out_ok; + stats->if_in_ok = rep_stat_rsp->if_in_ok; + stats->if_out_ucast_bytes = rep_stat_rsp->if_out_ucast_bytes; + stats->if_out_multicast_bytes = rep_stat_rsp->if_out_multicast_bytes; + stats->if_out_broadcast_bytes = rep_stat_rsp->if_out_broadcast_bytes; + stats->if_in_ucast_bytes = rep_stat_rsp->if_in_ucast_bytes; + stats->if_in_multicast_bytes = rep_stat_rsp->if_in_multicast_bytes; + stats->if_in_broadcast_bytes = rep_stat_rsp->if_in_broadcast_bytes; + stats->ethernet_status = rep_stat_rsp->ethernet_status; + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_reset_req(struct control * control) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_reset_req()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_RESET); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + + control->rsp_expected = pkt->hdr.pkt_cmd; + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_reset_rsp(struct control * control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + + CONTROL_FUNCTION("%s: control_reset_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_RESET) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +int control_heartbeat_req(struct control * control, u32 hb_interval) +{ + struct send_io *send_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_heartbeat *heartbeat_req; + + CONTROL_FUNCTION("%s: control_heartbeat_req()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + + send_io = control_init_hdr(control, CMD_HEARTBEAT); + if (!send_io) + goto failure; + + pkt = control_packet(send_io); + heartbeat_req = &pkt->cmd.heartbeat_req; + heartbeat_req->hb_interval = cpu_to_be32(hb_interval); + + control->rsp_expected = pkt->hdr.pkt_cmd; + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return control_send(control, send_io); +failure: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + return -1; +} + +int control_heartbeat_rsp(struct control * control) +{ + struct recv_io *recv_io; + struct vnic_control_packet *pkt; + struct vnic_cmd_heartbeat *heartbeat_rsp; + + CONTROL_FUNCTION("%s: control_heartbeat_rsp()\n", + control_ifcfg_name(control)); + dma_sync_single_for_cpu(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + + recv_io = control_get_rsp(control); + if (!recv_io) + goto out; + + pkt = control_packet(recv_io); + if (pkt->hdr.pkt_cmd != CMD_HEARTBEAT) { + CONTROL_ERROR("%s: sent control request:\n", + control_ifcfg_name(control)); + control_log_control_packet(control_last_req(control)); + CONTROL_ERROR("%s: received control response:\n", + control_ifcfg_name(control)); + control_log_control_packet(pkt); + goto failure; + } + + heartbeat_rsp = &pkt->cmd.heartbeat_rsp; + + control_recv(control, recv_io); + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return 0; +failure: + viport_failure(control->parent); +out: + dma_sync_single_for_device(control->parent->config->ibdev->dma_device, + control->recv_dma, control->recv_len, + DMA_FROM_DEVICE); + return -1; +} + +static int control_init_recv_ios(struct control * control, + struct viport * viport, + struct vnic_control_packet * pkt) +{ + struct io *io; + struct ib_device *ibdev = viport->config->ibdev; + struct control_config *config = control->config; + dma_addr_t recv_dma; + unsigned int i; + + + control->recv_len = sizeof *pkt * config->num_recvs; + control->recv_dma = dma_map_single(ibdev->dma_device, + pkt, control->recv_len, + DMA_FROM_DEVICE); + + if (dma_mapping_error(control->recv_dma)) { + CONTROL_ERROR("control recv dma map error\n"); + goto failure; + } + + recv_dma = control->recv_dma; + for (i = 0; i < config->num_recvs; i++) { + io = &control->recv_ios[i].io; + io->viport = viport; + io->routine = control_recv_complete; + io->type = RECV; + + control->recv_ios[i].virtual_addr = (u8 *)pkt; + control->recv_ios[i].list.addr = recv_dma; + control->recv_ios[i].list.length = sizeof *pkt; + control->recv_ios[i].list.lkey = control->mr->lkey; + + recv_dma = recv_dma + sizeof *pkt; + pkt++; + + io->rwr.wr_id = (u64)io; + io->rwr.sg_list = &control->recv_ios[i].list; + io->rwr.num_sge = 1; + if (vnic_ib_post_recv(&control->ib_conn, io)) + goto unmap_recv; + } + + return 0; +unmap_recv: + dma_unmap_single(control->parent->config->ibdev->dma_device, + control->recv_dma, control->send_len, + DMA_FROM_DEVICE); +failure: + return -1; +} + +static int control_init_send_ios(struct control *control, + struct viport *viport, + struct vnic_control_packet * pkt) +{ + struct io * io; + struct ib_device *ibdev = viport->config->ibdev; + + control->send_io.virtual_addr = (u8*)pkt; + control->send_len = sizeof *pkt; + control->send_dma = dma_map_single(ibdev->dma_device, pkt, + control->send_len, + DMA_TO_DEVICE); + if (dma_mapping_error(control->send_dma)) { + CONTROL_ERROR("control send dma map error\n"); + goto failure; + } + + io = &control->send_io.io; + io->viport = viport; + io->routine = control_send_complete; + + control->send_io.list.addr = control->send_dma; + control->send_io.list.length = sizeof *pkt; + control->send_io.list.lkey = control->mr->lkey; + + io->swr.wr_id = (u64)io; + io->swr.sg_list = &control->send_io.list; + io->swr.num_sge = 1; + io->swr.opcode = IB_WR_SEND; + io->swr.send_flags = IB_SEND_SIGNALED; + io->type = SEND; + + return 0; +failure: + return -1; +} + +int control_init(struct control * control, struct viport * viport, + struct control_config * config, struct ib_pd * pd) +{ + struct vnic_control_packet *pkt; + unsigned int sz; + + CONTROL_FUNCTION("%s: control_init()\n", + control_ifcfg_name(control)); + control->parent = viport; + control->config = config; + control->ib_conn.viport = viport; + control->ib_conn.ib_config = &config->ib_config; + control->ib_conn.state = IB_CONN_UNINITTED; + control->req_outstanding = 0; + control->seq_num = 0; + control->response = NULL; + control->info = NULL; + INIT_LIST_HEAD(&control->failure_list); + spin_lock_init(&control->io_lock); + + if (vnic_ib_conn_init(&control->ib_conn, viport, pd, + &config->ib_config)) { + CONTROL_ERROR("Control IB connection" + " initialization failed\n"); + goto failure; + } + + control->mr = ib_get_dma_mr(pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(control->mr)) { + CONTROL_ERROR("%s: failed to register memory" + " for control connection\n", + control_ifcfg_name(control)); + goto destroy_conn; + } + + control->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev, + vnic_ib_cm_handler, + &control->ib_conn); + if (IS_ERR(control->ib_conn.cm_id)) { + CONTROL_ERROR("creating control CM ID failed\n"); + goto destroy_conn; + } + + sz = sizeof(struct recv_io) * config->num_recvs; + control->recv_ios = vmalloc(sz); + memset(control->recv_ios, 0, sz); + + if (!control->recv_ios) { + CONTROL_ERROR("%s: failed allocating space for recv ios\n", + control_ifcfg_name(control)); + goto destroy_conn; + } + + /*One send buffer and num_recvs recv buffers */ + control->local_storage = kzalloc(sizeof *pkt * + (config->num_recvs + 1), + GFP_KERNEL); + + if (!control->local_storage) { + CONTROL_ERROR("%s: failed allocating space" + " for local storage\n", + control_ifcfg_name(control)); + goto destroy_conn; + } + + pkt = control->local_storage; + if (control_init_send_ios(control, viport, pkt)) + goto free_storage; + + pkt++; + if (control_init_recv_ios(control, viport, pkt)) + goto unmap_send; + + return 0; + +unmap_send: + dma_unmap_single(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); +free_storage: + vfree(control->recv_ios); + kfree(control->local_storage); +destroy_conn: + ib_destroy_qp(control->ib_conn.qp); + ib_destroy_cq(control->ib_conn.cq); +failure: + return -1; +} + +void control_cleanup(struct control *control) +{ + CONTROL_FUNCTION("%s: control_disconnect()\n", + control_ifcfg_name(control)); + init_completion(&control->ib_conn.done); + + if (ib_send_cm_dreq(control->ib_conn.cm_id, NULL, 0)) + printk(KERN_DEBUG "control CM DREQ sending failed\n"); + else + wait_for_completion(&control->ib_conn.done); + control_timer_stop(control); + ib_destroy_cm_id(control->ib_conn.cm_id); + ib_destroy_qp(control->ib_conn.qp); + ib_destroy_cq(control->ib_conn.cq); + ib_dereg_mr(control->mr); + dma_unmap_single(control->parent->config->ibdev->dma_device, + control->send_dma, control->send_len, + DMA_TO_DEVICE); + dma_unmap_single(control->parent->config->ibdev->dma_device, + control->recv_dma, control->send_len, + DMA_FROM_DEVICE); + vfree(control->recv_ios); + kfree(control->local_storage); + +} + +static void control_log_report_status_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_REPORT_STATUS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO + " lan_switch_num = %u, is_fatal = %u\n", + pkt->cmd.report_status.lan_switch_num, + pkt->cmd.report_status.is_fatal); + printk(KERN_INFO + " status_number = %u, status_info = %u\n", + be32_to_cpu(pkt->cmd.report_status.status_number), + be32_to_cpu(pkt->cmd.report_status.status_info)); + pkt->cmd.report_status.file_name[31] = '\0'; + pkt->cmd.report_status.routine[31] = '\0'; + printk(KERN_INFO " filename = %s, routine = %s\n", + pkt->cmd.report_status.file_name, + pkt->cmd.report_status.routine); + printk(KERN_INFO + " line_num = %u, error_parameter = %u\n", + be32_to_cpu(pkt->cmd.report_status.line_num), + be32_to_cpu(pkt->cmd.report_status.error_parameter)); + pkt->cmd.report_status.desc_text[127] = '\0'; + printk(KERN_INFO " desc_text = %s\n", + pkt->cmd.report_status.desc_text); +} + +static void control_log_report_stats_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_REPORT_STATISTICS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " lan_switch_num = %u\n", + pkt->cmd.report_statistics_req.lan_switch_num); + if (pkt->hdr.pkt_type == TYPE_REQ) + return; + printk(KERN_INFO " if_in_broadcast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_broadcast_pkts)); + printk(" if_in_multicast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_multicast_pkts)); + printk(KERN_INFO " if_in_octets = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_octets)); + printk(" if_in_ucast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_ucast_pkts)); + printk(KERN_INFO " if_in_nucast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_nucast_pkts)); + printk(" if_in_underrun = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_underrun)); + printk(KERN_INFO " if_in_errors = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_errors)); + printk(" if_out_errors = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_errors)); + printk(KERN_INFO " if_out_octets = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_octets)); + printk(" if_out_ucast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_ucast_pkts)); + printk(KERN_INFO " if_out_multicast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_multicast_pkts)); + printk(" if_out_broadcast_pkts = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_broadcast_pkts)); + printk(KERN_INFO " if_out_nucast_pkts = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_nucast_pkts)); + printk(" if_out_ok = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp.if_out_ok)); + printk(KERN_INFO " if_in_ok = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp.if_in_ok)); + printk(" if_out_ucast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_ucast_bytes)); + printk(KERN_INFO " if_out_multicast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_multicast_bytes)); + printk(" if_out_broadcast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_out_broadcast_bytes)); + printk(KERN_INFO " if_in_ucast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_ucast_bytes)); + printk(" if_in_multicast_bytes = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_multicast_bytes)); + printk(KERN_INFO " if_in_broadcast_bytes = %llu", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + if_in_broadcast_bytes)); + printk(" ethernet_status = %llu\n", + be64_to_cpu(pkt->cmd.report_statistics_rsp. + ethernet_status)); +} + +static void control_log_config_link_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_LINK\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " cmd_flags = %x\n", + pkt->cmd.config_link_req.cmd_flags); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_ENABLE_NIC) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_NIC\n"); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_DISABLE_NIC) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_NIC\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_ENABLE_MCAST_ALL) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_" + "MCAST_ALL\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_DISABLE_MCAST_ALL) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_" + "MCAST_ALL\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_ENABLE_PROMISC) + printk(KERN_INFO + " VNIC_FLAG_ENABLE_" + "PROMISC\n"); + if (pkt->cmd.config_link_req. + cmd_flags & VNIC_FLAG_DISABLE_PROMISC) + printk(KERN_INFO + " VNIC_FLAG_DISABLE_" + "PROMISC\n"); + if (pkt->cmd.config_link_req.cmd_flags & VNIC_FLAG_SET_MTU) + printk(KERN_INFO + " VNIC_FLAG_SET_MTU\n"); + printk(KERN_INFO + " lan_switch_num = %x, mtu_size = %d\n", + pkt->cmd.config_link_req.lan_switch_num, + be16_to_cpu(pkt->cmd.config_link_req.mtu_size)); + if (pkt->hdr.pkt_type == TYPE_RSP) { + printk(KERN_INFO + " default_vlan = %u," + " hw_mac_address =" + " %02x:%02x:%02x:%02x:%02x:%02x\n", + be16_to_cpu(pkt->cmd.config_link_req. + default_vlan), + pkt->cmd.config_link_req.hw_mac_address[0], + pkt->cmd.config_link_req.hw_mac_address[1], + pkt->cmd.config_link_req.hw_mac_address[2], + pkt->cmd.config_link_req.hw_mac_address[3], + pkt->cmd.config_link_req.hw_mac_address[4], + pkt->cmd.config_link_req.hw_mac_address[5]); + } +} + +static void control_log_config_addrs_pkt(struct vnic_control_packet *pkt) +{ + int i; + + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_ADDRESSES\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO + " num_address_ops = %x," + " lan_switch_num = %d\n", + pkt->cmd.config_addresses_req.num_address_ops, + pkt->cmd.config_addresses_req.lan_switch_num); + for (i = 0; (i < pkt->cmd.config_addresses_req.num_address_ops) + && (i < 16); i++) { + printk(KERN_INFO + " list_address_ops[%u].index" + " = %u\n", + i, + be16_to_cpu(pkt->cmd.config_addresses_req. + list_address_ops[i].index)); + switch (pkt->cmd.config_addresses_req. + list_address_ops[i].operation) { + case VNIC_OP_GET_ENTRY: + printk(KERN_INFO + " list_address_ops[%u]." + "operation = VNIC_OP_GET_ENTRY\n", + i); + break; + case VNIC_OP_SET_ENTRY: + printk(KERN_INFO + " list_address_ops[%u]." + "operation = VNIC_OP_SET_ENTRY\n", + i); + break; + default: + printk(KERN_INFO + " list_address_ops[%u]." + "operation = UNKNOWN(%d)\n", + i, + pkt->cmd.config_addresses_req. + list_address_ops[i].operation); + break; + } + printk(KERN_INFO + " list_address_ops[%u].valid" + " = %u\n", + i, + pkt->cmd.config_addresses_req. + list_address_ops[i].valid); + printk(KERN_INFO + " list_address_ops[%u].address" + " = %02x:%02x:%02x:%02x:%02x:%02x\n", + i, + pkt->cmd.config_addresses_req. + list_address_ops[i].address[0], + pkt->cmd.config_addresses_req. + list_address_ops[i].address[1], + pkt->cmd.config_addresses_req. + list_address_ops[i].address[2], + pkt->cmd.config_addresses_req. + list_address_ops[i].address[3], + pkt->cmd.config_addresses_req. + list_address_ops[i].address[4], + pkt->cmd.config_addresses_req. + list_address_ops[i].address[5]); + printk(KERN_INFO + " list_address_ops[%u].vlan" + " = %u\n", + i, + be16_to_cpu(pkt->cmd.config_addresses_req. + list_address_ops[i].vlan)); + } + +} + +static void control_log_exch_pools_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_EXCHANGE_POOLS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " datapath = %u\n", + pkt->cmd.exchange_pools_req.data_path); + printk(KERN_INFO " pool_rkey = %08x" + " pool_addr = %llx\n", + be32_to_cpu(pkt->cmd.exchange_pools_req.pool_rkey), + be64_to_cpu(pkt->cmd.exchange_pools_req.pool_addr)); +} + +static void control_log_data_path_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_CONFIG_DATA_PATH\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " path_identifier = %llx," + " data_path = %u\n", + pkt->cmd.config_data_path_req.path_identifier, + pkt->cmd.config_data_path_req.data_path); + printk(KERN_INFO + "host config size_recv_pool_entry = %u," + " num_recv_pool_entries = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.size_recv_pool_entry), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.num_recv_pool_entries)); + printk(KERN_INFO + " timeout_before_kick = %u," + " num_recv_pool_entries_before_kick = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config.timeout_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + num_recv_pool_entries_before_kick)); + printk(KERN_INFO + " num_recv_pool_bytes_before_kick = %u," + " free_recv_pool_entries_per_update = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + num_recv_pool_bytes_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + host_recv_pool_config. + free_recv_pool_entries_per_update)); + printk(KERN_INFO + "eioc config size_recv_pool_entry = %u," + " num_recv_pool_entries = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.size_recv_pool_entry), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.num_recv_pool_entries)); + printk(KERN_INFO + " timeout_before_kick = %u," + " num_recv_pool_entries_before_kick = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config.timeout_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + num_recv_pool_entries_before_kick)); + printk(KERN_INFO + " num_recv_pool_bytes_before_kick = %u," + " free_recv_pool_entries_per_update = %u\n", + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + num_recv_pool_bytes_before_kick), + be32_to_cpu(pkt->cmd.config_data_path_req. + eioc_recv_pool_config. + free_recv_pool_entries_per_update)); +} + +static void control_log_init_vnic_pkt(struct vnic_control_packet *pkt) +{ + printk(KERN_INFO + " pkt_cmd = CMD_INIT_VNIC\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO + " vnic_major_version = %u," + " vnic_minor_version = %u\n", + be16_to_cpu(pkt->cmd.init_vnic_req.vnic_major_version), + be16_to_cpu(pkt->cmd.init_vnic_req.vnic_minor_version)); + if (pkt->hdr.pkt_type == TYPE_REQ) { + printk(KERN_INFO + " vnic_instance = %u," + " num_data_paths = %u\n", + pkt->cmd.init_vnic_req.vnic_instance, + pkt->cmd.init_vnic_req.num_data_paths); + printk(KERN_INFO + " num_address_entries = %u\n", + be16_to_cpu(pkt->cmd.init_vnic_req. + num_address_entries)); + } else { + printk(KERN_INFO + " num_lan_switches = %u," + " num_data_paths = %u\n", + pkt->cmd.init_vnic_rsp.num_lan_switches, + pkt->cmd.init_vnic_rsp.num_data_paths); + printk(KERN_INFO + " num_address_entries = %u," + " features_supported = %08x\n", + be16_to_cpu(pkt->cmd.init_vnic_rsp. + num_address_entries), + be32_to_cpu(pkt->cmd.init_vnic_rsp. + features_supported)); + if (pkt->cmd.init_vnic_rsp.num_lan_switches != 0) { + printk(KERN_INFO + "lan_switch[0] lan_switch_num = %u," + " num_enet_ports = %08x\n", + pkt->cmd.init_vnic_rsp. + lan_switch[0].lan_switch_num, + pkt->cmd.init_vnic_rsp. + lan_switch[0].num_enet_ports); + printk(KERN_INFO + " default_vlan = %u," + " hw_mac_address =" + " %02x:%02x:%02x:%02x:%02x:%02x\n", + be16_to_cpu(pkt->cmd.init_vnic_rsp. + lan_switch[0].default_vlan), + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[0], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[1], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[2], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[3], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[4], + pkt->cmd.init_vnic_rsp.lan_switch[0]. + hw_mac_address[5]); + } + } +} + +static void control_log_control_packet(struct vnic_control_packet *pkt) +{ + switch (pkt->hdr.pkt_type) { + case TYPE_INFO: + printk(KERN_INFO "control_packet: pkt_type = TYPE_INFO\n"); + break; + case TYPE_REQ: + printk(KERN_INFO "control_packet: pkt_type = TYPE_REQ\n"); + break; + case TYPE_RSP: + printk(KERN_INFO "control_packet: pkt_type = TYPE_RSP\n"); + break; + case TYPE_ERR: + printk(KERN_INFO "control_packet: pkt_type = TYPE_ERR\n"); + break; + default: + printk(KERN_INFO "control_packet: pkt_type = UNKNOWN\n"); + } + + switch (pkt->hdr.pkt_cmd) { + case CMD_INIT_VNIC: + control_log_init_vnic_pkt(pkt); + break; + case CMD_CONFIG_DATA_PATH: + control_log_data_path_pkt(pkt); + break; + case CMD_EXCHANGE_POOLS: + control_log_exch_pools_pkt(pkt); + break; + case CMD_CONFIG_ADDRESSES: + control_log_config_addrs_pkt(pkt); + break; + case CMD_CONFIG_LINK: + control_log_config_link_pkt(pkt); + break; + case CMD_REPORT_STATISTICS: + control_log_report_stats_pkt(pkt); + break; + case CMD_CLEAR_STATISTICS: + printk(KERN_INFO + " pkt_cmd = CMD_CLEAR_STATISTICS\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + case CMD_REPORT_STATUS: + control_log_report_status_pkt(pkt); + + break; + case CMD_RESET: + printk(KERN_INFO + " pkt_cmd = CMD_RESET\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + case CMD_HEARTBEAT: + printk(KERN_INFO + " pkt_cmd = CMD_HEARTBEAT\n"); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + printk(KERN_INFO " hb_interval = %d\n", + be32_to_cpu(pkt->cmd.heartbeat_req.hb_interval)); + break; + default: + printk(KERN_INFO + " pkt_cmd = UNKNOWN (%u)\n", + pkt->hdr.pkt_cmd); + printk(KERN_INFO + " pkt_seq_num = %u," + " pkt_retry_count = %u\n", + pkt->hdr.pkt_seq_num, + pkt->hdr.pkt_retry_count); + break; + } +} diff --git a/drivers/infiniband/ulp/vnic/vnic_control.h b/drivers/infiniband/ulp/vnic/vnic_control.h new file mode 100644 index 0000000..9dfc741 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_control.h @@ -0,0 +1,146 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONTROL_H_INCLUDED +#define VNIC_CONTROL_H_INCLUDED + +#ifdef CONFIG_INFINIBAND_VNIC_STATS +#include +#include +#endif /* CONFIG_INFINIBAND_VNIC_STATS */ + +#include "vnic_ib.h" +#include "vnic_control_pkt.h" + +enum control_timer_state { + TIMER_IDLE = 0, + TIMER_ACTIVE = 1, + TIMER_EXPIRED = 2 +}; + +struct control { + struct viport *parent; + struct control_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + struct vnic_control_packet *local_storage; + int send_len; + int recv_len; + u16 maj_ver; + u16 min_ver; + struct vnic_lan_switch_attribs lan_switch; + struct send_io send_io; + struct recv_io *recv_ios; + dma_addr_t send_dma; + dma_addr_t recv_dma; + enum control_timer_state timer_state; + struct timer_list timer; + u8 req_retry_counter; + u8 req_outstanding; + u8 seq_num; + u8 rsp_expected; + struct recv_io *response; + struct recv_io *info; + struct list_head failure_list; + spinlock_t io_lock; + struct completion done; +#ifdef CONFIG_INFINIBAND_VNIC_STATS + struct { + cycles_t request_time; /* intermediate value */ + cycles_t response_time; + u32 response_num; + cycles_t response_max; + cycles_t response_min; + u32 timeout_num; + } statistics; +#endif /* CONFIG_INFINIBAND_VNIC_STATS */ +}; + +int control_init(struct control *control, struct viport *viport, + struct control_config *config, struct ib_pd *pd); + +void control_cleanup(struct control *control); + +void control_process_async(struct control *control); + +int control_init_vnic_req(struct control *control); +int control_init_vnic_rsp(struct control *control, u32 * features, + u8 * mac_address, u16 * num_addrs, u16 * vlan); + +int control_config_data_path_req(struct control *control, u64 path_id, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc); +int control_config_data_path_rsp(struct control *control, + struct vnic_recv_pool_config *host, + struct vnic_recv_pool_config *eioc, + struct vnic_recv_pool_config *max_host, + struct vnic_recv_pool_config *max_eioc, + struct vnic_recv_pool_config *min_host, + struct vnic_recv_pool_config *min_eioc); + +int control_exchange_pools_req(struct control *control, + u64 addr, u32 rkey); +int control_exchange_pools_rsp(struct control *control, + u64 * addr, u32 * rkey); + +int control_config_link_req(struct control *control, + u16 flags, u16 mtu); +int control_config_link_rsp(struct control *control, + u16 * flags, u16 * mtu); + +int control_config_addrs_req(struct control *control, + struct vnic_address_op *addrs, u16 num); +int control_config_addrs_rsp(struct control *control); + +int control_report_statistics_req(struct control *control); +int control_report_statistics_rsp(struct control *control, + struct vnic_cmd_report_stats_rsp *stats); + +int control_heartbeat_req(struct control *control, u32 hb_interval); +int control_heartbeat_rsp(struct control *control); + +int control_reset_req(struct control *control); +int control_reset_rsp(struct control *control); + + +#define control_packet(io) \ + (struct vnic_control_packet *)(io)->virtual_addr +#define control_is_connected(control) \ + (vnic_ib_conn_connected(&((control)->ib_conn))) + +#define control_last_req(control) control_packet(&(control)->send_io) +#define control_features(control) (control)->features_supported + +#define control_get_mac_address(control,addr) \ + memcpy(addr,(control)->lan_switch.hw_mac_address, ETH_ALEN) + +#endif /* VNIC_CONTROL_H_INCLUDED */ diff --git a/drivers/infiniband/ulp/vnic/vnic_control_pkt.h b/drivers/infiniband/ulp/vnic/vnic_control_pkt.h new file mode 100644 index 0000000..a7d4fb9 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_control_pkt.h @@ -0,0 +1,292 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONTROL_PKT_H_INCLUDED +#define VNIC_CONTROL_PKT_H_INCLUDED + +#include + +#define VNIC_MAX_NODENAME_LEN 64 + +struct vnic_connection_data { + u64 path_id; + u8 vnic_instance; + u8 path_num; + u8 nodename[VNIC_MAX_NODENAME_LEN + 1]; +}; + +struct vnic_control_header { + u8 pkt_type; + u8 pkt_cmd; + u8 pkt_seq_num; + u8 pkt_retry_count; + u32 reserved; /* for 64-bit alignmnet */ +}; + +/* ptk_type values */ +enum { + TYPE_INFO = 0, + TYPE_REQ = 1, + TYPE_RSP = 2, + TYPE_ERR = 3 +}; + +/* ptk_cmd values */ +enum { + CMD_INIT_VNIC = 1, + CMD_CONFIG_DATA_PATH = 2, + CMD_EXCHANGE_POOLS = 3, + CMD_CONFIG_ADDRESSES = 4, + CMD_CONFIG_LINK = 5, + CMD_REPORT_STATISTICS = 6, + CMD_CLEAR_STATISTICS = 7, + CMD_REPORT_STATUS = 8, + CMD_RESET = 9, + CMD_HEARTBEAT = 10 +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_REQ data format */ +struct vnic_cmd_init_vnic_req { + __be16 vnic_major_version; + __be16 vnic_minor_version; + u8 vnic_instance; + u8 num_data_paths; + __be16 num_address_entries; +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP subdata format */ +struct vnic_lan_switch_attribs { + u8 lan_switch_num; + u8 num_enet_ports; + __be16 default_vlan; + u8 hw_mac_address[ETH_ALEN]; +}; + +/* pkt_cmd CMD_INIT_VNIC, pkt_type TYPE_RSP data format */ +struct vnic_cmd_init_vnic_rsp { + __be16 vnic_major_version; + __be16 vnic_minor_version; + u8 num_lan_switches; + u8 num_data_paths; + __be16 num_address_entries; + __be32 features_supported; + struct vnic_lan_switch_attribs lan_switch[1]; +}; + +/* features_supported values */ +enum { + VNIC_FEAT_IPV4_HEADERS = 0x0001, + VNIC_FEAT_IPV6_HEADERS = 0x0002, + VNIC_FEAT_IPV4_CSUM_RX = 0x0004, + VNIC_FEAT_IPV4_CSUM_TX = 0x0008, + VNIC_FEAT_TCP_CSUM_RX = 0x0010, + VNIC_FEAT_TCP_CSUM_TX = 0x0020, + VNIC_FEAT_UDP_CSUM_RX = 0x0040, + VNIC_FEAT_UDP_CSUM_TX = 0x0080, + VNIC_FEAT_TCP_SEGMENT = 0x0100, + VNIC_FEAT_IPV4_IPSEC_OFFLOAD = 0x0200, + VNIC_FEAT_IPV6_IPSEC_OFFLOAD = 0x0400, + VNIC_FEAT_FCS_PROPAGATE = 0x0800, + VNIC_FEAT_PF_KICK = 0x1000, + VNIC_FEAT_PF_FORCE_ROUTE = 0x2000, + VNIC_FEAT_CHASH_OFFLOAD = 0x4000 +}; + +/* pkt_cmd CMD_CONFIG_DATA_PATH subdata format */ +struct vnic_recv_pool_config { + __be32 size_recv_pool_entry; + __be32 num_recv_pool_entries; + __be32 timeout_before_kick; + __be32 num_recv_pool_entries_before_kick; + __be32 num_recv_pool_bytes_before_kick; + __be32 free_recv_pool_entries_per_update; +}; + +/* pkt_cmd CMD_CONFIG_DATA_PATH data format */ +struct vnic_cmd_config_data_path { + u64 path_identifier; + u8 data_path; + u8 reserved[3]; + struct vnic_recv_pool_config host_recv_pool_config; + struct vnic_recv_pool_config eioc_recv_pool_config; +}; + +/* pkt_cmd CMD_EXCHANGE_POOLS data format */ +struct vnic_cmd_exchange_pools { + u8 data_path; + u8 reserved[3]; + __be32 pool_rkey; + __be64 pool_addr; +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES subdata format */ +struct vnic_address_op { + __be16 index; + u8 operation; + u8 valid; + u8 address[6]; + __be16 vlan; +}; + +/* operation values */ +enum { + VNIC_OP_SET_ENTRY = 0x01, + VNIC_OP_GET_ENTRY = 0x02 +}; + +/* pkt_cmd CMD_CONFIG_ADDRESSES data format */ +struct vnic_cmd_config_addresses { + u8 num_address_ops; + u8 lan_switch_num; + struct vnic_address_op list_address_ops[1]; +}; + +/* CMD_CONFIG_LINK data format */ +struct vnic_cmd_config_link { + u8 cmd_flags; + u8 lan_switch_num; + __be16 mtu_size; + __be16 default_vlan; + u8 hw_mac_address[6]; +}; + +/* cmd_flags values */ +enum { + VNIC_FLAG_ENABLE_NIC = 0x01, + VNIC_FLAG_DISABLE_NIC = 0x02, + VNIC_FLAG_ENABLE_MCAST_ALL = 0x04, + VNIC_FLAG_DISABLE_MCAST_ALL = 0x08, + VNIC_FLAG_ENABLE_PROMISC = 0x10, + VNIC_FLAG_DISABLE_PROMISC = 0x20, + VNIC_FLAG_SET_MTU = 0x40 +}; + +/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_REQ data format */ +struct vnic_cmd_report_stats_req { + u8 lan_switch_num; +}; + +/* pkt_cmd CMD_REPORT_STATISTICS, pkt_type TYPE_RSP data format */ +struct vnic_cmd_report_stats_rsp { + u8 lan_switch_num; + u8 reserved[7]; /* for 64-bit alignment */ + __be64 if_in_broadcast_pkts; + __be64 if_in_multicast_pkts; + __be64 if_in_octets; + __be64 if_in_ucast_pkts; + __be64 if_in_nucast_pkts; /* if_in_broadcast_pkts + + if_in_multicast_pkts */ + __be64 if_in_underrun; /* (OID_GEN_RCV_NO_BUFFER) */ + __be64 if_in_errors; /* (OID_GEN_RCV_ERROR) */ + __be64 if_out_errors; /* (OID_GEN_XMIT_ERROR) */ + __be64 if_out_octets; + __be64 if_out_ucast_pkts; + __be64 if_out_multicast_pkts; + __be64 if_out_broadcast_pkts; + __be64 if_out_nucast_pkts; /* if_out_broadcast_pkts + + if_out_multicast_pkts */ + __be64 if_out_ok; /* if_out_nucast_pkts + + if_out_ucast_pkts(OID_GEN_XMIT_OK) */ + __be64 if_in_ok; /* if_in_nucast_pkts + + if_in_ucast_pkts(OID_GEN_RCV_OK) */ + __be64 if_out_ucast_bytes; /* (OID_GEN_DIRECTED_BYTES_XMT) */ + __be64 if_out_multicast_bytes; /* (OID_GEN_MULTICAST_BYTES_XMT) */ + __be64 if_out_broadcast_bytes; /* (OID_GEN_BROADCAST_BYTES_XMT) */ + __be64 if_in_ucast_bytes; /* (OID_GEN_DIRECTED_BYTES_RCV) */ + __be64 if_in_multicast_bytes; /* (OID_GEN_MULTICAST_BYTES_RCV) */ + __be64 if_in_broadcast_bytes; /* (OID_GEN_BROADCAST_BYTES_RCV) */ + __be64 ethernet_status; /* OID_GEN_MEDIA_CONNECT_STATUS) */ +}; + +/* pkt_cmd CMD_CLEAR_STATISTICS data format */ +struct vnic_cmd_clear_statistics { + u8 lan_switch_num; +}; + +/* pkt_cmd CMD_REPORT_STATUS data format */ +struct vnic_cmd_report_status { + u8 lan_switch_num; + u8 is_fatal; + u8 reserved[2]; /* for 32-bit alignment */ + __be32 status_number; + __be32 status_info; + u8 file_name[32]; + u8 routine[32]; + __be32 line_num; + __be32 error_parameter; + u8 desc_text[128]; +}; + +/* pkt_cmd CMD_HEARTBEAT data format */ +struct vnic_cmd_heartbeat { + __be32 hb_interval; +}; + +enum { + VNIC_STATUS_LINK_UP = 1, + VNIC_STATUS_LINK_DOWN = 2, + VNIC_STATUS_ENET_AGGREGATION_CHANGE = 3, + VNIC_STATUS_EIOC_SHUTDOWN = 4, + VNIC_STATUS_CONTROL_ERROR = 5, + VNIC_STATUS_EIOC_ERROR = 6 +}; + +#define VNIC_MAX_CONTROLPKTSZ 256 +#define VNIC_MAX_CONTROLDATASZ \ + (VNIC_MAX_CONTROLPKTSZ - sizeof(struct vnic_control_header)) + +struct vnic_control_packet { + struct vnic_control_header hdr; + union { + struct vnic_cmd_init_vnic_req init_vnic_req; + struct vnic_cmd_init_vnic_rsp init_vnic_rsp; + struct vnic_cmd_config_data_path config_data_path_req; + struct vnic_cmd_config_data_path config_data_path_rsp; + struct vnic_cmd_exchange_pools exchange_pools_req; + struct vnic_cmd_exchange_pools exchange_pools_rsp; + struct vnic_cmd_config_addresses config_addresses_req; + struct vnic_cmd_config_addresses config_addresses_rsp; + struct vnic_cmd_config_link config_link_req; + struct vnic_cmd_config_link config_link_rsp; + struct vnic_cmd_report_stats_req report_statistics_req; + struct vnic_cmd_report_stats_rsp report_statistics_rsp; + struct vnic_cmd_clear_statistics clear_statistics_req; + struct vnic_cmd_clear_statistics clear_statistics_rsp; + struct vnic_cmd_report_status report_status; + struct vnic_cmd_heartbeat heartbeat_req; + struct vnic_cmd_heartbeat heartbeat_rsp; + + char cmd_data[VNIC_MAX_CONTROLDATASZ]; + } cmd; +}; + +#endif /* VNIC_CONTROL_PKT_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:56:31 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:26:31 +0530 Subject: [openib-general] [PATCH v2 5/11] Implementation of Data path of the communication protocol Message-ID: <455A2677.7102.612A002@ramachandra.kuchimanchi.qlogic.com> Adds the files that implement the data transfer part of the communication protocol with the VEx. The RDMA of ethernet packets is implemented in here. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_data.c | 1114 ++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_data.h | 182 +++++ drivers/infiniband/ulp/vnic/vnic_trailer.h | 103 +++ 3 files changed, 1399 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_data.c b/drivers/infiniband/ulp/vnic/vnic_data.c new file mode 100644 index 0000000..2579697 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_data.c @@ -0,0 +1,1114 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_config.h" +#include "vnic_data.h" +#include "vnic_trailer.h" +#include "vnic_stats.h" + +static void data_received_kick(struct io *io); +static void data_xmit_complete(struct io *io); + +u32 min_rcv_skb = 60; +module_param(min_rcv_skb, int, 0444); +MODULE_PARM_DESC(min_rcv_skb, "Packets of size (in bytes) less than" + " or equal this value will be copied during receive." + " Default 60"); + +u32 min_xmt_skb = 60; +module_param(min_xmt_skb, int, 0444); +MODULE_PARM_DESC(min_xmit_skb, "Packets of size (in bytes) less than" + " or equal to this value will be copied during transmit." + "Default 60"); + +int data_init(struct data * data, struct viport * viport, + struct data_config * config, struct ib_pd *pd) +{ + DATA_FUNCTION("data_init()\n"); + + data->parent = viport; + data->config = config; + data->ib_conn.viport = viport; + data->ib_conn.ib_config = &config->ib_config; + data->ib_conn.state = IB_CONN_UNINITTED; + + if ((min_xmt_skb < 60) || (min_xmt_skb > 9000)) { + DATA_ERROR("min_xmt_skb (%d) must be between 60 and 9000\n", + min_xmt_skb); + goto failure; + } + if (vnic_ib_conn_init(&data->ib_conn, viport, pd, + &config->ib_config)) { + DATA_ERROR("Data IB connection initialization failed\n"); + goto failure; + } + data->mr = ib_get_dma_mr(pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE); + if (IS_ERR(data->mr)) { + DATA_ERROR("failed to register memory for" + " data connection\n"); + goto destroy_conn; + } + + data->ib_conn.cm_id = ib_create_cm_id(viport->config->ibdev, + vnic_ib_cm_handler, + &data->ib_conn); + + if (IS_ERR(data->ib_conn.cm_id)) { + DATA_ERROR("creating data CM ID failed\n"); + goto destroy_conn; + } + + return 0; + +destroy_conn: + ib_destroy_qp(data->ib_conn.qp); + ib_destroy_cq(data->ib_conn.cq); +failure: + return -1; +} + +static void data_post_recvs(struct data *data) +{ + unsigned long flags; + + DATA_FUNCTION("data_post_recvs()\n"); + spin_lock_irqsave(&data->recv_ios_lock, flags); + while (!list_empty(&data->recv_ios)) { + struct io *io = list_entry(data->recv_ios.next, + struct io, list_ptrs); + struct recv_io *recv_io = (struct recv_io *)io; + + list_del(&recv_io->io.list_ptrs); + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + if (vnic_ib_post_recv(&data->ib_conn, &recv_io->io)) { + viport_failure(data->parent); + return; + } + spin_lock_irqsave(&data->recv_ios_lock, flags); + } + spin_unlock_irqrestore(&data->recv_ios_lock, flags); +} + +static void data_init_pool_work_reqs(struct data * data, + struct recv_io * recv_io) +{ + struct recv_pool *recv_pool = &data->recv_pool; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct rdma_io *rdma_io; + struct rdma_dest *rdma_dest; + dma_addr_t xmit_dma; + u8 *xmit_data; + unsigned int i; + + INIT_LIST_HEAD(&data->recv_ios); + spin_lock_init(&data->recv_ios_lock); + spin_lock_init(&data->xmit_buf_lock); + for (i = 0; i < data->config->num_recvs; i++) { + recv_io[i].io.viport = data->parent; + recv_io[i].io.routine = data_received_kick; + recv_io[i].list.addr = data->region_data_dma; + recv_io[i].list.length = 4; + recv_io[i].list.lkey = data->mr->lkey; + + recv_io[i].io.rwr.wr_id = (u64)&recv_io[i].io; + recv_io[i].io.rwr.sg_list = &recv_io[i].list; + recv_io[i].io.rwr.num_sge = 1; + + list_add(&recv_io[i].io.list_ptrs, &data->recv_ios); + } + + INIT_LIST_HEAD(&recv_pool->avail_recv_bufs); + for (i = 0; i < recv_pool->pool_sz; i++) { + rdma_dest = &recv_pool->recv_bufs[i]; + list_add(&rdma_dest->list_ptrs, + &recv_pool->avail_recv_bufs); + } + + xmit_dma = xmit_pool->xmitdata_dma; + xmit_data = xmit_pool->xmit_data; + + for (i = 0; i < xmit_pool->num_xmit_bufs; i++) { + rdma_io = &xmit_pool->xmit_bufs[i]; + rdma_io->index = i; + rdma_io->io.viport = data->parent; + rdma_io->io.routine = data_xmit_complete; + + rdma_io->list[0].lkey = data->mr->lkey; + rdma_io->list[1].lkey = data->mr->lkey; + rdma_io->io.swr.wr_id = (u64)rdma_io; + rdma_io->io.swr.sg_list = rdma_io->list; + rdma_io->io.swr.num_sge = 2; + rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE; + rdma_io->io.swr.send_flags = IB_SEND_SIGNALED; + rdma_io->io.type = RDMA; + + rdma_io->data = xmit_data; + rdma_io->data_dma = xmit_dma; + + xmit_data += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT); + xmit_dma += ALIGN(min_xmt_skb, VIPORT_TRAILER_ALIGNMENT); + rdma_io->trailer = (struct viport_trailer *)xmit_data; + rdma_io->trailer_dma = xmit_dma; + xmit_data += sizeof(struct viport_trailer); + xmit_dma += sizeof(struct viport_trailer); + } + + xmit_pool->rdma_rkey = data->mr->rkey; + xmit_pool->rdma_addr = xmit_pool->buf_pool_dma; +} + +static void data_init_free_bufs_swrs(struct data * data) +{ + struct rdma_io *rdma_io; + struct send_io *send_io; + + rdma_io = &data->free_bufs_io; + rdma_io->io.viport = data->parent; + rdma_io->io.routine = NULL; + + rdma_io->list[0].lkey = data->mr->lkey; + + rdma_io->io.swr.wr_id = (u64)rdma_io; + rdma_io->io.swr.sg_list = rdma_io->list; + rdma_io->io.swr.num_sge = 1; + rdma_io->io.swr.opcode = IB_WR_RDMA_WRITE; + rdma_io->io.swr.send_flags = IB_SEND_SIGNALED; + rdma_io->io.type = RDMA; + + send_io = &data->kick_io; + send_io->io.viport = data->parent; + send_io->io.routine = NULL; + + send_io->list.addr = data->region_data_dma; + send_io->list.length = 0; + send_io->list.lkey = data->mr->lkey; + + send_io->io.swr.wr_id = (u64)send_io; + send_io->io.swr.sg_list = &send_io->list; + send_io->io.swr.num_sge = 1; + send_io->io.swr.opcode = IB_WR_SEND; + send_io->io.swr.send_flags = IB_SEND_SIGNALED; + send_io->io.type = SEND; +} + +static int data_init_buf_pools(struct data * data) +{ + struct recv_pool *recv_pool = &data->recv_pool; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct viport *viport = data->parent; + + recv_pool->buf_pool_len = + sizeof(struct buff_pool_entry) * recv_pool->eioc_pool_sz; + + recv_pool->buf_pool = kzalloc(recv_pool->buf_pool_len, GFP_KERNEL); + + if (!recv_pool->buf_pool) { + DATA_ERROR("failed allocating %d bytes" + " for recv pool bufpool\n", + recv_pool->buf_pool_len); + goto failure; + } + + recv_pool->buf_pool_dma = + dma_map_single(viport->config->ibdev->dma_device, + recv_pool->buf_pool, recv_pool->buf_pool_len, + DMA_TO_DEVICE); + + if (dma_mapping_error(recv_pool->buf_pool_dma)) { + DATA_ERROR("xmit buf_pool dma map error\n"); + goto free_recv_pool; + } + + xmit_pool->buf_pool_len = + sizeof(struct buff_pool_entry) * xmit_pool->pool_sz; + xmit_pool->buf_pool = kzalloc(xmit_pool->buf_pool_len, GFP_KERNEL); + + if (!xmit_pool->buf_pool) { + DATA_ERROR("failed allocating %d bytes" + " for xmit pool bufpool\n", + xmit_pool->buf_pool_len); + goto unmap_recv_pool; + } + + xmit_pool->buf_pool_dma = + dma_map_single(viport->config->ibdev->dma_device, + xmit_pool->buf_pool, xmit_pool->buf_pool_len, + DMA_FROM_DEVICE); + + if (dma_mapping_error(xmit_pool->buf_pool_dma)) { + DATA_ERROR("xmit buf_pool dma map error\n"); + goto free_xmit_pool; + } + + xmit_pool->xmit_data = kzalloc(xmit_pool->xmitdata_len, GFP_KERNEL); + + if (!xmit_pool->xmit_data) { + DATA_ERROR("failed allocating %d bytes for xmit data\n", + xmit_pool->xmitdata_len); + goto unmap_xmit_pool; + } + + xmit_pool->xmitdata_dma = + dma_map_single(viport->config->ibdev->dma_device, + xmit_pool->xmit_data, xmit_pool->xmitdata_len, + DMA_TO_DEVICE); + + if (dma_mapping_error(xmit_pool->xmitdata_dma)) { + DATA_ERROR("xmit data dma map error\n"); + goto free_xmit_data; + } + + return 0; + +free_xmit_data: + kfree(xmit_pool->xmit_data); +unmap_xmit_pool: + dma_unmap_single(data->parent->config->ibdev->dma_device, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_FROM_DEVICE); +free_xmit_pool: + kfree(xmit_pool->buf_pool); +unmap_recv_pool: + dma_unmap_single(data->parent->config->ibdev->dma_device, + recv_pool->buf_pool_dma, + recv_pool->buf_pool_len, DMA_TO_DEVICE); +free_recv_pool: + kfree(recv_pool->buf_pool); +failure: + return -1; +} + +static void data_init_xmit_pool(struct data * data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + + xmit_pool->pool_sz = + be32_to_cpu(data->eioc_pool_parms.num_recv_pool_entries); + xmit_pool->buffer_sz = + be32_to_cpu(data->eioc_pool_parms.size_recv_pool_entry); + + xmit_pool->notify_count = 0; + xmit_pool->notify_bundle = data->config->notify_bundle; + xmit_pool->next_xmit_pool = 0; + xmit_pool->num_xmit_bufs = xmit_pool->notify_bundle * 2; + xmit_pool->next_xmit_buf = 0; + xmit_pool->last_comp_buf = xmit_pool->num_xmit_bufs - 1; + + xmit_pool->kick_count = 0; + xmit_pool->kick_byte_count = 0; + + xmit_pool->send_kicks = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_entries_before_kick) + || be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_bytes_before_kick); + xmit_pool->kick_bundle = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_entries_before_kick); + xmit_pool->kick_byte_bundle = + be32_to_cpu(data-> + eioc_pool_parms.num_recv_pool_bytes_before_kick); + + xmit_pool->need_buffers = 1; + + xmit_pool->xmitdata_len = + BUFFER_SIZE(min_xmt_skb) * xmit_pool->num_xmit_bufs; +} + +static void data_init_recv_pool(struct data * data) +{ + struct recv_pool *recv_pool = &data->recv_pool; + + recv_pool->pool_sz = data->config->host_recv_pool_entries; + recv_pool->eioc_pool_sz = + be32_to_cpu(data->host_pool_parms.num_recv_pool_entries); + if (recv_pool->pool_sz > recv_pool->eioc_pool_sz) + recv_pool->pool_sz = + be32_to_cpu(data->host_pool_parms.num_recv_pool_entries); + + recv_pool->buffer_sz = + be32_to_cpu(data->host_pool_parms.size_recv_pool_entry); + + recv_pool->sz_free_bundle = + be32_to_cpu(data-> + host_pool_parms.free_recv_pool_entries_per_update); + recv_pool->num_free_bufs = 0; + recv_pool->num_posted_bufs = 0; + + recv_pool->next_full_buf = 0; + recv_pool->next_free_buf = 0; + recv_pool->kick_on_free = 0; +} + +int data_connect(struct data * data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct recv_pool *recv_pool = &data->recv_pool; + struct recv_io * recv_io; + unsigned int sz; + struct viport *viport = data->parent; + + DATA_FUNCTION("data_connect()\n"); + + data_init_recv_pool(data); + data_init_xmit_pool(data); + + sz = sizeof(struct rdma_dest) * recv_pool->pool_sz + + sizeof(struct recv_io) * data->config->num_recvs + + sizeof(struct rdma_io) * xmit_pool->num_xmit_bufs; + + data->local_storage = vmalloc(sz); + + if (!data->local_storage) { + DATA_ERROR("failed allocating %d bytes" + " local storage\n", sz); + goto out; + } + + memset(data->local_storage, 0, sz); + + recv_pool->recv_bufs = (struct rdma_dest *)data->local_storage; + sz = sizeof(struct rdma_dest) * recv_pool->pool_sz; + + recv_io = (struct recv_io *)(data->local_storage + sz); + sz += sizeof(struct recv_io) * data->config->num_recvs; + + xmit_pool->xmit_bufs = (struct rdma_io *)(data->local_storage + sz); + data->region_data = kzalloc(4, GFP_KERNEL); + + if (!data->region_data) { + DATA_ERROR("failed to alloc memory for region data\n"); + goto free_local_storage; + } + + data->region_data_dma = + dma_map_single(viport->config->ibdev->dma_device, + data->region_data, 4, DMA_BIDIRECTIONAL); + + if (dma_mapping_error(data->region_data_dma)) { + DATA_ERROR("region data dma map error\n"); + goto free_region_data; + } + + if (data_init_buf_pools(data)) + goto unmap_region_data; + + data_init_free_bufs_swrs(data); + data_init_pool_work_reqs(data, recv_io); + + data_post_recvs(data); + + if (vnic_ib_cm_connect(&data->ib_conn)) + goto unmap_region_data; + + return 0; + +unmap_region_data: + dma_unmap_single(data->parent->config->ibdev->dma_device, + data->region_data_dma, 4, DMA_BIDIRECTIONAL); +free_region_data: + kfree(data->region_data); +free_local_storage: + vfree(data->local_storage); +out: + return -1; +} + +static void data_add_free_buffer(struct data *data, int index, + struct rdma_dest *rdma_dest) +{ + struct recv_pool *pool = &data->recv_pool; + struct buff_pool_entry *bpe; + + DATA_FUNCTION("data_add_free_buffer()\n"); + rdma_dest->trailer->connection_hash_and_valid = 0; + dma_sync_single_for_cpu(data->parent->config->ibdev->dma_device, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + bpe = &pool->buf_pool[index]; + bpe->rkey = cpu_to_be32(data->mr->rkey); + + bpe->remote_addr = cpu_to_be64((unsigned long long) + virt_to_phys(rdma_dest->data)); + bpe->valid = (u32) (rdma_dest - &pool->recv_bufs[0]) + 1; + ++pool->num_free_bufs; + + dma_sync_single_for_device(data->parent->config->ibdev->dma_device, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); +} + +/* NOTE: this routine is not reentrant */ +static void data_alloc_buffers(struct data *data, int initial_allocation) +{ + struct recv_pool *pool = &data->recv_pool; + struct rdma_dest *rdma_dest; + struct sk_buff *skb; + int index; + + DATA_FUNCTION("data_alloc_buffers()\n"); + index = ADD(pool->next_free_buf, pool->num_free_bufs, + pool->eioc_pool_sz); + + while (!list_empty(&pool->avail_recv_bufs)) { + rdma_dest = + list_entry(pool->avail_recv_bufs.next, + struct rdma_dest, list_ptrs); + if (!rdma_dest->skb) { + if (initial_allocation) + skb = alloc_skb(pool->buffer_sz + 2, + GFP_KERNEL); + else + skb = dev_alloc_skb(pool->buffer_sz + 2); + if (!skb) { + DATA_ERROR("failed to alloc skb\n"); + break; + } + skb_reserve(skb, 2); + skb_put(skb, pool->buffer_sz); + rdma_dest->skb = skb; + rdma_dest->data = skb->data; + rdma_dest->trailer = + (struct viport_trailer *)(rdma_dest->data + + pool->buffer_sz - + sizeof(struct + viport_trailer)); + } + rdma_dest->trailer->connection_hash_and_valid = 0; + + list_del_init(&rdma_dest->list_ptrs); + + data_add_free_buffer(data, index, rdma_dest); + index = NEXT(index, pool->eioc_pool_sz); + } +} + +static void data_send_kick_message(struct data *data) +{ + struct xmit_pool *pool = &data->xmit_pool; + DATA_FUNCTION("data_send_kick_message()\n"); + /* stop timer for bundle_timeout */ + if (data->kick_timer_on) { + del_timer(&data->kick_timer); + data->kick_timer_on = 0; + } + pool->kick_count = 0; + pool->kick_byte_count = 0; + + /* TODO: keep track of when kick is outstanding, and + * don't reuse until complete + */ + if (vnic_ib_post_send(&data->ib_conn, &data->free_bufs_io.io)) { + DATA_ERROR("failed to post send\n"); + viport_failure(data->parent); + } +} + +static void data_send_free_recv_buffers(struct data *data) +{ + struct recv_pool *pool = &data->recv_pool; + struct ib_send_wr *swr = &data->free_bufs_io.io.swr; + + int bufs_sent = 0; + u64 rdma_addr; + u32 offset; + u32 sz; + unsigned int num_to_send, next_increment; + + DATA_FUNCTION("data_send_free_recv_buffers()\n"); + + for (num_to_send = pool->sz_free_bundle; + num_to_send <= pool->num_free_bufs; + num_to_send += pool->sz_free_bundle) { + /* handle multiple bundles as one when possible. */ + next_increment = num_to_send + pool->sz_free_bundle; + if ((next_increment <= pool->num_free_bufs) + && (pool->next_free_buf + next_increment <= + pool->eioc_pool_sz)) { + continue; + } + + offset = pool->next_free_buf * + sizeof(struct buff_pool_entry); + sz = num_to_send * sizeof(struct buff_pool_entry); + rdma_addr = pool->eioc_rdma_addr + offset; + swr->sg_list->length = sz; + swr->sg_list->addr = pool->buf_pool_dma + offset; + swr->wr.rdma.remote_addr = rdma_addr; + + if (vnic_ib_post_send(&data->ib_conn, + &data->free_bufs_io.io)) { + DATA_ERROR("failed to post send\n"); + viport_failure(data->parent); + break; + } + INC(pool->next_free_buf, num_to_send, pool->eioc_pool_sz); + pool->num_free_bufs -= num_to_send; + pool->num_posted_bufs += num_to_send; + bufs_sent = 1; + } + + if (bufs_sent) { + if (pool->kick_on_free) + data_send_kick_message(data); + } + if (pool->num_posted_bufs == 0) { + DATA_ERROR("%s: unable to allocate receive buffers\n", + config_viport_name(data->parent->config)); + viport_failure(data->parent); + } +} + +void data_connected(struct data *data) +{ + DATA_FUNCTION("data_connected()\n"); + data->free_bufs_io.io.swr.wr.rdma.rkey = + data->recv_pool.eioc_rdma_rkey; + data_alloc_buffers(data, 1); + data_send_free_recv_buffers(data); + data->connected = 1; +} + +void data_disconnect(struct data *data) +{ + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct recv_pool *recv_pool = &data->recv_pool; + unsigned int i; + + DATA_FUNCTION("data_disconnect()\n"); + + data->connected = 0; + if (data->kick_timer_on) { + del_timer_sync(&data->kick_timer); + data->kick_timer_on = 0; + } + + for (i = 0; i < xmit_pool->num_xmit_bufs; i++) { + if (xmit_pool->xmit_bufs[i].skb) + dev_kfree_skb(xmit_pool->xmit_bufs[i].skb); + xmit_pool->xmit_bufs[i].skb = NULL; + + } + for (i = 0; i < recv_pool->pool_sz; i++) { + if (data->recv_pool.recv_bufs[i].skb) + dev_kfree_skb(recv_pool->recv_bufs[i].skb); + recv_pool->recv_bufs[i].skb = NULL; + } + vfree(data->local_storage); + if (data->region_data) { + dma_unmap_single(data->parent->config->ibdev->dma_device, + data->region_data_dma, 4, + DMA_BIDIRECTIONAL); + kfree(data->region_data); + } + + if (recv_pool->buf_pool) { + dma_unmap_single(data->parent->config->ibdev->dma_device, + recv_pool->buf_pool_dma, + recv_pool->buf_pool_len, DMA_TO_DEVICE); + kfree(recv_pool->buf_pool); + } + + if (xmit_pool->buf_pool) { + dma_unmap_single(data->parent->config->ibdev->dma_device, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_FROM_DEVICE); + kfree(xmit_pool->buf_pool); + } + + if (xmit_pool->xmit_data) { + dma_unmap_single(data->parent->config->ibdev->dma_device, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + kfree(xmit_pool->xmit_data); + } +} + +void data_cleanup(struct data *data) +{ + init_completion(&data->ib_conn.done); + if (ib_send_cm_dreq(data->ib_conn.cm_id, NULL, 0)) + printk(KERN_DEBUG "data CM DREQ sending failed\n"); + else + wait_for_completion(&data->ib_conn.done); + + ib_destroy_cm_id(data->ib_conn.cm_id); + ib_destroy_qp(data->ib_conn.qp); + ib_destroy_cq(data->ib_conn.cq); + ib_dereg_mr(data->mr); + +} + +static int data_alloc_xmit_buffer(struct data *data, struct sk_buff *skb, + struct buff_pool_entry **pp_bpe, + struct rdma_io **pp_rdma_io, + int *last) +{ + struct xmit_pool *pool = &data->xmit_pool; + unsigned long flags; + int ret; + + DATA_FUNCTION("data_alloc_xmit_buffer()\n"); + + spin_lock_irqsave(&data->xmit_buf_lock, flags); + dma_sync_single_for_cpu(data->parent->config->ibdev->dma_device, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + *last = 0; + *pp_rdma_io = &pool->xmit_bufs[pool->next_xmit_buf]; + *pp_bpe = &pool->buf_pool[pool->next_xmit_pool]; + + if ((*pp_bpe)->valid && pool->next_xmit_buf != + pool->last_comp_buf) { + INC(pool->next_xmit_buf, 1, pool->num_xmit_bufs); + INC(pool->next_xmit_pool, 1, pool->pool_sz); + if (!pool->buf_pool[pool->next_xmit_pool].valid) { + DATA_INFO("just used the last EIOU" + " receive buffer\n"); + *last = 1; + pool->need_buffers = 1; + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + data_kickreq_stats(data); + } else if (pool->next_xmit_buf == pool->last_comp_buf) { + DATA_INFO("just used our last xmit buffer\n"); + pool->need_buffers = 1; + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + } + (*pp_rdma_io)->skb = skb; + (*pp_bpe)->valid = 0; + ret = 0; + } else { + data_no_xmitbuf_stats(data); + DATA_ERROR("Out of xmit buffers\n"); + vnic_stop_xmit(data->parent->vnic, + data->parent->parent); + ret = -1; + } + + dma_sync_single_for_device(data->parent->config->ibdev-> + dma_device, pool->buf_pool_dma, + pool->buf_pool_len, DMA_TO_DEVICE); + spin_unlock_irqrestore(&data->xmit_buf_lock, flags); + return ret; +} + +static void data_rdma_packet(struct data *data, struct buff_pool_entry *bpe, + struct rdma_io *rdma_io) +{ + struct ib_send_wr *swr; + struct sk_buff *skb; + dma_addr_t trailer_data_dma; + dma_addr_t skb_data_dma; + struct xmit_pool *xmit_pool = &data->xmit_pool; + struct viport *viport = data->parent; + u8 *d; + int len; + int fill_len; + + DATA_FUNCTION("data_rdma_packet()\n"); + swr = &rdma_io->io.swr; + skb = rdma_io->skb; + len = ALIGN(rdma_io->len, VIPORT_TRAILER_ALIGNMENT); + fill_len = len - skb->len; + + dma_sync_single_for_cpu(data->parent->config->ibdev->dma_device, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + + d = (u8 *) rdma_io->trailer - fill_len; + trailer_data_dma = rdma_io->trailer_dma - fill_len; + memset(d, 0, fill_len); + + swr->sg_list[0].length = skb->len; + if (skb->len <= min_xmt_skb) { + memcpy(rdma_io->data, skb->data, skb->len); + swr->sg_list[0].lkey = data->mr->lkey; + swr->sg_list[0].addr = rdma_io->data_dma; + dev_kfree_skb_any(skb); + rdma_io->skb = NULL; + } else { + swr->sg_list[0].lkey = data->mr->lkey; + + skb_data_dma = dma_map_single(viport->config->ibdev->dma_device, + skb->data, skb->len, + DMA_TO_DEVICE); + + if (dma_mapping_error(skb_data_dma)) { + DATA_ERROR("skb data dma map error\n"); + goto failure; + } + + rdma_io->skb_data_dma = skb_data_dma; + + swr->sg_list[0].addr = skb_data_dma; + skb_orphan(skb); + } + dma_sync_single_for_cpu(data->parent->config->ibdev->dma_device, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_TO_DEVICE); + + swr->sg_list[1].addr = trailer_data_dma; + swr->sg_list[1].length = fill_len + sizeof(struct viport_trailer); + swr->sg_list[0].lkey = data->mr->lkey; + swr->wr.rdma.remote_addr = be64_to_cpu(bpe->remote_addr); + swr->wr.rdma.remote_addr += data->xmit_pool.buffer_sz; + swr->wr.rdma.remote_addr -= (sizeof(struct viport_trailer) + len); + swr->wr.rdma.rkey = be32_to_cpu(bpe->rkey); + + dma_sync_single_for_device(data->parent->config->ibdev->dma_device, + xmit_pool->buf_pool_dma, + xmit_pool->buf_pool_len, DMA_TO_DEVICE); + + data->xmit_pool.notify_count++; + if (data->xmit_pool.notify_count >= data->xmit_pool.notify_bundle) { + data->xmit_pool.notify_count = 0; + swr->send_flags = IB_SEND_SIGNALED; + } else { + swr->send_flags = 0; + } + dma_sync_single_for_device(data->parent->config->ibdev->dma_device, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); + if (vnic_ib_post_send(&data->ib_conn, &rdma_io->io)) { + DATA_ERROR("failed to post send for data RDMA write\n"); + viport_failure(data->parent); + goto failure; + } + + data_xmits_stats(data); +failure: + dma_sync_single_for_device(data->parent->config->ibdev->dma_device, + xmit_pool->xmitdata_dma, + xmit_pool->xmitdata_len, DMA_TO_DEVICE); +} + +static void data_kick_timeout_handler(unsigned long arg) +{ + struct data *data = (struct data *)arg; + + DATA_FUNCTION("data_kick_timeout_handler()\n"); + data->kick_timer_on = 0; + data_send_kick_message(data); +} + +int data_xmit_packet(struct data *data, struct sk_buff *skb) +{ + struct xmit_pool *pool = &data->xmit_pool; + struct rdma_io *rdma_io; + struct buff_pool_entry *bpe; + struct viport_trailer *trailer; + unsigned int sz = skb->len; + int last; + + DATA_FUNCTION("data_xmit_packet()\n"); + if (sz > pool->buffer_sz) { + DATA_ERROR("outbound packet too large, size = %d\n", sz); + return -1; + } + + if (data_alloc_xmit_buffer(data, skb, &bpe, &rdma_io, &last)) { + DATA_ERROR("error in allocating data xmit buffer\n"); + return -1; + } + + dma_sync_single_for_cpu(data->parent->config->ibdev->dma_device, + pool->xmitdata_dma, pool->xmitdata_len, + DMA_TO_DEVICE); + trailer = rdma_io->trailer; + + memset(trailer, 0, sizeof *trailer); + memcpy(trailer->dest_mac_addr, skb->data, ETH_ALEN); + + if (skb->sk) + trailer->connection_hash_and_valid = 0x40 | + ((be16_to_cpu(inet_sk(skb->sk)->sport) + + be16_to_cpu( inet_sk(skb->sk)->dport)) & 0x3f); + + trailer->connection_hash_and_valid |= CHV_VALID; + + if ((sz > 16) && (*(__be16 *) (skb->data + 12) == + __constant_cpu_to_be16(ETH_P_8021Q))) { + trailer->vlan = *(__be16 *) (skb->data + 14); + memmove(skb->data + 4, skb->data, 12); + skb_pull(skb, 4); + trailer->pkt_flags |= PF_VLAN_INSERT; + } + if (last) + trailer->pkt_flags |= PF_KICK; + if (sz < ETH_ZLEN) { + /* EIOU requires all packets to be + * of ethernet minimum packet size. + */ + trailer->data_length = __constant_cpu_to_be16(ETH_ZLEN); + rdma_io->len = ETH_ZLEN; + } else { + trailer->data_length = cpu_to_be16(sz); + rdma_io->len = sz; + } + + if (skb->ip_summed == CHECKSUM_PARTIAL) { + trailer->tx_chksum_flags = TX_CHKSUM_FLAGS_CHECKSUM_V4 + | TX_CHKSUM_FLAGS_IP_CHECKSUM + | TX_CHKSUM_FLAGS_TCP_CHECKSUM + | TX_CHKSUM_FLAGS_UDP_CHECKSUM; + } + + dma_sync_single_for_device(data->parent->config->ibdev->dma_device, + pool->xmitdata_dma, pool->xmitdata_len, + DMA_TO_DEVICE); + + data_rdma_packet(data, bpe, rdma_io); + + if (pool->send_kicks) { + /* EIOC needs kicks to inform it of sent packets */ + pool->kick_count++; + pool->kick_byte_count += sz; + if ((pool->kick_count >= pool->kick_bundle) + || (pool->kick_byte_count >= pool->kick_byte_bundle)) { + data_send_kick_message(data); + } else if (pool->kick_count == 1) { + init_timer(&data->kick_timer); + /* timeout_before_kick is in usec */ + data->kick_timer.expires = + msecs_to_jiffies(be32_to_cpu(data-> + eioc_pool_parms.timeout_before_kick) * 1000) + + jiffies; + data->kick_timer.data = (unsigned long)data; + data->kick_timer.function = data_kick_timeout_handler; + add_timer(&data->kick_timer); + data->kick_timer_on = 1; + } + } + return 0; +} + +void data_check_xmit_buffers(struct data *data) +{ + struct xmit_pool *pool = &data->xmit_pool; + unsigned long flags; + + DATA_FUNCTION("data_check_xmit_buffers()\n"); + spin_lock_irqsave(&data->xmit_buf_lock, flags); + dma_sync_single_for_cpu(data->parent->config->ibdev->dma_device, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + if (data->xmit_pool.need_buffers + && pool->buf_pool[pool->next_xmit_pool].valid + && pool->next_xmit_buf != pool->last_comp_buf) { + data->xmit_pool.need_buffers = 0; + vnic_restart_xmit(data->parent->vnic, + data->parent->parent); + DATA_INFO("there are free xmit buffers\n"); + } + dma_sync_single_for_device(data->parent->config->ibdev->dma_device, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + spin_unlock_irqrestore(&data->xmit_buf_lock, flags); +} + +static struct sk_buff *data_recv_to_skbuff(struct data *data, + struct rdma_dest *rdma_dest) +{ + struct viport_trailer *trailer; + struct sk_buff *skb = NULL; + int start; + unsigned int len; + u8 rx_chksum_flags; + + DATA_FUNCTION("data_recv_to_skbuff()\n"); + trailer = rdma_dest->trailer; + start = data_offset(data, trailer); + len = data_len(data, trailer); + + if (len <= min_rcv_skb) + skb = dev_alloc_skb(len + VLAN_HLEN + 2); + /* leave room for VLAN header and alignment */ + if (skb) { + skb_reserve(skb, VLAN_HLEN + 2); + memcpy(skb->data, rdma_dest->data + start, len); + skb_put(skb, len); + } else { + skb = rdma_dest->skb; + rdma_dest->skb = NULL; + rdma_dest->trailer = NULL; + rdma_dest->data = NULL; + skb_pull(skb, start); + skb_trim(skb, len); + } + + rx_chksum_flags = trailer->rx_chksum_flags; + DATA_INFO("rx_chksum_flags = %d, LOOP = %c, IP = %c," + " TCP = %c, UDP = %c\n", + rx_chksum_flags, + (rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) ? 'Y' : 'N', + (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED) ? 'N' : + '-', + (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED) ? 'N' : + '-', + (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED) ? 'Y' + : (rx_chksum_flags & RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED) ? 'N' : + '-'); + + if ((rx_chksum_flags & RX_CHKSUM_FLAGS_LOOPBACK) + || ((rx_chksum_flags & RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED) + && ((rx_chksum_flags & RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED) + || (rx_chksum_flags & + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED)))) + skb->ip_summed = CHECKSUM_UNNECESSARY; + else + skb->ip_summed = CHECKSUM_NONE; + + if (trailer->pkt_flags & PF_VLAN_INSERT) { + u8 *rv; + + rv = skb_push(skb, 4); + memmove(rv, rv + 4, 12); + *(__be16 *) (rv + 12) = __constant_cpu_to_be16(ETH_P_8021Q); + if (trailer->pkt_flags & PF_PVID_OVERRIDDEN) + *(__be16 *) (rv + 14) = trailer->vlan & + __constant_cpu_to_be16(0xF000); + else + *(__be16 *) (rv + 14) = trailer->vlan; + } + + return skb; +} + +static int data_incoming_recv(struct data *data) +{ + struct recv_pool *pool = &data->recv_pool; + struct rdma_dest *rdma_dest; + struct viport_trailer *trailer; + struct buff_pool_entry *bpe; + struct sk_buff *skb; + + DATA_FUNCTION("data_incoming_recv()\n"); + if (pool->next_full_buf == pool->next_free_buf) + return -1; + bpe = &pool->buf_pool[pool->next_full_buf]; + rdma_dest = &pool->recv_bufs[bpe->valid - 1]; + trailer = rdma_dest->trailer; + + if (!trailer + || !(trailer->connection_hash_and_valid & CHV_VALID)) + return -1; + + /* received a packet */ + if (trailer->pkt_flags & PF_KICK) + pool->kick_on_free = 1; + + skb = data_recv_to_skbuff(data, rdma_dest); + + if (skb) { + vnic_recv_packet(data->parent->vnic, + data->parent->parent, skb); + list_add(&rdma_dest->list_ptrs, &pool->avail_recv_bufs); + } + + dma_sync_single_for_cpu(data->parent->config->ibdev->dma_device, + pool->buf_pool_dma, pool->buf_pool_len, + DMA_TO_DEVICE); + + bpe->valid = 0; + dma_sync_single_for_device(data->parent->config->ibdev-> + dma_device, pool->buf_pool_dma, + pool->buf_pool_len, DMA_TO_DEVICE); + + INC(pool->next_full_buf, 1, pool->eioc_pool_sz); + pool->num_posted_bufs--; + data_recvs_stats(data); + return 0; +} + +static void data_received_kick(struct io *io) +{ + struct data *data = &io->viport->data; + unsigned long flags; + + DATA_FUNCTION("data_received_kick()\n"); + data_note_kickrcv_time(); + spin_lock_irqsave(&data->recv_ios_lock, flags); + list_add(&io->list_ptrs, &data->recv_ios); + spin_unlock_irqrestore(&data->recv_ios_lock, flags); + data_post_recvs(data); + data_rcvkicks_stats(data); + data_check_xmit_buffers(data); + + while (!data_incoming_recv(data)); + + if (data->connected) { + data_alloc_buffers(data, 0); + data_send_free_recv_buffers(data); + } +} + +static void data_xmit_complete(struct io *io) +{ + struct rdma_io *rdma_io = (struct rdma_io *)io; + struct data *data = &io->viport->data; + struct xmit_pool *pool = &data->xmit_pool; + struct sk_buff *skb; + + DATA_FUNCTION("data_xmit_complete()\n"); + + if (rdma_io->skb) + dma_unmap_single(data->parent->config->ibdev->dma_device, + rdma_io->skb_data_dma, rdma_io->skb->len, + DMA_TO_DEVICE); + + while (pool->last_comp_buf != rdma_io->index) { + INC(pool->last_comp_buf, 1, pool->num_xmit_bufs); + skb = pool->xmit_bufs[pool->last_comp_buf].skb; + if (skb) + dev_kfree_skb_any(skb); + pool->xmit_bufs[pool->last_comp_buf].skb = NULL; + } + + data_check_xmit_buffers(data); +} diff --git a/drivers/infiniband/ulp/vnic/vnic_data.h b/drivers/infiniband/ulp/vnic/vnic_data.h new file mode 100644 index 0000000..379df14 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_data.h @@ -0,0 +1,182 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_DATA_H_INCLUDED +#define VNIC_DATA_H_INCLUDED + +#include + +#ifdef CONFIG_INFINIBAND_VNIC_STATS +#include +#endif /* CONFIG_INFINIBAND_VNIC_STATS */ + +#include "vnic_ib.h" +#include "vnic_control_pkt.h" +#include "vnic_trailer.h" + +struct rdma_dest { + struct list_head list_ptrs; + struct sk_buff *skb; + u8 *data; + struct viport_trailer *trailer; +}; + +struct buff_pool_entry { + __be64 remote_addr; + __be32 rkey; + u32 valid; +}; + +struct recv_pool { + u32 buffer_sz; + u32 pool_sz; + u32 eioc_pool_sz; + u32 eioc_rdma_rkey; + u64 eioc_rdma_addr; + u32 next_full_buf; + u32 next_free_buf; + u32 num_free_bufs; + u32 num_posted_bufs; + u32 sz_free_bundle; + int kick_on_free; + struct buff_pool_entry *buf_pool; + dma_addr_t buf_pool_dma; + int buf_pool_len; + struct rdma_dest *recv_bufs; + struct list_head avail_recv_bufs; +}; + +struct xmit_pool { + u32 buffer_sz; + u32 pool_sz; + u32 notify_count; + u32 notify_bundle; + u32 next_xmit_buf; + u32 last_comp_buf; + u32 num_xmit_bufs; + u32 next_xmit_pool; + u32 kick_count; + u32 kick_byte_count; + u32 kick_bundle; + u32 kick_byte_bundle; + int need_buffers; + int send_kicks; + uint32_t rdma_rkey; + u64 rdma_addr; + struct buff_pool_entry *buf_pool; + dma_addr_t buf_pool_dma; + int buf_pool_len; + struct rdma_io *xmit_bufs; + u8 *xmit_data; + dma_addr_t xmitdata_dma; + int xmitdata_len; +}; + +struct data { + struct viport *parent; + struct data_config *config; + struct ib_mr *mr; + struct vnic_ib_conn ib_conn; + u8 *local_storage; + struct vnic_recv_pool_config host_pool_parms; + struct vnic_recv_pool_config eioc_pool_parms; + struct recv_pool recv_pool; + struct xmit_pool xmit_pool; + u8 *region_data; + dma_addr_t region_data_dma; + struct rdma_io free_bufs_io; + struct send_io kick_io; + struct list_head recv_ios; + spinlock_t recv_ios_lock; + spinlock_t xmit_buf_lock; + int kick_timer_on; + int connected; + struct timer_list kick_timer; + struct completion done; +#ifdef CONFIG_INFINIBAND_VNIC_STATS + struct { + u32 xmit_num; + u32 recv_num; + u32 free_buf_sends; + u32 free_buf_num; + u32 free_buf_min; + u32 kick_recvs; + u32 kick_reqs; + u32 no_xmit_bufs; + cycles_t no_xmit_buf_time; + } statistics; +#endif /* CONFIG_INFINIBAND_VNIC_STATS */ +}; + +int data_init(struct data *data, struct viport *viport, + struct data_config *config, struct ib_pd *pd); + +int data_connect(struct data *data); +void data_connected(struct data *data); +void data_disconnect(struct data *data); + +int data_xmit_packet(struct data *data, struct sk_buff *skb); + +void data_cleanup(struct data *data); + +#define data_is_connected(data) \ + (vnic_ib_conn_connected(&((data)->ib_conn))) +#define data_path_id(data) (data)->config->path_id +#define data_eioc_pool(data) &(data)->eioc_pool_parms +#define data_host_pool(data) &(data)->host_pool_parms +#define data_eioc_pool_min(data) &(data)->config->eioc_min +#define data_host_pool_min(data) &(data)->config->host_min +#define data_eioc_pool_max(data) &(data)->config->eioc_max +#define data_host_pool_max(data) &(data)->config->host_max +#define data_local_pool_addr(data) (data)->xmit_pool.rdma_addr +#define data_local_pool_rkey(data) (data)->xmit_pool.rdma_rkey +#define data_remote_pool_addr(data) &(data)->recv_pool.eioc_rdma_addr +#define data_remote_pool_rkey(data) &(data)->recv_pool.eioc_rdma_rkey + +#define data_max_mtu(data) \ + MAX_PAYLOAD(min((data)->recv_pool.buffer_sz, \ + (data)->xmit_pool.buffer_sz)) - VLAN_ETH_HLEN + +#define data_len(data, trailer) be16_to_cpu(trailer->data_length) +#define data_offset(data, trailer) \ + data->recv_pool.buffer_sz - sizeof(struct viport_trailer) \ + - ALIGN(data_len(data, trailer), VIPORT_TRAILER_ALIGNMENT) \ + + trailer->data_alignment_offset + +/* the following macros manipulate ring buffer indexes. + * the ring buffer size must be a power of 2. + */ +#define ADD(index, increment, size) (((index) + (increment))&((size) - 1)) +#define NEXT(index, size) ADD(index, 1, size) +#define INC(index, increment, size) (index) = ADD(index, increment, size) + +#endif /* VNIC_DATA_H_INCLUDED */ diff --git a/drivers/infiniband/ulp/vnic/vnic_trailer.h b/drivers/infiniband/ulp/vnic/vnic_trailer.h new file mode 100644 index 0000000..dd8a073 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_trailer.h @@ -0,0 +1,103 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_TRAILER_H_INCLUDED +#define VNIC_TRAILER_H_INCLUDED + +/* pkt_flags values */ +enum { + PF_CHASH_VALID = 0x01, + PF_IPSEC_VALID = 0x02, + PF_TCP_SEGMENT = 0x04, + PF_KICK = 0x08, + PF_VLAN_INSERT = 0x10, + PF_PVID_OVERRIDDEN = 0x20, + PF_FCS_INCLUDED = 0x40, + PF_FORCE_ROUTE = 0x80 +}; + +/* tx_chksum_flags values */ +enum { + TX_CHKSUM_FLAGS_CHECKSUM_V4 = 0x01, + TX_CHKSUM_FLAGS_CHECKSUM_V6 = 0x02, + TX_CHKSUM_FLAGS_TCP_CHECKSUM = 0x04, + TX_CHKSUM_FLAGS_UDP_CHECKSUM = 0x08, + TX_CHKSUM_FLAGS_IP_CHECKSUM = 0x10 +}; + +/* rx_chksum_flags values */ +enum { + RX_CHKSUM_FLAGS_TCP_CHECKSUM_FAILED = 0x01, + RX_CHKSUM_FLAGS_UDP_CHECKSUM_FAILED = 0x02, + RX_CHKSUM_FLAGS_IP_CHECKSUM_FAILED = 0x04, + RX_CHKSUM_FLAGS_TCP_CHECKSUM_SUCCEEDED = 0x08, + RX_CHKSUM_FLAGS_UDP_CHECKSUM_SUCCEEDED = 0x10, + RX_CHKSUM_FLAGS_IP_CHECKSUM_SUCCEEDED = 0x20, + RX_CHKSUM_FLAGS_LOOPBACK = 0x40, + RX_CHKSUM_FLAGS_RESERVED = 0x80 +}; + +/* connection_hash_and_valid values */ +enum { + CHV_VALID = 0x80, + CHV_HASH_MASH = 0x7f +}; + +struct viport_trailer { + s8 data_alignment_offset; + u8 rndis_header_length; /* reserved for use by edp */ + __be16 data_length; + u8 pkt_flags; + u8 tx_chksum_flags; + u8 rx_chksum_flags; + u8 ip_sec_flags; + u32 tcp_seq_no; + u32 ip_sec_offload_handle; + u32 ip_sec_next_offload_handle; + u8 dest_mac_addr[6]; + __be16 vlan; + u16 time_stamp; + u8 origin; + u8 connection_hash_and_valid; +}; + +#define VIPORT_TRAILER_ALIGNMENT 32 + +#define BUFFER_SIZE(len) \ + (sizeof(struct viport_trailer) + \ + ALIGN((len), VIPORT_TRAILER_ALIGNMENT)) + +#define MAX_PAYLOAD(len) \ + ALIGN_DOWN((len) - sizeof(struct viport_trailer), \ + VIPORT_TRAILER_ALIGNMENT) + +#endif /* VNIC_TRAILER_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:56:55 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:26:55 +0530 Subject: [openib-general] [PATCH v2 6/11] IB core stack interaction Message-ID: <455A268F.26993.612FC99@ramachandra.kuchimanchi.qlogic.com> Adds the files that implement interaction with the core IB stack. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_ib.c | 691 +++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_ib.h | 170 ++++++++ 2 files changed, 861 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_ib.c b/drivers/infiniband/ulp/vnic/vnic_ib.c new file mode 100644 index 0000000..71f02cf --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_ib.c @@ -0,0 +1,691 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_config.h" +#include "vnic_ib.h" +#include "vnic_viport.h" +#include "vnic_sys.h" +#include "vnic_main.h" +#include "vnic_stats.h" + +static int vnic_ib_inited = 0; + +static void vnic_add_one(struct ib_device *device); +static void vnic_remove_one(struct ib_device *device); + +static struct ib_client vnic_client = { + .name = "vnic", + .add = vnic_add_one, + .remove = vnic_remove_one +}; + +static struct ib_sa_client vnic_sa_client; + +static CLASS_DEVICE_ATTR(create_primary, S_IWUSR, NULL, + vnic_create_primary); +static CLASS_DEVICE_ATTR(create_secondary, S_IWUSR, NULL, + vnic_create_secondary); + +static CLASS_DEVICE_ATTR(delete_vnic, S_IWUSR, NULL, vnic_delete); + +static struct vnic_ib_port *vnic_add_port(struct vnic_ib_device *device, + u8 port_num) +{ + struct vnic_ib_port *port; + + port = kzalloc(sizeof *port, GFP_KERNEL); + if (!port) + return NULL; + + init_completion(&port->cdev_info.released); + port->dev = device; + port->port_num = port_num; + + port->cdev_info.class_dev.class = &vnic_class; + port->cdev_info.class_dev.dev = device->dev->dma_device; + snprintf(port->cdev_info.class_dev.class_id, BUS_ID_SIZE, + "vnic-%s-%d", device->dev->name, port_num); + + if (class_device_register(&port->cdev_info.class_dev)) + goto free_port; + + if (class_device_create_file(&port->cdev_info.class_dev, + &class_device_attr_create_primary)) + goto err_class; + if (class_device_create_file(&port->cdev_info.class_dev, + &class_device_attr_create_secondary)) + goto err_class; + + return port; +err_class: + class_device_unregister(&port->cdev_info.class_dev); +free_port: + kfree(port); + + return NULL; +} + +static void vnic_add_one(struct ib_device *device) +{ + struct vnic_ib_device *vnic_dev; + struct vnic_ib_port *port; + int s, e, p; + + vnic_dev = kmalloc(sizeof *vnic_dev, GFP_KERNEL); + if (!vnic_dev) + return; + + vnic_dev->dev = device; + INIT_LIST_HEAD(&vnic_dev->port_list); + + if (device->node_type == RDMA_NODE_IB_SWITCH) { + s = 0; + e = 0; + + } else { + s = 1; + e = device->phys_port_cnt; + + } + + for (p = s; p <= e; p++) { + port = vnic_add_port(vnic_dev, p); + if (port) + list_add_tail(&port->list, &vnic_dev->port_list); + } + + ib_set_client_data(device, &vnic_client, vnic_dev); + +} + +static void vnic_remove_one(struct ib_device *device) +{ + struct vnic_ib_device *vnic_dev; + struct vnic_ib_port *port, *tmp_port; + + vnic_dev = ib_get_client_data(device, &vnic_client); + list_for_each_entry_safe(port, tmp_port, + &vnic_dev->port_list, list) { + class_device_unregister(&port->cdev_info.class_dev); + /* + * wait for sysfs entries to go away, so that no new vnics + * are created + */ + wait_for_completion(&port->cdev_info.released); + kfree(port); + + } + kfree(vnic_dev); +} + +int vnic_ib_init(void) +{ + int ret = -1; + + IB_FUNCTION("vnic_ib_init()\n"); + + /* class has to be registered before + * calling ib_register_client() because, that call + * will trigger vnic_add_port() which will register + * class_device for the port with the parent class + * as vnic_class + */ + ret = class_register(&vnic_class); + if (ret) { + printk(KERN_ERR PFX "couldn't register class" + " infiniband_vnic; error %d", ret); + goto out; + } + + ib_sa_register_client(&vnic_sa_client); + ret = ib_register_client(&vnic_client); + if (ret) { + printk(KERN_ERR PFX "couldn't register IB client;" + " error %d", ret); + goto err_ib_reg; + } + + interface_cdev.class_dev.class = &vnic_class; + snprintf(interface_cdev.class_dev.class_id, + BUS_ID_SIZE, "interfaces"); + init_completion(&interface_cdev.released); + ret = class_device_register(&interface_cdev.class_dev); + if (ret) { + printk(KERN_ERR PFX "couldn't register class interfaces;" + " error %d", ret); + goto err_class_dev; + } + ret = class_device_create_file(&interface_cdev.class_dev, + &class_device_attr_delete_vnic); + if (ret) { + printk(KERN_ERR PFX "couldn't create class file" + " 'delete_vnic'; error %d", ret); + goto err_class_file; + } + + vnic_ib_inited = 1; + + return ret; +err_class_file: + class_device_unregister(&interface_cdev.class_dev); +err_class_dev: + ib_unregister_client(&vnic_client); +err_ib_reg: + ib_sa_unregister_client(&vnic_sa_client); + class_unregister(&vnic_class); +out: + return ret; +} + +void vnic_ib_cleanup(void) +{ + IB_FUNCTION("vnic_ib_cleanup()\n"); + + if (!vnic_ib_inited) + return; + + class_device_unregister(&interface_cdev.class_dev); + wait_for_completion(&interface_cdev.released); + + ib_unregister_client(&vnic_client); + ib_sa_unregister_client(&vnic_sa_client); + class_unregister(&vnic_class); +} + +static void vnic_path_rec_completion(int status, + struct ib_sa_path_rec *pathrec, + void *context) +{ + struct vnic_ib_path_info *p = context; + p->status = status; + if (!status) + p->path = *pathrec; + + complete(&p->done); +} + +int vnic_ib_get_path(struct netpath *netpath, struct vnic * vnic) +{ + struct viport_config *config = netpath->viport->config; + int ret = 0; + + init_completion(&config->path_info.done); + IB_INFO("Using SA path rec get time out value of %d\n", + config->sa_path_rec_get_timeout); + config->path_info.path_query_id = + ib_sa_path_rec_get(&vnic_sa_client, + config->ibdev, + config->port, + &config->path_info.path, + IB_SA_PATH_REC_DGID | + IB_SA_PATH_REC_SGID | + IB_SA_PATH_REC_NUMB_PATH | + IB_SA_PATH_REC_PKEY, + config->sa_path_rec_get_timeout, + GFP_KERNEL, + vnic_path_rec_completion, + &config->path_info, + &config->path_info.path_query); + + if (config->path_info.path_query_id < 0) { + IB_ERROR("SA path record query failed; error %d\n", + config->path_info.path_query_id); + ret= config->path_info.path_query_id; + goto out; + } + + wait_for_completion(&config->path_info.done); + + if (config->path_info.status < 0) { + printk(KERN_WARNING PFX "path record query failed for dgid " + "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x\n", + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[0]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[2]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[4]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[6]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[8]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[10]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[12]), + (int)be16_to_cpu(*(__be16 *) &config->path_info.path. + dgid.raw[14])); + + if (config->path_info.status == -ETIMEDOUT) + printk(KERN_WARNING PFX + "reason: path record query timed out\n"); + else if (config->path_info.status == -EIO) + printk(KERN_WARNING PFX + "reason: error in sending path record query\n"); + else + printk(KERN_WARNING PFX "reason: error %d in sending" + " path record query\n", + config->path_info.status); + + netpath_timer(netpath, vnic->config->no_path_timeout); + ret = config->path_info.status; + } +out: + return ret; +} + +static void ib_qp_event(struct ib_event *event, void *context) +{ + IB_ERROR("QP event %d\n", event->event); +} + +static void vnic_ib_completion(struct ib_cq *cq, void *ptr) +{ + struct ib_wc wc; + struct io *io; + struct vnic_ib_conn *ib_conn = ptr; + cycles_t comp_time; + u32 comp_num = 0; + + vnic_ib_note_comptime_stats(&comp_time); + vnic_ib_callback_stats(ib_conn); + + ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + while (ib_poll_cq(cq, 1, &wc) > 0) { + io = (struct io *)(wc.wr_id); + vnic_ib_comp_stats(ib_conn, &comp_num); + if (wc.status) { +#if 0 + IB_ERROR("completion error wc.status %d" + " wc.opcode %d vendor err 0x%x\n", + wc.status, wc.opcode, wc.vendor_err); +#endif + } else if (io) { + vnic_ib_io_stats(io, ib_conn, comp_time); + if (io->routine) + (*io->routine) (io); + } + } + vnic_ib_maxio_stats(ib_conn, comp_num); +} + +static int vnic_ib_mod_qp_to_rts(struct ib_cm_id * cm_id, + struct vnic_ib_conn * ib_conn) +{ + int attr_mask = 0; + int ret; + struct ib_qp_attr *qp_attr = NULL; + + qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL); + if (!qp_attr) + return -ENOMEM; + + qp_attr->qp_state = IB_QPS_RTR; + + if ((ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask))) + goto out; + + if((ret = ib_modify_qp(ib_conn->qp, qp_attr, attr_mask))) + goto out; + + IB_INFO("QP RTR\n"); + + qp_attr->qp_state = IB_QPS_RTS; + + if((ret = ib_cm_init_qp_attr(cm_id, qp_attr, &attr_mask))) + goto out; + + if((ret=ib_modify_qp(ib_conn->qp, qp_attr, attr_mask))) + goto out; + + IB_INFO("QP RTS\n"); + + if((ret = ib_send_cm_rtu(cm_id, NULL, 0))) + goto out; +out: + kfree(qp_attr); + return ret; +} + +int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event) +{ + struct vnic_ib_conn *ib_conn = cm_id->context; + struct viport *viport = ib_conn->viport; + int err = 0; + int disconn = 0; + + switch (event->event) { + case IB_CM_REQ_ERROR: + IB_ERROR("sending CM REQ failed\n"); + disconn = 1; + break; + case IB_CM_REP_RECEIVED: + IB_INFO("CM REP recvd\n"); + if (vnic_ib_mod_qp_to_rts(cm_id, ib_conn)) + err = 1; + else { + ib_conn->state = IB_CONN_CONNECTED; + vnic_ib_connected_time_stats(ib_conn); + IB_INFO("RTU SENT\n"); + } + break; + case IB_CM_REJ_RECEIVED: + printk(KERN_ERR PFX "CM rejected control connection \n"); + if (event->param.rej_rcvd.reason == + IB_CM_REJ_INVALID_SERVICE_ID) + printk(KERN_ERR "reason: invalid service ID. " + "IOCGUID value specified may be incorrect\n"); + else + printk(KERN_ERR "reason code : 0x%x\n", + event->param.rej_rcvd.reason); + + disconn = 1; + break; + case IB_CM_MRA_RECEIVED: + IB_INFO("CM MRA received\n"); + break; + + case IB_CM_DREP_RECEIVED: + IB_INFO("CM DREP recvd\n"); + ib_conn->state = IB_CONN_DISCONNECTED; + break; + + case IB_CM_TIMEWAIT_EXIT: + IB_ERROR("CM timewait exit\n"); + err = 1; + break; + + default: + IB_INFO("unhandled CM event %d\n", event->event); + break; + + } + + if (err) { + ib_conn->state = IB_CONN_DISCONNECTED; + viport_failure(viport); + } + + if (disconn) { + ib_conn->state = IB_CONN_DISCONNECTED; + viport_disconnect(viport); + + } + complete(&ib_conn->done); + return 0; +} + + +int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn) +{ + struct ib_cm_req_param *req = NULL; + struct viport *viport; + int ret = -1; + + if (!vnic_ib_conn_initted(ib_conn)) { + IB_ERROR("IB Connection out of state for CM connect (%d)\n", + ib_conn->state); + return -EINVAL; + } + + vnic_ib_conntime_stats(ib_conn); + req = kzalloc(sizeof *req, GFP_KERNEL); + if (!req) + return -ENOMEM; + + viport = ib_conn->viport; + + req->primary_path = &viport->config->path_info.path; + req->alternate_path = NULL; + req->qp_num = ib_conn->qp->qp_num; + req->qp_type = ib_conn->qp->qp_type; + req->service_id = ib_conn->ib_config->service_id; + req->private_data = &ib_conn->ib_config->conn_data; + req->private_data_len = sizeof(struct vnic_connection_data); + req->flow_control = 1; + + get_random_bytes(&req->starting_psn, 4); + req->starting_psn &= 0xffffff; + + /* + * Both responder_resources and initiator_depth are set to zero + * as we do not need RDMA read. + * + * They also must be set to zero, otherwise data connections + * are rejected by VEx. + */ + req->responder_resources = 0; + req->initiator_depth = 0; + req->remote_cm_response_timeout = 20; + req->local_cm_response_timeout = 20; + req->retry_count = ib_conn->ib_config->retry_count; + req->rnr_retry_count = ib_conn->ib_config->rnr_retry_count; + req->max_cm_retries = 15; + + ib_conn->state = IB_CONN_CONNECTING; + + ret = ib_send_cm_req(ib_conn->cm_id, req); + + kfree(req); + + if (ret) { + IB_ERROR("CM REQ sending failed; error %d \n", ret); + ib_conn->state = IB_CONN_DISCONNECTED; + } + + return ret; +} + +static int vnic_ib_init_qp(struct vnic_ib_conn * ib_conn, + struct vnic_ib_config *config, + struct ib_pd *pd, + struct viport_config * viport_config) +{ + struct ib_qp_init_attr *init_attr; + struct ib_qp_attr *attr; + int ret; + + init_attr = kzalloc(sizeof *init_attr, GFP_KERNEL); + if (!init_attr) + return -ENOMEM; + + init_attr->event_handler = ib_qp_event; + init_attr->cap.max_send_wr = config->num_sends; + init_attr->cap.max_recv_wr = config->num_recvs; + init_attr->cap.max_recv_sge = config->recv_scatter; + init_attr->cap.max_send_sge = config->send_gather; + init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + init_attr->qp_type = IB_QPT_RC; + init_attr->send_cq = ib_conn->cq; + init_attr->recv_cq = ib_conn->cq; + + ib_conn->qp = ib_create_qp(pd, init_attr); + + if (IS_ERR(ib_conn->qp)) { + ret = -1; + IB_ERROR("could not create QP\n"); + goto free_init_attr; + } + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) { + ret = -ENOMEM; + goto destroy_qp; + } + + ret = ib_find_cached_pkey(viport_config->ibdev, + viport_config->port, + be16_to_cpu(viport_config->path_info.path. + pkey), + &attr->pkey_index); + if (ret) { + printk(KERN_WARNING PFX "ib_find_cached_pkey() failed; " + "error %d\n", ret); + goto freeattr; + } + + attr->qp_state = IB_QPS_INIT; + attr->qp_access_flags = IB_ACCESS_REMOTE_WRITE; + attr->port_num = viport_config->port; + + ret = ib_modify_qp(ib_conn->qp, attr, + IB_QP_STATE | + IB_QP_PKEY_INDEX | + IB_QP_ACCESS_FLAGS | IB_QP_PORT); + if (ret) { + printk(KERN_WARNING PFX "could not modify QP; error %d \n", + ret); + goto freeattr; + } + + kfree(attr); + kfree(init_attr); + return ret; + +freeattr: + kfree(attr); +destroy_qp: + ib_destroy_qp(ib_conn->qp); +free_init_attr: + kfree(init_attr); + return ret; +} + +int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config) +{ + struct viport_config *viport_config = viport->config; + int ret = -1; + unsigned int cq_size = config->num_sends + config->num_recvs; + + + if (!vnic_ib_conn_uninitted(ib_conn)) { + IB_ERROR("IB Connection out of state for init (%d)\n", + ib_conn->state); + return -EINVAL; + } + + ib_conn->cq = ib_create_cq(viport_config->ibdev, vnic_ib_completion, + NULL, ib_conn, cq_size); + if (IS_ERR(ib_conn->cq)) { + IB_ERROR("could not create CQ\n"); + goto out; + } + + ib_req_notify_cq(ib_conn->cq, IB_CQ_NEXT_COMP); + + ret = vnic_ib_init_qp(ib_conn, config, pd, viport_config); + + if(ret) + goto destroy_cq; + + ib_conn->conn_lock = SPIN_LOCK_UNLOCKED; + ib_conn->state = IB_CONN_INITTED; + + return ret; + +destroy_cq: + ib_destroy_cq(ib_conn->cq); +out: + return ret; +} + +int vnic_ib_post_recv(struct vnic_ib_conn * ib_conn, struct io * io) +{ + cycles_t post_time; + struct ib_recv_wr *bad_wr; + int ret = -1; + unsigned long flags; + + IB_FUNCTION("vnic_ib_post_recv()\n"); + + spin_lock_irqsave(&ib_conn->conn_lock, flags); + + if (!vnic_ib_conn_initted(ib_conn) && + !vnic_ib_conn_connected(ib_conn)) + return -EINVAL; + + vnic_ib_pre_rcvpost_stats(ib_conn, io, &post_time); + io->type = RECV; + ret = ib_post_recv(ib_conn->qp, &io->rwr, &bad_wr); + if (ret) { + IB_ERROR("error in posting rcv wr; error %d\n", ret); + goto out; + } + + vnic_ib_post_rcvpost_stats(ib_conn, post_time); +out: + spin_unlock_irqrestore(&ib_conn->conn_lock, flags); + return ret; + +} + +int vnic_ib_post_send(struct vnic_ib_conn * ib_conn, struct io * io) +{ + cycles_t post_time; + unsigned long flags; + struct ib_send_wr *bad_wr; + int ret = -1; + + IB_FUNCTION("vnic_ib_post_send()\n"); + + spin_lock_irqsave(&ib_conn->conn_lock, flags); + if (!vnic_ib_conn_connected(ib_conn)) { + IB_ERROR("IB Connection out of state for" + " posting sends (%d)\n", ib_conn->state); + goto out; + } + + vnic_ib_pre_sendpost_stats(io, &post_time); + if (io->swr.opcode == IB_WR_RDMA_WRITE) + io->type = RDMA; + else + io->type = SEND; + + ret = ib_post_send(ib_conn->qp, &io->swr, &bad_wr); + if (ret) { + IB_ERROR("error in posting send wr; error %d\n", ret); + goto out; + } + + vnic_ib_post_sendpost_stats(ib_conn, io, post_time); +out: + spin_unlock_irqrestore(&ib_conn->conn_lock, flags); + return ret; +} diff --git a/drivers/infiniband/ulp/vnic/vnic_ib.h b/drivers/infiniband/ulp/vnic/vnic_ib.h new file mode 100644 index 0000000..f009876 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_ib.h @@ -0,0 +1,170 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_IB_H_INCLUDED +#define VNIC_IB_H_INCLUDED + +#include +#include +#include +#include +#include +#include + +#include "vnic_sys.h" +#include "vnic_netpath.h" +#define PFX "ib_vnic: " + +struct io; +typedef void (comp_routine_t) (struct io * io); + +enum vnic_ib_conn_state { + IB_CONN_UNINITTED = 0, + IB_CONN_INITTED = 1, + IB_CONN_CONNECTING = 2, + IB_CONN_CONNECTED = 3, + IB_CONN_DISCONNECTED = 4 +}; + +struct vnic_ib_conn { + struct viport *viport; + struct vnic_ib_config *ib_config; + spinlock_t conn_lock; + enum vnic_ib_conn_state state; + struct ib_qp *qp; + struct ib_cq *cq; + struct ib_cm_id *cm_id; + struct completion done; +#ifdef CONFIG_INFINIBAND_VNIC_STATS + struct { + cycles_t connection_time; + cycles_t rdma_post_time; + u32 rdma_post_ios; + cycles_t rdma_comp_time; + u32 rdma_comp_ios; + cycles_t send_post_time; + u32 send_post_ios; + cycles_t send_comp_time; + u32 send_comp_ios; + cycles_t recv_post_time; + u32 recv_post_ios; + cycles_t recv_comp_time; + u32 recv_comp_ios; + u32 num_ios; + u32 num_callbacks; + u32 max_ios; + } statistics; +#endif /* CONFIG_INFINIBAND_VNIC_STATS */ +}; + +struct vnic_ib_path_info { + struct ib_sa_path_rec path; + struct ib_sa_query *path_query; + int path_query_id; + int status; + struct completion done; +}; + +struct vnic_ib_device { + struct ib_device *dev; + struct list_head port_list; +}; + +struct vnic_ib_port { + struct vnic_ib_device *dev; + u8 port_num; + struct class_dev_info cdev_info; + struct list_head list; +}; + +struct io { + struct list_head list_ptrs; + struct viport *viport; + comp_routine_t *routine; + struct ib_recv_wr rwr; + struct ib_send_wr swr; +#ifdef CONFIG_INFINIBAND_VNIC_STATS + cycles_t time; +#endif /* CONFIG_INFINIBAND_VNIC_STATS */ + enum {RECV, RDMA, SEND} type; +}; + +struct rdma_io { + struct io io; + struct ib_sge list[2]; + u16 index; + u16 len; + u8 *data; + dma_addr_t data_dma; + struct sk_buff *skb; + dma_addr_t skb_data_dma; + struct viport_trailer *trailer; + dma_addr_t trailer_dma; +}; + +struct send_io { + struct io io; + struct ib_sge list; + u8 *virtual_addr; +}; + +struct recv_io { + struct io io; + struct ib_sge list; + u8 *virtual_addr; +}; + +int vnic_ib_init(void); +void vnic_ib_cleanup(void); + +struct vnic; +int vnic_ib_get_path(struct netpath *netpath, struct vnic * vnic); +int vnic_ib_conn_init(struct vnic_ib_conn *ib_conn, struct viport *viport, + struct ib_pd *pd, struct vnic_ib_config *config); + +int vnic_ib_post_recv(struct vnic_ib_conn *ib_conn, struct io *io); +int vnic_ib_post_send(struct vnic_ib_conn *ib_conn, struct io *io); +int vnic_ib_cm_connect(struct vnic_ib_conn *ib_conn); +int vnic_ib_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event); + +#define vnic_ib_conn_uninitted(ib_conn) \ + ((ib_conn)->state == IB_CONN_UNINITTED) +#define vnic_ib_conn_initted(ib_conn) \ + ((ib_conn)->state == IB_CONN_INITTED) +#define vnic_ib_conn_connecting(ib_conn) \ + ((ib_conn)->state == IB_CONN_CONNECTING) +#define vnic_ib_conn_connected(ib_conn) \ + ((ib_conn)->state == IB_CONN_CONNECTED) +#define vnic_ib_conn_disconnected(ib_conn) \ + ((ib_conn)->state == IB_CONN_DISCONNECTED) + +#endif /* VNIC_IB_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:57:21 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:27:21 +0530 Subject: [openib-general] [PATCH v2 7/11] Handling of various configurable parameters of the driver Message-ID: <455A26A9.3205.6136313@ramachandra.kuchimanchi.qlogic.com> Adds the files that handle various configurable parameters of the VNIC driver ---- configuration of virtual NIC, control, data connections to the VEx and general IB connection parameters. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_config.c | 348 +++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_config.h | 213 ++++++++++++++++++ 2 files changed, 561 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_config.c b/drivers/infiniband/ulp/vnic/vnic_config.c new file mode 100644 index 0000000..4cca951 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_config.c @@ -0,0 +1,348 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +#include + +#include "vnic_util.h" +#include "vnic_config.h" +#include "vnic_trailer.h" + +#define SST_AGN 0x10ULL +#define SST_OUI 0x00066AULL + +enum { + CONTROL_PATH_ID = 0x0, + DATA_PATH_ID = 0x1 +}; + +#define IOC_NUMBER(GUID) (((GUID) >> 32) & 0xFF) + +static u16 max_mtu = MAX_MTU; + +static u32 default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT; +static u32 sa_path_rec_get_timeout = SA_PATH_REC_GET_TIMEOUT; + +static u32 default_primary_reconnect_timeout = + DEFAULT_PRIMARY_RECONNECT_TIMEOUT; +static u32 default_primary_switch_timeout = DEFAULT_PRIMARY_SWITCH_TIMEOUT; +static int default_prefer_primary = DEFAULT_PREFER_PRIMARY; + +static int use_rx_csum = VNIC_USE_RX_CSUM; +static int use_tx_csum = VNIC_USE_TX_CSUM; + +module_param(max_mtu, ushort, 0444); +MODULE_PARM_DESC(max_mtu, "Maximum MTU size (1500-9500). Default is 9500"); + +module_param(default_prefer_primary, bool, 0444); +MODULE_PARM_DESC(default_prefer_primary, "Determines if primary path is" + " preferred (1) or not (0). Defaults to 0"); +module_param(use_rx_csum, bool, 0444); +MODULE_PARM_DESC(use_rx_csum, "Determines if RX checksum is done on VEx (1)" + " or not (0). Defaults to 1"); +module_param(use_tx_csum, bool, 0444); +MODULE_PARM_DESC(use_tx_csum, "Determines if TX checksum is done on VEx (1)" + " or not (0). Defaults to 1"); +module_param(default_no_path_timeout, uint, 0444); +MODULE_PARM_DESC(default_no_path_timeout, "Time to wait in milliseconds" + " before reconnecting to VEx after connection loss"); +module_param(default_primary_reconnect_timeout, uint, 0444); +MODULE_PARM_DESC(default_primary_reconnect_timeout, "Time to wait in" + " milliseconds before reconnecting the" + " primary path to VEx"); +module_param(default_primary_switch_timeout, uint, 0444); +MODULE_PARM_DESC(default_primary_switch_timeout, "Time to wait before" + " switching back to primary path if" + " primary path is preferred"); +module_param(sa_path_rec_get_timeout, uint, 0444); +MODULE_PARM_DESC(sa_path_rec_get_timeout, "Time out value in milliseconds" + " for SA path record get queries"); + +static void config_control_defaults(struct control_config *control_config, + struct path_param *params) +{ + int len; + char *dot; + u64 sid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (CONTROL_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + control_config->ib_config.service_id = cpu_to_be64(sid); + control_config->ib_config.conn_data.path_id = 0; + control_config->ib_config.conn_data.vnic_instance = params->instance; + control_config->ib_config.conn_data.path_num = 0; + dot = strchr(init_utsname()->nodename, '.'); + + if (dot) + len = dot - init_utsname()->nodename; + else + len = strlen(init_utsname()->nodename); + + if (len > VNIC_MAX_NODENAME_LEN) + len = VNIC_MAX_NODENAME_LEN; + + memcpy(control_config->ib_config.conn_data.nodename, + init_utsname()->nodename, len); + + control_config->ib_config.retry_count = RETRY_COUNT; + control_config->ib_config.rnr_retry_count = RETRY_COUNT; + control_config->ib_config.min_rnr_timer = MIN_RNR_TIMER; + + /* These values are not configurable*/ + control_config->ib_config.num_recvs = 5; + control_config->ib_config.num_sends = 1; + control_config->ib_config.recv_scatter = 1; + control_config->ib_config.send_gather = 1; + + control_config->num_recvs = control_config->ib_config.num_recvs; + + control_config->vnic_instance = params->instance; + control_config->max_address_entries = MAX_ADDRESS_ENTRIES; + control_config->min_address_entries = MIN_ADDRESS_ENTRIES; + control_config->req_retry_count = CONTROL_REQ_RETRY_COUNT; + control_config->rsp_timeout = msecs_to_jiffies(CONTROL_RSP_TIMEOUT); +} + +static void config_data_defaults(struct data_config *data_config, + struct path_param *params) +{ + u64 sid; + + sid = (SST_AGN << 56) | (SST_OUI << 32) | (DATA_PATH_ID << 8) + | IOC_NUMBER(be64_to_cpu(params->ioc_guid)); + + data_config->ib_config.service_id = cpu_to_be64(sid); + data_config->ib_config.conn_data.path_id = jiffies; /* random */ + data_config->ib_config.conn_data.vnic_instance = params->instance; + data_config->ib_config.conn_data.path_num = 0; + + data_config->ib_config.retry_count = RETRY_COUNT; + data_config->ib_config.rnr_retry_count = RETRY_COUNT; + data_config->ib_config.min_rnr_timer = MIN_RNR_TIMER; + + /* + * NOTE: the num_recvs size assumes that the EIOC could + * RDMA enough packets to fill all of the host recv + * pool entries, plus send a kick message after each + * packet, plus RDMA new buffers for the size of + * the EIOC recv buffer pool, plus send kick messages + * after each min_host_update_sz of new buffers all + * before the host can even pull off the first completed + * receive off the completion queue, and repost the + * receive. NOT LIKELY! + */ + data_config->ib_config.num_recvs = HOST_RECV_POOL_ENTRIES + + (MAX_EIOC_POOL_SZ / MIN_HOST_UPDATE_SZ); + + data_config->ib_config.num_sends = (2 * NOTIFY_BUNDLE_SZ) + + (HOST_RECV_POOL_ENTRIES / MIN_EIOC_UPDATE_SZ) + 1; + + data_config->ib_config.recv_scatter = 1; /* not configurable */ + data_config->ib_config.send_gather = 2; /* not configurable */ + + data_config->num_recvs = data_config->ib_config.num_recvs; + data_config->path_id = data_config->ib_config.conn_data.path_id; + + + data_config->host_recv_pool_entries = HOST_RECV_POOL_ENTRIES; + + data_config->host_min.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU)); + data_config->host_max.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + max_mtu)); + data_config->eioc_min.size_recv_pool_entry = + cpu_to_be32(BUFFER_SIZE(VLAN_ETH_HLEN + MIN_MTU)); + data_config->eioc_max.size_recv_pool_entry = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_entries = + __constant_cpu_to_be32(MIN_HOST_POOL_SZ); + data_config->host_max.num_recv_pool_entries = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + data_config->eioc_min.num_recv_pool_entries = + __constant_cpu_to_be32(MIN_EIOC_POOL_SZ); + data_config->eioc_max.num_recv_pool_entries = + __constant_cpu_to_be32(MAX_EIOC_POOL_SZ); + + data_config->host_min.timeout_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_TIMEOUT); + data_config->host_max.timeout_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_TIMEOUT); + data_config->eioc_min.timeout_before_kick = 0; + data_config->eioc_max.timeout_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_ENTRIES); + data_config->host_max.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_ENTRIES); + data_config->eioc_min.num_recv_pool_entries_before_kick = 0; + data_config->eioc_max.num_recv_pool_entries_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MIN_HOST_KICK_BYTES); + data_config->host_max.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MAX_HOST_KICK_BYTES); + data_config->eioc_min.num_recv_pool_bytes_before_kick = 0; + data_config->eioc_max.num_recv_pool_bytes_before_kick = + __constant_cpu_to_be32(MAX_PARAM_VALUE); + + data_config->host_min.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MIN_HOST_UPDATE_SZ); + data_config->host_max.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MAX_HOST_UPDATE_SZ); + data_config->eioc_min.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MIN_EIOC_UPDATE_SZ); + data_config->eioc_max.free_recv_pool_entries_per_update = + __constant_cpu_to_be32(MAX_EIOC_UPDATE_SZ); + + data_config->notify_bundle = NOTIFY_BUNDLE_SZ; +} + +static void config_path_info_defaults(struct viport_config *config, + struct path_param *params) +{ + int i; + ib_get_cached_gid(config->ibdev, config->port, 0, + &config->path_info.path.sgid); + for (i = 0; i < 16; i++) { + config->path_info.path.dgid.raw[i] = params->dgid[i]; + } + config->path_info.path.pkey = params->pkey; + config->path_info.path.numb_path = 1; + config->sa_path_rec_get_timeout = sa_path_rec_get_timeout; + +} + +static void config_viport_defaults(struct viport_config *config, + struct path_param *params) +{ + config->ibdev = params->ibdev; + config->port = params->port; + config->ioc_guid = params->ioc_guid; + config->stats_interval = msecs_to_jiffies(VIPORT_STATS_INTERVAL); + config->hb_interval = msecs_to_jiffies(VIPORT_HEARTBEAT_INTERVAL); + config->hb_timeout = VIPORT_HEARTBEAT_TIMEOUT * 1000; + /*hb_timeout needs to be in usec*/ + config_path_info_defaults(config, params); + + config_control_defaults(&config->control_config, params); + config_data_defaults(&config->data_config, params); +} + +static void config_vnic_defaults(struct vnic_config *config) +{ + config->no_path_timeout = msecs_to_jiffies(default_no_path_timeout); + config->primary_connect_timeout = + msecs_to_jiffies(DEFAULT_PRIMARY_CONNECT_TIMEOUT); + config->primary_reconnect_timeout = + msecs_to_jiffies(default_primary_reconnect_timeout); + config->primary_switch_timeout = + msecs_to_jiffies(default_primary_switch_timeout); + config->prefer_primary = default_prefer_primary; + config->use_rx_csum = use_rx_csum; + config->use_tx_csum = use_tx_csum; +} + +struct viport_config *config_alloc_viport(struct path_param *params) +{ + struct viport_config *config; + + config = kzalloc(sizeof *config, GFP_KERNEL); + if (!config) { + CONFIG_ERROR("could not allocate memory for" + " struct viport_config\n"); + return NULL; + } + + config_viport_defaults(config, params); + + return config; +} + +struct vnic_config *config_alloc_vnic(void) +{ + struct vnic_config *config; + + config = kzalloc(sizeof *config, GFP_KERNEL); + if (!config) { + CONFIG_ERROR("couldn't allocate memory for" + " struct vnic_config\n"); + + return NULL; + } + + config_vnic_defaults(config); + return config; +} + +char *config_viport_name(struct viport_config *config) +{ + /* function only called by one thread, can return a static string */ + static char str[64]; + + sprintf(str, "GUID %llx instance %d", + be64_to_cpu(config->ioc_guid), + config->control_config.vnic_instance); + return str; +} + +int config_start(void) +{ + max_mtu = min_t(u16, max_mtu, MAX_MTU); + max_mtu = max_t(u16, max_mtu, MIN_MTU); + + sa_path_rec_get_timeout = min_t(u32, sa_path_rec_get_timeout, + MAX_SA_TIMEOUT); + sa_path_rec_get_timeout = max_t(u32, sa_path_rec_get_timeout, + MIN_SA_TIMEOUT); + + if (!default_no_path_timeout) + default_no_path_timeout = DEFAULT_NO_PATH_TIMEOUT; + + if (!default_primary_reconnect_timeout) + default_primary_reconnect_timeout = + DEFAULT_PRIMARY_RECONNECT_TIMEOUT; + + if (!default_primary_switch_timeout) + default_primary_switch_timeout = + DEFAULT_PRIMARY_SWITCH_TIMEOUT; + + return 0; + +} diff --git a/drivers/infiniband/ulp/vnic/vnic_config.h b/drivers/infiniband/ulp/vnic/vnic_config.h new file mode 100644 index 0000000..c5e63a7 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_config.h @@ -0,0 +1,213 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_CONFIG_H_INCLUDED +#define VNIC_CONFIG_H_INCLUDED + +#include +#include +#include + +#include "vnic_control.h" +#include "vnic_ib.h" + +enum { + VNIC_CLASS_SUBCLASS = 0x2000066A, + VNIC_PROTOCOL = 0, + VNIC_PROT_VERSION = 1 +}; + +enum { + MIN_MTU = 1500, /* minimum negotiated MTU size */ + MAX_MTU = 9500 /* jumbo frame */ +}; + +/* + * TODO: tune the pool parameter values + */ +enum { + MIN_ADDRESS_ENTRIES = 16, + MAX_ADDRESS_ENTRIES = 64 +}; + +enum { + HOST_RECV_POOL_ENTRIES = 512, + MIN_HOST_POOL_SZ = 64, + MIN_EIOC_POOL_SZ = 64, + MAX_EIOC_POOL_SZ = 256, + MIN_HOST_UPDATE_SZ = 8, + MAX_HOST_UPDATE_SZ = 32, + MIN_EIOC_UPDATE_SZ = 8, + MAX_EIOC_UPDATE_SZ = 32, + NOTIFY_BUNDLE_SZ = 32 +}; + +enum { + MIN_HOST_KICK_TIMEOUT = 10, /* in usec */ + MAX_HOST_KICK_TIMEOUT = 100 /* in usec */ +}; + +enum { + MIN_HOST_KICK_ENTRIES = 1, + MAX_HOST_KICK_ENTRIES = 128 +}; + +enum { + MIN_HOST_KICK_BYTES = 0, + MAX_HOST_KICK_BYTES = 5000 +}; + +enum { + DEFAULT_NO_PATH_TIMEOUT = 10000, + DEFAULT_PRIMARY_CONNECT_TIMEOUT = 10000, + DEFAULT_PRIMARY_RECONNECT_TIMEOUT = 10000, + DEFAULT_PRIMARY_SWITCH_TIMEOUT = 10000 +}; + +enum { + VIPORT_STATS_INTERVAL = 500, /* .5 sec */ + VIPORT_HEARTBEAT_INTERVAL = 1000, /* 1 second */ + VIPORT_HEARTBEAT_TIMEOUT = 64000 /* 64 sec */ +}; + +enum { + CONTROL_RSP_TIMEOUT = 1000 /* 1 sec */ +}; + +/* infiniband connection parameters */ +enum { + RETRY_COUNT = 3, + MIN_RNR_TIMER = 22, /* 20 ms */ + DEFAULT_PKEY = 0 /* pkey table index */ +}; + +enum { + SA_PATH_REC_GET_TIMEOUT = 1000, /* 1000 ms */ + MIN_SA_TIMEOUT = 100, /* 100 ms */ + MAX_SA_TIMEOUT = 20000 /* 20s */ +}; + +#define MAX_PARAM_VALUE 0x40000000 +#define VNIC_USE_RX_CSUM 1 +#define VNIC_USE_TX_CSUM 1 +#define DEFAULT_PREFER_PRIMARY 0 +#define CONTROL_REQ_RETRY_COUNT 4 + +struct path_param { + __be64 ioc_guid; + u8 port; + u8 instance; + struct ib_device *ibdev; + struct vnic_ib_port *ibport; + char name[IFNAMSIZ]; + u8 dgid[16]; + __be16 pkey; + int rx_csum; + int tx_csum; + int heartbeat; +}; + +struct vnic_ib_config { + __be64 service_id; + struct vnic_connection_data conn_data; + u32 retry_count; + u32 rnr_retry_count; + u8 min_rnr_timer; + u32 num_sends; + u32 num_recvs; + u32 recv_scatter; /* 1 */ + u32 send_gather; /* 1 or 2 */ +}; + +struct control_config { + struct vnic_ib_config ib_config; + u32 num_recvs; + u8 vnic_instance; + u16 max_address_entries; + u16 min_address_entries; + u32 rsp_timeout; + u8 req_retry_count; +}; + +struct data_config { + struct vnic_ib_config ib_config; + u64 path_id; + u32 num_recvs; + u32 host_recv_pool_entries; + struct vnic_recv_pool_config host_min; + struct vnic_recv_pool_config host_max; + struct vnic_recv_pool_config eioc_min; + struct vnic_recv_pool_config eioc_max; + u32 notify_bundle; +}; + +struct viport_config { + struct viport *viport; + struct control_config control_config; + struct data_config data_config; + struct vnic_ib_path_info path_info; + u32 sa_path_rec_get_timeout; + struct ib_device *ibdev; + u32 port; + u32 stats_interval; + u32 hb_interval; + u32 hb_timeout; + __be64 ioc_guid; + size_t path_idx; +}; + +/* + * primary_connect_timeout - if the secondary connects first, + * how long do we give the primary? + * primary_reconnect_timeout - same as above, but used when recovering + * from the case where both paths fail + * primary_switch_timeout - how long do we wait before switching to the + * primary when it comes back? + */ +struct vnic_config { + struct vnic *vnic; + char name[IFNAMSIZ]; + u32 no_path_timeout; + u32 primary_connect_timeout; + u32 primary_reconnect_timeout; + u32 primary_switch_timeout; + int prefer_primary; + int use_rx_csum; + int use_tx_csum; +}; + +int config_start(void); +struct viport_config *config_alloc_viport(struct path_param *params); +struct vnic_config *config_alloc_vnic(void); +char *config_viport_name(struct viport_config *config); + +#endif /* VNIC_CONFIG_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:57:45 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:27:45 +0530 Subject: [openib-general] [PATCH v2 8/11] sysfs interface implementation Message-ID: <455A26C1.32306.613C121@ramachandra.kuchimanchi.qlogic.com> Adds the files that implement the sysfs interface of the driver. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_sys.c | 786 ++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/vnic_sys.h | 54 ++ 2 files changed, 840 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_sys.c b/drivers/infiniband/ulp/vnic/vnic_sys.c new file mode 100644 index 0000000..805fff7 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_sys.c @@ -0,0 +1,786 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_util.h" +#include "vnic_config.h" +#include "vnic_ib.h" +#include "vnic_viport.h" +#include "vnic_main.h" +#include "vnic_stats.h" + +extern struct list_head vnic_list; + +/* + * target eiocs are added by writing + * + * ioc_guid=,dgid=,pkey=,name= + * to the create_primary sysfs attribute. + */ +enum { + VNIC_OPT_ERR = 0, + VNIC_OPT_IOC_GUID = 1 << 0, + VNIC_OPT_DGID = 1 << 1, + VNIC_OPT_PKEY = 1 << 2, + VNIC_OPT_NAME = 1 << 3, + VNIC_OPT_INSTANCE = 1 << 4, + VNIC_OPT_RXCSUM = 1 << 5, + VNIC_OPT_TXCSUM = 1 << 6, + VNIC_OPT_HEARTBEAT = 1 << 7, + VNIC_OPT_ALL = (VNIC_OPT_IOC_GUID | + VNIC_OPT_DGID | VNIC_OPT_NAME | VNIC_OPT_PKEY), +}; + +static match_table_t vnic_opt_tokens = { + {VNIC_OPT_IOC_GUID, "ioc_guid=%s"}, + {VNIC_OPT_DGID, "dgid=%s"}, + {VNIC_OPT_PKEY, "pkey=%x"}, + {VNIC_OPT_NAME, "name=%s"}, + {VNIC_OPT_INSTANCE, "instance=%d"}, + {VNIC_OPT_RXCSUM, "rx_csum=%s"}, + {VNIC_OPT_TXCSUM, "tx_csum=%s"}, + {VNIC_OPT_HEARTBEAT, "heartbeat=%d"}, + {VNIC_OPT_ERR, NULL} +}; + +static void vnic_release_class_dev(struct class_device *class_dev) +{ + struct class_dev_info *cdev_info = + container_of(class_dev, struct class_dev_info, class_dev); + + complete(&cdev_info->released); + +} + +struct class vnic_class = { + .name = "infiniband_vnic", + .release = vnic_release_class_dev +}; + +struct class_dev_info interface_cdev; + +static int vnic_parse_options(const char *buf, struct path_param *param) +{ + char *options, *sep_opt; + char *p; + char dgid[3]; + substring_t args[MAX_OPT_ARGS]; + int opt_mask = 0; + int token; + int ret = -EINVAL; + int i; + + options = kstrdup(buf, GFP_KERNEL); + if (!options) + return -ENOMEM; + + sep_opt = options; + while ((p = strsep(&sep_opt, ",")) != NULL) { + if (!*p) + continue; + + token = match_token(p, vnic_opt_tokens, args); + opt_mask |= token; + + switch (token) { + case VNIC_OPT_IOC_GUID: + p = match_strdup(args); + param->ioc_guid = cpu_to_be64(simple_strtoull(p, NULL, + 16)); + kfree(p); + break; + + case VNIC_OPT_DGID: + p = match_strdup(args); + if (strlen(p) != 32) { + printk(KERN_WARNING PFX + "bad dest GID parameter '%s'\n", p); + kfree(p); + goto out; + } + + for (i = 0; i < 16; ++i) { + strlcpy(dgid, p + i * 2, 3); + param->dgid[i] = simple_strtoul(dgid, NULL, + 16); + + } + kfree(p); + break; + + case VNIC_OPT_PKEY: + if (match_hex(args, &token)) { + printk(KERN_WARNING PFX + "bad P_key parameter '%s'\n", p); + goto out; + } + param->pkey = cpu_to_be16(token); + break; + + case VNIC_OPT_NAME: + p = match_strdup(args); + if (strlen(p) >= IFNAMSIZ) { + printk(KERN_WARNING PFX + "interface name parameter too long\n"); + kfree(p); + goto out; + } + strcpy(param->name, p); + kfree(p); + break; + case VNIC_OPT_INSTANCE: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX + "bad instance parameter '%s'\n", p); + goto out; + } + + if (token > 255 || token < 0) { + printk(KERN_WARNING PFX + "instance parameter must be" + " > 0 and <= 255\n"); + goto out; + } + + param->instance = token; + break; + case VNIC_OPT_RXCSUM: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->rx_csum = 1; + else if (!strncmp(p, "false", 5)) + param->rx_csum = 0; + else { + printk(KERN_WARNING PFX + "bad rx_csum parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + case VNIC_OPT_TXCSUM: + p = match_strdup(args); + if (!strncmp(p, "true", 4)) + param->tx_csum = 1; + else if (!strncmp(p, "false", 5)) + param->tx_csum = 0; + else { + printk(KERN_WARNING PFX + "bad tx_csum parameter." + " must be 'true' or 'false'\n"); + kfree(p); + goto out; + } + kfree(p); + break; + case VNIC_OPT_HEARTBEAT: + if (match_int(args, &token)) { + printk(KERN_WARNING PFX + "bad instance parameter '%s'\n", p); + goto out; + } + + if (token > 6000 || token < 0) { + printk(KERN_WARNING PFX + "heartbeat parameter must be" + " > 0 and <= 6000\n"); + goto out; + } + param->heartbeat = token; + break; + default: + printk(KERN_WARNING PFX + "unknown parameter or missing value " + "'%s' in target creation request\n", p); + goto out; + } + + } + + if ((opt_mask & VNIC_OPT_ALL) == VNIC_OPT_ALL) + ret = 0; + else + for (i = 0; i < ARRAY_SIZE(vnic_opt_tokens); ++i) + if ((vnic_opt_tokens[i].token & VNIC_OPT_ALL) && + !(vnic_opt_tokens[i].token & opt_mask)) + printk(KERN_WARNING PFX + "target creation request is " + "missing parameter '%s'\n", + vnic_opt_tokens[i].pattern); + +out: + kfree(options); + return ret; + +} + +static ssize_t show_vnic_state(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, class_dev_info); + switch (vnic->state) { + case VNIC_UNINITIALIZED: + return sprintf(buf, "VNIC_UNINITIALIZED\n"); + case VNIC_REGISTERED: + return sprintf(buf, "VNIC_REGISTERED\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + } + +} + +static CLASS_DEVICE_ATTR(vnic_state, S_IRUGO, show_vnic_state, NULL); + +static ssize_t show_rx_csum(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, class_dev_info); + + if (vnic->config->use_rx_csum) + return sprintf(buf, "true\n"); + else + return sprintf(buf, "false\n"); +} + +static CLASS_DEVICE_ATTR(rx_csum, S_IRUGO, show_rx_csum, NULL); + +static ssize_t show_tx_csum(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, class_dev_info); + + if (vnic->config->use_tx_csum) + return sprintf(buf, "true\n"); + else + return sprintf(buf, "false\n"); +} + +static CLASS_DEVICE_ATTR(tx_csum, S_IRUGO, show_tx_csum, NULL); + +static ssize_t show_current_path(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, class_dev_info); + + if (vnic->current_path == &vnic->primary_path) + return sprintf(buf, "primary path\n"); + else if (vnic->current_path == &vnic->secondary_path) + return sprintf(buf, "secondary path\n"); + else + return sprintf(buf, "none\n"); + +} + +static CLASS_DEVICE_ATTR(current_path, S_IRUGO, show_current_path, NULL); + +static struct attribute * vnic_dev_attrs[] = { + &class_device_attr_vnic_state.attr, + &class_device_attr_rx_csum.attr, + &class_device_attr_tx_csum.attr, + &class_device_attr_current_path.attr, + NULL +}; + +struct attribute_group vnic_dev_attr_group = { + .attrs = vnic_dev_attrs, +}; + +static int create_netpath(struct netpath *npdest, + struct path_param *p_params) +{ + struct viport_config *viport_config; + struct viport *viport; + struct vnic *vnic; + struct list_head *ptr; + int ret = 0; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (vnic->primary_path.viport) { + viport_config = vnic->primary_path.viport->config; + if ((viport_config->ioc_guid == p_params->ioc_guid) + && (viport_config->control_config.vnic_instance + == p_params->instance)) { + SYS_ERROR("GUID %llx," + " INSTANCE %d already in use\n", + be64_to_cpu(p_params->ioc_guid), + p_params->instance); + ret = -EINVAL; + goto out; + } + } + + if (vnic->secondary_path.viport) { + viport_config = vnic->secondary_path.viport->config; + if ((viport_config->ioc_guid == p_params->ioc_guid) + && (viport_config->control_config.vnic_instance + == p_params->instance)) { + SYS_ERROR("GUID %llx," + " INSTANCE %d already in use\n", + be64_to_cpu(p_params->ioc_guid), + p_params->instance); + ret = -EINVAL; + goto out; + } + } + } + + if (npdest->viport) { + SYS_ERROR("create_netpath: path already exists\n"); + ret = -EINVAL; + goto out; + } + + viport_config = config_alloc_viport(p_params); + if (!viport_config) { + SYS_ERROR("create_netpath: failed creating viport config\n"); + ret = -1; + goto out; + } + + /*User specified heartbeat value is in 1/100s of a sec*/ + if (p_params->heartbeat != -1) { + viport_config->hb_interval = + msecs_to_jiffies(p_params->heartbeat * 10); + viport_config->hb_timeout = + (p_params->heartbeat << 6) * 10000; /* usec */ + } + + viport_config->path_idx = 0; + + viport = viport_allocate(viport_config); + if (!viport) { + SYS_ERROR("create_netpath: failed creating viport\n"); + kfree(viport_config); + ret = -1; + goto out; + } + + npdest->viport = viport; + viport->parent = npdest; + viport->vnic = npdest->parent; + viport_kick(viport); + vnic_disconnected(npdest->parent, npdest); +out: + return ret; +} + +struct vnic *create_vnic(struct path_param *param) +{ + struct vnic_config *vnic_config; + struct vnic *vnic; + struct list_head *ptr; + + SYS_INFO("create_vnic: name = %s\n", param->name); + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, param->name)) { + SYS_ERROR("vnic %s already exists\n", + param->name); + return NULL; + } + } + + vnic_config = config_alloc_vnic(); + if (!vnic_config) { + SYS_ERROR("create_vnic: failed creating vnic config\n"); + return NULL; + } + + if (param->rx_csum != -1) + vnic_config->use_rx_csum = param->rx_csum; + + if (param->tx_csum != -1) + vnic_config->use_tx_csum = param->tx_csum; + + strcpy(vnic_config->name, param->name); + vnic = vnic_allocate(vnic_config); + if (!vnic) { + SYS_ERROR("create_vnic: failed allocating vnic\n"); + goto free_vnic_config; + } + + init_completion(&vnic->class_dev_info.released); + + vnic->class_dev_info.class_dev.class = &vnic_class; + vnic->class_dev_info.class_dev.parent = &interface_cdev.class_dev; + snprintf(vnic->class_dev_info.class_dev.class_id, BUS_ID_SIZE, + vnic_config->name); + + if (class_device_register(&vnic->class_dev_info.class_dev)) { + SYS_ERROR("create_vnic: error in registering" + " vnic class dev\n"); + goto free_vnic; + } + + if (sysfs_create_group(&vnic->class_dev_info.class_dev.kobj, + &vnic_dev_attr_group)) { + SYS_ERROR("create_vnic: error in creating" + "vnic attr group\n"); + goto err_attr; + + } + + if (vnic_setup_stats_files(vnic)) + goto err_stats; + + return vnic; +err_stats: + sysfs_remove_group(&vnic->class_dev_info.class_dev.kobj, + &vnic_dev_attr_group); +err_attr: + class_device_unregister(&vnic->class_dev_info.class_dev); + wait_for_completion(&vnic->class_dev_info.released); +free_vnic: + list_del(&vnic->list_ptrs); + kfree(vnic); +free_vnic_config: + kfree(vnic_config); + return NULL; +} + +ssize_t vnic_delete(struct class_device * class_dev, + const char *buf, size_t count) +{ + struct vnic *vnic; + struct list_head *ptr; + int ret = -EINVAL; + + if (count > IFNAMSIZ) { + printk(KERN_WARNING PFX "invalid vnic interface name\n"); + return ret; + } + + SYS_INFO("vnic_delete: name = %s\n", buf); + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strcmp(vnic->config->name, buf)) { + vnic_free(vnic); + return count; + } + } + + printk(KERN_WARNING PFX "vnic interface '%s' does not exist\n", buf); + return ret; +} + +static ssize_t show_viport_state(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct netpath *path = + container_of(info, struct netpath, class_dev_info); + switch (path->viport->state) { + case VIPORT_DISCONNECTED: + return sprintf(buf, "VIPORT_DISCONNECTED\n"); + case VIPORT_CONNECTED: + return sprintf(buf, "VIPORT_CONNECTED\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + } + +} + +static CLASS_DEVICE_ATTR(viport_state, S_IRUGO, show_viport_state, NULL); + +static ssize_t show_link_state(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct netpath *path = + container_of(info, struct netpath, class_dev_info); + + switch (path->viport->link_state) { + case LINK_UNINITIALIZED: + return sprintf(buf, "LINK_UNINITIALIZED\n"); + case LINK_INITIALIZE: + return sprintf(buf, "LINK_INITIALIZE\n"); + case LINK_INITIALIZECONTROL: + return sprintf(buf, "LINK_INITIALIZECONTROL\n"); + case LINK_INITIALIZEDATA: + return sprintf(buf, "LINK_INITIALIZEDATA\n"); + case LINK_CONTROLCONNECT: + return sprintf(buf, "LINK_CONTROLCONNECT\n"); + case LINK_CONTROLCONNECTWAIT: + return sprintf(buf, "LINK_CONTROLCONNECTWAIT\n"); + case LINK_INITVNICREQ: + return sprintf(buf, "LINK_INITVNICREQ\n"); + case LINK_INITVNICRSP: + return sprintf(buf, "LINK_INITVNICRSP\n"); + case LINK_BEGINDATAPATH: + return sprintf(buf, "LINK_BEGINDATAPATH\n"); + case LINK_CONFIGDATAPATHREQ: + return sprintf(buf, "LINK_CONFIGDATAPATHREQ\n"); + case LINK_CONFIGDATAPATHRSP: + return sprintf(buf, "LINK_CONFIGDATAPATHRSP\n"); + case LINK_DATACONNECT: + return sprintf(buf, "LINK_DATACONNECT\n"); + case LINK_DATACONNECTWAIT: + return sprintf(buf, "LINK_DATACONNECTWAIT\n"); + case LINK_XCHGPOOLREQ: + return sprintf(buf, "LINK_XCHGPOOLREQ\n"); + case LINK_XCHGPOOLRSP: + return sprintf(buf, "LINK_XCHGPOOLRSP\n"); + case LINK_INITIALIZED: + return sprintf(buf, "LINK_INITIALIZED\n"); + case LINK_IDLE: + return sprintf(buf, "LINK_IDLE\n"); + case LINK_IDLING: + return sprintf(buf, "LINK_IDLING\n"); + case LINK_CONFIGLINKREQ: + return sprintf(buf, "LINK_CONFIGLINKREQ\n"); + case LINK_CONFIGLINKRSP: + return sprintf(buf, "LINK_CONFIGLINKRSP\n"); + case LINK_CONFIGADDRSREQ: + return sprintf(buf, "LINK_CONFIGADDRSREQ\n"); + case LINK_CONFIGADDRSRSP: + return sprintf(buf, "LINK_CONFIGADDRSRSP\n"); + case LINK_REPORTSTATREQ: + return sprintf(buf, "LINK_REPORTSTATREQ\n"); + case LINK_REPORTSTATRSP: + return sprintf(buf, "LINK_REPORTSTATRSP\n"); + case LINK_HEARTBEATREQ: + return sprintf(buf, "LINK_HEARTBEATREQ\n"); + case LINK_HEARTBEATRSP: + return sprintf(buf, "LINK_HEARTBEATRSP\n"); + case LINK_RESET: + return sprintf(buf, "LINK_RESET\n"); + case LINK_RESETRSP: + return sprintf(buf, "LINK_RESETRSP\n"); + case LINK_RESETCONTROL: + return sprintf(buf, "LINK_RESETCONTROL\n"); + case LINK_RESETCONTROLRSP: + return sprintf(buf, "LINK_RESETCONTROLRSP\n"); + case LINK_DATADISCONNECT: + return sprintf(buf, "LINK_DATADISCONNECT\n"); + case LINK_CONTROLDISCONNECT: + return sprintf(buf, "LINK_CONTROLDISCONNECT\n"); + case LINK_CLEANUPDATA: + return sprintf(buf, "LINK_CLEANUPDATA\n"); + case LINK_CLEANUPCONTROL: + return sprintf(buf, "LINK_CLEANUPCONTROL\n"); + case LINK_DISCONNECTED: + return sprintf(buf, "LINK_DISCONNECTED\n"); + case LINK_RETRYWAIT: + return sprintf(buf, "LINK_RETRYWAIT\n"); + default: + return sprintf(buf, "INVALID STATE\n"); + + } + +} +static CLASS_DEVICE_ATTR(link_state, S_IRUGO, show_link_state, NULL); + +static ssize_t show_heartbeat(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + + struct netpath *path = + container_of(info, struct netpath, class_dev_info); + + /* hb_inteval is in jiffies, convert it back to + * 1/100ths of a second + */ + return sprintf(buf, "%d\n", + (jiffies_to_msecs(path->viport->config->hb_interval)/10)); +} + +static CLASS_DEVICE_ATTR(heartbeat, S_IRUGO, show_heartbeat, NULL); + +static struct attribute * vnic_path_attrs[] = { + &class_device_attr_viport_state.attr, + &class_device_attr_link_state.attr, + &class_device_attr_heartbeat.attr, + NULL +}; + +struct attribute_group vnic_path_attr_group = { + .attrs = vnic_path_attrs, +}; + + +static int setup_path_class_files(struct netpath *path, char *name) +{ + init_completion(&path->class_dev_info.released); + + path->class_dev_info.class_dev.class = &vnic_class; + path->class_dev_info.class_dev.parent = + &path->parent->class_dev_info.class_dev; + snprintf(path->class_dev_info.class_dev.class_id, + BUS_ID_SIZE, name); + + if (class_device_register(&path->class_dev_info.class_dev)) { + SYS_ERROR("error in registering path class dev\n"); + goto out; + } + + if (sysfs_create_group(&path->class_dev_info.class_dev.kobj, + &vnic_path_attr_group)) { + SYS_ERROR("error in creating vnic path group attrs"); + goto err_path; + } + + return 0; + +err_path: + class_device_unregister(&path->class_dev_info.class_dev); + wait_for_completion(&path->class_dev_info.released); +out: + return -1; + +} + +ssize_t vnic_create_primary(struct class_device * class_dev, + const char *buf, size_t count) +{ + struct class_dev_info *cdev = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic_ib_port *target = + container_of(cdev, struct vnic_ib_port, cdev_info); + + struct path_param param; + int ret = -EINVAL; + struct vnic *vnic; + + param.instance = 0; + param.rx_csum = -1; + param.tx_csum = -1; + param.heartbeat = -1; + + ret = vnic_parse_options(buf, ¶m); + + if (ret) + goto out; + + param.ibdev = target->dev->dev; + param.ibport = target; + param.port = target->port_num; + + vnic = create_vnic(¶m); + if (!vnic) { + printk(KERN_ERR PFX "creating vnic failed\n"); + ret = -EINVAL; + goto out; + } + + if (create_netpath(&vnic->primary_path, ¶m)) { + printk(KERN_ERR PFX "creating primary netpath failed\n"); + goto free_vnic; + } + + if (setup_path_class_files(&vnic->primary_path, "primary_path")) + goto free_vnic; + + if (vnic && !vnic->primary_path.viport) { + printk(KERN_ERR PFX "no valid netpaths\n"); + goto free_vnic; + } + + return count; + +free_vnic: + vnic_free(vnic); + ret = -EINVAL; +out: + return ret; +} + +ssize_t vnic_create_secondary(struct class_device * class_dev, + const char *buf, size_t count) +{ + struct class_dev_info *cdev = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic_ib_port *target = + container_of(cdev, struct vnic_ib_port, cdev_info); + + struct path_param param; + struct vnic *vnic; + int ret = -EINVAL; + struct list_head *ptr; + int found = 0; + + param.instance = 0; + param.rx_csum = -1; + param.tx_csum = -1; + param.heartbeat = -1; + + ret = vnic_parse_options(buf, ¶m); + + if (ret) + goto out; + + list_for_each(ptr, &vnic_list) { + vnic = list_entry(ptr, struct vnic, list_ptrs); + if (!strncmp(vnic->config->name, param.name, IFNAMSIZ)) { + found = 1; + break; + } + } + + if (!found) { + printk(KERN_ERR PFX + "primary connection with name '%s' does not exist\n", + param.name); + ret = -EINVAL; + goto out; + } + + param.ibdev = target->dev->dev; + param.ibport = target; + param.port = target->port_num; + + if (create_netpath(&vnic->secondary_path, ¶m)) { + printk(KERN_ERR PFX "creating secondary netpath failed\n"); + ret = -EINVAL; + goto out; + } + + if (setup_path_class_files(&vnic->secondary_path, "secondary_path")) + goto free_vnic; + + return count; + +free_vnic: + vnic_free(vnic); + ret = -EINVAL; +out: + return ret; +} diff --git a/drivers/infiniband/ulp/vnic/vnic_sys.h b/drivers/infiniband/ulp/vnic/vnic_sys.h new file mode 100644 index 0000000..eaa136c --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_sys.h @@ -0,0 +1,54 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_SYS_H_INCLUDED +#define VNIC_SYS_H_INCLUDED + +struct class_dev_info { + struct class_device class_dev; + struct completion released; +}; + +extern struct class vnic_class; +extern struct class_dev_info interface_cdev; +extern struct attribute_group vnic_dev_attr_group; +extern struct attribute_group vnic_path_attr_group; + +extern ssize_t vnic_create_primary(struct class_device *class_dev, + const char *buf, size_t count); + +extern ssize_t vnic_create_secondary(struct class_device *class_dev, + const char *buf, size_t count); + +extern ssize_t vnic_delete(struct class_device *class_dev, + const char *buf, size_t count); +#endif /*VNIC_SYS_H_INCLUDED*/ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:58:14 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:28:14 +0530 Subject: [openib-general] [PATCH v2 9/11] Statistics collection implementation Message-ID: <455A26DE.14382.6143037@ramachandra.kuchimanchi.qlogic.com> Adds the files that implement collection of statistics. Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_stats.c | 226 ++++++++++++++ drivers/infiniband/ulp/vnic/vnic_stats.h | 488 ++++++++++++++++++++++++++++++ 2 files changed, 714 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_stats.c b/drivers/infiniband/ulp/vnic/vnic_stats.c new file mode 100644 index 0000000..f44b19b --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_stats.c @@ -0,0 +1,226 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include + +#include "vnic_main.h" + +cycles_t recv_ref; + +/* + * TODO: Statistics reporting for control path, data path, + * RDMA times, IOs etc + * + */ +static ssize_t show_lifetime(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time = get_cycles() - vnic->statistics.start_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static CLASS_DEVICE_ATTR(lifetime, S_IRUGO, show_lifetime, NULL); + +static ssize_t show_conntime(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + if (vnic->statistics.conn_time) + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.conn_time); + return 0; +} + +static CLASS_DEVICE_ATTR(connection_time, S_IRUGO, show_conntime, NULL); + +static ssize_t show_disconnects(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + u32 num; + + if (vnic->statistics.disconn_ref) + num = vnic->statistics.disconn_num + 1; + else + num = vnic->statistics.disconn_num; + + return sprintf(buf, "%d\n", num); +} + +static CLASS_DEVICE_ATTR(disconnects, S_IRUGO, show_disconnects, NULL); + +static ssize_t show_total_disconn_time(struct class_device *class_dev, + char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time; + + if (vnic->statistics.disconn_ref) + time = vnic->statistics.disconn_time + + get_cycles() - vnic->statistics.disconn_ref; + else + time = vnic->statistics.disconn_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static CLASS_DEVICE_ATTR(total_disconn_time, S_IRUGO, + show_total_disconn_time, NULL); + +static ssize_t show_carrier_losses(struct class_device *class_dev, + char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + u32 num; + + if (vnic->statistics.carrier_ref) + num = vnic->statistics.carrier_off_num + 1; + else + num = vnic->statistics.carrier_off_num; + + return sprintf(buf, "%d\n", num); +} + +static CLASS_DEVICE_ATTR(carrier_losses, S_IRUGO, + show_carrier_losses, NULL); + +static ssize_t show_total_carr_loss_time(struct class_device *class_dev, + char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + cycles_t time; + + if (vnic->statistics.carrier_ref) + time = vnic->statistics.carrier_off_time + + get_cycles() - vnic->statistics.carrier_ref; + else + time = vnic->statistics.carrier_off_time; + + return sprintf(buf, "%llu\n", (unsigned long long)time); +} + +static CLASS_DEVICE_ATTR(total_carrier_loss_time, S_IRUGO, + show_total_carr_loss_time, NULL); + +static ssize_t show_total_recv_time(struct class_device *class_dev, + char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.recv_time); +} + +static CLASS_DEVICE_ATTR(total_recv_time, S_IRUGO, + show_total_recv_time, NULL); + +static ssize_t show_recvs(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.recv_num); +} + +static CLASS_DEVICE_ATTR(recvs, S_IRUGO, show_recvs, NULL); + +static ssize_t show_total_xmit_time(struct class_device *class_dev, + char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%llu\n", + (unsigned long long)vnic->statistics.xmit_time); +} + +static CLASS_DEVICE_ATTR(total_xmit_time, S_IRUGO, + show_total_xmit_time, NULL); + +static ssize_t show_xmits(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.xmit_num); +} + +static CLASS_DEVICE_ATTR(xmits, S_IRUGO, show_xmits, NULL); + +static ssize_t show_failed_xmits(struct class_device *class_dev, char *buf) +{ + struct class_dev_info *info = + container_of(class_dev, struct class_dev_info, class_dev); + struct vnic *vnic = container_of(info, struct vnic, stat_info); + + return sprintf(buf, "%d\n", vnic->statistics.xmit_fail); +} + +static CLASS_DEVICE_ATTR(failed_xmits, S_IRUGO, show_failed_xmits, NULL); + +static struct attribute * vnic_stats_attrs[] = { + &class_device_attr_lifetime.attr, + &class_device_attr_xmits.attr, + &class_device_attr_total_xmit_time.attr, + &class_device_attr_failed_xmits.attr, + &class_device_attr_recvs.attr, + &class_device_attr_total_recv_time.attr, + &class_device_attr_connection_time.attr, + &class_device_attr_disconnects.attr, + &class_device_attr_total_disconn_time.attr, + &class_device_attr_carrier_losses.attr, + &class_device_attr_total_carrier_loss_time.attr, + NULL +}; + +struct attribute_group vnic_stats_attr_group = { + .attrs = vnic_stats_attrs, +}; diff --git a/drivers/infiniband/ulp/vnic/vnic_stats.h b/drivers/infiniband/ulp/vnic/vnic_stats.h new file mode 100644 index 0000000..bd05651 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_stats.h @@ -0,0 +1,488 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_STATS_H_INCLUDED +#define VNIC_STATS_H_INCLUDED + +#include "vnic_main.h" +#include "vnic_ib.h" + +#ifdef CONFIG_INFINIBAND_VNIC_STATS + +extern struct attribute_group vnic_stats_attr_group; +extern cycles_t recv_ref; + +static inline void vnic_connected_stats(struct vnic *vnic) +{ + if (vnic->statistics.conn_time == 0) { + vnic->statistics.conn_time = + get_cycles() - vnic->statistics.start_time; + } + + if (vnic->statistics.disconn_ref != 0) { + vnic->statistics.disconn_time += + get_cycles() - vnic->statistics.disconn_ref; + vnic->statistics.disconn_num++; + vnic->statistics.disconn_ref = 0; + } + +} + +static inline void vnic_stop_xmit_stats(struct vnic * vnic) +{ + if (vnic->statistics.xmit_ref == 0) + vnic->statistics.xmit_ref = get_cycles(); +} + +static inline void vnic_restart_xmit_stats(struct vnic *vnic) +{ + if (vnic->statistics.xmit_ref != 0) { + vnic->statistics.xmit_off_time += + get_cycles() - vnic->statistics.xmit_ref; + vnic->statistics.xmit_off_num++; + vnic->statistics.xmit_ref = 0; + } +} + +static inline void vnic_recv_pkt_stats(struct vnic *vnic) +{ + vnic->statistics.recv_time += get_cycles() - recv_ref; + vnic->statistics.recv_num++; +} + +static inline void vnic_pre_pkt_xmit_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic, + cycles_t time) +{ + vnic->statistics.xmit_time += get_cycles() - time; + vnic->statistics.xmit_num++; + +} + +static inline void vnic_xmit_fail_stats(struct vnic *vnic) +{ + vnic->statistics.xmit_fail++; +} + +static inline void vnic_carrier_loss_stats(struct vnic *vnic) +{ + if (vnic->statistics.carrier_ref != 0) { + vnic->statistics.carrier_off_time += + get_cycles() - vnic->statistics.carrier_ref; + vnic->statistics.carrier_off_num++; + vnic->statistics.carrier_ref = 0; + } +} + +static inline int vnic_setup_stats_files(struct vnic *vnic) +{ + init_completion(&vnic->stat_info.released); + vnic->stat_info.class_dev.class = &vnic_class; + vnic->stat_info.class_dev.parent = &vnic->class_dev_info.class_dev; + snprintf(vnic->stat_info.class_dev.class_id, BUS_ID_SIZE, + "stats"); + + if (class_device_register(&vnic->stat_info.class_dev)) { + SYS_ERROR("create_vnic: error in registering" + " stat class dev\n"); + goto stats_out; + } + + if (sysfs_create_group(&vnic->stat_info.class_dev.kobj, + &vnic_stats_attr_group)) + goto err_stats_file; + + return 0; +err_stats_file: + class_device_unregister(&vnic->stat_info.class_dev); + wait_for_completion(&vnic->stat_info.released); +stats_out: + return -1; +} + +static inline void vnic_cleanup_stats_files(struct vnic * vnic) +{ + sysfs_remove_group(&vnic->class_dev_info.class_dev.kobj, + &vnic_stats_attr_group); + class_device_unregister(&vnic->stat_info.class_dev); + wait_for_completion(&vnic->stat_info.released); +} + +static inline void vnic_disconn_stats(struct vnic *vnic) +{ + if (!vnic->statistics.disconn_ref) + vnic->statistics.disconn_ref = get_cycles(); + + if (vnic->statistics.carrier_ref == 0) + vnic->statistics.carrier_ref = get_cycles(); +} + +static inline void vnic_alloc_stats(struct vnic *vnic) +{ + vnic->statistics.start_time = get_cycles(); +} + +static inline void control_note_rsptime_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void control_update_rsptime_stats(struct control *control, + cycles_t response_time) +{ + response_time -= control->statistics.request_time; + control->statistics.response_time += response_time; + control->statistics.response_num++; + if (control->statistics.response_max < response_time) + control->statistics.response_max = response_time; + if ((control->statistics.response_min == 0) || + (control->statistics.response_min > response_time)) + control->statistics.response_min = response_time; + +} + +static inline void control_note_reqtime_stats(struct control * control) +{ + control->statistics.request_time = get_cycles(); +} + +static inline void control_timeout_stats(struct control *control) +{ + control->statistics.timeout_num++; +} + +static inline void data_kickreq_stats(struct data * data) +{ + data->statistics.kick_reqs++; +} + +static inline void data_no_xmitbuf_stats(struct data * data) +{ + data->statistics.no_xmit_bufs++; +} + +static inline void data_xmits_stats(struct data * data) +{ + data->statistics.xmit_num++; +} + +static inline void data_recvs_stats(struct data * data) +{ + data->statistics.recv_num++; +} + +static inline void data_note_kickrcv_time(void) +{ + recv_ref = get_cycles(); +} + +static inline void data_rcvkicks_stats(struct data * data) +{ + data->statistics.kick_recvs++; +} + + +static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.connection_time = get_cycles(); +} + +static inline void vnic_ib_note_comptime_stats(cycles_t *time) +{ + *time = get_cycles(); +} + +static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.num_callbacks++; +} + +static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn, + u32 *comp_num) +{ + ib_conn->statistics.num_ios++; + *comp_num = *comp_num + 1; + +} + +static inline void vnic_ib_io_stats(struct io * io, + struct vnic_ib_conn *ib_conn, + cycles_t comp_time) +{ + if (io->type == RECV) + io->time = comp_time; + else if (io->type == RDMA) { + ib_conn->statistics.rdma_comp_time += comp_time - io->time; + ib_conn->statistics.rdma_comp_ios++; + } else if (io->type == SEND) { + ib_conn->statistics.send_comp_time += comp_time - io->time; + ib_conn->statistics.send_comp_ios++; + } +} + +static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn, + u32 comp_num) +{ + if (comp_num > ib_conn->statistics.max_ios) + ib_conn->statistics.max_ios = comp_num; +} + +static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn) +{ + ib_conn->statistics.connection_time = + get_cycles() - ib_conn->statistics.connection_time; + +} + +static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t *time) +{ + *time = get_cycles(); + if (io->time != 0) { + ib_conn->statistics.recv_comp_time += *time - io->time; + ib_conn->statistics.recv_comp_ios++; + } + +} + +static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn, + cycles_t time) +{ + ib_conn->statistics.recv_post_time += get_cycles() - time; + ib_conn->statistics.recv_post_ios++; +} + +static inline void vnic_ib_pre_sendpost_stats(struct io *io, + cycles_t *time) +{ + io->time = *time = get_cycles(); +} + +static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t time) +{ + time = get_cycles() - time; + if (io->swr.opcode == IB_WR_RDMA_WRITE) { + ib_conn->statistics.rdma_post_time += time; + ib_conn->statistics.rdma_post_ios++; + } else { + ib_conn->statistics.send_post_time += time; + ib_conn->statistics.send_post_ios++; + } +} +#else /*CONFIG_INIFINIBAND_VNIC_STATS*/ + +static inline void vnic_connected_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_stop_xmit_stats(struct vnic * vnic) +{ + ; +} + +static inline void vnic_restart_xmit_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_recv_pkt_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_pre_pkt_xmit_stats(cycles_t *time) +{ + ; +} + +static inline void vnic_post_pkt_xmit_stats(struct vnic *vnic, + cycles_t time) +{ + ; +} + +static inline void vnic_xmit_fail_stats(struct vnic *vnic) +{ + ; +} + +static inline int vnic_setup_stats_files(struct vnic *vnic) +{ + return 0; +} + +static inline void vnic_cleanup_stats_files(struct vnic * vnic) +{ + ; +} + +static inline void vnic_carrier_loss_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_disconn_stats(struct vnic *vnic) +{ + ; +} + +static inline void vnic_alloc_stats(struct vnic *vnic) +{ + ; +} + +static inline void control_note_rsptime_stats(cycles_t *time) +{ + ; +} + +static inline void control_update_rsptime_stats(struct control *control, + cycles_t response_time) +{ + ; +} + +static inline void control_note_reqtime_stats(struct control * control) +{ + ; +} + +static inline void control_timeout_stats(struct control *control) +{ + ; +} + +static inline void data_kickreq_stats(struct data * data) +{ + ; +} + +static inline void data_no_xmitbuf_stats(struct data * data) +{ + ; +} + +static inline void data_xmits_stats(struct data * data) +{ + ; +} + +static inline void data_recvs_stats(struct data * data) +{ + ; +} + +static inline void data_note_kickrcv_time(void) +{ + ; +} + +static inline void data_rcvkicks_stats(struct data * data) +{ + ; +} + +static inline void vnic_ib_conntime_stats(struct vnic_ib_conn *ib_conn) +{ + ; +} + +static inline void vnic_ib_note_comptime_stats(cycles_t *time) +{ + ; +} + +static inline void vnic_ib_callback_stats(struct vnic_ib_conn *ib_conn) + +{ + ; +} +static inline void vnic_ib_comp_stats(struct vnic_ib_conn *ib_conn, + u32 *comp_num) +{ + ; +} + +static inline void vnic_ib_io_stats(struct io * io, + struct vnic_ib_conn *ib_conn, + cycles_t comp_time) +{ + ; +} + +static inline void vnic_ib_maxio_stats(struct vnic_ib_conn *ib_conn, + u32 comp_num) +{ + ; +} + +static inline void vnic_ib_connected_time_stats(struct vnic_ib_conn *ib_conn) +{ + ; +} + +static inline void vnic_ib_pre_rcvpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t *time) +{ + ; +} + +static inline void vnic_ib_post_rcvpost_stats(struct vnic_ib_conn *ib_conn, + cycles_t time) +{ + ; +} + +static inline void vnic_ib_pre_sendpost_stats(struct io *io, + cycles_t *time) +{ + ; +} + +static inline void vnic_ib_post_sendpost_stats(struct vnic_ib_conn *ib_conn, + struct io *io, + cycles_t time) +{ + ; +} +#endif /*CONFIG_INIFINIBAND_VNIC_STATS*/ + +#endif /*VNIC_STATS_H_INCLUDED*/ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:58:42 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:28:42 +0530 Subject: [openib-general] [PATCH v2 10/11] Driver utility file - implements various utility macros Message-ID: <455A26FA.3533.6149E52@ramachandra.kuchimanchi.qlogic.com> Adds the driver utility file. This file contains utility macros for debugging etc Signed-off-by: Ramachandra K --- drivers/infiniband/ulp/vnic/vnic_util.h | 231 +++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/vnic/vnic_util.h b/drivers/infiniband/ulp/vnic/vnic_util.h new file mode 100644 index 0000000..acbb0f4 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/vnic_util.h @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 QLogic, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef VNIC_UTIL_H_INCLUDED +#define VNIC_UTIL_H_INCLUDED + +#define MODULE_NAME "VNIC" + +#define VNIC_MAJORVERSION 1 +#define VNIC_MINORVERSION 1 + +#define is_power_of2(value) (((value) & ((value - 1))) == 0) +#define ALIGN_DOWN(x, a) ((x)&(~((a)-1))) + +extern u32 vnic_debug; + +enum { + DEBUG_IB_INFO = 0x00000001, + DEBUG_IB_FUNCTION = 0x00000002, + DEBUG_IB_FSTATUS = 0x00000004, + DEBUG_IB_ASSERTS = 0x00000008, + DEBUG_CONTROL_INFO = 0x00000010, + DEBUG_CONTROL_FUNCTION = 0x00000020, + DEBUG_CONTROL_PACKET = 0x00000040, + DEBUG_CONFIG_INFO = 0x00000100, + DEBUG_DATA_INFO = 0x00001000, + DEBUG_DATA_FUNCTION = 0x00002000, + DEBUG_NETPATH_INFO = 0x00010000, + DEBUG_VIPORT_INFO = 0x00100000, + DEBUG_VIPORT_FUNCTION = 0x00200000, + DEBUG_LINK_STATE = 0x00400000, + DEBUG_VNIC_INFO = 0x01000000, + DEBUG_VNIC_FUNCTION = 0x02000000, + DEBUG_SYS_INFO = 0x10000000, + DEBUG_SYS_VERBOSE = 0x40000000 +}; + +#ifdef CONFIG_INFINIBAND_VNIC_DEBUG +#define PRINT(level, x, fmt, arg...) \ + printk(level "%s: %s: %s, line %d: " fmt, \ + MODULE_NAME, x, __FILE__, __LINE__, ##arg) + +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...) \ + do { \ + if (condition) \ + printk(level "%s: %s: %s, line %d: " fmt, \ + MODULE_NAME, x, __FILE__, __LINE__, \ + ##arg); \ + } while(0) +#else +#define PRINT(level, x, fmt, arg...) \ + printk(level "%s: " fmt, MODULE_NAME, ##arg) + +#define PRINT_CONDITIONAL(level, x, condition, fmt, arg...) \ + do { \ + if (condition) \ + printk(level "%s: %s: " fmt, \ + MODULE_NAME, x, ##arg); \ + } while(0) +#endif /*CONFIG_INFINIBAND_VNIC_DEBUG*/ + +#define IB_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "IB", fmt, ##arg) +#define IB_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "IB", fmt, ##arg) + +#define IB_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "IB", \ + (vnic_debug & DEBUG_IB_FUNCTION), \ + fmt, ##arg) + +#define IB_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "IB", \ + (vnic_debug & DEBUG_IB_INFO), \ + fmt, ##arg) + +#define IB_ASSERT(x) \ + do { \ + if ((vnic_debug & DEBUG_IB_ASSERTS) && !(x)) \ + panic("%s assertion failed, file: %s," \ + " line %d: ", \ + MODULE_NAME,__FILE__,__LINE__) \ + } while(0) + +#define CONTROL_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "CONTROL", fmt, ##arg) +#define CONTROL_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "CONTROL", fmt, ##arg) + +#define CONTROL_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONTROL", \ + (vnic_debug & DEBUG_CONTROL_INFO), \ + fmt, ##arg) + +#define CONTROL_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONTROL", \ + (vnic_debug & DEBUG_CONTROL_FUNCTION), \ + fmt, ##arg) + +#define CONTROL_PACKET(pkt) \ + do { \ + if (vnic_debug & DEBUG_CONTROL_PACKET) \ + control_log_control_packet(pkt); \ + } while(0) + +#define CONFIG_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "CONFIG", fmt, ##arg) +#define CONFIG_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "CONFIG", fmt, ##arg) + +#define CONFIG_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "CONFIG", \ + (vnic_debug & DEBUG_CONFIG_INFO), \ + fmt, ##arg) + +#define DATA_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "DATA", fmt, ##arg) +#define DATA_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "DATA", fmt, ##arg) + +#define DATA_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "DATA", \ + (vnic_debug & DEBUG_DATA_INFO), \ + fmt, ##arg) + +#define DATA_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "DATA", \ + (vnic_debug & DEBUG_DATA_FUNCTION), \ + fmt, ##arg) + +#define NETPATH_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "NETPATH", fmt, ##arg) +#define NETPATH_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "NETPATH", fmt, ##arg) + +#define NETPATH_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NETPATH", \ + (vnic_debug & DEBUG_NETPATH_INFO), \ + fmt, ##arg) + +#define VIPORT_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "VIPORT", fmt, ##arg) +#define VIPORT_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "VIPORT", fmt, ##arg) + +#define VIPORT_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "VIPORT", \ + (vnic_debug & DEBUG_VIPORT_INFO), \ + fmt, ##arg) + +#define VIPORT_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "VIPORT", \ + (vnic_debug & DEBUG_VIPORT_FUNCTION), \ + fmt, ##arg) + +#define LINK_STATE(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "LINK", \ + (vnic_debug & DEBUG_LINK_STATE), \ + fmt, ##arg) + +#define VNIC_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "NIC", fmt, ##arg) +#define VNIC_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "NIC", fmt, ##arg) +#define VNIC_INIT(fmt, arg...) \ + PRINT(KERN_INFO, "NIC", fmt, ##arg) + +#define VNIC_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NIC", \ + (vnic_debug & DEBUG_VNIC_INFO), \ + fmt, ##arg) + +#define VNIC_FUNCTION(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "NIC", \ + (vnic_debug & DEBUG_VNIC_FUNCTION), \ + fmt, ##arg) + +#define SYS_PRINT(fmt, arg...) \ + PRINT(KERN_INFO, "SYS", fmt, ##arg) +#define SYS_ERROR(fmt, arg...) \ + PRINT(KERN_ERR, "SYS", fmt, ##arg) + +#define SYS_INFO(fmt, arg...) \ + PRINT_CONDITIONAL(KERN_INFO, \ + "SYS", \ + (vnic_debug & DEBUG_SYS_INFO), \ + fmt, ##arg) + +#endif /* VNIC_UTIL_H_INCLUDED */ From ramachandra.kuchimanchi at qlogic.com Tue Nov 14 06:59:09 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Tue, 14 Nov 2006 20:29:09 +0530 Subject: [openib-general] [PATCH v2 11/11] Kconfig/Makefile. Modifications to toplevel Kconfig/Makefile Message-ID: <455A2715.31314.61508A5@ramachandra.kuchimanchi.qlogic.com> Adds the Kconfig and Makefile for the driver. Modifies the top level Infiniband Kconfig and Makefile to include VNIC. Signed-off-by: Ramachandra K --- drivers/infiniband/Kconfig | 2 ++ drivers/infiniband/Makefile | 1 + drivers/infiniband/ulp/vnic/Kconfig | 28 ++++++++++++++++++++++++++++ drivers/infiniband/ulp/vnic/Makefile | 12 ++++++++++++ 4 files changed, 43 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 9edface..5676c6a 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -45,4 +45,6 @@ source "drivers/infiniband/ulp/srp/Kconf source "drivers/infiniband/ulp/iser/Kconfig" +source "drivers/infiniband/ulp/vnic/Kconfig" + endmenu diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index 2b5d109..5407878 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -6,3 +6,4 @@ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ +obj-$(CONFIG_INFINIBAND_VNIC) += ulp/vnic/ diff --git a/drivers/infiniband/ulp/vnic/Kconfig b/drivers/infiniband/ulp/vnic/Kconfig new file mode 100644 index 0000000..39f88a3 --- /dev/null +++ b/drivers/infiniband/ulp/vnic/Kconfig @@ -0,0 +1,28 @@ +config INFINIBAND_VNIC + tristate "VNIC - Support for QLogic Virtual Ethernet I/O Controller" + depends on INFINIBAND && NETDEVICES && INET + ---help--- + Support for the QLogic Virtual Ethernet I/O Controller + (VEx). In conjunction with the VEx, this provides virtual + ethernet interfaces and transports ethernet packets over + InfiniBand so that you can communicate with Ethernet networks + using your IB device. + +config INFINIBAND_VNIC_DEBUG + bool "VNIC Verbose debugging" + depends on INFINIBAND_VNIC + default n + ---help--- + This option causes verbose debugging code to be compiled + into the VNIC driver. The output can be turned on via the + vnic_debug module parameter. + +config INFINIBAND_VNIC_STATS + bool "VNIC Statistics" + depends on INFINIBAND_VNIC + default n + ---help--- + This option compiles statistics collecting code into the + data path of the VNIC driver to help in profiling and fine + tuning. This adds some overhead in the interest of gathering + data. diff --git a/drivers/infiniband/ulp/vnic/Makefile b/drivers/infiniband/ulp/vnic/Makefile new file mode 100644 index 0000000..27aafae --- /dev/null +++ b/drivers/infiniband/ulp/vnic/Makefile @@ -0,0 +1,12 @@ +obj-$(CONFIG_INFINIBAND_VNIC) += ib_vnic.o + +ib_vnic-y := vnic_main.o \ + vnic_ib.o \ + vnic_viport.o \ + vnic_control.o \ + vnic_data.o \ + vnic_netpath.o \ + vnic_config.o \ + vnic_sys.o + +ib_vnic-$(CONFIG_INFINIBAND_VNIC_STATS) += vnic_stats.o From mst at mellanox.co.il Tue Nov 14 07:43:07 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Tue, 14 Nov 2006 17:43:07 +0200 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) In-Reply-To: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> References: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> Message-ID: <20061114154307.GE27446@mellanox.co.il> Quoting Ramachandra K : > Subject: [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) > > This patch set adds support for the QLogic Virtual Ethernet I/O > controller (VEx), which presents a true Ethernet NIC to the host. > > This driver provides a standard Ethernet NIC interface to the system and > treats IB as an I/O bus to allow a host CPU to use the VEx card as its NIC. Is the VEx wire protocol documented somewhere? For example, what is a viport? What is a netpath? It's somewhat hard to understand the code without the protocol spec it is trying to implement. -- MST From bos at pathscale.com Tue Nov 14 10:11:25 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 14 Nov 2006 10:11:25 -0800 Subject: [openib-general] [PATCH] Fix ipath build on ia64 Message-ID: <455A06CD.8090205@pathscale.com> I've already sent this patch to Andrew privately. Compile-tested on ia64, and works fine on x86_64. From bos at pathscale.com Tue Nov 14 10:31:40 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Tue, 14 Nov 2006 10:31:40 -0800 Subject: [openib-general] [PATCH v2 1/11] Driver Main files - netdev functions and corresponding state maintenance In-Reply-To: <455A254A.32264.60F0E7E@ramachandra.kuchimanchi.qlogic.com> References: <455A254A.32264.60F0E7E@ramachandra.kuchimanchi.qlogic.com> Message-ID: <455A0B8C.6020900@pathscale.com> Ramachandra K wrote: > +void vnic_connected(struct vnic *vnic, struct netpath *netpath) > +{ > + VNIC_FUNCTION("vnic_connected()\n"); > + if (netpath->second_bias) > + vnic_npevent_queue_evt(netpath, VNIC_SECNP_CONNECTED); > + else > + vnic_npevent_queue_evt(netpath, VNIC_PRINP_CONNECTED); > + > + vnic_connected_stats(vnic); > +} This driver looks rather like it's duplicating at least some of the functionality provided by the bonding driver. Could you explain the differences between the two approaches, please? > +void vnic_stop_xmit(struct vnic *vnic, struct netpath *netpath) > +{ > + VNIC_FUNCTION("vnic_stop_xmit()\n"); > + if (netpath == vnic->current_path) { > + if (vnic->xmit_started) { > + netif_stop_queue(&vnic->netdevice); > + vnic->xmit_started = 0; > + } > + > + vnic_stop_xmit_stats(vnic); > + } > +} Why is there an asymmetry between some of the routines, which will operate on either netpath depending on the bias setting, and others like this, which only operate on the current netpath and silently do nothing otherwise? This is at least confusing. > +static struct vnic * vnic_handle_npevent(struct vnic *vnic, > + enum vnic_npevent_type npevt_type) > +{ > + struct netpath *netpath; > + > + VNIC_INFO("%s: processing %s, netpath=%s, carrier=%d\n", > + vnic->config->name, vnic_npevent_str[npevt_type], > + netpath_to_string(vnic, vnic->current_path), > + vnic->carrier); > + > + switch (npevt_type) { > + case VNIC_PRINP_CONNECTED: I still don't understand this business of queueing events in the normal net driver entry points, then redispatching them in this big switch statement. Could you at least add a comment to the code that indicates why you're doing this? This structure makes the driver harder to follow without providing any obvious benefits. Message-ID: Roland Dreier wrote on 11/13/2006 08:45:52 AM: > > Sorry I was not intend to send previous email. Anyway I accidently sent it > > out. What I thought was there would be a problem, if the missed_event > > always return to 1. Then this napi poll would keep forever. > > Well, it's limited by the quota that the net stack gives it, so > there's no possibility of looping forever. However.... > > > How about defer the rotting packets process later? like this: > > that seems like it is still correct. > > > With this patch, I could get NAPI + non scaling code throughput performance > > from 1XXMb/s to 7XXMb/s, anyway there are some other problems I am still > > investigating now. > > But I wonder why it gives you a factor of 4 in performance?? Why does > it make a difference? I would have thought that the rotting packet > situation would be rare enough that it doesn't really matter for > performance exactly how we handle it. > > What are the other problems you're investigating? > > - R. The rotting packet situation consistently happens for ehca driver. The napi could poll forever with your original patch. That's the reason I defer the rotting packet process in next napi poll. It does help the performance from 1XXMb/s to 7XXMb/s, but not as expected 3XXXMb/s. With the defer rotting packet process patch, I can see packets out of order problem in TCP layer. Is it possible there is a race somewhere causing two napi polls in the same time? mthca seems to use irq auto affinity, but ehca uses round-robin interrupt. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue Nov 14 12:45:42 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 14 Nov 2006 12:45:42 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Roland, I think there is a barrier might be needed in checking LINK SCHED state, like smp_mb_before_clear_bit() and smp_mb_after_clear_bit(), otherwise the netif_rx_reschedule() for rotting packet and next interrupt netif_rx_schedule() could be running in the time. If the interrupt is round-robin fashion, then packets are going to be out of order in TCP layer. I will test it out once I have the resouce. How do you think? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue Nov 14 13:51:51 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 14 Nov 2006 13:51:51 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Roland, Ignore my previous email, test_and_set_bit is atomic operation and has the memeory barrier already. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From xma at us.ibm.com Tue Nov 14 14:35:38 2006 From: xma at us.ibm.com (Shirley Ma) Date: Tue, 14 Nov 2006 14:35:38 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: >From the code work through, if defering rotting packet process by return (missed_event && netif_rx_reschedule(dev, 0)); Then the same dev->poll can be added to per cpu poll list twice: one is from netif_rx_reschedule, one is from napi return 1. That might explains packets out of order: when one poll finishes and reset LINK SCHED bit and the next interrupt runs on other cpu. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Tue Nov 14 15:12:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Nov 2006 15:12:29 -0800 Subject: [openib-general] [PATCH] Fix ipath build on ia64 In-Reply-To: <455A06CD.8090205@pathscale.com> (Bryan O'Sullivan's message of "Tue, 14 Nov 2006 10:11:25 -0800") References: <455A06CD.8090205@pathscale.com> Message-ID: Seems fine, with the minor comment: > +#ifdef CONFIG_HT_IRQ > case PCI_DEVICE_ID_INFINIPATH_HT: > ipath_init_iba6110_funcs(dd); > break; > +#endif > +#ifdef CONFIG_PCI_MSI > case PCI_DEVICE_ID_INFINIPATH_PE800: > ipath_init_iba6120_funcs(dd); > break; > +#endif would it make sense to add the ifdef to the struct pci_driver too, so that the probe function doesn't even get called for a device the driver can't handle? - R. From rdreier at cisco.com Tue Nov 14 15:18:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Nov 2006 15:18:23 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: (Shirley Ma's message of "Tue, 14 Nov 2006 12:11:23 -0800") References: Message-ID: Shirley> The rotting packet situation consistently happens for Shirley> ehca driver. The napi could poll forever with your Shirley> original patch. That's the reason I defer the rotting Shirley> packet process in next napi poll. Hmm, I don't see it. In my latest patch, the poll routine does: repoll: done = 0; empty = 0; while (max) { t = min(IPOIB_NUM_WC, max); n = ib_poll_cq(priv->cq, t, priv->ibwc); for (i = 0; i < n; ++i) { if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) { ++done; --max; ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); } else ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); } if (n != t) { empty = 1; break; } } dev->quota -= done; *budget -= done; if (empty) { netif_rx_complete(dev); if (unlikely(ib_req_notify_cq(priv->cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) && netif_rx_reschedule(dev, 0)) goto repoll; return 0; } return 1; so every receive completion will count against the limit set by the variable max. The only way I could see the driver staying in the poll routine for a long time would be if it was only processing send completions, but even that doesn't actually seem bad: the driver is making progress handling completions. Shirley> It does help the performance from 1XXMb/s to 7XXMb/s, but Shirley> not as expected 3XXXMb/s. Is that 3xxx Mb/sec the performance you see without the NAPI patch? Shirley> With the defer rotting packet process patch, I can see Shirley> packets out of order problem in TCP layer. Is it Shirley> possible there is a race somewhere causing two napi polls Shirley> in the same time? mthca seems to use irq auto affinity, Shirley> but ehca uses round-robin interrupt. I don't see how two NAPI polls could run at once, and I would expect worse effects from them stepping on each other than just out-of-order packets. However, the fact that ehca does round-robin interrupt handling might lead to out-of-order packets just because different CPUs are all feeding packets into the network stack. - R. From rdreier at cisco.com Tue Nov 14 15:23:19 2006 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Nov 2006 15:23:19 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: (Shirley Ma's message of "Tue, 14 Nov 2006 14:35:38 -0800") References: Message-ID: Shirley> From the code work through, if defering rotting packet Shirley> process by return (missed_event && Shirley> netif_rx_reschedule(dev, 0)); Then the same dev->poll can Shirley> be added to per cpu poll list twice: one is from Shirley> netif_rx_reschedule, one is from napi return 1. That Shirley> might explains packets out of order: when one poll Shirley> finishes and reset LINK SCHED bit and the next interrupt Shirley> runs on other cpu. I don't think so. It's completely normal for dev->poll() to return 1 when there's more work to be done, so the networking core will just move the device to the tail of the poll list. So I don't see why it would make a difference if we actually do any work after netif_rx_reschedule() or not. On the other hand I still don't see why it helps to drop out of the poll routine immediately even though we know there is more work to be done, and the networking stack has told us it could handle more packets. - R. From johnt1johnt2 at gmail.com Tue Nov 14 22:26:50 2006 From: johnt1johnt2 at gmail.com (john t) Date: Wed, 15 Nov 2006 11:56:50 +0530 Subject: [openib-general] RC v/s UD Message-ID: Hi, I ran "ibv_rc_pingpong" and "ibv_ud_pingpong" utilities and found that RC gives a BW of 4302 Mbit/sec and UD gives a BW of 2133 Mbit/sec on my setup (2 hosts connected to a switch). So it seems UD is inefficient then RC (gives almost half the BW as that of RC). Is this expected? Isnt it true that UD involves less overheads then RC? Are there ways to get maximum BW (close to RC) from UD? Regards, John T. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Wed Nov 15 00:21:19 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Nov 2006 10:21:19 +0200 Subject: [openib-general] Fwd: [ANNOUNCE] GIT 1.4.4 Message-ID: <20061115082119.GA17595@mellanox.co.il> FYI A usual number of gitweb enhancements. git-blame is probably the most interesting. -- MST -------------- next part -------------- An embedded message was scrubbed... From: "Junio C Hamano" Subject: [ANNOUNCE] GIT 1.4.4 Date: Wed, 15 Nov 2006 09:43:17 +0200 Size: 29947 URL: From mst at mellanox.co.il Wed Nov 15 01:20:44 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Nov 2006 11:20:44 +0200 Subject: [openib-general] ifconfig down stuck Message-ID: <20061115092044.GA18747@mellanox.co.il> running nightly tests on latest IB code from Linus' git tree (backported to 2.6.9), ifconfig down got stuck at some point. I see a ton of unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 in dmesg sysrq trace below. I also noticed that ib_ucm:ib_ucm_cleanup seems to be stuck in mutex_lock - apparently ib_ucm module was being removed. Not sure whether these are related problems. SysRq : Show State sibling task PC pid father child younger older init S 000000000000000b 0 1 0 2 (NOTLB) 00000100c7fb1d78 0000000000000002 00000100c7fb1e58 ffffffff8018f019 0000000000000000 00000100c7fb1e58 000000d033715120 0000000300000246 0000010133ba77f0 00000000000007ce Call Trace:{dput+56} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {do_select+939} {__pollwait+0} {sys_select+820} {system_call+126} migration/0 S 000001000105b7e0 0 2 1 3 (L-TLB) 0000010133b59ec8 0000000000000046 000001000105b7e0 0000001900000076 0000010132979030 0000000000000076 0000010001044a40 0000000000000001 0000010133bab7f0 0000000000000185 Call Trace:{migration_thread+323} {migration_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} ksoftirqd/0 S 0000000000000000 0 3 1 4 2 (L-TLB) 0000010133b5df08 0000000000000046 000001012e2dd030 000000190000007a 0000010133695030 000000000000007a 0000010001044a40 0000000000000000 0000010133bab030 00000000000000e9 Call Trace:{ksoftirqd+0} {ksoftirqd+60} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} migration/1 S 000001000105b7e0 0 4 1 5 3 (L-TLB) 0000010133b5fec8 0000000000000046 000001000105b7e0 0000001900000074 00000101326077f0 0000000000000074 000001000104ca40 0000000100000001 0000010133bac7f0 0000000000000128 Call Trace:{migration_thread+323} {migration_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} ksoftirqd/1 S 0000000000000001 0 5 1 6 4 (L-TLB) 00000100c7f21f08 0000000000000046 000001013362d7f0 0000001900000079 0000010133695030 0000000000000079 000001000104ca40 0000000100000000 0000010133bac030 00000000000000fa Call Trace:{tasklet_action+103} {ksoftirqd+0} {ksoftirqd+60} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} migration/2 S 000001000105b7e0 0 6 1 7 5 (L-TLB) 00000100c7f25ec8 0000000000000046 000001000105b7e0 0000001900000074 0000010132979030 0000000000000074 0000010001054a40 0000000200000001 0000010133b727f0 00000000000001a2 Call Trace:{migration_thread+323} {migration_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} ksoftirqd/2 S 0000000000000002 0 7 1 8 6 (L-TLB) 00000100c7f57f08 0000000000000046 0000010133b1b030 0000001900000079 0000010050f5e7f0 0000000000000079 0000010001054a40 0000000200000000 0000010133b72030 0000000000000143 Call Trace:{tasklet_action+103} {ksoftirqd+0} {ksoftirqd+60} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} migration/3 S 00000100010437e0 0 8 1 9 7 (L-TLB) 00000100c7f59ec8 0000000000000046 00000100010437e0 0000001900000074 00000101326077f0 0000000000000074 000001000105ca40 0000000300000001 00000100c7f2f7f0 0000000000000191 Call Trace:{migration_thread+323} {migration_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} ksoftirqd/3 S 0000000000000000 0 9 1 10 8 (L-TLB) 0000010005d8df08 0000000000000046 000001013296d030 0000001900000077 0000010132b1e7f0 0000000000000077 000001000105ca40 0000000300000000 00000100c7f2f030 0000000000000153 Call Trace:{ksoftirqd+0} {ksoftirqd+60} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} events/0 S ffffffff80162f6a 0 10 1 14 11 9 (L-TLB) 0000010133833e68 0000000000000046 ffffffff803d4300 0000000000000246 0000000000000246 0000000000000246 000000010321acaf 0000000000000246 000001013380e7f0 000000000000166a Call Trace:{cache_reap+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {worker_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} events/1 S ffffffff80162f6a 0 11 1 391 12 10 (L-TLB) 0000010133835e68 0000000000000046 0000010133ba7030 000001000104eec0 0000000000000005 0000000000000246 000000010321b13f 0000000100000246 000001013380e030 0000000000000d63 Call Trace:{cache_reap+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {worker_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} events/2 S ffffffff80162f6a 0 12 1 2069 13 11 (L-TLB) 0000010133839e68 0000000000000046 0000010133baa7f0 0000010001056ec0 000000000000000a 0000000000000246 000000010321ab60 0000000200000246 000001013380f7f0 00000000000012e3 Call Trace:{cache_reap+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {worker_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} events/3 S ffffffff80162f6a 0 13 1 392 48 12 (L-TLB) 000001013383be68 0000000000000046 0000010133baa030 0000000000000000 0000000000000000 0000000000000246 000000010321ae1f 0000000300000246 000001013380f030 0000000000000a70 Call Trace:{cache_reap+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {worker_thread+0} {kthread+200} {child_rip+8} {kthread+0} {child_rip+0} khelper S 0000010021105bd8 0 14 10 15 (L-TLB) 000001013383de68 0000000000000046 00000100010437e0 000000190000006a 0000010026ae3030 000000000000006a 0000010001044a40 0000000000000001 00000101338117f0 000000000000016b Call Trace:{__call_usermodehelper+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kacpid S 0000010005dac040 0 15 10 44 14 (L-TLB) 00000100c7d71e68 0000000000000046 0000000000000000 0000000000000072 0000000000000000 0000000000000001 ffffffff80467880 0000000000000000 0000010133811030 00000000000007fa Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kblockd/0 S ffffffff8024e2d9 0 44 10 45 15 (L-TLB) 00000100c7db1e68 0000000000000046 00000060ffffffff 0000000000000000 000000ac00000246 00000101336babc8 0000010005f41780 0000000005ec40c0 00000100c7d4c7f0 0000000000000266 Call Trace:{blk_unplug_work+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kblockd/1 S 0000010005ec42a0 0 45 10 46 44 (L-TLB) 00000100c7db3e68 0000000000000046 00000060ffffffff 0000001900000074 00000101336d4030 0000000000000074 000001000104ca40 0000000105ec40c0 00000100c7d4c030 0000000000000819 Call Trace:{blk_unplug_work+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kblockd/2 S ffffffff8024e2d9 0 46 10 47 45 (L-TLB) 00000100c7db7e68 0000000000000046 00000060ffffffff 0000000000000000 000000c900000246 00000101336babc8 0000010005f41780 0000000205ec40c0 00000100c7d5d7f0 00000000000006a9 Call Trace:{blk_unplug_work+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0}<0>unregister_netdevice: waiting for ib0 to become free. Usage count = -61 {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kblockd/3 S ffffffff8024e2d9 0 47 10 75 46 (L-TLB) 00000100c7db9e68 0000000000000046 00000060ffffffff 0000000000000000 000000a500000246 00000101336babc8 0000010005f41780 0000000305ec40c0 00000100c7d5d030 000000000000025a Call Trace:{blk_unplug_work+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} khubd S 0000000000000000 0 48 1 77 13 (L-TLB) 00000100c7ddbe78 0000000000000046 000001013230c030 000000000000007d 0000000000000000 0000000000000000 ffffffff80467880 0000000000000246 00000100c7da07f0 0000000000000959 Call Trace:{free_pages_bulk+692} {hub_thread+2921} {autoremove_wake_function+0} {autoremove_wake_function+0} {child_rip+8} {hub_thread+0} {child_rip+0} pdflush S ffffffff8014b4f0 0 75 10 76 47 (L-TLB) 00000100c7df1ec8 0000000000000046 0000000000000000 000000000000006a 00000100c7dc57f0 00000000c7d75240 ffffffff80467880 00000000010437e0 00000100c7dc57f0 00000000000004e0 Call Trace:{set_user_nice+259} {keventd_create_kthread+0} {pdflush+191} {pdflush+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} pdflush S ffffffff8014b4f0 0 76 10 78 75 (L-TLB) 0000010005e03ec8 0000000000000046 0000010005d2bc88 000000000003ce19 000000010321b576 00000100c7fb1de8 00000000fffffffc 00000001c7fb1dd8 00000100c7dc5030 00000000000002ac Call Trace:{keventd_create_kthread+0} {pdflush+191} {pdflush+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} aio/0 S 00000100c7d9eec0 0 78 10 79 76 (L-TLB) 0000010005e27e68 0000000000000046 0000000000000000 000000000000006b 0000000000000000 0000000000000008 ffffffff80467880 0000000000000000 00000100c7dc7030 00000000000005d2 Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kswapd0 S 0000010005e05f08 0 77 1 225 48 (L-TLB) 0000010005e05eb8 0000000000000046 0000010005e05e08 000000000000007d 0000000000000000 0000000000000000 ffffffff80467880 00000000801346b7 00000100c7dc77f0 0000000000000d5c Call Trace:{kswapd+231} {autoremove_wake_function+0} {autoremove_wake_function+0} {child_rip+8} {kswapd+0} {child_rip+0} aio/1 S 00000100c7d9ef40 0 79 10 80 78 (L-TLB) 0000010005e2be68 0000000000000046 0000000000000000 0000000000000000 0000000000000000 0000000000000008 000000000000231e 0000000100000000 00000100c7df97f0 0000000000000cf8 Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} aio/2 S 00000100c7d9efc0 0 80 10 81 79 (L-TLB) 0000010005e2de68 0000000000000046 0000000000000000 0000000000000000 0000000000000000 0000000000000008 000000000000231e 0000000200000000 00000100c7df9030 0000000000000d5a Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} aio/3 S 00000100c7d9f040 0 81 10 1987 80 (L-TLB) 0000010005e2fe68 0000000000000046 0000000000000000 0000000000000000 0000000000000000 0000000000000007 0000000000001f36 0000000300000000 00000100c7dfa7f0 0000000000000c32 Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {__wake_up_common+67} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kseriod S 0000000000000000 0 225 1 359 77 (L-TLB) 00000100c7dddec8 0000000000000046 00000100c7ddde18 000001000000007a ffffffff80132215 0000000000000000 ffffffff80467880 00000000c7daf218 00000100c7da0030 0000000000000e74 Call Trace:{deactivate_task+37} {serio_thread+469} {autoremove_wake_function+0} {do_exit+3137} {autoremove_wake_function+0} {child_rip+8} {selinux_d_instantiate+0} {selinux_d_instantiate+0} {selinux_d_instantiate+0} {serio_thread+0} {child_rip+0} scsi_eh_0 S 0000000000573710 0 359 1 400 225 (L-TLB) 0000010005d4ddf8 0000000000000046 00000101336ba800 0000010133b1bbd2 0000000000000018 0000000000000008 000000000000231e 0000000100000246 0000010133b1b7f0 000000000000294f Call Trace:{__down_interruptible+203} {default_wake_function+0} {__down_failed_interruptible+53} {:scsi_mod:.text.lock.scsi_error+45} {do_exit+3137} {child_rip+8} {:scsi_mod:scsi_error_handler+0} {child_rip+0} ata/0 S ffffffff80309800 0 391 11 (L-TLB) 0000010132919e68 0000000000000046 0000000000000000 0000000000000000 0000000000000000 0000000000000009 0000000000002706 0000000000000000 00000101336957f0 0000000000001664 Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ata/1 S 000001013372a900 0 392 13 393 (L-TLB) 0000010132841e68 0000000000000046 0000000100000001 0000001900000075 0000010133ba77f0 0000000000000075 000001000104ca40 0000000100000100 0000010133b457f0 0000000000000e3d Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ata/2 S ffffffff80309800 0 393 13 394 392 (L-TLB) 0000010132935e68 0000000000000046 0000000100000001 0000000000000002 0000010133b459d0 0000000000000008 000000000000231e 0000000200000001 0000010005f617f0 00000000000010d9 Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ata/3 S 000001013372aa00 0 394 13 393 (L-TLB) 0000010132937e68 0000000000000046 000001000104b7e0 0000001900000086 00000100c7f2f030 0000000000000086 000001000105ca40 0000000333679240 0000010005f61030 0000000000000d9d Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} scsi_eh_1 S 0000010133782900 0 400 1 413 359 (L-TLB) 000001013291bdf8 0000000000000046 0000010133b13000 000001013378abd2 000001013378a7f0 0000000000000006 0000000000001acc 0000000200000246 000001013378a7f0 0000000000000e6a Call Trace:{__down_interruptible+203} {default_wake_function+0} {__down_failed_interruptible+53} {:scsi_mod:.text.lock.scsi_error+45} {do_exit+3137} {child_rip+8} {:scsi_mod:scsi_error_handler+0} {child_rip+0} kjournald D ffffffff80309800 0 413 1 1573 400 (L-TLB) 0000010005fa7e78 ffffffff8030a14d 0000010133baa7f0 0000001900000073 00000101336d4030 0000000000000073 0000010001044a40 000000022d0ec240 0000010005f9b030 000000000000021d Call Trace:{thread_return+88} {del_timer+107} {:jbd:kjournald+250} {autoremove_wake_function+0} {autoremove_wake_function+0} {:jbd:commit_timeout+0} {child_rip+8} {:jbd:kjournald+0} {child_rip+0} udevd S 0000000000000006 0 1573 1 2087 413 (NOTLB) 0000010131e5bd78 0000000000000002 0000010133baa7f0 00000000000000d2 000000d00000f810 0000000000000246 0000000000000246 000000020000e780 0000010132235030 0000000000001a10 Call Trace:{schedule_timeout+224} {datagram_poll+39} {do_select+939} {__pollwait+0} {sys_select+820} {system_call+126} kauditd S ffffffff8014b4f0 0 1987 10 15262 81 (L-TLB) 00000101315f3ea8 0000000000000046 00000004bc4af1e4 000001013380e7f0 00000100010437e0 0000000000000000 0000000000000000 00000000bc465a6c 00000101329797f0 00000000000009f8 Call Trace:{keventd_create_kthread+0} {kauditd_thread+0} {kauditd_thread+383} {default_wake_function+0} {keventd_create_kthread+0} {default_wake_function+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kmirrord S ffffffff8014b4f0 0 2069 12 15263 (L-TLB) 0000010130999e68 0000000000000046 ffffffff801476af 00000101317b37a8 000000000000000a 0000000000000000 0000000000000008 0000000230998000 0000010005d5e7f0 0000000000001a3b Call Trace:{worker_thread+0} {keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} kjournald S 0000000000000000 0 2087 1 2690 1573 (L-TLB) 00000101304d3e78 0000000000000046 0000010133baa7f0 0000010130663af8 0000010133b66a98 0000000000000005 0000000000000000 00000002000003e8 000001013362d030 0000000000002484 Call Trace:{:jbd:kjournald+506} {autoremove_wake_function+0} {autoremove_wake_function+0} {:jbd:commit_timeout+0} {child_rip+8} {:jbd:kjournald+0} {child_rip+0} dhclient S ffffffff80309800 0 2690 1 2905 2087 (NOTLB) 0000010131151d78 0000000000000002 0000010132839080 ffffffff802a72c8 000001002afdf800 0000000000000246 000000d000000000 0000000000000246 000001013230c7f0 000000000000123c Call Trace:{sock_recvmsg+284} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {do_select+939} {__pollwait+0} {sys_select+820} {dnotify_parent+34} {system_call+126} syslogd D 000001012fd1fef8 0 2905 1 2909 2690 (NOTLB) 000001012fd1fd98 0000000000000006 00000100010537e0 0000000000000002 000001012fd1fd18 ffffffff801321e3 00000100010537e0 0000000000000003 00000101336d4030 0000000000000190 Call Trace:{activate_task+124} {:jbd:log_wait_commit+202} {autoremove_wake_function+0} {autoremove_wake_function+0} {:jbd:journal_stop+536} {:jbd:journal_force_commit+32} {:ext3:ext3_force_commit+36} {__writeback_single_inode+311} {sync_inode+52} {:ext3:ext3_sync_file+191} {sys_fsync+155} {system_call+126} klogd S 000000000000006e 0 2909 1 2920 2905 (NOTLB) 0000010130821be8 0000000000000006 0000010133ba7030 ffffffff80132155 0000000000000080 0000000000000080 00000000000000d0 0000000100000008 000001013229b030 000000000000067f Call Trace:{recalc_task_prio+337} {schedule_timeout+224} {prepare_to_wait_exclusive+21} {unix_wait_for_peer+163} {autoremove_wake_function+0} {autoremove_wake_function+0} {unix_dgram_sendmsg+957} {sock_aio_write+306} {do_sync_write+173} {do_syslog+482} {autoremove_wake_function+0} {autoremove_wake_function+0} {autoremove_wake_function+0} {dnotify_parent+34} {vfs_write+226} {sys_write+69} {system_call+126} irqbalance S 0000000000000000 0 2920 1 2932 2909 (NOTLB) 000001012fe43ee8 0000000000000006 0000000000001000 0000002a9556d000 000001012fd63558 0000002a95600000 0000010130191558 000000019556d000 0000010132af2030 000000000000fe35 Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {sys_nanosleep+192} {system_call+126} portmap S 7fffffffffffffff 0 2932 1 2952 2920 (NOTLB) 000001012f89fe88 0000000000000006 000001012f89fe48 0000001c2abb4070 000001012f89fec8 ffffffff00000010 000000d02f89fe48 0000000100000246 000001013230c030 0000000000005875 Call Trace:{schedule_timeout+224} {add_wait_queue+18} {tcp_poll+44} {sys_poll+604} {__pollwait+0} {system_call+126} rpc.statd S 0000000000000007 0 2952 1 3553 2932 (NOTLB) 000001012f839d78 0000000000000006 0000000000000071 ffffffff802a7143 0000000000000000 000001012f839e58 000000d033788950 0000000300000246 000001013237b030 000000000000974f Call Trace:{sock_sendmsg+271} {schedule_timeout+224} {tcp_poll+44} {do_select+939} {__pollwait+0} {sys_select+820} {system_call+126} rpc.idmapd S 0000000000000000 0 3553 1 3592 2952 (NOTLB) 000001012f4f3e78 0000000000000006 0000010133ba7030 0000000102e0fd5c 0000000000000000 000001012f4f3e88 0000000000000005 0000000100000005 0000010132607030 00000000000007a0 Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {sys_epoll_wait+403} {sys_rt_sigaction+97} {default_wake_function+0} {system_call+126} rpciod S 000001012f6ce000 0 3592 1 3593 3553 (L-TLB) 000001012f519ec8 0000000000000046 000001002f962c88 0000000000000001 ffffffffa0247060 ffffffffa01cef37 0000010001044a40 000000003395b800 00000101336d67f0 00000000000004cf Call Trace:{:sunrpc:__rpc_execute+867} {:sunrpc:rpciod+483} {autoremove_wake_function+0} {autoremove_wake_function+0} {child_rip+8} {:sunrpc:rpciod+0} {child_rip+0} lockd S 7fffffffffffffff 0 3593 1 3623 3592 (L-TLB) 000001012f599e18 0000000000000046 0000000000000000 0000000000000246 0000000000000246 ffffffff802ac378 0000000000003700 000000002fcb4780 0000010132b1e030 0000000000004e89 Call Trace:{skb_dequeue+80} {schedule_timeout+224} {:sunrpc:svc_reserve+75} {add_wait_queue+18} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:lockd:lockd+0} {:lockd:lockd+370} {child_rip+8} {:lockd:lockd+0} {:lockd:lockd+0} {child_rip+0} ypbind S 7fffffffffffffff 0 3623 1 3624 3593 (NOTLB) 000001012f037e88 0000000000000002 0000000000000008 000001012f1dec80 000001012fd7bc80 ffffffff802db1b3 000000d000000000 0000000000000246 00000101321b07f0 000000000000c2bc Call Trace:{tcp_rcv_state_process+3789} {schedule_timeout+224} {add_wait_queue+18} {tcp_poll+44} {sys_poll+604} {__pollwait+0} {system_call+126} ypbind S 0000003dd8b0d1c0 0 3624 1 3626 3623 (NOTLB) 000001012f059e48 0000000000000002 0000000000000000 0000000000000000 0000000000000246 0000000000000003 000001012f059de8 0000000080133e50 0000010132af27f0 000000000000127a Call Trace:{__up_read+16} {do_futex+706} {do_sync_write+173} {schedule_timeout+224} {dequeue_signal+58} {sys_rt_sigtimedwait+487} {sys_futex+203} {sys_write+96} {system_call+126} ypbind D 0000000000000002 0 3626 1 3810 3624 (NOTLB) 000001012f05baa8 0000000000000002 00000101328337f0 0000000000000206 0000000000000001 00000000324da000 0000001000000001 0000000000000246 00000101328337f0 0000000000001828 Call Trace:{__down+147} {default_wake_function+0} {__down_failed+53} {.text.lock.dev+85} {netlink_dump_start+307} {rtnetlink_rcv+861} {netlink_data_ready+22} {netlink_sendskb+113} {netlink_sendmsg+694} {sock_sendmsg+271} {__generic_file_write_nolock+158} {vsnprintf+1406} {autoremove_wake_function+0} {sockfd_lookup+16} {sys_sendto+195} {sys_getsockname+125} {netlink_insert+321} {fd_install+42} {sock_map_fd+59} {system_call+126} smartd S 0000000000000000 0 3810 1 3820 3626 (NOTLB) 000001012f55bee8 0000000000000006 0000010133bbea00 0000010133b96400 000001012b8d01c0 ffffffff80250ebb 0000000000000000 0000000100000246 00000101327b0030 0000000000003027 Call Trace:{blkdev_ioctl+1674} {:sd_mod:scsi_disk_put+81} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {sys_nanosleep+192} {system_call+126} acpid S 0000000000000005 0 3820 1 3898 3810 (NOTLB) 00000101303c9d78 0000000000000006 000001012f221508 0000000000000246 0000010132868580 00000101303c9d30 0000000000000000 000000000000000e 0000010005f9b7f0 000000000001f502 Call Trace:{schedule_timeout+224} {do_select+939} {__pollwait+0} {sys_select+820} {dnotify_parent+34} {system_call+126} sshd S 0000000000000004 0 3898 1 10192 3921 3820 (NOTLB) 000001012e927d78 0000000000000002 0000010133baa7f0 000001000000e780 0000000000000000 00000000000000d2 000000d00000f810 0000000200000246 000001013237b7f0 00000000000066f8 Call Trace:{schedule_timeout+224} {tcp_poll+44} {do_select+939} {__pollwait+0} {sys_select+820} {dput+56} {system_call+126} xinetd S 0000000000000009 0 3921 1 3941 3898 (NOTLB) 000001012ea39d78 0000000000000002 0000010133ba7030 0000001900000076 000001003a1e17f0 0000000000000076 000000d001054a40 0000000100000246 000001013229b7f0 00000000000017df Call Trace:{schedule_timeout+224} {tcp_poll+44} {do_select+939} {__pollwait+0} {sys_select+820} {system_call+126} rpc.rquotad S 000001012eb5fcc0 0 3941 1 3958 3921 (NOTLB) 000001012eac7e88 0000000000000002 0000010131d63568 0000001900000074 0000010132979030 0000000000000074 0000010001044a40 0000000000000246 0000010131c56030 000000000001cd61 Call Trace:{schedule_timeout+224} {add_wait_queue+18} {tcp_poll+44} {sys_poll+604} {__pollwait+0} {system_call+126} nfsd S 000001000104b7e0 0 3958 1 3959 3941 (L-TLB) 000001012eb0bde8 0000000000000046 0000010133baa030 0000001900000073 000001013208f7f0 0000000000000073 000001000105ca40 000000032fcb4b00 000001012ef59030 0000000000001b50 Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+93} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} nfsd S 000001000104b7e0 0 3959 1 3960 3958 (L-TLB) 000001012eb0fde8 0000000000000046 000001012ef59030 0000001900000073 0000010132774030 0000000000000073 000001000105ca40 00000003e2c10000 000001013208f7f0 000000000000106b Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+55} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} nfsd S 000001000104b7e0 0 3960 1 3961 3959 (L-TLB) 000001012e441de8 0000000000000046 000001013208f7f0 0000001900000073 000001012ef597f0 0000000000000073 000001000105ca40 000000030001e831 0000010132774030 0000000000000e1f Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+55} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} nfsd S 000001000104b7e0 0 3961 1 3962 3960 (L-TLB) 000001012e437de8 0000000000000046 0000010132774030 0000001900000073 0000010133b17030 0000000000000073 000001000105ca40 0000000300013ebd 000001012ef597f0 0000000000000d68 Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+55} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} nfsd S 000001000104b7e0 0 3962 1 3963 3961 (L-TLB) 000001012eba3de8 0000000000000046 000001012ef597f0 0000001900000073 00000101322ef030 0000000000000073 000001000105ca40 0000000333b17030 0000010133b17030 0000000000000d3e Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+55} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} nfsd S 000001000104b7e0 0 3963 1 3964 3962 (L-TLB) 000001012e495de8 0000000000000046 0000010133b17030 0000001900000073 00000101325b67f0 0000000000000073 000001000105ca40 00000003322ef030 00000101322ef030 0000000000000d5a Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+55} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} nfsd S 000001000104b7e0 0 3964 1 3965 3963 (L-TLB) 000001012e467de8 0000000000000046 00000101322ef030 0000001900000073 00000101321b0030 0000000000000073 000001000105ca40 00000003325b67f0 00000101325b67f0 0000000000000d36 Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+55} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} nfsd S 000000000036ee80 0 3965 1 3969 3964 (L-TLB) 000001012eb91de8 0000000000000046 00000101325b67f0 000000190000007d 00000101322357f0 000000000000007d 000001000105ca40 00000003328c29a8 00000101321b0030 0000000000000d19 Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {:sunrpc:svc_recv+786} {default_wake_function+0} {default_wake_function+0} {:nfsd:nfsd+0} {:nfsd:nfsd+381} {schedule_tail+55} {child_rip+8} {:nfsd:nfsd+0} {:nfsd:nfsd+0} {child_rip+0} rpc.mountd S 000001012eb5f7c0 0 3969 1 3999 3965 (NOTLB) 000001012ebfbd78 0000000000000006 00000000000000bf 0000001900000074 000001013296d030 0000000000000074 0000010001044a40 0000000000000246 0000010133b177f0 0000000000021f84 Call Trace:{schedule_timeout+224} {tcp_poll+44} {do_select+939} {__pollwait+0} {sys_select+820} {strncpy_from_user+91} {system_call+126} sendmail S 0000000000000005 0 3999 1 4007 3969 (NOTLB) 000001012e0b7d78 0000000000000006 0000000000000022 00000100b0621015 00000000ffffffff 0000000000000002 000000d030620ffe 0000000100000246 000001013378a030 00000000000020ae Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {do_select+939} {__pollwait+0} {invalidate_inode_buffers+12} {sys_select+820} {dput+56} {system_call+126} sendmail S 0000000000000000 0 4007 1 4047 3999 (NOTLB) 000001012e1dbf68 0000000000000006 0000000103346adf 0000000000000246 0000000000000246 ffffffff8013fcb4 0000000000000004 000000030036f0a4 0000010132833030 0000000000009ae2 Call Trace:{__mod_timer+293} {do_setitimer+332} {sys_pause+23} {system_call+126} gpm S 00000100010437e0 0 4047 1 4117 4007 (NOTLB) 000001012dda3d78 0000000000000002 00000000000000bf 0000001900000086 0000010133b72030 0000000000000086 0000010001054a40 000000028015ade7 00000100c7dfa030 000000000000db72 Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {do_select+939} {__pollwait+0} {sys_select+820} {system_call+126} crond S 0000000000000000 0 4117 1 4140 4047 (NOTLB) 000001012e235ee8 0000000000000002 000001012e235e58 000001012e235ef8 000001012e235ef8 ffffffff80181d23 000000000000080a 0000000300043b57 00000101336d47f0 0000000000000e4f Call Trace:{cp_new_stat+233} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {sys_nanosleep+192} {system_call+126} xfs S 0000000000000004 0 4140 1 4159 4117 (NOTLB) 000001012de31d78 0000000000000002 ffffffff803d4300 0000000000000246 00000101326077f0 0000000000000074 0000010001054a40 0000000000000046 000001012dc5b030 000000000000526c Call Trace:{__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {do_select+939} {__pollwait+0} {sys_select+820} {system_call+126} atd S 0000000000000000 0 4159 1 4175 4140 (NOTLB) 000001012decbee8 0000000000000006 000000002efdb354 ffffffff8018b1b8 000001012decbf38 00000000a0073ad8 000001012decbf38 000000022decbf38 000001013296d7f0 0000000000002579 Call Trace:{filldir64+0} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {sys_nanosleep+192} {system_call+126} dbus-daemon-1 S 7fffffffffffffff 0 4175 1 4188 4159 (NOTLB) 000001012d857e88 0000000000000002 000001012d857e18 0000000000000246 000001012dc5b7f0 0000000000000000 000001012d857f50 000000022d857f50 000001012ef5a7f0 00000000000a3e3c Call Trace:{schedule_timeout+224} {add_wait_queue+18} {sys_poll+604} {__pollwait+0} {sys_read+69} {system_call+126} cups-config-d S 7fffffffffffffff 0 4188 1 4199 4175 (NOTLB) 000001012d8c9e88 0000000000000006 000001012e3cf1f8 0000000000000246 000001012d8c9e18 0000000000000000 000001012d8c9f50 000000022d8c9f50 00000101336d6030 000000000001cd2c Call Trace:{schedule_timeout+224} {add_wait_queue+18} {sys_poll+604} {__pollwait+0} {system_call+126} hald S 00000000000007d1 0 4199 1 4217 4188 (NOTLB) 000001012d9dbe88 0000000000000006 0000010133ba7030 ffffffff8018bfe0 0040ea5b00008001 0000010132adb688 000000d000000000 0000000100000246 00000101328387f0 0000000000001db1 Call Trace:{sys_poll+777} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {datagram_poll+0} {sys_poll+604} {__pollwait+0} {system_call+126} python S 0000000000000000 0 4217 1 5216 4365 4199 (NOTLB) 000001012da69e08 0000000000000006 0000000000000202 ffffffff80169144 0000010133678040 000001012e333018 000000000060a2c8 000000022dad3108 000001013208f030 00000000000049ed Call Trace:{do_wp_page+1181} {pipe_wait+128} {autoremove_wake_function+0} {cp_new_stat+233} {autoremove_wake_function+0} {__vma_link+66} {pipe_readv+526} {pipe_read+26} {vfs_read+207} {sys_read+69} {system_call+126} python S 0000000000000000 0 4365 1 4962 4368 4217 (NOTLB) 000001012c62dd78 0000000000000002 000001012c62de58 000001012c62de58 0000000000000000 ffffffff8018599c 00000000fffffffe 0000000080186cff 0000010131c567f0 000000000000150a Call Trace:{path_release+12} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {do_select+939} {__user_walk+94} {__pollwait+0} {sys_select+820} {system_call+126} agetty S 0000000000000001 0 4368 1 4369 4365 (NOTLB) 0000010132b85d88 0000000000000002 0000000000000075 0000000000000000 0000010137ecc5e8 ffffffff8015ade7 0000000000000001 0000000205d69b18 00000101329267f0 000000000000fb18 Call Trace:{filemap_nopage+384} {uart_start+38} {schedule_timeout+224} {add_wait_queue+18} {read_chan+1059} {default_wake_function+0} {default_wake_function+0} {tty_read+230} {vfs_read+207} {sys_read+69} {system_call+126} mingetty S 000001013377c0c0 0 4369 1 4370 4368 (NOTLB) 000001012c6b3d88 0000000000000006 0000000d00000075 0000001900000076 00000101322ef7f0 0000000000000076 0000010001044a40 0000000005d69b18 00000101323cd7f0 0000000000090cc0 Call Trace:{complement_pos+12} {complement_pos+12} {schedule_timeout+224} {add_wait_queue+18} {read_chan+1059} {default_wake_function+0} {__wake_up+54} {default_wake_function+0} {tty_read+230} {vfs_read+207} {sys_read+69} {system_call+126} mingetty S 000001012d4f9280 0 4370 1 4371 4369 (NOTLB) 000001012c1ebd88 0000000000000002 0000000d00000075 000000190000006f 000001012e2dd030 000000000000006f 0000010001044a40 0000000005d69b18 00000101326977f0 000000000003d013 Call Trace:{schedule_timeout+224} {add_wait_queue+18} {read_chan+1059} {default_wake_function+0} {__wake_up+54} {default_wake_function+0} {tty_read+230} {vfs_read+207} {sys_read+69} {system_call+126} mingetty S 000001012bf227c0 0 4371 1 4372 4370 (NOTLB) 000001012ee7dd88 0000000000000002 0000000d00000075 0000001900000076 000001012cacc7f0 0000000000000076 0000010001044a40 0000000005d69b18 00000101322ef7f0 0000000000019b8b Call Trace:{schedule_timeout+224} {add_wait_queue+18} {read_chan+1059} {default_wake_function+0} {__wake_up+54} {default_wake_function+0} {tty_read+230} {vfs_read+207} {sys_read+69} {system_call+126} mingetty S 000001013222f0c0 0 4372 1 4373 4371 (NOTLB) 000001012d941d88 0000000000000002 0000000d00000075 0000001900000076 000001012cacc7f0 0000000000000076 0000010001054a40 0000000205d69b18 000001012e5eb030 000000000001c064 Call Trace:{schedule_timeout+224} {add_wait_queue+18} {read_chan+1059} {default_wake_function+0} {__wake_up+54} {default_wake_function+0} {tty_read+230} {vfs_read+207} {sys_read+69} {system_call+126} mingetty S 000001013222fcc0 0 4373 1 4374 4372 (NOTLB) 000001012d905d88 0000000000000002 0000000d00000075 0000001900000072 000001012e31a030 0000000000000072 0000010001054a40 0000000205d69b18 000001012cacc7f0 0000000000030f7c Call Trace:{schedule_timeout+224} {add_wait_queue+18} {read_chan+1059} {default_wake_function+0} {__wake_up+54} {default_wake_function+0} {tty_read+230} {vfs_read+207} {sys_read+69} {system_call+126} mingetty S 0000000000000001 0 4374 1 30324 4373 (NOTLB) 000001012dac9d88 0000000000000002 0000000d00000075 000001012dac9dd8 0000010137ec00ff ffffffff8015ade7 0000000000000001 0000000105d69b18 000001012cacc030 00000000000331d5 Call Trace:{filemap_nopage+384} {schedule_timeout+224} {add_wait_queue+18} {read_chan+1059} {default_wake_function+0} {__wake_up+54} {default_wake_function+0} {tty_read+230} {vfs_read+207} {sys_read+69} {system_call+126} python S ffffffff80309800 0 4962 4365 10190 (NOTLB) 000001012bbe3d78 0000000000000002 000001012bbe3e58 000001012bbe3e58 0000000000000000 ffffffff8018599c 00000000fffffffe 0000000080186cff 000001012e2dd030 0000000000004440 Call Trace:{path_release+12} {__mod_timer+293} {schedule_timeout+367} {process_timeout+0} {do_select+939} {__user_walk+94} {__pollwait+0} {sys_select+820} {system_call+126} cupsd D 0000000000000002 0 30324 1 4314 4374 (NOTLB) 00000100ada0daa8 0000000000000002 0000010132926030 0000000000000202 0000000000000001 0000000080161936 0000001000000001 0000000000000246 0000010132926030 0000000000001c1f Call Trace:{__down+147} {default_wake_function+0} {__down_failed+53} {.text.lock.dev+85} {netlink_dump_start+307} {rtnetlink_rcv+861} {netlink_data_ready+22} {netlink_sendskb+113} {netlink_sendmsg+694} {sock_sendmsg+271} {del_timer+107} {vsnprintf+1406} {autoremove_wake_function+0} {sockfd_lookup+16} {sys_sendto+195} {sys_getsockname+125} {netlink_insert+321} {fd_install+42} {sock_map_fd+59} {system_call+126} ib_cm/0 S ffffffffa00f5b5b 0 15262 10 15264 1987 (L-TLB) 000001005fc95e68 0000000000000046 ffffffff803d4300 0000001900000069 000001013380e7f0 0000000000000246 0000000018664492 0000000000000206 00000100243cf7f0 0000000000000132 Call Trace:{:ib_cm:cm_work_handler+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ib_cm/1 S ffffffffa00f5b5b 0 15263 12 15265 2069 (L-TLB) 0000010028807e68 0000000000000046 0000010133ba7030 0000000000000212 0000010028807dd8 0000000000000246 00000000186644be 0000000100000206 000001009e8aa7f0 00000000000001a6 Call Trace:{:ib_cm:cm_work_handler+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ib_cm/2 S ffffffffa00f5b5b 0 15264 10 15289 15262 (L-TLB) 00000100a5cc9e68 0000000000000046 0000010133baa7f0 0000000000000206 00000000ffffff98 0000000000000246 0000000018664493 0000000200000206 00000100af56f7f0 00000000000001ee Call Trace:{:ib_cm:cm_work_handler+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ib_cm/3 S ffffffffa00f5b5b 0 15265 12 15286 15263 (L-TLB) 0000010041b45e68 0000000000000046 0000010133baa030 0000000000000297 0000010041b45dd8 0000000000000246 000000001866909f 0000000300000206 0000010048d027f0 000000000000033e Call Trace:{:ib_cm:cm_work_handler+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} mthcacatas S ffffffff80309800 0 15286 12 15329 15265 (L-TLB) 0000010021105ce8 0000000000000046 0000010133baa7f0 ffffffffa02b8910 0000010069648000 0000000000000006 000000010321bb00 0000000269648000 0000010026ae3030 00000000000000db Call Trace:{__mod_timer+293} {:ib_mthca:catas_reset+0} {schedule_timeout+367} {process_timeout+0} {netdev_run_todo+515} {:ib_ipoib:ipoib_remove_one+100} {:ib_core:ib_unregister_device+90} {:ib_mthca:__mthca_remove_one+48} {:ib_mthca:__mthca_restart_one+23} {:ib_mthca:catas_reset+198} {worker_thread+419} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ib_mad1 S ffffffffa00e31db 0 15289 10 15290 15264 (L-TLB) 00000100be61de68 0000000000000046 00000100be61ddc8 ffffffffa00f03c0 00000100be61de48 0000000000000206 0000000000000202 0000000200000206 000001012e2dd7f0 00000000000002d3 Call Trace:{:ib_mad:timeout_sends+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ib_mad2 S ffffffffa00e2911 0 15290 10 15471 15289 (L-TLB) 000001003c963e68 0000000000000046 00000100c7d914e0 000001009ed377c0 0000000000000212 000001003c963e08 00000100c7d91400 00000000a00e2e7a 00000100a0f607f0 00000000000000b5 Call Trace:{:ib_mad:ib_mad_completion_handler+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ipoib S ffffffff80309800 0 15329 12 15466 15286 (L-TLB) 0000010034f87e68 0000000000000046 ffffffff803d4300 0000000000000000 0000ffff1b4012ff ffffffff00000000 00000000000080fe 00000000abc90200 00000100a9988030 000000000000010a Call Trace:{:ib_ipoib:ipoib_reap_ah+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} ib_addr_wq S ffffffff80309800 0 15466 12 15329 (L-TLB) 000001002edcfe68 0000000000000046 ffffffff803d4300 0000001900000073 00000101336d4030 0000000000000073 0000010001044a40 0000000000000000 0000010129cad7f0 00000000000000d3 Call Trace:{:ib_addr:process_req+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} iw_cm_wq S 0000010032d0ec40 0 15471 10 15481 15290 (L-TLB) 00000100c6473e68 0000000000000046 00000100010537e0 0000001900000074 0000010133ba77f0 0000000000000074 0000010001044a40 0000000000000000 00000100326b5030 00000000000017e3 Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} rdma_cm_wq S 0000000000000002 0 15481 10 15486 15471 (L-TLB) 0000010060f0fe68 0000000000000046 0000000000000001 0000001900000075 0000010133ba77f0 0000000000000075 0000010001044a40 0000000080467658 000001002e97c7f0 0000000000000d04 Call Trace:{keventd_create_kthread+0} {worker_thread+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} sdp S ffffffffa0355248 0 15486 10 15481 (L-TLB) 0000010060e4be68 0000000000000046 0000000000000000 00000100326b57f0 ffffffff80135752 0000010060e4bdc0 0000010060e4bdc0 00000002bd30e000 00000100326b57f0 00000000000004ad Call Trace:{autoremove_wake_function+0} {:ib_sdp:sdp_destroy_work+0} {worker_thread+226} {default_wake_function+0} {default_wake_function+0} {keventd_create_kthread+0} {worker_thread+0} {keventd_create_kthread+0} {kthread+200} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} modprobe D 0000000000000000 0 4314 1 9803 30324 (NOTLB) 00000100268f3de8 0000000000000002 000001009e8aa030 0000010050f4e040 000001009e8aa030 000001000104b7e0 000001000105b7e0 000000018030a0f5 000001009e8aa030 00000000000012a9 Call Trace:{__down+147} {default_wake_function+0} {__down_failed+53} {:ib_core:.text.lock.device+65} {:ib_ucm:ib_ucm_cleanup+13} {sys_delete_module+479} {__up_write+20} {sys_munmap+94} {system_call+126} ifconfig D 0000000000000000 0 5216 4217 (NOTLB) 00000100196c5d48 0000000000000002 fffffffffffffff4 fffffffffffffff4 000001001f4eea98 000001001ae4c618 000000012f7cf8b8 00000000196c5d48 000001012e5eb7f0 00000000002913b4 Call Trace:{do_page_fault+575} {__down+147} {default_wake_function+0} {__down_failed+53} {.text.lock.dev+85} {devinet_ioctl+1671} {inet_ioctl+124} {sock_ioctl+699} {sys_ioctl+853} {system_call+126} ifconfig D 0000000000000000 0 9803 1 10123 4314 (NOTLB) 0000010037121d48 0000000000000002 fffffffffffffff4 fffffffffffffff4 00000100b0206ad8 00000100606fd618 000000012f7cf8b8 0000000037121d48 0000010112e2d030 0000000000169063 Call Trace:{do_page_fault+575} {__down+147} {default_wake_function+0} {__down_failed+53} {.text.lock.dev+85} {devinet_ioctl+1671} {inet_ioctl+124} {sock_ioctl+699} {sys_ioctl+853} {system_call+126} ifconfig D 0000000000000000 0 10123 1 9803 (NOTLB) 0000010014d2fd48 0000000000000002 fffffffffffffff4 fffffffffffffff4 00000100798c1048 0000010037c4c618 000000012f7cf8b8 0000000014d2fd48 000001008ffc17f0 000000000011d9f1 Call Trace:{do_page_fault+575} {__down+147} {default_wake_function+0} {__down_failed+53} {.text.lock.dev+85} {devinet_ioctl+1671} {inet_ioctl+124} {sock_ioctl+699} {sys_ioctl+853} {system_call+126} ps X 000001008b197f18 0 10190 4962 (L-TLB) 000001008b197ef8 0000000000000046 000001013291ec80 0000010112e2d7f0 0000000000000046 ffffffff80142adf 0000000000000011 0000000000040001 0000010112e2d7f0 00000000016c9ead Call Trace:{do_notify_parent+382} {filp_close+103} {__cond_resched+23} {do_exit+3137} {vfs_write+248} {sys_exit_group+0} {system_call+126} sshd S ffffffff80309800 0 10192 3898 10194 (NOTLB) 000001003ec8bd78 0000000000000006 000001012e8573b0 0000000000000246 0000000000000246 ffffffff802d1bd9 000000d032979030 0000000000000246 00000100af56f030 0000000000001ddc Call Trace:{tcp_sendmsg+3755} {tty_ldisc_try+60} {schedule_timeout+224} {tty_ldisc_deref+103} {do_select+939} {__pollwait+0} {sys_select+820} {dnotify_parent+34} {system_call+126} bash R running task 0 10194 10192 (NOTLB) -- MST From mst at mellanox.co.il Wed Nov 15 01:29:46 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Nov 2006 11:29:46 +0200 Subject: [openib-general] RC v/s UD In-Reply-To: References: Message-ID: <20061115092946.GA19365@mellanox.co.il> With RC, you can send larger messages and hardware will perform the fragmentation. BTW, ibv_rc_pingpong/ibv_ud_pingpong are pingpong examples - they do not attempt to measure maximum streaming bandwidth. Look under the perftests directory for some benchmarks: e.g. send_bw can measure bandwidth with sends. Note that with RC, RDMA and SEND bandwidth can also mean different things. Quoting r. john t : Subject: RC v/s UD Hi, I ran "ibv_rc_pingpong" and "ibv_ud_pingpong" utilities and found that RC gives a BW of 4302 Mbit/sec and UD gives a BW of 2133 Mbit/sec on my setup (2 hosts connected to a switch). So it seems UD is inefficient then RC (gives almost half the BW as that of RC). Is this expected? Isnt it true that UD involves less overheads then RC? Are there ways to get maximum BW (close to RC) from UD? Regards, John T. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From mst at mellanox.co.il Wed Nov 15 01:43:48 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Nov 2006 11:43:48 +0200 Subject: [openib-general] [PATCH] IB/ucm: fix deadlock in cleanup Message-ID: <20061115094348.GA19619@mellanox.co.il> ib_ucm_cleanup_events has file_mutex while calling ib_destroy_cm_id. It seems this can deadlock since ib_destroy_cm_id flushes event handlers, and ib_ucm_event_handler needs file_mutex, too. Signed-off-by: Michael S. Tsirkin --- I'll be testing the following the next night - but it seems a right thing to do regardless of whether it fixes the issues I reported earlier. Sean, does this make sense to you? If yes, please ack for 2.6.19. diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index ad4f4d5..0128288 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -161,12 +161,14 @@ static void ib_ucm_cleanup_events(struct struct ib_ucm_event, ctx_list); list_del(&uevent->file_list); list_del(&uevent->ctx_list); + mutex_unlock(&ctx->file->file_mutex); /* clear incoming connections. */ if (ib_ucm_new_cm_id(uevent->resp.event)) ib_destroy_cm_id(uevent->cm_id); kfree(uevent); + mutex_lock(&ctx->file->file_mutex); } mutex_unlock(&ctx->file->file_mutex); } -- MST From mst at mellanox.co.il Wed Nov 15 02:18:30 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Wed, 15 Nov 2006 12:18:30 +0200 Subject: [openib-general] ifconfig down stuck In-Reply-To: <20061115092044.GA18747@mellanox.co.il> References: <20061115092044.GA18747@mellanox.co.il> Message-ID: <20061115101830.GA21384@mellanox.co.il> OK, what seems to have happened was that we had a catastrophic error on mthca, which triggered a hotplug remove. That, in turn, is waiting for net device ref count to get to 0. But is it normal that the ref count is negative? Quoting r. Michael S. Tsirkin : Subject: [openib-general] ifconfig down stuck running nightly tests on latest IB code from Linus' git tree (backported to 2.6.9), ifconfig down got stuck at some point. I see a ton of unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 unregister_netdevice: waiting for ib0 to become free. Usage count = -61 in dmesg -- MST From arne.redlich at xiranet.com Wed Nov 15 03:43:00 2006 From: arne.redlich at xiranet.com (Arne Redlich) Date: Wed, 15 Nov 2006 12:43:00 +0100 Subject: [openib-general] [PATCH] IB/SRP - increase supported CDB size Message-ID: <87ac2t0wcb.fsf@confield.dd.xiranet.com> This patch sets the Scsi_Host's max_cmd_len from 12 (default) to 16. Otherwise scsi_dispatch_cmd() won't pass down certain commands such as READ CAPACITY 16, required for supporting disks > 2TB. Signed-off-by: Arne Redlich diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 4b09147..01776c9 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1716,7 +1716,8 @@ static ssize_t srp_create_target(struct if (!target_host) return -ENOMEM; - target_host->max_lun = SRP_MAX_LUN; + target_host->max_lun = SRP_MAX_LUN; + target_host->max_cmd_len = SRP_MAX_CMD_LEN; target = host_to_target(target_host); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index d4e35ef..8503af0 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -57,6 +57,7 @@ enum { SRP_MAX_LUN = 512, SRP_DEF_SG_TABLESIZE = 12, + SRP_MAX_CMD_LEN = 16, SRP_RQ_SHIFT = 6, SRP_RQ_SIZE = 1 << SRP_RQ_SHIFT, From dotanb at dev.mellanox.co.il Wed Nov 15 03:45:43 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Wed, 15 Nov 2006 13:45:43 +0200 (IST) Subject: [openib-general] RC v/s UD In-Reply-To: <20061115092946.GA19365@mellanox.co.il> References: <20061115092946.GA19365@mellanox.co.il> Message-ID: <10308.194.90.237.34.1163591143.squirrel@dev.mellanox.co.il> One more thing: Did you use the default message size of the tests? In ibv_ud_pingpong the default message size is 2K In ibv_rc_pingpong the default message size is 4K so, 2 posts and 2 completions where handled in UD for every 1 post and 1 completion in RC ... Dotan > With RC, you can send larger messages and hardware will perform the > fragmentation. > BTW, ibv_rc_pingpong/ibv_ud_pingpong are pingpong examples - they do not > attempt to measure maximum streaming bandwidth. > > Look under the perftests directory for some benchmarks: > e.g. send_bw can measure bandwidth with sends. > > Note that with RC, RDMA and SEND bandwidth can also mean different things. > > > Quoting r. john t : > Subject: RC v/s UD > > Hi, > > I ran "ibv_rc_pingpong" and "ibv_ud_pingpong" utilities and found that RC > gives > a BW of 4302 Mbit/sec and UD gives a BW of 2133 Mbit/sec on my setup (2 > hosts > connected to a switch). So it seems UD is inefficient then RC (gives > almost > half the BW as that of RC). Is this expected? Isnt it true that UD > involves > less overheads then RC? > > Are there ways to get maximum BW (close to RC) from UD? > > Regards, > John T. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > -- > MST > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From johnt1johnt2 at gmail.com Wed Nov 15 05:10:25 2006 From: johnt1johnt2 at gmail.com (john t) Date: Wed, 15 Nov 2006 18:40:25 +0530 Subject: [openib-general] RC v/s UD In-Reply-To: <10308.194.90.237.34.1163591143.squirrel@dev.mellanox.co.il> References: <20061115092946.GA19365@mellanox.co.il> <10308.194.90.237.34.1163591143.squirrel@dev.mellanox.co.il> Message-ID: One more thing: Did you use the default message size of the tests? > In ibv_ud_pingpong the default message size is 2K > In ibv_rc_pingpong the default message size is 4K > > so, 2 posts and 2 completions where handled in UD for every 1 post and 1 > completion in RC ... I used default message size. ibv_rc_pingpong with message size set to 2K gives same reading as ibv_ud_pingpong and with increasing message size gives better results. So does it mean that more number of posts and completions hurt the perforrance?? Is there a way to minimize number of posts/completions in UD ?? send_bw shows that UD and RC give almost same performance when n = 1000 iterations but for few iterations (say 2) UD is good. Basically I am doing some experiments with Broadcast and my readings show that for large data sizes the performance is not good. Given that switch is able to route at high rate, I think the reason for low performance boils down to UD being not able to handle more then 2K message size. Is it possible to have something like RDMA for UD and hence for broadcast (although IB spec does not suppot it) or have hardware do the fragmentation for UD if size specified is more then 2K ?? Regards, John T. -------------- next part -------------- An HTML attachment was scrubbed... URL: From monis at voltaire.com Wed Nov 15 08:11:58 2006 From: monis at voltaire.com (Moni Shoua) Date: Wed, 15 Nov 2006 18:11:58 +0200 Subject: [openib-general] Add module params to mthca to control the HCA profile In-Reply-To: <20061115094348.GA19619@mellanox.co.il> References: <20061115094348.GA19619@mellanox.co.il> Message-ID: <455B3C4E.9030605@voltaire.com> Hi, A few months ago, Leonid Arsh submitted a patch to mthca that enables to control some of the HCA profile values. This patch was discussed here (see references below) but wasn't accepted and somehow got lost and I'd like to re-submit it. http://openib.org/pipermail/openib-general/2006-May/021821.html http://openib.org/pipermail/openib-general/2006-May/022424.html From dotanb at dev.mellanox.co.il Wed Nov 15 08:16:38 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Wed, 15 Nov 2006 18:16:38 +0200 (IST) Subject: [openib-general] RC v/s UD In-Reply-To: References: <20061115092946.GA19365@mellanox.co.il> <10308.194.90.237.34.1163591143.squirrel@dev.mellanox.co.il> Message-ID: <13560.194.90.237.34.1163607398.squirrel@dev.mellanox.co.il> if you want to check performance i suggest you to check the ib_*_lat ib_*_bw tests and not the pingpong tests which were written as an examples .. Dotan > One more thing: Did you use the default message size of the tests? >> In ibv_ud_pingpong the default message size is 2K >> In ibv_rc_pingpong the default message size is 4K >> >> so, 2 posts and 2 completions where handled in UD for every 1 post and 1 >> completion in RC ... > > > > I used default message size. ibv_rc_pingpong with message size set to 2K > gives same reading as ibv_ud_pingpong and with increasing message size > gives better results. > > So does it mean that more number of posts and completions hurt the > perforrance?? Is there a way to minimize number of posts/completions in UD > ?? > > send_bw shows that UD and RC give almost same performance when n = 1000 > iterations but for few iterations (say 2) UD is good. > > Basically I am doing some experiments with Broadcast and my readings show > that for large data sizes the performance is not good. Given that switch > is > able to route at high rate, I think the reason for low performance boils > down to UD being not able to handle more then 2K message size. Is it > possible to have something like RDMA for UD and hence for > broadcast (although IB spec does not suppot it) or have hardware do the > fragmentation for UD if size specified is more then 2K ?? > > Regards, > John T. > From monis at voltaire.com Wed Nov 15 08:34:53 2006 From: monis at voltaire.com (Moni Shoua) Date: Wed, 15 Nov 2006 18:34:53 +0200 Subject: [openib-general] Add module params to mthca to control the HCA profile In-Reply-To: <455B3C4E.9030605@voltaire.com> References: <20061115094348.GA19619@mellanox.co.il> <455B3C4E.9030605@voltaire.com> Message-ID: <455B41AD.40207@voltaire.com> Moni Shoua wrote: >Hi, >A few months ago, Leonid Arsh submitted a patch to mthca that enables to >control some of the HCA profile values. >This patch was discussed here (see references below) but wasn't accepted >and somehow got lost and I'd like to re-submit it. > >http://openib.org/pipermail/openib-general/2006-May/021821.html >http://openib.org/pipermail/openib-general/2006-May/022424.html > > >_______________________________________________ >openib-general mailing list >openib-general at openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > Sorry, submitted to the wrong place From monis at voltaire.com Wed Nov 15 08:37:04 2006 From: monis at voltaire.com (Moni Shoua) Date: Wed, 15 Nov 2006 18:37:04 +0200 Subject: [openib-general] Add module params to mthca to control the HCA profile Message-ID: <455B4230.6070101@voltaire.com> Hi, A few months ago, Leonid Arsh submitted a patch to mthca that enables to control some of the HCA profile values. This patch was discussed here (see references below) but wasn't accepted and somehow got lost and I'd like to re-submit it. http://openib.org/pipermail/openib-general/2006-May/021821.html http://openib.org/pipermail/openib-general/2006-May/022424.html MoniS From monis at voltaire.com Wed Nov 15 08:39:39 2006 From: monis at voltaire.com (Moni Shoua) Date: Wed, 15 Nov 2006 18:39:39 +0200 Subject: [openib-general] [PATCH] IB/mthca: HCA profile module parameters In-Reply-To: <455B4230.6070101@voltaire.com> References: <455B4230.6070101@voltaire.com> Message-ID: <455B42CB.7030008@voltaire.com> From: Leonid Arsh Adds module parameters that enable settting some of the HCA profile values. Signed-off-by: Leonid Arsh Signed-off-by: Moni Shoua --- mthca_main.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 101 insertions(+), 3 deletions(-) --- mthca_main.c.orig 2006-11-14 22:07:58.000000000 -0500 +++ mthca_main.c 2006-11-15 09:42:30.151093815 -0500 @@ -80,9 +80,6 @@ module_param(tune_pci, int, 0444); MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if nonzero"); -static const char mthca_version[] __devinitdata = - DRV_NAME ": Mellanox InfiniBand HCA driver v" - DRV_VERSION " (" DRV_RELDATE ")\n"; static struct mthca_profile default_profile = { .num_qp = 1 << 16, @@ -96,6 +93,103 @@ .uarc_size = 1 << 18, /* Arbel only */ }; +module_param_named(num_qp, default_profile.num_qp, int, 0444); +MODULE_PARM_DESC(num_qp, "maximum number of available QPs per HCA"); + +module_param_named(rdb_per_qp, default_profile.rdb_per_qp, int, 0444); +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP"); + +module_param_named(num_cq, default_profile.num_cq, int, 0444); +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); + +module_param_named(num_mcg, default_profile.num_mcg, int, 0444); +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); + +module_param_named(num_mpt, default_profile.num_mpt, int, 0444); +MODULE_PARM_DESC(num_mpt, + "maximum number of memory protection pable entries per HCA"); + +module_param_named(num_mtt, default_profile.num_mtt, int, 0444); +MODULE_PARM_DESC(num_mtt, + "maximum number of memory translation table segments per HCA"); +/* Tavor only */ +module_param_named(num_udav, default_profile.num_udav, int, 0444); +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA"); + +/* Tavor only */ +module_param_named(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, int, 0444); +MODULE_PARM_DESC(fmr_reserved_mtts, + "number of memory translation table segments reserved for FMR"); + +static const char mthca_version[] __devinitdata = + DRV_NAME ": Mellanox InfiniBand HCA driver v" + DRV_VERSION " (" DRV_RELDATE ")\n"; + +#define is_power_of_2(x) (x>0 &&(x & (x - 1))) +#define to_up_power_of_2(x) (x = roundup_pow_of_two(x)) +static int __devinit mthca_validate_profile(struct mthca_dev *mdev, + struct mthca_profile *profile) +{ + if (!is_power_of_2(default_profile.num_qp)){ + to_up_power_of_2(default_profile.num_qp); + mthca_warn(mdev, "num_qp rounded to power of 2 (%d).\n", + default_profile.num_qp); + } + + if (!is_power_of_2(default_profile.rdb_per_qp)){ + to_up_power_of_2(default_profile.rdb_per_qp); + mthca_warn(mdev, "rdb_per_qp rounded to power of 2 (%d)\n", + default_profile.rdb_per_qp); + } + + if (!is_power_of_2(default_profile.num_cq)){ + to_up_power_of_2(default_profile.num_cq); + mthca_warn(mdev, "num_cq rounded to power of 2 (%d)\n", + default_profile.num_cq); + } + + if (!is_power_of_2(default_profile.num_mcg)){ + to_up_power_of_2(default_profile.num_mcg); + mthca_warn(mdev, "num_mcg rounded to power of 2 (%d)\n", + default_profile.num_mcg); + } + if (!is_power_of_2(default_profile.num_mpt)){ + to_up_power_of_2(default_profile.num_mpt); + mthca_warn(mdev, "num_mpt rounded to power of 2 (%d)\n", + default_profile.num_mpt); + } + + if (!is_power_of_2(default_profile.num_mtt)){ + to_up_power_of_2(default_profile.num_mtt); + mthca_warn(mdev, "num_mtt rounded to power of 2 (%d)\n", + default_profile.num_mtt); + } + + if (mthca_is_memfree(mdev)) { + if (!is_power_of_2(default_profile.num_udav)){ + to_up_power_of_2(default_profile.num_udav); + mthca_warn(mdev, "num_udav rounded to power of 2 (%d)\n", + default_profile.num_udav); + } + + if (!is_power_of_2(default_profile.fmr_reserved_mtts)){ + to_up_power_of_2(default_profile.fmr_reserved_mtts); + mthca_warn(mdev, "fmr_reserved_mtts rounded to power of 2 (%d)\n", + default_profile.fmr_reserved_mtts); + } + if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) { + mthca_err(mdev, + "Invalid fmr_reserved_mtts parameter value (%d). " + "Must be lower then num_mtt (%d)\n", + default_profile.fmr_reserved_mtts, + default_profile.num_mtt ); + return -EINVAL; + } + } + + return 0; +} + static int __devinit mthca_tune_pci(struct mthca_dev *mdev) { int cap; @@ -1095,6 +1189,10 @@ if (err) goto err_cmd; + err = mthca_validate_profile(mdev, &default_profile); + if (err) + goto err_cmd; + err = mthca_init_hca(mdev); if (err) goto err_cmd; From arthur.jones at qlogic.com Wed Nov 15 09:21:06 2006 From: arthur.jones at qlogic.com (Arthur Jones) Date: Wed, 15 Nov 2006 09:21:06 -0800 Subject: [openib-general] [PATCH] IB/mthca: HCA profile module parameters In-Reply-To: <455B42CB.7030008@voltaire.com> References: <455B4230.6070101@voltaire.com> <455B42CB.7030008@voltaire.com> Message-ID: <20061115172106.GA18018@bauxite.pathscale.com> hi moni, ... On Wed, Nov 15, 2006 at 06:39:39PM +0200, Moni Shoua wrote: > From: Leonid Arsh > [...] > +#define is_power_of_2(x) (x>0 &&(x & (x - 1))) this is named funny (should be is_not_power_of_2?)... arthur From xma at us.ibm.com Wed Nov 15 10:13:15 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 15 Nov 2006 10:13:15 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Roland Dreier wrote on 11/14/2006 03:18:23 PM: > Shirley> The rotting packet situation consistently happens for > Shirley> ehca driver. The napi could poll forever with your > Shirley> original patch. That's the reason I defer the rotting > Shirley> packet process in next napi poll. > > Hmm, I don't see it. In my latest patch, the poll routine does: > > repoll: > done = 0; > empty = 0; > > while (max) { > t = min(IPOIB_NUM_WC, max); > n = ib_poll_cq(priv->cq, t, priv->ibwc); > > for (i = 0; i < n; ++i) { > if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) { > ++done; > --max; > ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); > } else > ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); > } > > if (n != t) { > empty = 1; > break; > } > } > > dev->quota -= done; > *budget -= done; > > if (empty) { > netif_rx_complete(dev); > if (unlikely(ib_req_notify_cq(priv->cq, > IB_CQ_NEXT_COMP | > IB_CQ_REPORT_MISSED_EVENTS)) && > netif_rx_reschedule(dev, 0)) > goto repoll; > > return 0; > } > > return 1; > > so every receive completion will count against the limit set by the > variable max. The only way I could see the driver staying in the poll > routine for a long time would be if it was only processing send > completions, but even that doesn't actually seem bad: the driver is > making progress handling completions. What I have found in ehca driver, n! = t, does't mean it's empty. If poll again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most of the time reports 1. It relies on netif_rx_reschedule() returns 0 to exit napi poll. That might be the reason in poll routine for a long time? I will rerun my test to use n! = 0 to see any difference here. > > Shirley> It does help the performance from 1XXMb/s to 7XXMb/s, but > Shirley> not as expected 3XXXMb/s. > > Is that 3xxx Mb/sec the performance you see without the NAPI patch? Without NAPI patch, in my test environment ehca can gain around 2800Mb to 3000Mb/s throughput. > Shirley> With the defer rotting packet process patch, I can see > Shirley> packets out of order problem in TCP layer. Is it > Shirley> possible there is a race somewhere causing two napi polls > Shirley> in the same time? mthca seems to use irq auto affinity, > Shirley> but ehca uses round-robin interrupt. > > I don't see how two NAPI polls could run at once, and I would expect > worse effects from them stepping on each other than just out-of-order > packets. However, the fact that ehca does round-robin interrupt > handling might lead to out-of-order packets just because different > CPUs are all feeding packets into the network stack. > > - R. Normally for NAPI there should be only one running at a time. And NAPI process packet all the way to TCP layer by processing packet one by one (netif_receive_skb()). So it shouldn't lead to out-of-packets even for round-robin interrupt handling in NAPI. I am still investing this. Thanks Shirley -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Nov 15 10:32:46 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 10:32:46 -0800 Subject: [openib-general] Add module params to mthca to control the HCA profile In-Reply-To: <455B3C4E.9030605@voltaire.com> (Moni Shoua's message of "Wed, 15 Nov 2006 18:11:58 +0200") References: <20061115094348.GA19619@mellanox.co.il> <455B3C4E.9030605@voltaire.com> Message-ID: > This patch was discussed here (see references below) but wasn't > accepted and somehow got lost and I'd like to re-submit it. OK... so please resubmit the patch... From rdreier at cisco.com Wed Nov 15 10:36:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 10:36:31 -0800 Subject: [openib-general] Add module params to mthca to control the HCA profile In-Reply-To: (Roland Dreier's message of "Wed, 15 Nov 2006 10:32:46 -0800") References: <20061115094348.GA19619@mellanox.co.il> <455B3C4E.9030605@voltaire.com> Message-ID: > OK... so please resubmit the patch... Sorry, I see the patch now. I will review soon. From xma at us.ibm.com Wed Nov 15 10:34:36 2006 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 15 Nov 2006 10:34:36 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: >I will rerun my test to use n! = 0 to see any difference here. It should be n == 0 to indicate empty. Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.young at isilon.com Wed Nov 15 11:23:14 2006 From: jeff.young at isilon.com (Jeff Young) Date: Wed, 15 Nov 2006 11:23:14 -0800 Subject: [openib-general] mthca question Message-ID: Howdy, I see in mthca_main.c the line which states: #warning The mthca driver is no longer kept up to date in svn. #warning For the latest code, track the upstream kernel. What does this mean? What is the upstream kernel? Where do I download the latest sources from? Thanks, Jeff Jeff Young | Software Development Engineer Isilon Systems P +1-651-698-2109 F +1-651-698-3286 www.isilon.com How breakthroughs begin.(tm) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6119 bytes Desc: image001.gif URL: From rdreier at cisco.com Wed Nov 15 11:29:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 11:29:10 -0800 Subject: [openib-general] mthca question In-Reply-To: (Jeff Young's message of "Wed, 15 Nov 2006 11:23:14 -0800") References: Message-ID: > #warning The mthca driver is no longer kept up to date in svn. > #warning For the latest code, track the upstream kernel. > > What does this mean? What is the upstream kernel? Where do > I download the latest sources from? this means that the definitive source for the mthca driver is the standard Linux kernel. The upstream kernel just means Linus's kernel tree, which you can download from kernel.org or any of the many mirrors. - R. From swise at opengridcomputing.com Wed Nov 15 13:42:27 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 15:42:27 -0600 Subject: [openib-general] mthca question In-Reply-To: References: Message-ID: <1163626947.13803.48.camel@stevo-desktop> We should put this type of warning in all the infiniband/core modules that have moved to the kernel... On Wed, 2006-11-15 at 11:29 -0800, Roland Dreier wrote: > > #warning The mthca driver is no longer kept up to date in svn. > > #warning For the latest code, track the upstream kernel. > > > > What does this mean? What is the upstream kernel? Where do > > I download the latest sources from? > > this means that the definitive source for the mthca driver is the > standard Linux kernel. The upstream kernel just means Linus's kernel > tree, which you can download from kernel.org or any of the many mirrors. > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rjwalsh at pathscale.com Wed Nov 15 15:11:28 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 15 Nov 2006 15:11:28 -0800 Subject: [openib-general] Question about multicast GIDs Message-ID: <455B9EA0.6070106@pathscale.com> Hi all, Is there are registration authority for multicast GIDs? Or at least a safe way of assigning a range of GIDs to a vendor? Regards, Robert. From rdreier at cisco.com Wed Nov 15 15:32:37 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 15:32:37 -0800 Subject: [openib-general] ANNOUNCE: libibverbs and libmthca moving to git Message-ID: I've converted the libibverbs and libmthca svn history into git (with some minor cleanups, mostly improving the changelog entries). I've also created git tags for all of the releases noted in the history. The git trees git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git should now be considered the authoritative sources for libibverbs and libmthca, and I will not be updating the svn trees. Should this cause major problems, I am prepared to abandon the experiment and go back to svn, but I am optimistic that we will be able to take advantage of the major improvements that git offers. The following commands will download copies of the repositories into the local directories libibverbs and libmthca, respectively: git clone git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git git clone git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git Amusingly, although this fetches a complete clone of the repository with the the full development histories (including both 1.0 and 1.1 branches for libibverbs), the git trees use less disk space than an equivalent svn checkout of just the tip of the tree. As I said above, the libibverbs repository has both the unstable development branch (named "master") and the stable 1.0 branch (named "stable"). From your libibverbs directory, gitk --all will visualize the full history of all branches. git checkout stable will switch to the stable (1.0) branch, and git checkout master will switch back to the unstable branch. To update your repository with new upstream changes, just do git pull from the repository. From rdreier at cisco.com Wed Nov 15 15:46:47 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 15:46:47 -0800 Subject: [openib-general] [PATCH] IB/mthca: HCA profile module parameters In-Reply-To: <455B42CB.7030008@voltaire.com> (Moni Shoua's message of "Wed, 15 Nov 2006 18:39:39 +0200") References: <455B4230.6070101@voltaire.com> <455B42CB.7030008@voltaire.com> Message-ID: The patch is line-wrapped and bizarrely corrupted and won't apply, eg: > + mthca_warn(mdev, "num_qp rounded to power of 2 (%d).\n", > + default_profile.num_qp); + } This is completely unnecessary: > +#define to_up_power_of_2(x) (x = roundup_pow_of_two(x)) ...just open code this. And this seems strange: > +#define is_power_of_2(x) (x>0 &&(x & (x - 1))) so there's no warning if someone passes in a negative value?? and it's backwards too, (x & (x - 1)) is 0 precisely for the powers of 2. Was this patch tested at all? Anyway, all this > + if (!is_power_of_2(default_profile.num_qp)){ > + to_up_power_of_2(default_profile.num_qp); > + mthca_warn(mdev, "num_qp rounded to power of 2 (%d).\n", > + default_profile.num_qp); + } seems very repetive. Can't it be wrapped up in a function so we just do something like mthca_check_profile_value(&default_profile.num_qp); mthca_check_profile_value(&default_profile.rdb_per_qp); mthca_check_profile_value(&default_profile.num_cq); etc. - R. From rdreier at cisco.com Wed Nov 15 15:48:23 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 15:48:23 -0800 Subject: [openib-general] ifconfig down stuck In-Reply-To: <20061115101830.GA21384@mellanox.co.il> (Michael S. Tsirkin's message of "Wed, 15 Nov 2006 12:18:30 +0200") References: <20061115092044.GA18747@mellanox.co.il> <20061115101830.GA21384@mellanox.co.il> Message-ID: > That, in turn, is waiting for net device ref count to get to 0. > But is it normal that the ref count is negative? No, that's not normal, it's the real bug. Something is doing an extra dev_put() on the device. - r. From rdreier at cisco.com Wed Nov 15 15:49:45 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 15:49:45 -0800 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) In-Reply-To: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> (Ramachandra K.'s message of "Tue, 14 Nov 2006 20:20:33 +0530") References: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> Message-ID: > If you think these patches are good enough, could you please create a > branch in your git tree based on for-2.6.20 for this code ? Yes, I will create a vex branch for this in my tree. However, moving this further upstream will depend on getting a real review of the code, and some sort of protocol document will probably be required for anyone to wade through this... - R. From rdreier at cisco.com Wed Nov 15 16:22:33 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 16:22:33 -0800 Subject: [openib-general] Question about multicast GIDs References: <455B9EA0.6070106@pathscale.com> Message-ID: > Is there are registration authority for multicast GIDs? Or at least a > safe way of assigning a range of GIDs to a vendor? I don't think so. Perhaps RFC 3307 would be of some use... - R. From sashak at voltaire.com Wed Nov 15 16:40:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Nov 2006 02:40:32 +0200 Subject: [openib-general] [RFC PATCH] opensm: dump/restore SA database Message-ID: <20061116004032.GR31078@sashak.voltaire.com> This adds OpenSM SA database DB dumping and restoring functionality. In verbose mode OpenSM will dump SA DB (existing multicast groups, services and InformInfo) into dump file which named "opensm-sa.dump" and located under standard OpenSM dump directory (/var/log by default). If option -S is specified and SA DB dump file name is provided OpenSM will try to restore SA database from this file. And if succeed will don't ask for clients reregistration at subnet bring-up. Signed-off-by: Sasha Khapyorsky --- osm/include/opensm/osm_sa.h | 43 +++- osm/include/opensm/osm_subnet.h | 8 + osm/opensm/main.c | 20 +- osm/opensm/osm_lid_mgr.c | 1 + osm/opensm/osm_opensm.c | 3 + osm/opensm/osm_sa.c | 630 +++++++++++++++++++++++++++++++++++ osm/opensm/osm_sa_mcmember_record.c | 14 +- osm/opensm/osm_state_mgr.c | 8 + osm/opensm/osm_subnet.c | 11 + 9 files changed, 729 insertions(+), 9 deletions(-) diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h index 0d450ad..70d9bb0 100644 --- a/osm/include/opensm/osm_sa.h +++ b/osm/include/opensm/osm_sa.h @@ -54,8 +54,8 @@ #include #include #include #include -#include #include +#include #include #include #include @@ -445,6 +445,47 @@ osm_sa_bind( * SEE ALSO *********/ +struct _osm_opensm_t; +/****f* OpenSM: SA/osm_sa_db_file_dump +* NAME +* osm_sa_db_file_dump +* +* DESCRIPTION +* Dumps the SA DB to the dump file. +* +* SYNOPSIS +*/ +int osm_sa_db_file_dump(struct _osm_opensm_t *p_osm); +/* +* PARAMETERS +* p_osm +* [in] Pointer to an osm_opensm_t object. +* +* RETURN VALUES +* None +* +*********/ + +/****f* OpenSM: SA/osm_sa_db_file_load +* NAME +* osm_sa_db_file_load +* +* DESCRIPTION +* Loads SA DB from the file. +* +* SYNOPSIS +*/ +int osm_sa_db_file_load(struct _osm_opensm_t *p_osm); +/* +* PARAMETERS +* p_osm +* [in] Pointer to an osm_opensm_t object. +* +* RETURN VALUES +* 0 on success, other value on failure. +* +*********/ + END_C_DECLS #endif /* _OSM_SA_H_ */ diff --git a/osm/include/opensm/osm_subnet.h b/osm/include/opensm/osm_subnet.h index d9895ec..2b22ea5 100644 --- a/osm/include/opensm/osm_subnet.h +++ b/osm/include/opensm/osm_subnet.h @@ -279,6 +279,7 @@ typedef struct _osm_subn_opt char * lid_matrix_dump_file; char * ucast_dump_file; char * updn_guid_file; + char * sa_db_file; boolean_t exit_on_fatal; boolean_t honor_guid2lid_file; osm_qos_options_t qos_options; @@ -287,6 +288,7 @@ typedef struct _osm_subn_opt osm_qos_options_t qos_swe_options; osm_qos_options_t qos_rtr_options; boolean_t enable_quirks; + boolean_t no_clients_rereg; } osm_subn_opt_t; /* * FIELDS @@ -443,6 +445,9 @@ typedef struct _osm_subn_opt * updn_guid_file * Pointer to name of the UPDN guid file given by User * +* sa_db_file +* Name of the SA database file. +* * exit_on_fatal * If TRUE (default) - SM will exit on fatal subnet initialization issues. * If FALSE - SM will not exit. @@ -474,6 +479,9 @@ typedef struct _osm_subn_opt * Enable high risk new features and not fully qualified * hardware specific work arounds * +* no_clients_rereg +* When TRUE disables clients reregistration request. +* * SEE ALSO * Subnet object *********/ diff --git a/osm/opensm/main.c b/osm/opensm/main.c index 752b546..6c83018 100644 --- a/osm/opensm/main.c +++ b/osm/opensm/main.c @@ -48,18 +48,18 @@ #if HAVE_CONFIG_H # include #endif /* HAVE_CONFIG_H */ -#include "stdio.h" +#include #include #include #include +#include +#include +#include #include #ifdef OSM_VENDOR_INTF_OPENIB #include #endif #include -#include -#include -#include #include /******************************************************************** @@ -186,6 +186,10 @@ show_usage(void) "--ucast_file \n" " This option specifies name of the unicast dump file\n" " from where switch forwarding tables will be loaded.\n\n"); + printf( "-S\n" + "--sadb_file \n" + " This option specifies name of the SA DB dump file\n" + " from where SA database will be loaded.\n\n"); printf ("-a\n" "--add_guid_file \n" " Set the root nodes for the Up/Down routing algorithm\n" @@ -537,7 +541,7 @@ #endif boolean_t cache_options = FALSE; char *ignore_guids_file_name = NULL; uint32_t val; - const char * const short_option = "i:f:ed:g:l:L:s:t:a:R:M:U:P:NQvVhorcyx"; + const char * const short_option = "i:f:ed:g:l:L:s:t:a:R:M:U:S:P:NQvVhorcyx"; /* In the array below, the 2nd parameter specified the number @@ -573,6 +577,7 @@ #endif { "routing_engine",1, NULL, 'R'}, { "lid_matrix_file",1, NULL, 'M'}, { "ucast_file" ,1, NULL, 'U'}, + { "sadb_file" ,1, NULL, 'S'}, { "add_guid_file", 1, NULL, 'a'}, { "cache-options", 0, NULL, 'c'}, { "stay_on_fatal", 0, NULL, 'y'}, @@ -812,6 +817,11 @@ #endif printf(" Ucast dump file is \'%s\'\n", optarg); break; + case 'S': + opt.sa_db_file = optarg; + printf(" SA DB file is \'%s\'\n", optarg); + break; + case 'a': /* Specifies port guids file diff --git a/osm/opensm/osm_lid_mgr.c b/osm/opensm/osm_lid_mgr.c index da0b7a0..f2601dc 100644 --- a/osm/opensm/osm_lid_mgr.c +++ b/osm/opensm/osm_lid_mgr.c @@ -1276,6 +1276,7 @@ __osm_lid_mgr_set_physp_pi( if ( ( p_mgr->p_subn->first_time_master_sweep == TRUE || new_port == TRUE ) && + !p_mgr->p_subn->opt.no_clients_rereg && ( (p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) != 0 ) ) ib_port_info_set_client_rereg( p_pi, 1 ); else diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c index 7c0ad7b..2ce7b36 100644 --- a/osm/opensm/osm_opensm.c +++ b/osm/opensm/osm_opensm.c @@ -165,6 +165,9 @@ osm_opensm_destroy( /* shut down the dispatcher - so no new messages cross */ cl_disp_shutdown( &p_osm->disp ); + /* dump SA DB */ + osm_sa_db_file_dump(p_osm); + /* do the destruction in reverse order as init */ if (p_osm->routing_engine.delete) p_osm->routing_engine.delete(p_osm->routing_engine.context); diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c index 1446e15..5d81454 100644 --- a/osm/opensm/osm_sa.c +++ b/osm/opensm/osm_sa.c @@ -51,6 +51,10 @@ # include #endif /* HAVE_CONFIG_H */ #include +#include +#include +#include +#include #include #include #include @@ -62,6 +66,10 @@ #include #include #include #include +#include +#include +#include +#include #define OSM_SA_INITIAL_TID_VALUE 0xabc @@ -529,3 +537,625 @@ osm_sa_bind( OSM_LOG_EXIT( p_sa->p_log ); return( status ); } + +/********************************************************************** + **********************************************************************/ +/* + * SA DB Dumper + * + */ + +struct opensm_dump_context { + osm_opensm_t *p_osm; + FILE *file; +}; + +static int +opensm_dump_to_file(osm_opensm_t *p_osm, const char *file_name, + void (*dump_func)(osm_opensm_t *p_osm, FILE *file)) +{ + char path[1024]; + FILE *file; + + snprintf(path, sizeof(path), "%s/%s", + p_osm->subn.opt.dump_files_dir, file_name); + + file = fopen(path, "w"); + if (!file) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "opensm_dump_to_file: ERR 0000: " + "cannot open file \'%s\': %s\n", + file_name, strerror(errno)); + return -1; + } + + chmod(path, S_IRUSR|S_IWUSR); + + dump_func(p_osm, file); + + fclose(file); + return 0; +} + +static void +mcast_mgr_dump_one_port(cl_map_item_t *p_map_item, void *cxt) +{ + FILE *file = ((struct opensm_dump_context *)cxt)->file; + osm_mcm_port_t *p_mcm_port = (osm_mcm_port_t *)p_map_item; + + fprintf(file, "mcm_port: " + "port_gid=0x%016" PRIx64 ":0x%016" PRIx64 " " + "scope_state=0x%02x proxy_join=0x%x" "\n\n", + cl_ntoh64(p_mcm_port->port_gid.unicast.prefix), + cl_ntoh64(p_mcm_port->port_gid.unicast.interface_id), + p_mcm_port->scope_state, + p_mcm_port->proxy_join); +} + +static void +sa_dump_one_mgrp(cl_map_item_t *p_map_item, void *cxt) +{ + struct opensm_dump_context dump_context; + osm_opensm_t *p_osm = ((struct opensm_dump_context *)cxt)->p_osm; + FILE *file = ((struct opensm_dump_context *)cxt)->file; + osm_mgrp_t *p_mgrp = (osm_mgrp_t *)p_map_item; + + fprintf(file, "MC Group 0x%04x %s:" + " mgid=0x%016" PRIx64 ":0x%016" PRIx64 + " port_gid=0x%016" PRIx64 ":0x%016" PRIx64 + " qkey=0x%08x mlid=0x%04x mtu=0x%02x tclass=0x%02x" + " pkey=0x%04x rate=0x%02x pkt_life=0x%02x sl_flow_hop=0x%08x" + " scope_state=0x%02x proxy_join=0x%x" "\n\n", + cl_ntoh16(p_mgrp->mlid), + p_mgrp->well_known ? " (well known)" : "", + cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix), + cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id), + cl_ntoh64(p_mgrp->mcmember_rec.port_gid.unicast.prefix), + cl_ntoh64(p_mgrp->mcmember_rec.port_gid.unicast.interface_id), + cl_ntoh32(p_mgrp->mcmember_rec.qkey), + cl_ntoh16(p_mgrp->mcmember_rec.mlid), + p_mgrp->mcmember_rec.mtu, + p_mgrp->mcmember_rec.tclass, + cl_ntoh16(p_mgrp->mcmember_rec.pkey), + p_mgrp->mcmember_rec.rate, + p_mgrp->mcmember_rec.pkt_life, + cl_ntoh32(p_mgrp->mcmember_rec.sl_flow_hop), + p_mgrp->mcmember_rec.scope_state, + p_mgrp->mcmember_rec.proxy_join + ); + + dump_context.p_osm = p_osm; + dump_context.file = file; + + cl_qmap_apply_func(&p_mgrp->mcm_port_tbl, + mcast_mgr_dump_one_port, &dump_context); +} + +static void +sa_dump_one_inform(cl_list_item_t *p_list_item, void *cxt) +{ + FILE *file = ((struct opensm_dump_context *)cxt)->file; + osm_infr_t *p_infr = (osm_infr_t *)p_list_item; + ib_inform_info_record_t *p_iir = &p_infr->inform_record; + + fprintf(file, "InformInfo Record:" + " subscriber_gid=0x%016" PRIx64 ":0x%016" PRIx64 + " subscriber_enum=0x%x" + " InformInfo:" + " gid=0x%016" PRIx64 ":0x%016" PRIx64 + " lid_range_begin=0x%x" + " lid_range_end=0x%x" + " is_generic=0x%x" + " subscribe=0x%x" + " trap_type=0x%x" + " trap_num=0x%x" + " qpn_resp_time_val=0x%x" + " node_type=0x%06x" + " rep_addr: lid=0x%04x path_bits=0x%02x static_rate=0x%02x" + " remote_qp=0x%08x remote_qkey=0x%08x pkey=0x%04x sl=0x%02x" + "\n\n", + cl_ntoh64(p_iir->subscriber_gid.unicast.prefix), + cl_ntoh64(p_iir->subscriber_gid.unicast.interface_id), + cl_ntoh16(p_iir->subscriber_enum), + cl_ntoh64(p_iir->inform_info.gid.unicast.prefix), + cl_ntoh64(p_iir->inform_info.gid.unicast.interface_id), + cl_ntoh16(p_iir->inform_info.lid_range_begin), + cl_ntoh16(p_iir->inform_info.lid_range_end), + p_iir->inform_info.is_generic, + p_iir->inform_info.subscribe, + cl_ntoh16(p_iir->inform_info.trap_type), + cl_ntoh16(p_iir->inform_info.g_or_v.generic.trap_num), + cl_ntoh32(p_iir->inform_info.g_or_v.generic.qpn_resp_time_val), + cl_ntoh32(ib_inform_info_get_node_type(&p_iir->inform_info)), + cl_ntoh16(p_infr->report_addr.dest_lid), + p_infr->report_addr.path_bits, + p_infr->report_addr.static_rate, + cl_ntoh32(p_infr->report_addr.addr_type.gsi.remote_qp), + cl_ntoh32(p_infr->report_addr.addr_type.gsi.remote_qkey), + cl_ntoh16(p_infr->report_addr.addr_type.gsi.pkey), + p_infr->report_addr.addr_type.gsi.service_level); +} + +static void +sa_dump_one_service(cl_list_item_t *p_list_item, void *cxt) +{ + FILE *file = ((struct opensm_dump_context *)cxt)->file; + osm_svcr_t *p_svcr = (osm_svcr_t *)p_list_item; + ib_service_record_t *p_sr = &p_svcr->service_record; + + fprintf(file, "Service Record: id=0x%016" PRIx64 + " gid=0x%016" PRIx64 ":0x%016" PRIx64 + " pkey=0x%x" + " lease=0x%x" + " key=0x%02x%02x%02x%02x%02x%02x%02x%02x" + ":0x%02x%02x%02x%02x%02x%02x%02x%02x" + " name=\'%s\'" + " data8=0x%02x%02x%02x%02x%02x%02x%02x%02x" + ":0x%02x%02x%02x%02x%02x%02x%02x%02x" + " data16=0x%04x%04x%04x%04x:0x%04x%04x%04x%04x" + " data32=0x%08x%08x:0x%08x%08x" + " data64=0x%016" PRIx64 ":0x%016" PRIx64 + " modified_time=0x%x lease_period=0x%x\n\n", + cl_ntoh64( p_sr->service_id ), + cl_ntoh64( p_sr->service_gid.unicast.prefix ), + cl_ntoh64( p_sr->service_gid.unicast.interface_id ), + cl_ntoh16( p_sr->service_pkey ), + cl_ntoh32( p_sr->service_lease ), + p_sr->service_key[0], p_sr->service_key[1], + p_sr->service_key[2], p_sr->service_key[3], + p_sr->service_key[4], p_sr->service_key[5], + p_sr->service_key[6], p_sr->service_key[7], + p_sr->service_key[8], p_sr->service_key[9], + p_sr->service_key[10], p_sr->service_key[11], + p_sr->service_key[12], p_sr->service_key[13], + p_sr->service_key[14], p_sr->service_key[15], + p_sr->service_name, + p_sr->service_data8[0], p_sr->service_data8[1], + p_sr->service_data8[2], p_sr->service_data8[3], + p_sr->service_data8[4], p_sr->service_data8[5], + p_sr->service_data8[6], p_sr->service_data8[7], + p_sr->service_data8[8], p_sr->service_data8[9], + p_sr->service_data8[10], p_sr->service_data8[11], + p_sr->service_data8[12], p_sr->service_data8[13], + p_sr->service_data8[14], p_sr->service_data8[15], + cl_ntoh16(p_sr->service_data16[0]), + cl_ntoh16(p_sr->service_data16[1]), + cl_ntoh16(p_sr->service_data16[2]), + cl_ntoh16(p_sr->service_data16[3]), + cl_ntoh16(p_sr->service_data16[4]), + cl_ntoh16(p_sr->service_data16[5]), + cl_ntoh16(p_sr->service_data16[6]), + cl_ntoh16(p_sr->service_data16[7]), + cl_ntoh32(p_sr->service_data32[0]), + cl_ntoh32(p_sr->service_data32[1]), + cl_ntoh32(p_sr->service_data32[2]), + cl_ntoh32(p_sr->service_data32[3]), + cl_ntoh64(p_sr->service_data64[0]), + cl_ntoh64(p_sr->service_data64[1]), + p_svcr->modified_time, p_svcr->lease_period); +} + +static void +sa_dump_all_sa(osm_opensm_t *p_osm, FILE *file) +{ + struct opensm_dump_context dump_context; + + dump_context.p_osm = p_osm; + dump_context.file = file; + osm_log(&p_osm->log, OSM_LOG_DEBUG, "sa_dump_all_sa: Dump multicat:\n"); + cl_plock_acquire(&p_osm->lock); + cl_qmap_apply_func(&p_osm->subn.mgrp_mlid_tbl, + sa_dump_one_mgrp, &dump_context); + osm_log(&p_osm->log, OSM_LOG_DEBUG, "sa_dump_all_sa: Dump inform:\n"); + cl_qlist_apply_func(&p_osm->subn.sa_infr_list, + sa_dump_one_inform, &dump_context); + osm_log(&p_osm->log, OSM_LOG_DEBUG, "sa_dump_all_sa: Dump services:\n"); + cl_qlist_apply_func(&p_osm->subn.sa_sr_list, + sa_dump_one_service, &dump_context); + cl_plock_release(&p_osm->lock); +} + +int osm_sa_db_file_dump(osm_opensm_t *p_osm) +{ + return opensm_dump_to_file(p_osm, "opensm-sa.dump", sa_dump_all_sa); +} + +/* + * SA DB Loader + * + */ + +osm_mgrp_t *load_mcgroup(osm_opensm_t *p_osm, ib_net16_t mlid, + ib_member_rec_t *p_mcm_rec, unsigned well_known) +{ + ib_net64_t comp_mask; + cl_map_item_t *p_next; + osm_mgrp_t *p_mgrp; + + cl_plock_excl_acquire(&p_osm->lock); + + if ((p_next = cl_qmap_get(&p_osm->subn.mgrp_mlid_tbl, mlid)) != + cl_qmap_end(&p_osm->subn.mgrp_mlid_tbl)) { + p_mgrp = (osm_mgrp_t *)p_next; + if (!memcmp(&p_mgrp->mcmember_rec.mgid, &p_mcm_rec->mgid, + sizeof(ib_gid_t))) { + osm_log(&p_osm->log, OSM_LOG_DEBUG, + "load_mcgroup: mgrp %04x is already here.", + cl_ntoh16(mlid)); + goto _out; + } + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "load_mcgroup: mlid %04x is already used by another " + "MC group. Will request clients reregistration.\n", + cl_ntoh16(mlid)); + p_mgrp = NULL; + goto _out; + } + + comp_mask = IB_MCR_COMPMASK_MTU | IB_MCR_COMPMASK_MTU_SEL + | IB_MCR_COMPMASK_RATE | IB_MCR_COMPMASK_RATE_SEL; + if (osm_mcmr_rcv_find_or_create_new_mgrp(&p_osm->sa.mcmr_rcv, + comp_mask, p_mcm_rec, + &p_mgrp) != IB_SUCCESS || + !p_mgrp || p_mgrp->mlid != mlid) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "load_mcgroup: cannot create MC group with mlid " + "0x%04x and mgid 0x%016" PRIx64 ":0x%016" PRIx64 "\n", + cl_ntoh16(mlid), + cl_ntoh64(p_mcm_rec->mgid.unicast.prefix), + cl_ntoh64(p_mcm_rec->mgid.unicast.interface_id)); + p_mgrp=NULL; + } + else if (well_known) + p_mgrp->well_known = TRUE; + + _out: + cl_plock_release(&p_osm->lock); + + return p_mgrp; +} + +static int load_svcr(osm_opensm_t *p_osm, ib_service_record_t *sr, + uint32_t modified_time, uint32_t lease_period) +{ + osm_svcr_t *p_svcr; + int ret = 0; + + cl_plock_excl_acquire(&p_osm->lock); + + if(osm_svcr_get_by_rid(&p_osm->subn, &p_osm->log, sr)) { + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "load_svcr ServiceRecord already exists.\n"); + goto _out; + } + + if (!(p_svcr = osm_svcr_new(sr))) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "load_svcr: cannot allocate new service struct\n"); + ret = -1; + goto _out; + } + + p_svcr->modified_time = modified_time; + p_svcr->lease_period = lease_period; + + osm_log(&p_osm->log, OSM_LOG_DEBUG, + "load_svcr: adding ServiceRecord...\n"); + + osm_svcr_insert_to_db(&p_osm->subn, &p_osm->log, p_svcr); + + if (lease_period != 0xffffffff) + cl_timer_trim(&p_osm->sa.sr_rcv.sr_timer, 1000); + + _out: + cl_plock_release(&p_osm->lock); + + return ret; +} + +static int load_infr(osm_opensm_t *p_osm, ib_inform_info_record_t *iir, + osm_mad_addr_t *addr) +{ + osm_infr_t infr, *p_infr; + int ret = 0; + + infr.h_bind = p_osm->sa.mad_ctrl.h_bind; + infr.p_infr_rcv = &p_osm->sa.infr_rcv; + /* other possible way to restore mad_addr partially is + to exract qpn from InfromInfo and to find lid by gid */ + infr.report_addr = *addr; + infr.inform_record = *iir; + + cl_plock_excl_acquire(&p_osm->lock); + if (osm_infr_get_by_rec(&p_osm->subn, &p_osm->log, &infr)) { + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "load_infr: InformInfo Record already exists.\n"); + goto _out; + } + + if (!(p_infr = osm_infr_new(&infr))) { + osm_log(&p_osm->log, OSM_LOG_ERROR, + "load_infr: cannot allocate new infr struct\n"); + ret = -1; + goto _out; + } + + osm_log(&p_osm->log, OSM_LOG_DEBUG, + "load_infr: adding InformInfo Record...\n"); + + osm_infr_insert_to_db(&p_osm->subn, &p_osm->log, p_infr); + + _out: + cl_plock_release(&p_osm->lock); + + return ret; +} + + +#define UNPACK_FUNC(name,x) \ +int unpack_##name##x(char *p, uint##x##_t *val_ptr) \ +{ \ + char *q; \ + unsigned long long num; \ + num = strtoull(p, &q, 16); \ + if (num > ~((uint##x##_t)0x0) \ + || q == p || (!isspace(*q) && *q != ':')) { \ + *val_ptr = 0; \ + return -1; \ + } \ + *val_ptr = cl_hton##x((uint##x##_t)num); \ + return q - p; \ +} + +#define cl_hton8(x) (x) + +UNPACK_FUNC(net,8); +UNPACK_FUNC(net,16); +UNPACK_FUNC(net,32); +UNPACK_FUNC(net,64); + +static int unpack_string(char *p, uint8_t *buf, unsigned len) +{ + char *q = p; + char delim = ' '; + if (*q == '\'' || *q == '\"') + delim = *q++; + while (--len && *q && *q != delim) + *buf++ = *q++; + *buf = '\0'; + if (*q == delim && delim != ' ') + q++; + return q - p; +} + +static int unpack_string64(char *p, uint8_t *buf) +{ + return unpack_string(p, buf, 64); +} + +#define PARSE_AHEAD(p, x, name, val_ptr) { int _ret; \ + p = strstr(p, name); \ + if (!p) { \ + osm_log(&p_osm->log, OSM_LOG_ERROR, \ + "PARSE ERROR: %s:%u: cannot find \"%s\" string\n", \ + file_name, lineno, (name)); \ + ret = -2; \ + goto _error; \ + } \ + p += strlen(name); \ + _ret = unpack_##x(p, (val_ptr)); \ + if (_ret < 0) { \ + osm_log(&p_osm->log, OSM_LOG_ERROR, \ + "PARSE ERROR: %s:%u: cannot parse "#x" value " \ + "after \"%s\"\n", file_name, lineno, (name)); \ + ret = _ret; \ + goto _error; \ + } \ + p += _ret; \ +} + +int osm_sa_db_file_load(osm_opensm_t *p_osm) +{ + char line[1024]; + char *file_name; + FILE *file; + int ret = 0; + osm_mgrp_t *p_mgrp = NULL; + unsigned rereg_clients = 0; + unsigned lineno; + + file_name = p_osm->subn.opt.sa_db_file; + if (!file_name) { + osm_log(&p_osm->log, OSM_LOG_VERBOSE, + "osm_sa_db_file_load: sa db file name is not " + "specifed. Skip restore\n"); + return 0; + } + + file = fopen(file_name, "r"); + if (!file) { + osm_log(&p_osm->log, OSM_LOG_ERROR|OSM_LOG_SYS, + "osm_sa_db_file_load: ERR 0000: " + "cannot open sa db file \'%s\'. " + "Skip restoring\n", file_name); + return -1; + } + + lineno = 0; + + while (fgets(line, sizeof(line) - 1, file) != NULL) { + char *p; + uint8_t val; + + lineno++; + + p = line; + while (isspace(*p)) + p++; + + if (*p == '#') + continue; + + if (!strncmp(p, "MC Group", 8)) { + ib_member_rec_t mcm_rec; + ib_net16_t mlid; + unsigned well_known = 0; + + p_mgrp = NULL; + memset(&mcm_rec, 0, sizeof(mcm_rec)); + + PARSE_AHEAD(p, net16, " 0x", &mlid); + if(strstr(p, "well known")) + well_known = 1; + PARSE_AHEAD(p, net64, " mgid=0x", + &mcm_rec.mgid.unicast.prefix); + PARSE_AHEAD(p, net64, ":0x", + &mcm_rec.mgid.unicast.interface_id); + PARSE_AHEAD(p, net64, " port_gid=0x", + &mcm_rec.port_gid.unicast.prefix); + PARSE_AHEAD(p, net64, ":0x", + &mcm_rec.port_gid.unicast.interface_id); + PARSE_AHEAD(p, net32, " qkey=0x", &mcm_rec.qkey); + PARSE_AHEAD(p, net16, " mlid=0x", &mcm_rec.mlid); + PARSE_AHEAD(p, net8, " mtu=0x", &mcm_rec.mtu); + PARSE_AHEAD(p, net8, " tclass=0x", &mcm_rec.tclass); + PARSE_AHEAD(p, net16, " pkey=0x", &mcm_rec.pkey); + PARSE_AHEAD(p, net8, " rate=0x", &mcm_rec.rate); + PARSE_AHEAD(p, net8, " pkt_life=0x", &mcm_rec.pkt_life); + PARSE_AHEAD(p, net32, " sl_flow_hop=0x", + &mcm_rec.sl_flow_hop); + PARSE_AHEAD(p, net8, " scope_state=0x", + &mcm_rec.scope_state); + PARSE_AHEAD(p, net8, " proxy_join=0x", &val); + mcm_rec.proxy_join = val; + + p_mgrp = load_mcgroup(p_osm, mlid, &mcm_rec, + well_known); + if (!p_mgrp) + rereg_clients = 1; + } + else if (p_mgrp && !strncmp(p, "mcm_port", 8)) { + ib_gid_t port_gid; + ib_net64_t guid; + uint8_t scope_state; + boolean_t proxy_join; + + PARSE_AHEAD(p, net64, " port_gid=0x", + &port_gid.unicast.prefix); + PARSE_AHEAD(p, net64, ":0x", + &port_gid.unicast.interface_id); + PARSE_AHEAD(p, net8, " scope_state=0x", &scope_state); + PARSE_AHEAD(p, net8, " proxy_join=0x", &val); + proxy_join = val; + + guid = port_gid.unicast.interface_id; + if (cl_qmap_get(&p_mgrp->mcm_port_tbl, + port_gid.unicast.interface_id) == + cl_qmap_end(&p_mgrp->mcm_port_tbl)) + osm_mgrp_add_port(p_mgrp, &port_gid, + scope_state, proxy_join); + } + else if (!strncmp(p, "Service Record:", 15)) { + ib_service_record_t s_rec; + uint32_t modified_time, lease_period; + + p_mgrp = NULL; + memset(&s_rec, 0, sizeof(s_rec)); + + PARSE_AHEAD(p, net64, " id=0x", &s_rec.service_id); + PARSE_AHEAD(p, net64, " gid=0x", + &s_rec.service_gid.unicast.prefix); + PARSE_AHEAD(p, net64, ":0x", + &s_rec.service_gid.unicast.interface_id); + PARSE_AHEAD(p, net16, " pkey=0x", &s_rec.service_pkey); + PARSE_AHEAD(p, net32, " lease=0x", &s_rec.service_lease); + PARSE_AHEAD(p, net64, " key=0x", + (ib_net64_t *)(&s_rec.service_key[0])); + PARSE_AHEAD(p, net64, ":0x", + (ib_net64_t *)(&s_rec.service_key[8])); + PARSE_AHEAD(p, string64, " name=", s_rec.service_name); + PARSE_AHEAD(p, net64, " data8=0x", + (ib_net64_t *)(&s_rec.service_data8[0])); + PARSE_AHEAD(p, net64, ":0x", + (ib_net64_t *)(&s_rec.service_data8[8])); + PARSE_AHEAD(p, net64, " data16=0x", + (ib_net64_t *)(&s_rec.service_data16[0])); + PARSE_AHEAD(p, net64, ":0x", + (ib_net64_t *)(&s_rec.service_data16[4])); + PARSE_AHEAD(p, net64, " data32=0x", + (ib_net64_t *)(&s_rec.service_data32[0])); + PARSE_AHEAD(p, net64, ":0x", + (ib_net64_t *)(&s_rec.service_data32[2])); + PARSE_AHEAD(p, net64, " data64=0x", &s_rec.service_data64[0]); + PARSE_AHEAD(p, net64, ":0x", &s_rec.service_data64[1]); + PARSE_AHEAD(p, net32, " modified_time=0x", + &modified_time); + PARSE_AHEAD(p, net32, " lease_period=0x", + &lease_period); + + if (load_svcr(p_osm, &s_rec, cl_ntoh32(modified_time), + cl_ntoh32(lease_period))) + rereg_clients = 1; + } + else if (!strncmp(p, "InformInfo Record:", 18)) { + ib_inform_info_record_t i_rec; + osm_mad_addr_t rep_addr; + + p_mgrp = NULL; + memset(&i_rec, 0, sizeof(i_rec)); + memset(&rep_addr, 0, sizeof(rep_addr)); + + PARSE_AHEAD(p, net64, " subscriber_gid=0x", + &i_rec.subscriber_gid.unicast.prefix); + PARSE_AHEAD(p, net64, ":0x", + &i_rec.subscriber_gid.unicast.interface_id); + PARSE_AHEAD(p, net16, " subscriber_enum=0x", + &i_rec.subscriber_enum); + PARSE_AHEAD(p, net64, " gid=0x", + &i_rec.inform_info.gid.unicast.prefix); + PARSE_AHEAD(p, net64, ":0x", + &i_rec.inform_info.gid.unicast.interface_id); + PARSE_AHEAD(p, net16, " lid_range_begin=0x", + &i_rec.inform_info.lid_range_begin); + PARSE_AHEAD(p, net16, " lid_range_end=0x", + &i_rec.inform_info.lid_range_end); + PARSE_AHEAD(p, net8, " is_generic=0x", + &i_rec.inform_info.is_generic); + PARSE_AHEAD(p, net8, " subscribe=0x", + &i_rec.inform_info.subscribe); + PARSE_AHEAD(p, net16, " trap_type=0x", + &i_rec.inform_info.trap_type); + PARSE_AHEAD(p, net16, " trap_num=0x", + &i_rec.inform_info.g_or_v.generic.trap_num); + PARSE_AHEAD(p, net32, " qpn_resp_time_val=0x", + &i_rec.inform_info.g_or_v.generic.qpn_resp_time_val); + PARSE_AHEAD(p, net32, " node_type=0x", + (uint32_t *)&i_rec.inform_info.g_or_v.generic.reserved2); + + PARSE_AHEAD(p, net16, " rep_addr: lid=0x", + &rep_addr.dest_lid); + PARSE_AHEAD(p, net8, " path_bits=0x", + &rep_addr.path_bits); + PARSE_AHEAD(p, net8, " static_rate=0x", + &rep_addr.static_rate); + PARSE_AHEAD(p, net32, " remote_qp=0x", + &rep_addr.addr_type.gsi.remote_qp); + PARSE_AHEAD(p, net32, " remote_qkey=0x", + &rep_addr.addr_type.gsi.remote_qkey); + PARSE_AHEAD(p, net16, " pkey=0x", + &rep_addr.addr_type.gsi.pkey); + PARSE_AHEAD(p, net8, " sl=0x", + &rep_addr.addr_type.gsi.service_level); + + if (load_infr(p_osm, &i_rec, &rep_addr)) + rereg_clients = 1; + } + } + + if (!rereg_clients) + p_osm->subn.opt.no_clients_rereg = TRUE; + + _error: + fclose(file); + return ret; +} diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c index bd52c77..f7f879b 100644 --- a/osm/opensm/osm_sa_mcmember_record.c +++ b/osm/opensm/osm_sa_mcmember_record.c @@ -291,7 +291,7 @@ available mlids. **********************************************************************/ static ib_net16_t __get_new_mlid( - IN osm_mcmr_recv_t* const p_rcv) + IN osm_mcmr_recv_t* const p_rcv, ib_net16_t requested_mlid) { osm_subn_t *p_subn = p_rcv->p_subn; osm_mgrp_t *p_mgrp; @@ -301,7 +301,15 @@ __get_new_mlid( uint16_t max_num_mlids; OSM_LOG_ENTER(p_rcv->p_log, __get_new_mlid); - + + if (requested_mlid && cl_ntoh16(requested_mlid) >= IB_LID_MCAST_START_HO && + cl_ntoh16(requested_mlid) < p_subn->max_multicast_lid_ho && + cl_qmap_get(&p_subn->mgrp_mlid_tbl, requested_mlid) == + cl_qmap_end(&p_subn->mgrp_mlid_tbl) ) { + mlid = cl_ntoh16(requested_mlid); + goto Exit; + } + /* If MCGroups table empty, first return the min mlid */ p_mgrp = (osm_mgrp_t*)cl_qmap_head( &p_subn->mgrp_mlid_tbl ); if (p_mgrp == (osm_mgrp_t*)cl_qmap_end( &p_subn->mgrp_mlid_tbl )) @@ -1249,7 +1257,7 @@ osm_mcmr_rcv_create_new_mgrp( we allocate a new mlid number before we might use it for MGID ... */ - mlid = __get_new_mlid(p_rcv); + mlid = __get_new_mlid(p_rcv, mcm_rec.mlid); if ( mlid == 0 ) { osm_log( p_rcv->p_log, OSM_LOG_ERROR, diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c index 993b7eb..b830a9c 100644 --- a/osm/opensm/osm_state_mgr.c +++ b/osm/opensm/osm_state_mgr.c @@ -2220,6 +2220,11 @@ osm_state_mgr_process( /* the returned signal is always DONE */ signal = osm_qos_setup(p_mgr->p_subn->p_osm); + /* try to restore SA DB (this should be before lid_mgr + because we may want to disable clients reregistration + when SA DB is restored) */ + osm_sa_db_file_load(p_mgr->p_subn->p_osm); + break; default: @@ -2806,6 +2811,9 @@ osm_state_mgr_process( osm_topology_file_create( p_mgr ); __osm_state_mgr_report( p_mgr ); __osm_state_mgr_up_msg( p_mgr ); + + if( osm_log_is_active(p_mgr->p_log, OSM_LOG_VERBOSE) ) + osm_sa_db_file_dump(p_mgr->p_subn->p_osm); } p_mgr->state = OSM_SM_STATE_PROCESS_REQUEST; signal = OSM_SIGNAL_IDLE_TIME_PROCESS; diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c index e53aad4..56963c4 100644 --- a/osm/opensm/osm_subnet.c +++ b/osm/opensm/osm_subnet.c @@ -495,8 +495,10 @@ osm_subn_set_default_opt( p_opt->lid_matrix_dump_file = NULL; p_opt->ucast_dump_file = NULL; p_opt->updn_guid_file = NULL; + p_opt->sa_db_file = NULL; p_opt->exit_on_fatal = TRUE; p_opt->enable_quirks = FALSE; + p_opt->no_clients_rereg = FALSE; subn_set_default_qos_options(&p_opt->qos_options); subn_set_default_qos_options(&p_opt->qos_ca_options); subn_set_default_qos_options(&p_opt->qos_sw0_options); @@ -972,6 +974,10 @@ osm_subn_parse_conf_file( "updn_guid_file", p_key, p_val, &p_opts->updn_guid_file); + __osm_subn_opts_unpack_charp( + "sa_db_file", + p_key, p_val, &p_opts->sa_db_file); + __osm_subn_opts_unpack_boolean( "exit_on_fatal", p_key, p_val, &p_opts->exit_on_fatal); @@ -1157,6 +1163,11 @@ osm_subn_write_conf_file( "# One guid in each line\n" "updn_guid_file %s\n\n", p_opts->updn_guid_file); + if (p_opts->sa_db_file) + fprintf( opts_file, + "# SA database file name\n" + "sa_db_file %s\n\n", + p_opts->sa_db_file); fprintf( opts_file, -- 1.4.3.2.g4bf7 From sashak at voltaire.com Wed Nov 15 16:59:25 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Nov 2006 02:59:25 +0200 Subject: [openib-general] ANNOUNCE: libibverbs and libmthca moving to git In-Reply-To: References: Message-ID: <20061116005925.GS31078@sashak.voltaire.com> Hi Roland, On 15:32 Wed 15 Nov , Roland Dreier wrote: > I've converted the libibverbs and libmthca svn history into git (with > some minor cleanups, mostly improving the changelog entries). I've > also created git tags for all of the releases noted in the history. > The git trees > > git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git > git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git > > should now be considered the authoritative sources for libibverbs and > libmthca, and I will not be updating the svn trees. Are you planning to mirror libibverbs and libnthca trees on the new OFA server too (git.kernel.org was overloaded last time)? Sasha From rjwalsh at pathscale.com Wed Nov 15 17:07:41 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Wed, 15 Nov 2006 17:07:41 -0800 Subject: [openib-general] Question about multicast GIDs In-Reply-To: References: <455B9EA0.6070106@pathscale.com> Message-ID: <455BB9DD.8080500@pathscale.com> Roland Dreier wrote: > > Is there are registration authority for multicast GIDs? Or at least a > > safe way of assigning a range of GIDs to a vendor? > > I don't think so. Perhaps RFC 3307 would be of some use... Ah - looks exactly like what I was looking for. Thanks. From rdreier at cisco.com Wed Nov 15 17:14:12 2006 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 15 Nov 2006 17:14:12 -0800 Subject: [openib-general] ANNOUNCE: libibverbs and libmthca moving to git In-Reply-To: <20061116005925.GS31078@sashak.voltaire.com> (Sasha Khapyorsky's message of "Thu, 16 Nov 2006 02:59:25 +0200") References: <20061116005925.GS31078@sashak.voltaire.com> Message-ID: > Are you planning to mirror libibverbs and libnthca trees on the new > OFA server too (git.kernel.org was overloaded last time)? I wasn't planning to try and set that up, but git makes it pretty trivial for someone else to mirror trees wherever they want. - R. From venkatesh.babu at 3leafnetworks.com Wed Nov 15 19:22:45 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Wed, 15 Nov 2006 19:22:45 -0800 Subject: [openib-general] OpenSM log growing too big In-Reply-To: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> References: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> Message-ID: <455BD985.3010900@3leafnetworks.com> I have OFED 1.0 stack and running OpenSM on a server connected to a IB subnet with couple of nodes. Usually the log file size is small. But ocassionally it is growing too big and filling up the whole hard disk. [root at vortex3l-88 ~]# ls -l /var/log/opensm* -rw-r--r-- 1 root root 33879121502 Nov 15 14:54 /var/log/opensm.log Most of the opensm.log file is filled with following messages. Out of 240,168,770 lines of log file 239,782,972 lines are from this __osm_trap_rcv_process_request. Nov 14 13:59:35 273746 [42803960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 127 times consecutively Nov 14 13:59:35 273908 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733372Nov 14 13:59:35 273966 [41401960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 128 times consecutively Nov 14 13:59:35 274176 [41E02960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733373Nov 14 13:59:35 274234 [41E02960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 129 times consecutively Nov 14 13:59:35 274380 [43204960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733374Nov 14 13:59:35 274436 [43204960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 130 times consecutively Nov 14 13:59:35 274662 [42803960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733375Nov 14 13:59:35 274720 [42803960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 131 times consecutively Nov 14 13:59:35 274970 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733376Nov 14 13:59:35 275026 [41401960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 132 times consecutively From swise at opengridcomputing.com Wed Nov 15 19:58:26 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:58:26 -0600 Subject: [openib-general] [PATCH 00/13] Chelsio T3 RDMA Driver Message-ID: <20061116035826.22635.61230.stgit@dell3.ogc.int> Roland / All: The following series implements the Chelsio T3 iWARP/RDMA Driver to be considered for inclusion in 2.6.20. It depends on the Chelsio T3 Ethernet Driver which is also under review now for 2.6.20 (http://marc.theaimsgroup.com/?l=linux-netdev&m=116363600816597&w=2). The patches are against 2.6.19-rc5. This patch series can also be pulled from: http://www.opengridcomputing.com/downloads/iw_cxgb3_patches.tar.bz2 The Chelsio T3 Ethernet Driver patch can be pulled from: http://service.chelsio.com/kernel.org/cxgb3.patch.bz2 Thanks, Steve. From swise at opengridcomputing.com Wed Nov 15 19:58:32 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:58:32 -0600 Subject: [openib-general] [PATCH 01/13] Linux RDMA Core Changes In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035831.22635.95377.stgit@dell3.ogc.int> Support provider-specific data in ib_uverbs_cmd_req_notify_cq(). The Chelsio iwarp provider library needs to pass information to the kernel verb for re-arming the CQ. Signed-off-by: Steve Wise --- drivers/infiniband/core/uverbs_cmd.c | 9 +++++++-- drivers/infiniband/hw/amso1100/c2.h | 2 +- drivers/infiniband/hw/amso1100/c2_cq.c | 3 ++- drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 ++- drivers/infiniband/hw/ehca/ehca_reqs.c | 3 ++- drivers/infiniband/hw/ipath/ipath_cq.c | 4 +++- drivers/infiniband/hw/ipath/ipath_verbs.h | 3 ++- drivers/infiniband/hw/mthca/mthca_cq.c | 6 ++++-- drivers/infiniband/hw/mthca/mthca_dev.h | 4 ++-- include/rdma/ib_verbs.h | 5 +++-- 10 files changed, 28 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 743247e..5dd1de9 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i int out_len) { struct ib_uverbs_req_notify_cq cmd; + struct ib_udata udata; struct ib_cq *cq; if (copy_from_user(&cmd, buf, sizeof cmd)) @@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i if (!cq) return -EINVAL; - ib_req_notify_cq(cq, cmd.solicited_only ? - IB_CQ_SOLICITED : IB_CQ_NEXT_COMP); + INIT_UDATA(&udata, buf + sizeof cmd, 0, + in_len - sizeof cmd, 0); + + cq->device->req_notify_cq(cq, cmd.solicited_only ? + IB_CQ_SOLICITED : IB_CQ_NEXT_COMP, + &udata); put_cq_read(cq); diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h index 1b17dcd..716f9dc 100644 --- a/drivers/infiniband/hw/amso1100/c2.h +++ b/drivers/infiniband/hw/amso1100/c2.h @@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index); extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index); extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata); /* CM */ extern int c2_llp_connect(struct iw_cm_id *cm_id, diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c index 05c9154..7ce8bca 100644 --- a/drivers/infiniband/hw/amso1100/c2_cq.c +++ b/drivers/infiniband/hw/amso1100/c2_cq.c @@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n return npolled; } -int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct c2_mq_shared __iomem *shared; struct c2_cq *cq; diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h index 3720e30..566b30c 100644 --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h @@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n int ehca_peek_cq(struct ib_cq *cq, int wc_cnt); -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify); +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata); struct ib_qp *ehca_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *init_attr, diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index b46bda1..3ed6992 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -634,7 +634,8 @@ poll_cq_exit0: return ret; } -int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) +int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify, + struct ib_udata *udata) { struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq); diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c index 87462e0..27ba4db 100644 --- a/drivers/infiniband/hw/ipath/ipath_cq.c +++ b/drivers/infiniband/hw/ipath/ipath_cq.c @@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq) * ipath_req_notify_cq - change the notification type for a completion queue * @ibcq: the completion queue * @notify: the type of notification to request + * @udata: user data * * Returns 0 for success. * * This may be called from interrupt context. Also called by * ib_req_notify_cq() in the generic verbs code. */ -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct ipath_cq *cq = to_icq(ibcq); unsigned long flags; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 8039f6e..0d39960 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_ int ipath_destroy_cq(struct ib_cq *ibcq); -int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify); +int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata); int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 149b369..ec7bb79 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -723,7 +723,8 @@ repoll: return err == 0 || err == -EAGAIN ? npolled : err; } -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify) +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, + struct ib_udata *udata) { __be32 doorbell[2]; @@ -740,7 +741,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq, return 0; } -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify) +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) { struct mthca_cq *cq = to_mcq(ibcq); __be32 doorbell[2]; diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index fe5cecf..6b9ccf6 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev int mthca_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify); +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata); int mthca_init_cq(struct mthca_dev *dev, int nent, struct mthca_ucontext *ctx, u32 pdn, struct mthca_cq *cq); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 8eacc35..e3e1a2c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -941,7 +941,8 @@ struct ib_device { struct ib_wc *wc); int (*peek_cq)(struct ib_cq *cq, int wc_cnt); int (*req_notify_cq)(struct ib_cq *cq, - enum ib_cq_notify cq_notify); + enum ib_cq_notify cq_notify, + struct ib_udata *udata); int (*req_ncomp_notif)(struct ib_cq *cq, int wc_cnt); struct ib_mr * (*get_dma_mr)(struct ib_pd *pd, @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_ static inline int ib_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify) { - return cq->device->req_notify_cq(cq, cq_notify); + return cq->device->req_notify_cq(cq, cq_notify, NULL); } /** From swise at opengridcomputing.com Wed Nov 15 19:58:37 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:58:37 -0600 Subject: [openib-general] [PATCH 02/13] Device Discovery and ULLD Linkage In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035837.22635.13571.stgit@dell3.ogc.int> Code to discover all the T3 devices and register them with the T3 RDMA Core and the Linux RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch.c | 222 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch.h | 134 ++++++++++++++++++++++ 2 files changed, 356 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c new file mode 100644 index 0000000..f45f005 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.c @@ -0,0 +1,222 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" +#include "iwch_user.h" +#include "iwch.h" +#include "iwch_cm.h" + +#define DRV_VERSION "1.1" + +MODULE_AUTHOR("Boyd Faulkner, Steve Wise"); +MODULE_DESCRIPTION("Chelsio T3 RDMA Driver"); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_VERSION(DRV_VERSION); + +cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; + +static void open_rnic_dev(struct t3cdev *); +static void close_rnic_dev(struct t3cdev *); + +struct cxgb3_client t3c_client = { + .name = "iw_cxgb3", + .add = open_rnic_dev, + .remove = close_rnic_dev, + .handlers = t3c_handlers, + .redirect = iwch_ep_redirect +}; + +static LIST_HEAD(dev_list); +static DEFINE_MUTEX(dev_mutex); + +static inline void *vzmalloc(int size) +{ + void *p = vmalloc(size); + memset(p, 0, size); + return p; +} + +static int open_rnic_init(struct iwch_dev *rnicp) +{ + PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); + rnicp->pdid2ptr = vzmalloc(sizeof(void*) * T3_MAX_NUM_PD); + if (!rnicp->pdid2ptr) + goto pdid_err; + rnicp->cqid2ptr = vzmalloc(sizeof(void*) * T3_MAX_NUM_CQ); + if (!rnicp->cqid2ptr) + goto cqid_err; + rnicp->qpid2ptr = vzmalloc(sizeof(void*) * T3_MAX_NUM_QP); + if (!rnicp->qpid2ptr) + goto qpid_err; + rnicp->mmid2ptr = vzmalloc(sizeof(void*) * + cxio_num_stags(&rnicp->rdev)); + if (!rnicp->mmid2ptr) + goto stag_err; + + spin_lock_init(&rnicp->lock); + + rnicp->attr.vendor_id = 0x168; + rnicp->attr.vendor_part_id = 7; + rnicp->attr.max_qps = T3_MAX_NUM_QP - 32; + rnicp->attr.max_wrs = (1UL << 24) - 1; + rnicp->attr.max_sge_per_wr = T3_MAX_SGE; + rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE; + rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1; + rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1; + rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev); + rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE; + rnicp->attr.max_pds = T3_MAX_NUM_PD - 1; + rnicp->attr.mem_pgsizes_bitmask = 0x7FFF; /* 4KB-128MB */ + rnicp->attr.can_resize_wq = 0; + rnicp->attr.max_rdma_reads_per_qp = 8; + rnicp->attr.max_rdma_read_resources = + rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps; + rnicp->attr.max_rdma_read_qp_depth = 8; /* IRD */ + rnicp->attr.max_rdma_read_depth = + rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps; + rnicp->attr.rq_overflow_handled = 0; + rnicp->attr.can_modify_ird = 0; + rnicp->attr.can_modify_ord = 0; + rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1; + rnicp->attr.stag0_value = 1; + rnicp->attr.zbva_support = 1; + rnicp->attr.local_invalidate_fence = 1; + rnicp->attr.cq_overflow_detection = 1; + return 0; + +stag_err: + vfree(rnicp->qpid2ptr); +qpid_err: + vfree(rnicp->cqid2ptr); +cqid_err: + vfree(rnicp->pdid2ptr); +pdid_err: + return -ENOMEM; +} + +static void open_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *rnicp; + static int vers_printed; + + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + if (!vers_printed++) + printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n", + DRV_VERSION); + rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp)); + if (!rnicp) { + printk(KERN_ERR MOD "Cannot allocate ib device\n"); + return; + } + rnicp->rdev.ulp = rnicp; + rnicp->rdev.t3cdev_p = tdev; + + if (cxio_rdev_open(&rnicp->rdev)) { + printk(KERN_ERR MOD "Unable to open CXIO rdev\n"); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + if (open_rnic_init(rnicp)) { + printk(KERN_ERR MOD "Unable to initialize device\n"); + cxio_rdev_close(&rnicp->rdev); + ib_dealloc_device(&rnicp->ibdev); + return; + } + + mutex_lock(&dev_mutex); + list_add_tail(&rnicp->entry, &dev_list); + mutex_unlock(&dev_mutex); + + if (iwch_register_device(rnicp)) { + printk(KERN_ERR MOD "Unable to register device\n"); + close_rnic_dev(tdev); + } + printk(KERN_INFO MOD "Initialized device %s\n", + pci_name(rnicp->rdev.rnic_info.pdev)); + return; +} + +static void close_rnic_dev(struct t3cdev *tdev) +{ + struct iwch_dev *dev, *tmp; + PDBG("%s t3cdev %p\n", __FUNCTION__, tdev); + mutex_lock(&dev_mutex); + list_for_each_entry_safe(dev, tmp, &dev_list, entry) { + if (dev->rdev.t3cdev_p == tdev) { + list_del(&dev->entry); + iwch_unregister_device(dev); + cxio_rdev_close(&dev->rdev); + vfree(dev->pdid2ptr); + vfree(dev->cqid2ptr); + vfree(dev->mmid2ptr); + vfree(dev->qpid2ptr); + ib_dealloc_device(&dev->ibdev); + break; + } + } + mutex_unlock(&dev_mutex); +} + +extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb); + +static int __init iwch_init_module(void) +{ + int err; + + err = cxio_hal_init(); + if (err) + return err; + err = iwch_cm_init(); + if (err) + return err; + cxio_register_ev_cb(iwch_ev_dispatch); + cxgb3_register_client(&t3c_client); + return 0; +} + +static void __exit iwch_exit_module(void) +{ + cxgb3_unregister_client(&t3c_client); + cxio_unregister_ev_cb(iwch_ev_dispatch); + iwch_cm_term(); + cxio_hal_exit(); +} + +module_init(iwch_init_module); +module_exit(iwch_exit_module); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h new file mode 100644 index 0000000..fe0a557 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -0,0 +1,134 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_H__ +#define __IWCH_H__ + +#include +#include +#include + +#include + +#include "cxio_hal.h" +#include "cxgb3_offload.h" + +struct iwch_pd; +struct iwch_cq; +struct iwch_qp; +struct iwch_mr; + +struct iwch_rnic_attributes { + u32 vendor_id; + u32 vendor_part_id; + u32 max_qps; + u32 max_wrs; /* Max for any SQ/RQ */ + u32 max_sge_per_wr; + u32 max_sge_per_rdma_write_wr; /* for RDMA Write WR */ + u32 max_cqs; + u32 max_cqes_per_cq; + u32 max_mem_regs; + u32 max_phys_buf_entries; /* for phys buf list */ + u32 max_pds; + + /* + * The memory page sizes supported by this RNIC. + * Bit position i in bitmap indicates page of + * size (4k)^i. Phys block list mode unsupported. + */ + u32 mem_pgsizes_bitmask; + u8 can_resize_wq; + + /* + * The maximum number of RDMA Reads that can be outstanding + * per QP with this RNIC as the target. + */ + u32 max_rdma_reads_per_qp; + + /* + * The maximum number of resources used for RDMA Reads + * by this RNIC with this RNIC as the target. + */ + u32 max_rdma_read_resources; + + /* + * The max depth per QP for initiation of RDMA Read + * by this RNIC. + */ + u32 max_rdma_read_qp_depth; + + /* + * The maximum depth for initiation of RDMA Read + * operations by this RNIC on all QPs + */ + u32 max_rdma_read_depth; + u8 rq_overflow_handled; + u32 can_modify_ird; + u32 can_modify_ord; + u32 max_mem_windows; + u32 stag0_value; + u8 zbva_support; + u8 local_invalidate_fence; + u32 cq_overflow_detection; +}; + +struct iwch_dev { + struct ib_device ibdev; + struct cxio_rdev rdev; + u32 device_cap_flags; + struct iwch_rnic_attributes attr; + struct iwch_pd **pdid2ptr; + struct iwch_cq **cqid2ptr; + struct iwch_qp **qpid2ptr; + struct iwch_mr **mmid2ptr; + spinlock_t lock; + struct list_head entry; +}; + +static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct iwch_dev, ibdev); +} + +static inline int t3b_device(struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3B); +} + +static inline int t3a_device(struct iwch_dev *rhp) +{ + return (rhp->rdev.t3cdev_p->type == T3A); +} + +extern struct cxgb3_client t3c_client; +extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS]; +#endif From swise at opengridcomputing.com Wed Nov 15 19:58:42 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:58:42 -0600 Subject: [openib-general] [PATCH 03/13] Provider Methods and Data Structures In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035842.22635.83591.stgit@dell3.ogc.int> Provider methods to support the Linux RDMA verbs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_provider.c | 1186 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_provider.h | 390 +++++++++ drivers/infiniband/hw/cxgb3/iwch_user.h | 68 ++ 3 files changed, 1644 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c new file mode 100644 index 0000000..11afe0c --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -0,0 +1,1186 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" +#include "iwch_user.h" + +static int iwch_modify_port(struct ib_device *ibdev, + u8 port, int port_modify_mask, + struct ib_port_modify *props) +{ + return -ENOSYS; +} + +static struct ib_ah *iwch_ah_create(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + return ERR_PTR(-ENOSYS); +} + +static int iwch_ah_destroy(struct ib_ah *ah) +{ + return -ENOSYS; +} + +static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) +{ + return -ENOSYS; +} + +static int iwch_process_mad(struct ib_device *ibdev, + int mad_flags, + u8 port_num, + struct ib_wc *in_wc, + struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + return -ENOSYS; +} + +static int iwch_dealloc_ucontext(struct ib_ucontext *context) +{ + struct iwch_dev *rhp = to_iwch_dev(context->device); + struct iwch_ucontext *ucontext = to_iwch_ucontext(context); + PDBG("%s context %p\n", __FUNCTION__, context); + cxio_release_ucontext(&rhp->rdev, &ucontext->uctx); + kfree(ucontext); + return 0; +} + +static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct iwch_ucontext *context; + struct iwch_dev *rhp = to_iwch_dev(ibdev); + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) + return ERR_PTR(-ENOMEM); + cxio_init_ucontext(&rhp->rdev, &context->uctx); + INIT_LIST_HEAD(&context->mmaps); + return &context->ibucontext; +} + +static int iwch_destroy_cq(struct ib_cq *ib_cq) +{ + struct iwch_cq *chp; + + PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq); + chp = to_iwch_cq(ib_cq); + + spin_lock_irq(&chp->rhp->lock); + chp->rhp->cqid2ptr[chp->cq.cqid] = NULL; + spin_unlock_irq(&chp->rhp->lock); + + atomic_dec(&chp->refcnt); + wait_event(chp->wait, !atomic_read(&chp->refcnt)); + + cxio_destroy_cq(&chp->rhp->rdev, &chp->cq); + kfree(chp); + return 0; +} + +static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + struct iwch_create_cq_resp uresp; + + PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries); + rhp = to_iwch_dev(ibdev); + chp = kzalloc(sizeof(*chp), GFP_KERNEL); + if (!chp) + return ERR_PTR(-ENOMEM); + + if (t3a_device(rhp)) { + + /* + * T3A: Add some fluff to handle extra CQEs inserted + * for various errors. + * Additional CQE possibilities: + * TERMINATE, + * incoming RDMA WRITE Failures + * incoming RDMA READ REQUEST FAILUREs + * NOTE: We cannot ensure the CQ won't overflow. + */ + entries += 16; + } + entries = roundup_pow_of_two(entries); + chp->cq.size_log2 = long_log2(entries); + + if (cxio_create_cq(&rhp->rdev, &chp->cq)) { + kfree(chp); + return ERR_PTR(-ENOMEM); + } + chp->rhp = rhp; + chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1; + spin_lock_init(&chp->lock); + atomic_set(&chp->refcnt, 1); + init_waitqueue_head(&chp->wait); + + spin_lock_irq(&rhp->lock); + rhp->cqid2ptr[chp->cq.cqid] = chp; + spin_unlock_irq(&rhp->lock); + + if (context) { + struct iwch_mm_entry *mm; + + mm = kmalloc(sizeof *mm, GFP_KERNEL); + if (!mm) { + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-ENOMEM); + } + uresp.cqid = chp->cq.cqid; + uresp.size_log2 = chp->cq.size_log2; + uresp.physaddr = virt_to_phys(chp->cq.queue); + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm); + iwch_destroy_cq(&chp->ibcq); + return ERR_PTR(-EFAULT); + } + mm->addr = uresp.physaddr; + mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * + sizeof (struct t3_cqe)); + insert_mmap(to_iwch_ucontext(context), mm); + } + PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n", + chp->cq.cqid, chp, (1 << chp->cq.size_log2), + (u64)chp->cq.dma_addr); + return &chp->ibcq; +} + +static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata) +{ + struct iwch_cq *chp = to_iwch_cq(cq); + struct t3_cq oldcq, newcq; + int ret; + + PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe); + + /* We don't downsize... */ + if (cqe <= cq->cqe) + return 0; + + /* create new t3_cq with new size */ + cqe = roundup_pow_of_two(cqe+1); + newcq.size_log2 = long_log2(cqe); + + /* Dont allow resize to less than the current wce count */ + if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) { + return -ENOMEM; + } + + /* Quiesce all QPs using this CQ */ + ret = iwch_quiesce_qps(chp); + if (ret) { + return ret; + } + + ret = cxio_create_cq(&chp->rhp->rdev, &newcq); + if (ret) { + kfree(chp); + return ret; + } + + /* copy CQEs */ + memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * + sizeof(struct t3_cqe)); + + /* old iwch_qp gets new t3_cq but keeps old cqid */ + oldcq = chp->cq; + chp->cq = newcq; + chp->cq.cqid = oldcq.cqid; + + /* resize new t3_cq to update the HW context */ + ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq); + if (ret) { + chp->cq = oldcq; + return ret; + } + chp->ibcq.cqe = (1<cq.size_log2) - 1; + + /* destroy old t3_cq */ + oldcq.cqid = newcq.cqid; + ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq); + if (ret) { + printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", + __FUNCTION__, ret); + } + + /* add user hooks here */ + + /* resume qps */ + ret = iwch_resume_qps(chp); + return ret; +} + +static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + enum t3_cq_opcode cq_op; + int err; + int flags; + struct iwch_req_notify_cq ucmd; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + if (notify == IB_CQ_SOLICITED) + cq_op = CQ_ARM_SE; + else + cq_op = CQ_ARM_AN; + if (udata && t3b_device(rhp)) { + if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd)) + return -EFAULT; + spin_lock_irqsave(&chp->lock, flags); + chp->cq.rptr = ucmd.rptr; + } else + spin_lock_irqsave(&chp->lock, flags); + PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr); + err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0); + spin_unlock_irqrestore(&chp->lock, flags); + if (err) + printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, + chp->cq.cqid); + return err; +} + +static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma) +{ + int len = vma->vm_end - vma->vm_start; + u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT; + struct cxio_rdev *rdev_p; + int ret = 0; + struct iwch_mm_entry *mm; + struct iwch_ucontext *ucontext; + + PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, + pgaddr, len); + + if (vma->vm_start & (PAGE_SIZE-1)) { + return -EINVAL; + } + + rdev_p = &(to_iwch_dev(context->device)->rdev); + ucontext = to_iwch_ucontext(context); + + mm = remove_mmap(ucontext, pgaddr, len); + if (!mm) + return -EINVAL; + kfree(mm); + + if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && + (pgaddr < (rdev_p->rnic_info.udbell_physbase + + rdev_p->rnic_info.udbell_len))) { + + /* + * Map T3 DB register. + */ + if (vma->vm_flags & VM_READ) { + return -EPERM; + } + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND; + vma->vm_flags &= ~VM_MAYREAD; + ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } else { + + /* + * Map WQ or CQ contig dma memory... + */ + ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, + len, vma->vm_page_prot); + } + + return ret; +} + +static int iwch_deallocate_pd(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + + php = to_iwch_pd(pd); + rhp = php->rhp; + PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid); + rhp->pdid2ptr[php->pdid] = NULL; + cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid); + kfree(php); + return 0; +} + +static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct iwch_pd *php; + u32 pdid; + struct iwch_dev *rhp; + + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + rhp = (struct iwch_dev *) ibdev; + pdid = cxio_hal_get_pdid(rhp->rdev.rscp); + if (!pdid) + return ERR_PTR(-EINVAL); + php = kzalloc(sizeof(*php), GFP_KERNEL); + if (!php) { + cxio_hal_put_pdid(rhp->rdev.rscp, pdid); + return ERR_PTR(-ENOMEM); + } + php->pdid = pdid; + php->rhp = rhp; + rhp->pdid2ptr[pdid] = php; + if (context) { + if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) { + iwch_deallocate_pd(&php->ibpd); + return ERR_PTR(-EFAULT); + } + } + PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php); + return &php->ibpd; +} + +static int iwch_dereg_mr(struct ib_mr *ib_mr) +{ + struct iwch_dev *rhp; + struct iwch_mr *mhp; + struct iwch_pd *php; + u32 mmid; + + PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr); + /* There can be no memory windows */ + if (atomic_read(&ib_mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(ib_mr); + rhp = mhp->rhp; + mmid = mhp->attr.stag >> 8; + cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, + mhp->attr.pbl_addr); + rhp->mmid2ptr[mmid] = NULL; + php = get_php(rhp, mhp->attr.pdid); + if (mhp->kva) + kfree((void *) (unsigned long) mhp->kva); + PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp); + kfree(mhp); + return 0; +} + +static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, + u64 *iova_start) +{ + u64 *page_list; + int shift; + u64 total_size; + int npages; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + int ret; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + php = to_iwch_pd(pd); + rhp = php->rhp; + + acc = iwch_convert_access(acc); + + + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + /* First check that we have enough alignment */ + if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + if (num_phys_buf > 1 && + ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) { + ret = -EINVAL; + goto err; + } + + ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start, + &total_size, &npages, &shift, &page_list); + if (ret) + goto err; + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + + /* NOTE: TPT perms are backwards from BIND WR perms! */ + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + ret = iwch_register_mem(rhp, php, mhp, shift, page_list); + kfree(page_list); + if (ret) { + goto err; + } + return &mhp->ibmr; +err: + kfree(mhp); + return ERR_PTR(ret); + +} + +static int iwch_reregister_phys_mem(struct ib_mr *mr, + int mr_rereg_mask, + struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, + int acc, u64 * iova_start) +{ + + struct iwch_mr mh, *mhp; + struct iwch_pd *php; + struct iwch_dev *rhp; + int new_acc; + u64 *page_list = NULL; + int shift = 0; + u64 total_size; + int npages; + int ret; + + PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd); + + /* There can be no memory windows */ + if (atomic_read(&mr->usecnt)) + return -EINVAL; + + mhp = to_iwch_mr(mr); + rhp = mhp->rhp; + php = to_iwch_pd(mr->pd); + + /* make sure we are on the same adapter */ + if (rhp != php->rhp) + return -EINVAL; + + new_acc = mhp->attr.perms; + + memcpy(&mh, mhp, sizeof *mhp); + + if (mr_rereg_mask & IB_MR_REREG_PD) + php = to_iwch_pd(pd); + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mh.attr.perms = iwch_convert_access(acc); + if (mr_rereg_mask & IB_MR_REREG_TRANS) + ret = build_phys_page_list(buffer_list, num_phys_buf, + iova_start, + &total_size, &npages, + &shift, &page_list); + + ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages); + kfree(page_list); + if (ret) { + return ret; + } + if (mr_rereg_mask & IB_MR_REREG_PD) + mhp->attr.pdid = php->pdid; + if (mr_rereg_mask & IB_MR_REREG_ACCESS) + mhp->attr.perms = acc; + if (mr_rereg_mask & IB_MR_REREG_TRANS) { + mhp->attr.zbva = 0; + mhp->attr.va_fbo = *iova_start; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) total_size; + mhp->attr.pbl_size = npages; + } + + return 0; +} + + +struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region, + int acc, struct ib_udata *udata) +{ + u64 *pages; + int shift, n, len; + int i, j, k; + int err = 0; + struct ib_umem_chunk *chunk; + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mr *mhp; + struct iwch_reg_user_mr_resp uresp; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + shift = ffs(region->page_size) - 1; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + + n = 0; + list_for_each_entry(chunk, ®ion->chunk_list, list) + n += chunk->nents; + + pages = kmalloc(n * sizeof(u64), GFP_KERNEL); + if (!pages) { + err = -ENOMEM; + goto err; + } + + acc = iwch_convert_access(acc); + + i = n = 0; + + list_for_each_entry(chunk, ®ion->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j) { + len = sg_dma_len(&chunk->page_list[j]) >> shift; + for (k = 0; k < len; ++k) { + pages[i++] = cpu_to_be64(sg_dma_address( + &chunk->page_list[j]) + + region->page_size * k); + } + } + + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.zbva = 0; + mhp->attr.perms = (acc & 0x1) << 3; + mhp->attr.perms |= (acc & 0x2) << 1; + mhp->attr.perms |= (acc & 0x4) >> 1; + mhp->attr.perms |= (acc & 0x8) >> 3; + mhp->attr.va_fbo = region->virt_base; + mhp->attr.page_size = shift - 12; + mhp->attr.len = (u32) region->length; + mhp->attr.pbl_size = i; + err = iwch_register_mem(rhp, php, mhp, shift, pages); + kfree(pages); + if (err) + goto err; + + if (udata && t3b_device(rhp)) { + uresp.pbl_addr = (mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3; + PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, + uresp.pbl_addr); + + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + iwch_dereg_mr(&mhp->ibmr); + err = -EFAULT; + goto err; + } + } + + return &mhp->ibmr; + +err: + kfree(mhp); + return ERR_PTR(err); +} + +struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc) +{ + struct ib_phys_buf bl; + u64 kva; + struct ib_mr *ibmr; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + + /* + * T3 only supports 32 bits of size. + */ + bl.size = 0xffffffff; + bl.addr = 0; + kva = 0; + ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva); + return ibmr; +} + +struct ib_mw *iwch_alloc_mw(struct ib_pd *pd) +{ + struct iwch_dev *rhp; + struct iwch_pd *php; + struct iwch_mw *mhp; + u32 mmid; + u32 stag = 0; + int ret; + + php = to_iwch_pd(pd); + rhp = php->rhp; + mhp = kzalloc(sizeof(*mhp), GFP_KERNEL); + if (!mhp) + return ERR_PTR(-ENOMEM); + ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid); + if (ret) { + kfree(mhp); + return ERR_PTR(ret); + } + mhp->rhp = rhp; + mhp->attr.pdid = php->pdid; + mhp->attr.type = TPT_MW; + mhp->attr.stag = stag; + mmid = (stag) >> 8; + rhp->mmid2ptr[mmid] = (struct iwch_mr *) mhp; + PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag); + return &(mhp->ibmw); +} + +int iwch_dealloc_mw(struct ib_mw *mw) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_pd *php; + u32 mmid; + + mhp = to_iwch_mw(mw); + rhp = mhp->rhp; + mmid = (mw->rkey) >> 8; + php = get_php(rhp, mhp->attr.pdid); + cxio_deallocate_window(&rhp->rdev, mhp->attr.stag); + rhp->mmid2ptr[mmid] = NULL; + kfree(mhp); + PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp); + return 0; +} + +static int iwch_destroy_qp(struct ib_qp *ib_qp) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_qp_attributes attrs; + struct iwch_ucontext *ucontext; + + qhp = to_iwch_qp(ib_qp); + rhp = qhp->rhp; + + if (qhp->attr.state == IWCH_QP_STATE_RTS) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0); + } + wait_event(qhp->wait, !qhp->ep); + + spin_lock_irq(&rhp->lock); + rhp->qpid2ptr[qhp->wq.qpid] = NULL; + spin_unlock_irq(&rhp->lock); + + atomic_dec(&qhp->refcnt); + wait_event(qhp->wait, !atomic_read(&qhp->refcnt)); + + ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) + : NULL; + cxio_destroy_qp(&rhp->rdev, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx); + + PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, + ib_qp, qhp->wq.qpid, qhp); + kfree(qhp); + return 0; +} + +static struct ib_qp *iwch_create_qp(struct ib_pd *pd, + struct ib_qp_init_attr *attrs, + struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + struct iwch_pd *php; + struct iwch_cq *schp; + struct iwch_cq *rchp; + struct iwch_create_qp_resp uresp; + int wqsize, sqsize, rqsize; + struct iwch_ucontext *ucontext; + + PDBG("%s ib_pd %p\n", __FUNCTION__, pd); + if (attrs->qp_type != IB_QPT_RC) + return ERR_PTR(-EINVAL); + php = to_iwch_pd(pd); + rhp = php->rhp; + schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid); + rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid); + if (!schp || !rchp) + return ERR_PTR(-EINVAL); + + /* The RQT size must be # of entries + 1 rounded up to a power of two */ + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr); + if (rqsize == attrs->cap.max_recv_wr) + rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1); + + /* T3 doesn't support RQT depth < 16 */ + if (rqsize < 16) + rqsize = 16; + + if (rqsize > T3_MAX_RQ_SIZE) + return ERR_PTR(-EINVAL); + + /* + * NOTE: The SQ and total WQ sizes don't need to be + * a power of two. However, all the code assumes + * they are. EG: Q_FREECNT() and friends. + */ + sqsize = roundup_pow_of_two(attrs->cap.max_send_wr); + wqsize = roundup_pow_of_two(rqsize + sqsize); + PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, + wqsize, sqsize, rqsize); + qhp = kzalloc(sizeof(*qhp), GFP_KERNEL); + if (!qhp) + return ERR_PTR(-ENOMEM); + qhp->wq.size_log2 = long_log2(wqsize); + qhp->wq.rq_size_log2 = long_log2(rqsize); + qhp->wq.sq_size_log2 = long_log2(sqsize); + ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL; + if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq, + ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) { + kfree(qhp); + return ERR_PTR(-ENOMEM); + } + attrs->cap.max_recv_wr = rqsize - 1; + attrs->cap.max_send_wr = sqsize; + qhp->rhp = rhp; + qhp->attr.pd = php->pdid; + qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid; + qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid; + qhp->attr.sq_num_entries = attrs->cap.max_send_wr; + qhp->attr.rq_num_entries = attrs->cap.max_recv_wr; + qhp->attr.sq_max_sges = attrs->cap.max_send_sge; + qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge; + qhp->attr.rq_max_sges = attrs->cap.max_recv_sge; + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.next_state = IWCH_QP_STATE_IDLE; + + /* + * XXX - These don't get passed in from the openib user + * at create time. The CM sets them via a QP modify. + * Need to fix... I think the CM should + */ + qhp->attr.enable_rdma_read = 1; + qhp->attr.enable_rdma_write = 1; + qhp->attr.enable_bind = 1; + qhp->attr.max_ord = 1; + qhp->attr.max_ird = 1; + + spin_lock_init(&qhp->lock); + init_waitqueue_head(&qhp->wait); + atomic_set(&qhp->refcnt, 1); + + spin_lock_irq(&rhp->lock); + rhp->qpid2ptr[qhp->wq.qpid] = qhp; + spin_unlock_irq(&rhp->lock); + if (udata) { + + struct iwch_mm_entry *mm1, *mm2; + + mm1 = kmalloc(sizeof *mm1, GFP_KERNEL); + if (!mm1) { + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + mm2 = kmalloc(sizeof *mm2, GFP_KERNEL); + if (!mm2) { + kfree(mm1); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-ENOMEM); + } + + uresp.qpid = qhp->wq.qpid; + uresp.size_log2 = qhp->wq.size_log2; + uresp.sq_size_log2 = qhp->wq.sq_size_log2; + uresp.rq_size_log2 = qhp->wq.rq_size_log2; + uresp.physaddr = virt_to_phys(qhp->wq.queue); + uresp.doorbell = qhp->wq.udb; + if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) { + kfree(mm1); + kfree(mm2); + iwch_destroy_qp(&qhp->ibqp); + return ERR_PTR(-EFAULT); + } + mm1->addr = uresp.physaddr; + mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr)); + insert_mmap(ucontext, mm1); + mm2->addr = uresp.doorbell & PAGE_MASK; + mm2->len = PAGE_SIZE; + insert_mmap(ucontext, mm2); + } + qhp->ibqp.qp_num = qhp->wq.qpid; + init_timer(&(qhp->timer)); + PDBG("%s sq_num_entries %d, rq_num_entries %d " + "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n", + __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries, + qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2); + return (&qhp->ibqp); +} + +static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, + int attr_mask, struct ib_udata *udata) +{ + struct iwch_dev *rhp; + struct iwch_qp *qhp; + enum iwch_qp_attr_mask mask = 0; + struct iwch_qp_attributes attrs; + + PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp); + + /* iwarp does not support the RTR state */ + if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR)) + attr_mask &= ~IB_QP_STATE; + + /* Make sure we still have something left to do */ + if (!attr_mask) + return 0; + + memset(&attrs, 0, sizeof attrs); + qhp = to_iwch_qp(ibqp); + rhp = qhp->rhp; + + attrs.next_state = iwch_convert_state(attr->qp_state); + attrs.enable_rdma_read = (attr->qp_access_flags & + IB_ACCESS_REMOTE_READ) ? 1 : 0; + attrs.enable_rdma_write = (attr->qp_access_flags & + IB_ACCESS_REMOTE_WRITE) ? 1 : 0; + attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0; + + + mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0; + mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? + (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0; + + return iwch_modify_qp(rhp, qhp, mask, &attrs, 0); +} + +void iwch_qp_add_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + atomic_inc(&(to_iwch_qp(qp)->refcnt)); +} + +void iwch_qp_rem_ref(struct ib_qp *qp) +{ + PDBG("%s ib_qp %p\n", __FUNCTION__, qp); + if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt))) + wake_up(&(to_iwch_qp(qp)->wait)); +} + +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn) +{ + PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn); + return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn); +} + + +static int iwch_query_pkey(struct ib_device *ibdev, + u8 port, u16 index, u16 * pkey) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + *pkey = 0; + return 0; +} + +static int iwch_query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct iwch_dev *dev; + + PDBG("%s ibdev %p, port %d, index %d, gid %p\n", + __FUNCTION__, ibdev, port, index, gid); + dev = to_iwch_dev(ibdev); + BUG_ON(port == 0 || port > 2); + memset(&(gid->raw[0]), 0, sizeof(gid->raw)); + memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6); + return 0; +} + +static int iwch_query_device(struct ib_device *ibdev, + struct ib_device_attr *props) +{ + + struct iwch_dev *dev; + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + + dev = to_iwch_dev(ibdev); + memset(props, 0, sizeof *props); + memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + props->device_cap_flags = dev->device_cap_flags; + props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor; + props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device; + props->max_mr_size = ~0ull; + props->max_qp = dev->attr.max_qps; + props->max_qp_wr = dev->attr.max_wrs; + props->max_sge = dev->attr.max_sge_per_wr; + props->max_sge_rd = 1; + props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp; + props->max_cq = dev->attr.max_cqs; + props->max_cqe = dev->attr.max_cqes_per_cq; + props->max_mr = dev->attr.max_mem_regs; + props->max_pd = dev->attr.max_pds; + props->local_ca_ack_delay = 0; + + return 0; +} + +static int iwch_query_port(struct ib_device *ibdev, + u8 port, struct ib_port_attr *props) +{ + PDBG("%s ibdev %p\n", __FUNCTION__, ibdev); + props->max_mtu = IB_MTU_4096; + props->lid = 0; + props->lmc = 0; + props->sm_lid = 0; + props->sm_sl = 0; + props->state = IB_PORT_ACTIVE; + props->phys_state = 0; + props->port_cap_flags = + IB_PORT_CM_SUP | + IB_PORT_SNMP_TUNNEL_SUP | + IB_PORT_REINIT_SUP | + IB_PORT_DEVICE_MGMT_SUP | + IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP; + props->gid_tbl_len = 1; + props->pkey_tbl_len = 1; + props->qkey_viol_cntr = 0; + props->active_width = 2; + props->active_speed = 2; + props->max_msg_sz = -1; + + return 0; +} + +static ssize_t show_rev(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type); +} + +static ssize_t show_fw_ver(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.fw_version); +} + +static ssize_t show_hca(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + struct ethtool_drvinfo info; + struct net_device *lldev = dev->rdev.t3cdev_p->lldev; + + PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev); + lldev->ethtool_ops->get_drvinfo(lldev, &info); + return sprintf(buf, "%s\n", info.driver); +} + +static ssize_t show_board(struct class_device *cdev, char *buf) +{ + struct iwch_dev *dev = container_of(cdev, struct iwch_dev, + ibdev.class_dev); + PDBG("%s class dev 0x%p\n", __FUNCTION__, dev); + return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor, + dev->rdev.rnic_info.pdev->device); +} + +static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); + +static struct class_device_attribute *iwch_class_attributes[] = { + &class_device_attr_hw_rev, + &class_device_attr_fw_ver, + &class_device_attr_hca_type, + &class_device_attr_board_id +}; + +int iwch_register_device(struct iwch_dev *dev) +{ + int ret; + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX); + memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid)); + memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6); + dev->ibdev.owner = THIS_MODULE; + dev->device_cap_flags = + (IB_DEVICE_ZERO_STAG | + IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW); + + dev->ibdev.uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV); + dev->ibdev.node_type = RDMA_NODE_RNIC; + memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC)); + dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports; + dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev); + dev->ibdev.query_device = iwch_query_device; + dev->ibdev.query_port = iwch_query_port; + dev->ibdev.modify_port = iwch_modify_port; + dev->ibdev.query_pkey = iwch_query_pkey; + dev->ibdev.query_gid = iwch_query_gid; + dev->ibdev.alloc_ucontext = iwch_alloc_ucontext; + dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext; + dev->ibdev.mmap = iwch_mmap; + dev->ibdev.alloc_pd = iwch_allocate_pd; + dev->ibdev.dealloc_pd = iwch_deallocate_pd; + dev->ibdev.create_ah = iwch_ah_create; + dev->ibdev.destroy_ah = iwch_ah_destroy; + dev->ibdev.create_qp = iwch_create_qp; + dev->ibdev.modify_qp = iwch_ib_modify_qp; + dev->ibdev.destroy_qp = iwch_destroy_qp; + dev->ibdev.create_cq = iwch_create_cq; + dev->ibdev.destroy_cq = iwch_destroy_cq; + dev->ibdev.resize_cq = iwch_resize_cq; + dev->ibdev.poll_cq = iwch_poll_cq; + dev->ibdev.get_dma_mr = iwch_get_dma_mr; + dev->ibdev.reg_phys_mr = iwch_register_phys_mem; + dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem; + dev->ibdev.reg_user_mr = iwch_reg_user_mr; + dev->ibdev.dereg_mr = iwch_dereg_mr; + dev->ibdev.alloc_mw = iwch_alloc_mw; + dev->ibdev.bind_mw = iwch_bind_mw; + dev->ibdev.dealloc_mw = iwch_dealloc_mw; + + dev->ibdev.attach_mcast = iwch_multicast_attach; + dev->ibdev.detach_mcast = iwch_multicast_detach; + dev->ibdev.process_mad = iwch_process_mad; + + dev->ibdev.req_notify_cq = iwch_arm_cq; + dev->ibdev.post_send = iwch_post_send; + dev->ibdev.post_recv = iwch_post_receive; + + + dev->ibdev.iwcm = + (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs), + GFP_KERNEL); + dev->ibdev.iwcm->connect = iwch_connect; + dev->ibdev.iwcm->accept = iwch_accept_cr; + dev->ibdev.iwcm->reject = iwch_reject_cr; + dev->ibdev.iwcm->create_listen = iwch_create_listen; + dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen; + dev->ibdev.iwcm->add_ref = iwch_qp_add_ref; + dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref; + dev->ibdev.iwcm->get_qp = iwch_get_qp; + + ret = ib_register_device(&dev->ibdev); + if (ret) + goto bail1; + + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) { + ret = class_device_create_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + if (ret) { + goto bail2; + } + } + return 0; +bail2: + ib_unregister_device(&dev->ibdev); +bail1: + return ret; +} + +void iwch_unregister_device(struct iwch_dev *dev) +{ + int i; + + PDBG("%s iwch_dev %p\n", __FUNCTION__, dev); + for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) + class_device_remove_file(&dev->ibdev.class_dev, + iwch_class_attributes[i]); + ib_unregister_device(&dev->ibdev); + return; +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h new file mode 100644 index 0000000..34c23e6 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h @@ -0,0 +1,390 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_PROVIDER_H__ +#define __IWCH_PROVIDER_H__ + +#include +#include +#include +#include +#include "t3cdev.h" +#include "iwch.h" +#include "cxio_wr.h" +#include "cxio_hal.h" + +struct iwch_pd { + struct ib_pd ibpd; + u32 pdid; + struct iwch_dev *rhp; +}; + +static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct iwch_pd, ibpd); +} + +struct tpt_attributes { + u32 stag; + u32 state:1; + u32 type:2; + u32 rsvd:1; + enum tpt_mem_perm perms; + u32 remote_invaliate_disable:1; + u32 zbva:1; + u32 mw_bind_enable:1; + u32 page_size:5; + + u32 pdid; + u32 qpid; + u32 pbl_addr; + u32 len; + u64 va_fbo; + u32 pbl_size; +}; + +struct iwch_mr { + struct ib_mr ibmr; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +typedef struct iwch_mw iwch_mw_handle; + +static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct iwch_mr, ibmr); +} + +struct iwch_mw { + struct ib_mw ibmw; + struct iwch_dev *rhp; + u64 kva; + struct tpt_attributes attr; +}; + +static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw) +{ + return container_of(ibmw, struct iwch_mw, ibmw); +} + +struct iwch_cq { + struct ib_cq ibcq; + struct iwch_dev *rhp; + struct t3_cq cq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; +}; + +static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct iwch_cq, ibcq); +} + +enum IWCH_QP_FLAGS { + QP_QUIESCED = 0x01 +}; + +struct iwch_mpa_attributes { + u8 recv_marker_enabled; + u8 xmit_marker_enabled; /* iWARP: enable inbound Read Resp. */ + u8 crc_enabled; + u8 version; /* 0 or 1 */ +}; + +struct iwch_qp_attributes { + u32 scq; + u32 rcq; + u32 sq_num_entries; + u32 rq_num_entries; + u32 sq_max_sges; + u32 sq_max_sges_rdma_write; + u32 rq_max_sges; + u32 state; + u8 enable_rdma_read; + u8 enable_rdma_write; /* enable inbound Read Resp. */ + u8 enable_bind; + u8 enable_mmid0_fastreg; /* Enable STAG0 + Fast-register */ + /* + * Next QP state. If specify the current state, only the + * QP attributes will be modified. + */ + u32 max_ord; + u32 max_ird; + u32 pd; /* IN */ + u32 next_state; + char terminate_buffer[52]; + u32 terminate_msg_len; + u8 is_terminate_local; + struct iwch_mpa_attributes mpa_attr; /* IN-OUT */ + struct iwch_ep *llp_stream_handle; + char *stream_msg_buf; /* Last stream msg. before Idle -> RTS */ + u32 stream_msg_buf_len; /* Only on Idle -> RTS */ +}; + +struct iwch_qp { + struct ib_qp ibqp; + struct iwch_dev *rhp; + struct iwch_ep *ep; + struct iwch_qp_attributes attr; + struct t3_wq wq; + spinlock_t lock; + atomic_t refcnt; + wait_queue_head_t wait; + enum IWCH_QP_FLAGS flags; + struct timer_list timer; +}; + +static inline int qp_quiesced(struct iwch_qp *qhp) +{ + return (qhp->flags & QP_QUIESCED); +} + +static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct iwch_qp, ibqp); +} + +void iwch_qp_add_ref(struct ib_qp *qp); +void iwch_qp_rem_ref(struct ib_qp *qp); +struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn); + +struct iwch_ucontext { + struct ib_ucontext ibucontext; + struct cxio_ucontext uctx; + struct list_head mmaps; +}; + +static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c) +{ + return container_of(c, struct iwch_ucontext, ibucontext); +} + +struct iwch_mm_entry { + struct list_head entry; + u64 addr; + unsigned len; +}; + +static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext, + u64 addr, unsigned len) +{ + struct list_head *pos, *nxt; + struct iwch_mm_entry *mm; + + mutex_lock(&ucontext->uctx.lock); + list_for_each_safe(pos, nxt, &ucontext->mmaps) { + + mm = list_entry(pos, struct iwch_mm_entry, entry); + if (mm->addr == addr && mm->len == len) { + list_del_init(&mm->entry); + mutex_unlock(&ucontext->uctx.lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, + mm->len); + return mm; + } + } + mutex_unlock(&ucontext->uctx.lock); + return NULL; +} + +static inline void insert_mmap(struct iwch_ucontext *ucontext, + struct iwch_mm_entry *mm) +{ + mutex_lock(&ucontext->uctx.lock); + PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len); + list_add_tail(&mm->entry, &ucontext->mmaps); + mutex_unlock(&ucontext->uctx.lock); +} + +enum iwch_qp_attr_mask { + IWCH_QP_ATTR_NEXT_STATE = 1 << 0, + IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7, + IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8, + IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9, + IWCH_QP_ATTR_MAX_ORD = 1 << 11, + IWCH_QP_ATTR_MAX_IRD = 1 << 12, + IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22, + IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23, + IWCH_QP_ATTR_MPA_ATTR = 1 << 24, + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25, + IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ | + IWCH_QP_ATTR_ENABLE_RDMA_WRITE | + IWCH_QP_ATTR_MAX_ORD | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_STREAM_MSG_BUFFER | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE) +}; + +int iwch_modify_qp(struct iwch_dev *rhp, + struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal); + +enum iwch_qp_state { + IWCH_QP_STATE_IDLE, + IWCH_QP_STATE_RTS, + IWCH_QP_STATE_ERROR, + IWCH_QP_STATE_TERMINATE, + IWCH_QP_STATE_CLOSING, + IWCH_QP_STATE_TOT +}; + +static inline int iwch_convert_state(enum ib_qp_state ib_state) +{ + switch (ib_state) { + case IB_QPS_RESET: + case IB_QPS_INIT: + return IWCH_QP_STATE_IDLE; + case IB_QPS_RTS: + return IWCH_QP_STATE_RTS; + case IB_QPS_SQD: + return IWCH_QP_STATE_CLOSING; + case IB_QPS_SQE: + return IWCH_QP_STATE_TERMINATE; + case IB_QPS_ERR: + return IWCH_QP_STATE_ERROR; + default: + return -1; + } +} + +enum iwch_mem_perms { + IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0, + IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1, + IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2, + IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3, + IWCH_MEM_ACCESS_ATOMICS = 1 << 4, + IWCH_MEM_ACCESS_BINDING = 1 << 5, + IWCH_MEM_ACCESS_LOCAL = + (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE), + IWCH_MEM_ACCESS_REMOTE = + (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ) + /* cannot go beyond 1 << 31 */ +} __attribute__ ((packed)); + +static inline u32 iwch_convert_access(int acc) +{ + return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0) + | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) | + (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) | + (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) | + IWCH_MEM_ACCESS_LOCAL_READ; +} + +enum iwch_mmid_state { + IWCH_STAG_STATE_VALID, + IWCH_STAG_STATE_INVALID +}; + +enum iwch_qp_query_flags { + IWCH_QP_QUERY_CONTEXT_NONE = 0x0, /* No ctx; Only attrs */ + IWCH_QP_QUERY_CONTEXT_GET = 0x1, /* Get ctx + attrs */ + IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2, /* Not Supported */ + + /* + * Quiesce QP context; Consumer + * will NOT replay outstanding WR + */ + IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4, + IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8, + IWCH_QP_QUERY_TEST_USERWRITE = 0x32 /* Test special */ +}; + +static inline struct iwch_pd *get_php(struct iwch_dev *rhp, u32 pdid) +{ + if (pdid >= T3_MAX_NUM_PD) + return NULL; + return rhp->pdid2ptr[pdid]; +} + +static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid) +{ + if (cqid >= T3_MAX_NUM_CQ) + return NULL; + return rhp->cqid2ptr[cqid]; +} + +static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid) +{ + if (qpid >= T3_MAX_NUM_QP) + return NULL; + return rhp->qpid2ptr[qpid]; +} + +static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid) +{ + if (mmid >= rhp->attr.max_mem_regs) + return NULL; + return rhp->mmid2ptr[mmid]; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr); +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind); +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc); +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg); +int iwch_register_device(struct iwch_dev *dev); +void iwch_unregister_device(struct iwch_dev *dev); +int iwch_quiesce_qps(struct iwch_cq *chp); +int iwch_resume_qps(struct iwch_cq *chp); +void stop_read_rep_timer(struct iwch_qp *qhp); +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list); +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list, + int npages); +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + u64 **page_list); + + +#define IWCH_NODE_DESC "cxgb3 Chelsio Communications" + +#endif diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h new file mode 100644 index 0000000..4e4b9c9 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_user.h @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __IWCH_USER_H__ +#define __IWCH_USER_H__ + +#define IWCH_UVERBS_ABI_VERSION 1 + +/* + * Make sure that all structs defined in this file remain laid out so + * that they pack the same way on 32-bit and 64-bit architectures (to + * avoid incompatibility between 32-bit userspace and 64-bit kernels). + * In particular do not use pointer types -- pass pointers in __u64 + * instead. + */ + +struct iwch_create_cq_resp { + __u64 physaddr; + __u32 cqid; + __u32 size_log2; +}; + +struct iwch_create_qp_resp { + __u64 physaddr; + __u64 doorbell; + __u32 qpid; + __u32 size_log2; + __u32 sq_size_log2; + __u32 rq_size_log2; +}; + +struct iwch_reg_user_mr_resp { + __u32 pbl_addr; +}; + +struct iwch_req_notify_cq { + __u32 rptr; +}; +#endif From swise at opengridcomputing.com Wed Nov 15 19:58:47 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:58:47 -0600 Subject: [openib-general] [PATCH 04/13] Connection Manager In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035847.22635.87333.stgit@dell3.ogc.int> This code implements the iWARP CM provider methods for the Chelsio driver. The Chelsio ULLD is used to setup and teardown TCP connections, and the T3 RDMA Core is used to move the connections in and out of RDMA mode. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 2121 +++++++++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/iwch_cm.h | 223 +++ 2 files changed, 2344 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c new file mode 100644 index 0000000..5f1954d --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -0,0 +1,2121 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include + +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "cxgb3_offload.h" +#include "iwch.h" +#include "iwch_provider.h" +#include "iwch_cm.h" + +char *states[] = { + "idle", + "listen", + "connecting", + "mpa_wait_req", + "mpa_req_sent", + "mpa_req_rcvd", + "mpa_rep_sent", + "fpdu_mode", + "aborting", + "closing", + "moribund", + "dead", + NULL, +}; + +static int ep_timeout_secs = 10; +module_param(ep_timeout_secs, int, 0444); +MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout " + "in seconds (default=10)"); + +static int mpa_rev = 1; +module_param(mpa_rev, int, 0444); +MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, " + "1 is spec compliant. (default=1)"); + +static int markers_enabled = 0; +module_param(markers_enabled, int, 0444); +MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)"); + +static int crc_enabled = 1; +module_param(crc_enabled, int, 0444); +MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)"); + +static int rcv_win = 512 * 1024; +module_param(rcv_win, int, 0444); +MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)"); + +static int snd_win = 512 * 1024; +module_param(snd_win, int, 0444); +MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)"); + +static unsigned int nocong = 1; +module_param(nocong, uint, 0444); +MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)"); + +static void process_work(void *ctx); +static struct workqueue_struct *workq; +DECLARE_WORK(skb_work, process_work, NULL); + +static struct sk_buff_head rxq; +static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS]; + +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp); +static void ep_timeout(unsigned long arg); +static void connect_reply_upcall(struct iwch_ep *ep, int status); + +static void start_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + if (timer_pending(&ep->timer)) { + PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + } else + ep_atomic_inc(&ep->com.refcnt); + ep->timer.expires = jiffies + ep_timeout_secs * HZ; + ep->timer.data = (unsigned long)ep; + ep->timer.function = ep_timeout; + add_timer(&ep->timer); +} + +static void stop_ep_timer(struct iwch_ep *ep) +{ + PDBG("%s ep %p\n", __FUNCTION__, ep); + del_timer_sync(&ep->timer); + free_ep(&ep->com); +} + +static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) +{ + struct cpl_tid_release *req; + + skb = get_skb(skb, sizeof *req, GFP_KERNEL); + if (!skb) { + return; + } + req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid)); + skb->priority = CPL_PRIORITY_SETUP; + tdev->send(tdev, skb); + return; +} + +static int migrate_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + +#if 1 /* XXX - Waiting for HW/FW Resolution on this... */ + return 0; +#endif + + if (!skb) { + return -ENOMEM; + } + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_T_MIGRATION); + req->mask = cpu_to_be64(1ULL << S_TCB_T_MIGRATION); + req->val = cpu_to_be64(1 << S_TCB_T_MIGRATION); + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +int iwch_quiesce_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) { + return -ENOMEM; + } + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE); + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +int iwch_resume_tid(struct iwch_ep *ep) +{ + struct cpl_set_tcb_field *req; + struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + + if (!skb) { + return -ENOMEM; + } + req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid)); + req->reply = 0; + req->cpu_idx = 0; + req->word = htons(W_TCB_RX_QUIESCE); + req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE); + req->val = 0; + + skb->priority = CPL_PRIORITY_DATA; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static void set_emss(struct iwch_ep *ep, u16 opt) +{ + PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt); + ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40; + if (G_TCPOPT_TSTAMP(opt)) { + ep->emss -= 12; + } + if (ep->emss < 128) + ep->emss = 128; + PDBG("emss=%d\n", ep->emss); +} + +static int state_comp_exch(struct iwch_ep_common *epc, + enum iwch_ep_state comp, + enum iwch_ep_state exch) +{ + unsigned long flags; + int ret; + + spin_lock_irqsave(&epc->lock, flags); + ret = (epc->state == comp); + if (ret) + epc->state = exch; + spin_unlock_irqrestore(&epc->lock, flags); + return ret; +} + +static enum iwch_ep_state state_read(struct iwch_ep_common *epc) +{ + unsigned long flags; + enum iwch_ep_state state; + + spin_lock_irqsave(&epc->lock, flags); + state = epc->state; + spin_unlock_irqrestore(&epc->lock, flags); + return state; +} + +static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new) +{ + unsigned long flags; + + spin_lock_irqsave(&epc->lock, flags); + PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], + states[new]); + epc->state = new; + spin_unlock_irqrestore(&epc->lock, flags); + return; +} + +static void *alloc_ep(int size, gfp_t gfp) +{ + struct iwch_ep_common *epc; + + epc = kmalloc(size, gfp); + if (epc) { + memset(epc, 0, size); + atomic_set(&epc->refcnt, 1); + spin_lock_init(&epc->lock); + init_waitqueue_head(&epc->waitq); + } + PDBG("%s alloc ep %p\n", __FUNCTION__, epc); + return (void *) epc; +} + +void __free_ep(struct iwch_ep_common *epc) +{ + PDBG("%s ep %p, &refcnt %p state %s, refcnt %d\n", + __FUNCTION__, epc, &epc->refcnt, + states[state_read(epc)], atomic_read(&epc->refcnt)); + + if (atomic_read(&epc->refcnt) == 1) { + goto out; + } + if (!atomic_dec_and_test(&epc->refcnt)) { + return; + } +out: + PDBG("free ep %p\n", epc); + kfree(epc); +} + +static void release_ep_resources(struct iwch_ep *ep) +{ + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + state_set(&ep->com, DEAD); + cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + if (ep->com.tdev->type == T3B) + release_tid(ep->com.tdev, ep->hwtid, NULL); + free_ep(&ep->com); +} + +static void process_work(void *ctx) +{ + struct sk_buff *skb = NULL; + void *ep; + struct t3cdev *tdev; + int ret; + + while ((skb = skb_dequeue(&rxq))) { + ep = *((void **) (skb->cb)); + tdev = *((struct t3cdev **) (skb->cb + sizeof(void *))); + ret = work_handlers[G_OPCODE(ntohl(skb->csum))] + (tdev, skb, ep); + if (ret & CPL_RET_BUF_DONE) + kfree_skb(skb); + + /* + * ep was referenced in sched(), and is freed here. + */ + free_ep(ep); + } +} + +static int status2errno(int status) +{ + switch (status) { + case CPL_ERR_NONE: + return 0; + case CPL_ERR_CONN_RESET: + return -ECONNRESET; + case CPL_ERR_ARP_MISS: + return -EHOSTUNREACH; + case CPL_ERR_CONN_TIMEDOUT: + return -ETIMEDOUT; + case CPL_ERR_TCAM_FULL: + return -ENOMEM; + case CPL_ERR_CONN_EXIST: + return -EADDRINUSE; + default: + return -EIO; + } +} + +/* + * Try and reuse skbs already allocated... + */ +static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp) +{ + if (skb) { + BUG_ON(skb_cloned(skb)); + skb_trim(skb, 0); + skb_get(skb); + } else { + skb = alloc_skb(len, gfp); + } + return skb; +} + +static struct rtable *find_route(struct t3cdev *dev, + u32 local_ip, u32 peer_ip, u16 local_port, + u16 peer_port, u8 tos) +{ + struct rtable *rt; + struct flowi fl = { + .oif = 0, + .nl_u = { + .ip4_u = { + .daddr = peer_ip, + .saddr = local_ip, + .tos = tos} + }, + .proto = IPPROTO_TCP, + .uli_u = { + .ports = { + .sport = local_port, + .dport = peer_port} + } + }; + + if (ip_route_output_flow(&rt, &fl, NULL, 0)) { + return NULL; + } + return rt; +} + +static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu) +{ + int i = 0; + + while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu) + ++i; + return i; +} + +static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb) +{ + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for an active open. + */ +static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + printk(KERN_ERR MOD "ARP failure duing connect\n"); + kfree_skb(skb); +} + +/* + * Handle an ARP failure for a CPL_ABORT_REQ. Change it into a no RST variant + * and send it along. + */ +static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb) +{ + struct cpl_abort_req *req = cplhdr(skb); + + PDBG("%s t3cdev %p\n", __FUNCTION__, dev); + req->cmd = CPL_ABORT_NO_RST; + cxgb3_ofld_send(dev, skb); +} + +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp) +{ + struct cpl_close_con_req *req; + struct sk_buff *skb; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid)); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp) +{ + struct cpl_abort_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(skb, sizeof(*req), gfp); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, abort_arp_failure); + req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ)); + req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid)); + req->cmd = CPL_ABORT_SEND_RST; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_connect(struct iwch_ep *ep) +{ + struct cpl_act_open_req *req; + struct sk_buff *skb; + u32 opt0h, opt0l, opt2; + unsigned int mtu_idx; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb.\n", + __FUNCTION__); + return -ENOMEM; + } + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + skb->priority = CPL_PRIORITY_SETUP; + set_arp_failure_handler(skb, act_open_req_arp_failure); + + req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid)); + req->local_port = ep->com.local_addr.sin_port; + req->peer_port = ep->com.remote_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_ip = ep->com.remote_addr.sin_addr.s_addr; + req->opt0h = htonl(opt0h); + req->opt0l = htonl(opt0l); + req->params = 0; + req->opt2 = htonl(opt2); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + + PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen); + + BUG_ON(skb_cloned(skb)); + + mpalen = sizeof(*mpa) + ep->plen; + if (skb->data + mpalen + sizeof(*req) > skb->end) { + kfree_skb(skb); + skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + connect_reply_upcall(ep, -ENOMEM); + return; + } + } + skb_trim(skb, 0); + skb_reserve(skb, sizeof(*req)); + skb_put(skb, mpalen); + skb->priority = CPL_PRIORITY_DATA; + mpa = (struct mpa_message *) skb->data; + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key)); + mpa->flags = (crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->private_data_size = htons(ep->plen); + mpa->revision = mpa_rev; + + if (ep->plen) { + memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen); + } + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + start_ep_timer(ep); + state_set(&ep->com, MPA_REQ_SENT); + return; +} + +static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = MPA_REJECT; + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) { + memcpy(mpa->private_data, pdata, plen); + } + + /* + * Reference the mpa skb again. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + skb->priority = CPL_PRIORITY_DATA; + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(mpalen); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + BUG_ON(ep->mpa_skb); + ep->mpa_skb = skb; + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen) +{ + int mpalen; + struct tx_data_wr *req; + struct mpa_message *mpa; + int len; + struct sk_buff *skb; + + PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen); + + mpalen = sizeof(*mpa) + plen; + + skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__); + return -ENOMEM; + } + skb->priority = CPL_PRIORITY_DATA; + skb_reserve(skb, sizeof(*req)); + mpa = (struct mpa_message *) skb_put(skb, mpalen); + memset(mpa, 0, sizeof(*mpa)); + memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key)); + mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | + (markers_enabled ? MPA_MARKERS : 0); + mpa->revision = mpa_rev; + mpa->private_data_size = htons(plen); + if (plen) { + memcpy(mpa->private_data, pdata, plen); + } + + /* + * Reference the mpa skb. This ensures the data area + * will remain in memory until the hw acks the tx. + * Function tx_ack() will deref it. + */ + skb_get(skb); + set_arp_failure_handler(skb, arp_failure_discard); + skb->h.raw = skb->data; + len = skb->len; + req = (struct tx_data_wr *) skb_push(skb, sizeof(*req)); + req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); + req->wr_lo = htonl(V_WR_TID(ep->hwtid)); + req->len = htonl(len); + req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) | + V_TX_SNDBUF(snd_win>>15)); + req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT); + req->sndseq = htonl(ep->snd_seq); + ep->mpa_skb = skb; + state_set(&ep->com, MPA_REP_SENT); + l2t_send(ep->com.tdev, skb, ep->l2t); + return 0; +} + +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_establish *req = cplhdr(skb); + unsigned int tid = GET_TID(req); + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid); + + dst_confirm(ep->dst); + + /* setup the hwtid for this connection */ + ep->hwtid = tid; + cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid); + + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + /* dealloc the atid */ + cxgb3_free_atid(ep->com.tdev, ep->atid); + + /* start MPA negotiation */ + send_mpa_req(ep, skb); + + return 0; +} + +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb) +{ + PDBG("%s ep %p\n", __FILE__, ep); + state_set(&ep->com, ABORTING); + send_abort(ep, skb, GFP_KERNEL); +} + +static void close_complete_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + if (ep->com.cm_id) { + PDBG("close complete delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void peer_close_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_DISCONNECT; + if (ep->com.cm_id) { + PDBG("peer close delivered ep %p cm_id %p tid %d\n", + ep, ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static void peer_abort_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CLOSE; + event.status = -ECONNRESET; + if (ep->com.cm_id) { + PDBG("abort delivered ep %p cm_id %p tid %d\n", ep, + ep->com.cm_id, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_reply_upcall(struct iwch_ep *ep, int status) +{ + struct iw_cm_event event; + + PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REPLY; + event.status = status; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + + if ((status == 0) || (status == -ECONNREFUSED)) { + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + } + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, + ep->hwtid, status); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } + if (status < 0) { + ep->com.cm_id->rem_ref(ep->com.cm_id); + ep->com.cm_id = NULL; + ep->com.qp = NULL; + } +} + +static void connect_request_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_CONNECT_REQUEST; + event.local_addr = ep->com.local_addr; + event.remote_addr = ep->com.remote_addr; + event.private_data_len = ep->plen; + event.private_data = ep->mpa_pkt + sizeof(struct mpa_message); + event.provider_data = ep; + if (state_read(&ep->parent_ep->com) != DEAD) + ep->parent_ep->com.cm_id->event_handler( + ep->parent_ep->com.cm_id, + &event); + free_ep(&ep->parent_ep->com); + ep->parent_ep = NULL; +} + +static void established_upcall(struct iwch_ep *ep) +{ + struct iw_cm_event event; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + memset(&event, 0, sizeof(event)); + event.event = IW_CM_EVENT_ESTABLISHED; + if (ep->com.cm_id) { + PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid); + ep->com.cm_id->event_handler(ep->com.cm_id, &event); + } +} + +static int update_rx_credits(struct iwch_ep *ep, u32 credits) +{ + struct cpl_rx_data_ack *req; + struct sk_buff *skb; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n"); + return 0; + } + + req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid)); + req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1)); + skb->priority = CPL_PRIORITY_ACK; + ep->com.tdev->send(ep->com.tdev, skb); + return credits; +} + +static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + struct iwch_qp_attributes attrs; + enum iwch_qp_attr_mask mask; + int err; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) { + return; + } + state_set(&ep->com, FPDU_MODE); + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + err = -EINVAL; + goto err; + } + + /* + * copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * if we don't even have the mpa message, then bail. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) { + return; + } + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* Validate MPA header. */ + if (mpa->revision != mpa_rev) { + err = -EPROTO; + goto err; + } + if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) { + err = -EPROTO; + goto err; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + err = -EPROTO; + goto err; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + err = -EPROTO; + goto err; + } + + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) { + return; + } + + if (mpa->flags & MPA_REJECT) { + err = -ECONNREFUSED; + goto err; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. And + * the MPA header is valid. + */ + + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + /* + * Quiesce the TID here. The uP unquiesces the TID as + * part of the rdma_init operation. + */ + err = migrate_tid(ep); + if (err) { + goto err; + } + + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ird; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD; + + /* bind QP and TID with INIT_WR */ + err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + if (!err) { + goto out; + } +err: + abort_connection(ep, skb); +out: + udelay(9000); + connect_reply_upcall(ep, err); + return; +} + +static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb) +{ + struct mpa_message *mpa; + u16 plen; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + /* + * Stop mpa timer. If it expired, then the state is + * CLOSING and we bail since ep_timeout already aborted + * the connection. + */ + stop_ep_timer(ep); + if (state_read(&ep->com) == CLOSING) { + return; + } + + /* + * If we get more than the supported amount of private data + * then we must fail this connection. + */ + if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) { + abort_connection(ep, skb); + return; + } + + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + + /* + * Copy the new data into our accumulation buffer. + */ + memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len); + ep->mpa_pkt_len += skb->len; + + /* + * If we don't even have the mpa message, then bail. + * We'll continue process when more data arrives. + */ + if (ep->mpa_pkt_len < sizeof(*mpa)) { + return; + } + PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__); + mpa = (struct mpa_message *) ep->mpa_pkt; + + /* + * Validate MPA Header. + */ + if (mpa->revision != mpa_rev) { + abort_connection(ep, skb); + return; + } + + if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) { + abort_connection(ep, skb); + return; + } + + plen = ntohs(mpa->private_data_size); + + /* + * Fail if there's too much private data. + */ + if (plen > MPA_MAX_PRIVATE_DATA) { + abort_connection(ep, skb); + return; + } + + /* + * If plen does not account for pkt size + */ + if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) { + abort_connection(ep, skb); + return; + } + ep->plen = (u8) plen; + + /* + * If we don't have all the pdata yet, then bail. + */ + if (ep->mpa_pkt_len < (sizeof(*mpa) + plen)) { + return; + } + + /* + * If we get here we have accumulated the entire mpa + * start reply message including private data. + */ + ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0; + ep->mpa_attr.recv_marker_enabled = markers_enabled; + ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0; + ep->mpa_attr.version = mpa_rev; + PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, " + "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__, + ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled, + ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version); + + state_set(&ep->com, MPA_REQ_RCVD); + + /* drive upcall */ + connect_request_upcall(ep); + return; +} + +static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_rx_data *hdr = cplhdr(skb); + unsigned int dlen = ntohs(hdr->len); + + PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen); + + skb_pull(skb, sizeof(*hdr)); + skb_trim(skb, dlen); + + switch (state_read(&ep->com)) { + case MPA_REQ_SENT: + process_mpa_reply(ep, skb); + break; + case MPA_REQ_WAIT: + process_mpa_request(ep, skb); + break; + case MPA_REP_SENT: + break; + default: + printk(KERN_ERR MOD "%s Unexpected streaming data." + " ep %p state %d tid %d\n", + __FUNCTION__, ep, state_read(&ep->com), ep->hwtid); + + /* + * The ep will timeout and inform the ULP of the failure. + * See ep_timeout(). + */ + break; + } + + /* update RX credits */ + update_rx_credits(ep, dlen); + + return CPL_RET_BUF_DONE; +} + +/* + * Upcall from the adapter indicating data has been transmitted. + * For us its just the single MPA request or reply. We can now free + * the skb holding the mpa message. + */ +static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_wr_ack *hdr = cplhdr(skb); + unsigned int credits = ntohs(hdr->credits); + enum iwch_qp_attr_mask mask; + + PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits); + + if (credits == 0) { + return CPL_RET_BUF_DONE; + } + BUG_ON(credits != 1); + BUG_ON(ep->mpa_skb == NULL); + kfree_skb(ep->mpa_skb); + ep->mpa_skb = NULL; + dst_confirm(ep->dst); + if (state_read(&ep->com) == MPA_REP_SENT) { + struct iwch_qp_attributes attrs; + + /* bind QP to EP and move to RTS */ + attrs.mpa_attr = ep->mpa_attr; + attrs.max_ird = ep->ord; + attrs.max_ord = ep->ord; + attrs.llp_stream_handle = ep; + attrs.next_state = IWCH_QP_STATE_RTS; + + /* bind QP and TID with INIT_WR */ + mask = IWCH_QP_ATTR_NEXT_STATE | + IWCH_QP_ATTR_LLP_STREAM_HANDLE | + IWCH_QP_ATTR_MPA_ATTR | + IWCH_QP_ATTR_MAX_IRD | + IWCH_QP_ATTR_MAX_ORD; + + ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, mask, &attrs, 1); + + if (!ep->com.rpl_err) { + state_set(&ep->com, FPDU_MODE); + established_upcall(ep); + } + + ep->com.rpl_done = 1; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + } + return CPL_RET_BUF_DONE; +} + +static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + close_complete_upcall(ep); + release_ep_resources(ep); + return CPL_RET_BUF_DONE; +} + +static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_act_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + cxgb3_free_atid(ep->com.tdev, ep->atid); + connect_reply_upcall(ep, status2errno(rpl->status)); + free_ep(&ep->com); + return CPL_RET_BUF_DONE; +} + +static int listen_start(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_pass_open_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n"); + return -ENOMEM; + } + + req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid)); + req->local_port = ep->com.local_addr.sin_port; + req->local_ip = ep->com.local_addr.sin_addr.s_addr; + req->peer_port = 0; + req->peer_ip = 0; + req->peer_netmask = 0; + req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS); + req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10)); + req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK)); + + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_pass_open_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, + rpl->status, status2errno(rpl->status)); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + + return CPL_RET_BUF_DONE; +} + +static int listen_stop(struct iwch_listen_ep *ep) +{ + struct sk_buff *skb; + struct cpl_close_listserv_req *req; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb = get_skb(NULL, sizeof(*req), GFP_KERNEL); + if (!skb) { + printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__); + return -ENOMEM; + } + req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req)); + req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid)); + skb->priority = 1; + ep->com.tdev->send(ep->com.tdev, skb); + return 0; +} + +static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb, + void *ctx) +{ + struct iwch_listen_ep *ep = ctx; + struct cpl_close_listserv_rpl *rpl = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.rpl_err = status2errno(rpl->status); + ep->com.rpl_done = 1; + wake_up(&ep->com.waitq); + return CPL_RET_BUF_DONE; +} + +static void accept_cr(struct iwch_ep *ep, u32 peer_ip, struct sk_buff *skb) +{ + struct cpl_pass_accept_rpl *rpl; + unsigned int mtu_idx; + u32 opt0h, opt0l, opt2; + int wscale; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(*rpl)); + skb_get(skb); + mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst)); + wscale = compute_wscale(rcv_win); + opt0h = V_NAGLE(0) | + V_NO_CONG(nocong) | + V_KEEP_ALIVE(1) | + F_TCAM_BYPASS | + V_WND_SCALE(wscale) | + V_MSS_IDX(mtu_idx) | + V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx); + opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10); + opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0); + + rpl = cplhdr(skb); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(opt0h); + rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT); + rpl->opt2 = htonl(opt2); + rpl->rsvd = rpl->opt2; /* workaround for HW bug */ + skb->priority = CPL_PRIORITY_SETUP; + l2t_send(ep->com.tdev, skb, ep->l2t); + + return; +} + +static void reject_cr(struct t3cdev *tdev, u32 hwtid, u32 peer_ip, + struct sk_buff *skb) +{ + PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, + peer_ip); + BUG_ON(skb_cloned(skb)); + skb_trim(skb, sizeof(struct cpl_tid_release)); + skb_get(skb); + + if (tdev->type == T3B) + release_tid(tdev, hwtid, skb); + else { + struct cpl_pass_accept_rpl *rpl; + + rpl = cplhdr(skb); + skb->priority = CPL_PRIORITY_SETUP; + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, + hwtid)); + rpl->peer_ip = peer_ip; + rpl->opt0h = htonl(F_TCAM_BYPASS); + rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); + rpl->opt2 = 0; + rpl->rsvd = rpl->opt2; + tdev->send(tdev, skb); + } +} + +static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *child_ep, *parent_ep = ctx; + struct cpl_pass_accept_req *req = cplhdr(skb); + unsigned int hwtid = GET_TID(req); + struct dst_entry *dst; + struct l2t_entry *l2t; + struct rtable *rt; + struct iff_mac tim; + + PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid); + + if (state_read(&parent_ep->com) != LISTEN) { + printk(KERN_ERR "%s - listening ep not in LISTEN\n", + __FUNCTION__); + goto reject; + } + + /* + * Find the netdev for this connection request. + */ + tim.mac_addr = req->dst_mac; + tim.vlan_tag = ntohs(req->vlan_tag); + if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) { + printk(KERN_ERR + "%s bad dst mac %02x %02x %02x %02x %02x %02x\n", + __FUNCTION__, + req->dst_mac[0], + req->dst_mac[1], + req->dst_mac[2], + req->dst_mac[3], + req->dst_mac[4], + req->dst_mac[5]); + goto reject; + } + + /* Find output route */ + rt = find_route(tdev, + req->local_ip, + req->peer_ip, + req->local_port, + req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid))); + if (!rt) { + printk(KERN_ERR MOD "%s - failed to find dst entry!\n", + __FUNCTION__); + goto reject; + } + dst = &rt->u.dst; + l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev->if_port); + if (!l2t) { + printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n", + __FUNCTION__); + dst_release(dst); + goto reject; + } + child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL); + if (!child_ep) { + printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n", + __FUNCTION__); + l2t_release(L2DATA(tdev), l2t); + dst_release(dst); + goto reject; + } + state_set(&child_ep->com, CONNECTING); + child_ep->com.tdev = tdev; + child_ep->com.cm_id = NULL; + child_ep->com.local_addr.sin_family = PF_INET; + child_ep->com.local_addr.sin_port = req->local_port; + child_ep->com.local_addr.sin_addr.s_addr = req->local_ip; + child_ep->com.remote_addr.sin_family = PF_INET; + child_ep->com.remote_addr.sin_port = req->peer_port; + child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip; + ep_atomic_inc(&parent_ep->com.refcnt); + child_ep->parent_ep = parent_ep; + child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid)); + child_ep->l2t = l2t; + child_ep->dst = dst; + child_ep->hwtid = hwtid; + init_timer(&child_ep->timer); + cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid); + accept_cr(child_ep, req->peer_ip, skb); + goto out; +reject: + reject_cr(tdev, hwtid, req->peer_ip, skb); +out: + return CPL_RET_BUF_DONE; +} + +static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct cpl_pass_establish *req = cplhdr(skb); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->snd_seq = ntohl(req->snd_isn); + + set_emss(ep, ntohs(req->tcp_opt)); + + dst_confirm(ep->dst); + state_set(&ep->com, MPA_REQ_WAIT); + start_ep_timer(ep); + + return CPL_RET_BUF_DONE; +} + +static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + int ret; + int abort = 0; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + dst_confirm(ep->dst); + switch (state_read(&ep->com)) { + case MPA_REQ_WAIT: + state_set(&ep->com, CLOSING); + break; + case MPA_REQ_SENT: + state_set(&ep->com, CLOSING); + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + state_set(&ep->com, CLOSING); + ep_atomic_inc(&ep->com.refcnt); + break; + case MPA_REP_SENT: + state_set(&ep->com, CLOSING); + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case FPDU_MODE: + state_set(&ep->com, CLOSING); + peer_close_upcall(ep); + attrs.next_state = IWCH_QP_STATE_CLOSING; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) { + printk(KERN_ERR MOD "%s - qp <- closing err!\n", + __FUNCTION__); + abort = 1; + } + break; + case ABORTING: + goto out; + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + goto out; + case MORIBUND: + stop_ep_timer(ep); + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + goto out; + case DEAD: + goto out; + default: + BUG_ON(1); + } + iwch_ep_disconnect(ep, abort, GFP_KERNEL); +out: + return CPL_RET_BUF_DONE; +} + +/* + * Returns whether an ABORT_REQ_RSS message is a negative advice. + */ +static inline int is_neg_adv_abort(unsigned int status) +{ + return status == CPL_ERR_RTX_NEG_ADVICE || + status == CPL_ERR_PERSIST_NEG_ADVICE; +} + +static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_abort_req_rss *req = cplhdr(skb); + struct iwch_ep *ep = ctx; + struct cpl_abort_rpl *rpl; + struct sk_buff *rpl_skb; + struct iwch_qp_attributes attrs; + int ret; + int state; + + if (is_neg_adv_abort(req->status)) { + PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, + ep->hwtid); + t3_l2t_send_event(ep->com.tdev, ep->l2t); + return CPL_RET_BUF_DONE; + } + + state = state_read(&ep->com); + PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state); + switch (state) { + case CONNECTING: + break; + case MPA_REQ_WAIT: + break; + case MPA_REQ_SENT: + connect_reply_upcall(ep, -ECONNRESET); + break; + case MPA_REP_SENT: + ep->com.rpl_done = 1; + ep->com.rpl_err = -ECONNRESET; + PDBG("waking up ep %p\n", ep); + wake_up(&ep->com.waitq); + break; + case MPA_REQ_RCVD: + + /* + * We're gonna mark this puppy DEAD, but keep + * the reference on it until the ULP accepts or + * rejects the CR. + */ + ep_atomic_inc(&ep->com.refcnt); + break; + case MORIBUND: + stop_ep_timer(ep); + case FPDU_MODE: + case CLOSING: + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + ret = iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + if (ret) { + printk(KERN_ERR MOD + "%s - qp <- error failed!\n", + __FUNCTION__); + } + } + peer_abort_upcall(ep); + break; + case ABORTING: + break; + case DEAD: + PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__); + return CPL_RET_BUF_DONE; + default: + BUG_ON(1); + break; + } + dst_confirm(ep->dst); + + rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL); + if (!rpl_skb) { + printk(KERN_ERR MOD "%s - cannot allocate skb!\n", + __FUNCTION__); + dst_release(ep->dst); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + free_ep(&ep->com); + return CPL_RET_BUF_DONE; + } + rpl_skb->priority = CPL_PRIORITY_DATA; + rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl)); + rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL)); + rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid)); + OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid)); + rpl->cmd = CPL_ABORT_NO_RST; + ep->com.tdev->send(ep->com.tdev, rpl_skb); + if (state != ABORTING) { + release_ep_resources(ep); + } + return CPL_RET_BUF_DONE; +} + +static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + BUG_ON(!ep); + + /* The cm_id may be null if we failed to connect */ + switch (state_read(&ep->com)) { + case CLOSING: + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + break; + case MORIBUND: + stop_ep_timer(ep); + if ((ep->com.cm_id) && (ep->com.qp)) { + attrs.next_state = IWCH_QP_STATE_IDLE; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, + IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + close_complete_upcall(ep); + release_ep_resources(ep); + break; + case DEAD: + default: + BUG_ON(1); + break; + } + + return CPL_RET_BUF_DONE; +} + +/* + * T3A does 3 things when a TERM is received: + * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet + * 2) generate an async event on the QP with the TERMINATE opcode + * 3) post a TERMINATE opcde cqe into the associated CQ. + * + * For (1), we save the message in the qp for later consumer consumption. + * For (2), we move the QP into TERMINATE, post a QP event and disconnect. + * For (3), we toss the CQE in cxio_poll_cq(). + * + * terminate() handles case (1)... + */ +static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p\n", __FUNCTION__, ep); + skb_pull(skb, sizeof(struct cpl_rdma_terminate)); + PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len); + memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len); + ep->com.qp->attr.terminate_msg_len = skb->len; + ep->com.qp->attr.is_terminate_local = 0; + return CPL_RET_BUF_DONE; +} + +static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct cpl_rdma_ec_status *rep = cplhdr(skb); + struct iwch_ep *ep = ctx; + + PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, + rep->status); + if (rep->status) { + struct iwch_qp_attributes attrs; + + printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n", + __FUNCTION__, ep->hwtid); + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + abort_connection(ep, NULL); + } + return CPL_RET_BUF_DONE; +} + +static void ep_timeout(unsigned long arg) +{ + struct iwch_ep *ep = (struct iwch_ep *)arg; + struct iwch_qp_attributes attrs; + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) { + struct sk_buff *skb; + + connect_reply_upcall(ep, -ETIMEDOUT); + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) { + abort_connection(ep, skb); + } + } + if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) { + struct sk_buff *skb; + + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) { + abort_connection(ep, skb); + } + } + if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) { + struct sk_buff *skb; + + if (ep->com.cm_id && ep->com.qp) { + attrs.next_state = IWCH_QP_STATE_ERROR; + iwch_modify_qp(ep->com.qp->rhp, + ep->com.qp, IWCH_QP_ATTR_NEXT_STATE, + &attrs, 1); + } + skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC); + if (skb) { + abort_connection(ep, skb); + } + } + free_ep(&ep->com); +} + +int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + + if (state_read(&ep->com) == DEAD) { + free_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + state_set(&ep->com, CLOSING); + if (mpa_rev == 0) { + abort_connection(ep, NULL); + } else { + err = send_mpa_reject(ep, pdata, pdata_len); + err = send_halfclose(ep, GFP_KERNEL); + } + return 0; +} + +int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err; + struct iwch_ep *ep = to_ep(cm_id); + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_qp *qp = get_qhp(h, conn_param->qpn); + + PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid); + if (state_read(&ep->com) == DEAD) { + free_ep(&ep->com); + return -ECONNRESET; + } + BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD); + + /* + * Quiesce the TID here. The uP unquiesces the TID as + * part of the rdma_init operation. + */ + err = migrate_tid(ep); + if (err) { + abort_connection(ep, NULL); + return err; + } + + BUG_ON(!qp); + if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) || + (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) { + abort_connection(ep, NULL); + return -EINVAL; + } + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = qp; + + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord); + ep_atomic_inc(&ep->com.refcnt); + err = send_mpa_reply(ep, conn_param->private_data, + conn_param->private_data_len); + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + free_ep(&ep->com); + return err; + } + + /* wait until the MPA is transmitted. */ + PDBG("sleeping on ep %p\n", ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + PDBG("awakened on ep %p\n", ep); + + err = ep->com.rpl_err; + if (err) { + ep->com.cm_id = NULL; + ep->com.qp = NULL; + cm_id->rem_ref(cm_id); + abort_connection(ep, NULL); + } + free_ep(&ep->com); + return err; +} + +int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_ep *ep; + struct rtable *rt; + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto out; + } + init_timer(&ep->timer); + ep->plen = conn_param->private_data_len; + if (ep->plen) { + memcpy(ep->mpa_pkt + sizeof(struct mpa_message), + conn_param->private_data, ep->plen); + } + ep->ird = conn_param->ird; + ep->ord = conn_param->ord; + ep->com.tdev = h->rdev.t3cdev_p; + + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->com.qp = get_qhp(h, conn_param->qpn); + BUG_ON(!ep->com.qp); + PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, + ep->com.qp, cm_id); + + /* + * Allocate an active TID to initiate a TCP connection. + */ + ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->atid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + /* find a route */ + rt = find_route(h->rdev.t3cdev_p, + cm_id->local_addr.sin_addr.s_addr, + cm_id->remote_addr.sin_addr.s_addr, + cm_id->local_addr.sin_port, + cm_id->remote_addr.sin_port, IPTOS_LOWDELAY); + if (!rt) { + printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__); + err = -EHOSTUNREACH; + goto fail3; + } + ep->dst = &rt->u.dst; + + /* get a l2t entry */ + ep->l2t = t3_l2t_get(ep->com.tdev, + ep->dst->neighbour, + ep->dst->neighbour->dev->if_port); + if (!ep->l2t) { + printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__); + err = -ENOMEM; + goto fail4; + } + + state_set(&ep->com, CONNECTING); + ep->tos = IPTOS_LOWDELAY; + ep->com.local_addr = cm_id->local_addr; + ep->com.remote_addr = cm_id->remote_addr; + + /* send connect request to rnic */ + err = send_connect(ep); + if (!err) { + goto out; + } + + l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t); +fail4: + dst_release(ep->dst); +fail3: + cxgb3_free_atid(ep->com.tdev, ep->atid); +fail2: + free_ep(&ep->com); +out: + return err; +} + +int iwch_create_listen(struct iw_cm_id *cm_id, int backlog) +{ + int err = 0; + struct iwch_dev *h = to_iwch_dev(cm_id->device); + struct iwch_listen_ep *ep; + + + might_sleep(); + + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); + if (!ep) { + printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__); + err = -ENOMEM; + goto fail1; + } + PDBG("%s ep %p\n", __FUNCTION__, ep); + ep->com.tdev = h->rdev.t3cdev_p; + cm_id->add_ref(cm_id); + ep->com.cm_id = cm_id; + ep->backlog = backlog; + ep->com.local_addr = cm_id->local_addr; + + /* + * Allocate a server TID. + */ + ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep); + if (ep->stid == -1) { + printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__); + err = -ENOMEM; + goto fail2; + } + + state_set(&ep->com, LISTEN); + err = listen_start(ep); + if (err) { + goto fail3; + } + + /* wait for pass_open_rpl */ + wait_event(ep->com.waitq, ep->com.rpl_done); + err = ep->com.rpl_err; + if (!err) { + cm_id->provider_data = ep; + goto out; + } +fail3: + cxgb3_free_stid(ep->com.tdev, ep->stid); +fail2: + free_ep(&ep->com); +fail1: +out: + return err; +} + +int iwch_destroy_listen(struct iw_cm_id *cm_id) +{ + int err; + struct iwch_listen_ep *ep = to_listen_ep(cm_id); + + PDBG("%s ep %p\n", __FUNCTION__, ep); + + might_sleep(); + state_set(&ep->com, DEAD); + ep->com.rpl_done = 0; + ep->com.rpl_err = 0; + err = listen_stop(ep); + wait_event(ep->com.waitq, ep->com.rpl_done); + cxgb3_free_stid(ep->com.tdev, ep->stid); + err = ep->com.rpl_err; + cm_id->rem_ref(cm_id); + free_ep(&ep->com); + return err; +} + +int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp) +{ + int ret=0; + int state; + + + state = state_read(&ep->com); + PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, + states[state], abrupt); + if (state == DEAD) { + PDBG("%s already dead ep %p\n", __FUNCTION__, ep); + return 0; + } + if (abrupt) { + if (state != ABORTING) { + state_set(&ep->com, ABORTING); + ret = send_abort(ep, NULL, gfp); + } + } else { + + if (state != CLOSING) { + state_set(&ep->com, CLOSING); + } else { + start_ep_timer(ep); + state_set(&ep->com, MORIBUND); + } + + ret = send_halfclose(ep, gfp); + } + return ret; +} + +int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, + struct l2t_entry *l2t) +{ + struct iwch_ep *ep = ctx; + + if (ep->dst != old) + return 0; + + PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, + l2t); + dst_hold(new); + l2t_release(L2DATA(ep->com.tdev), ep->l2t); + ep->l2t = l2t; + dst_release(old); + ep->dst = new; + return 1; +} + +/* + * All the CM events are handled on a work queue to have a safe context. + */ +static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx) +{ + struct iwch_ep_common *epc = ctx; + + ep_atomic_inc(&epc->refcnt); + + /* + * Save ctx and tdev in the skb->cb area. + */ + *((void **) skb->cb) = ctx; + *((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev; + + /* + * Queue the skb and schedule the worker thread. + */ + skb_queue_tail(&rxq, skb); + queue_work(workq, &skb_work); + return 0; +} + +int __init iwch_cm_init(void) +{ + skb_queue_head_init(&rxq); + + workq = create_singlethread_workqueue("iw_cxgb3"); + if (!workq) + return -ENOMEM; + + /* + * All upcalls from the T3 Core go to sched() to + * schedule the processing on a work queue. + */ + t3c_handlers[CPL_ACT_ESTABLISH] = sched; + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; + t3c_handlers[CPL_RX_DATA] = sched; + t3c_handlers[CPL_TX_DMA_ACK] = sched; + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; + t3c_handlers[CPL_ABORT_RPL] = sched; + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; + t3c_handlers[CPL_PASS_ESTABLISH] = sched; + t3c_handlers[CPL_PEER_CLOSE] = sched; + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; + t3c_handlers[CPL_RDMA_TERMINATE] = sched; + t3c_handlers[CPL_RDMA_EC_STATUS] = sched; + + /* + * These are the real handlers that are called from a + * work queue. + */ + work_handlers[CPL_ACT_ESTABLISH] = act_establish; + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; + work_handlers[CPL_RX_DATA] = rx_data; + work_handlers[CPL_TX_DMA_ACK] = tx_ack; + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; + work_handlers[CPL_ABORT_RPL] = abort_rpl; + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; + work_handlers[CPL_PEER_CLOSE] = peer_close; + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; + work_handlers[CPL_RDMA_TERMINATE] = terminate; + work_handlers[CPL_RDMA_EC_STATUS] = ec_status; + return 0; +} + +void __exit iwch_cm_term(void) +{ + flush_workqueue(workq); + destroy_workqueue(workq); +} diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h new file mode 100644 index 0000000..cd3e36e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h @@ -0,0 +1,223 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef _IWCH_CM_H_ +#define _IWCH_CM_H_ + +#include +#include +#include + +#include +#include + +#include "cxgb3_offload.h" +#include "iwch_provider.h" + +#define MPA_KEY_REQ "MPA ID Req Frame" +#define MPA_KEY_REP "MPA ID Rep Frame" + +#define MPA_MAX_PRIVATE_DATA 256 +#define MPA_REV 0 /* XXX - amso1100 uses rev 0 ! */ +#define MPA_REJECT 0x20 +#define MPA_CRC 0x40 +#define MPA_MARKERS 0x80 +#define MPA_FLAGS_MASK 0xE0 + +#define free_ep(A) { \ + PDBG("%s %d: Calling __free_ep\n",__FUNCTION__, __LINE__); \ + __free_ep(A); \ +} + +#define ep_atomic_inc(A) { \ + PDBG("%s %u: ep_atomic_inc A %p, refcnt %d\n", \ + __FUNCTION__, \ + __LINE__, A, \ + atomic_read(A)); \ + atomic_inc(A); \ +} + +struct mpa_message { + u8 key[16]; + u8 flags; + u8 revision; + u16 private_data_size; + u8 private_data[0]; +}; + +struct terminate_message { + u8 layer_etype; + u8 ecode; + u16 hdrct_rsvd; + u8 len_hdrs[0]; +}; + +#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28) + +enum iwch_layers_types { + LAYER_RDMAP = 0x00, + LAYER_DDP = 0x10, + LAYER_MPA = 0x20, + RDMAP_LOCAL_CATA = 0x00, + RDMAP_REMOTE_PROT = 0x01, + RDMAP_REMOTE_OP = 0x02, + DDP_LOCAL_CATA = 0x00, + DDP_TAGGED_ERR = 0x01, + DDP_UNTAGGED_ERR = 0x02, + DDP_LLP = 0x03 +}; + +enum iwch_rdma_ecodes { + RDMAP_INV_STAG = 0x00, + RDMAP_BASE_BOUNDS = 0x01, + RDMAP_ACC_VIOL = 0x02, + RDMAP_STAG_NOT_ASSOC = 0x03, + RDMAP_TO_WRAP = 0x04, + RDMAP_INV_VERS = 0x05, + RDMAP_INV_OPCODE = 0x06, + RDMAP_STREAM_CATA = 0x07, + RDMAP_GLOBAL_CATA = 0x08, + RDMAP_CANT_INV_STAG = 0x09, + RDMAP_UNSPECIFIED = 0xff +}; + +enum iwch_ddp_ecodes { + DDPT_INV_STAG = 0x00, + DDPT_BASE_BOUNDS = 0x01, + DDPT_STAG_NOT_ASSOC = 0x02, + DDPT_TO_WRAP = 0x03, + DDPT_INV_VERS = 0x04, + DDPU_INV_QN = 0x01, + DDPU_INV_MSN_NOBUF = 0x02, + DDPU_INV_MSN_RANGE = 0x03, + DDPU_INV_MO = 0x04, + DDPU_MSG_TOOBIG = 0x05, + DDPU_INV_VERS = 0x06 +}; + +enum iwch_mpa_ecodes { + MPA_CRC_ERR = 0x02, + MPA_MARKER_ERR = 0x03 +}; + +enum iwch_ep_state { + IDLE = 0, + LISTEN, + CONNECTING, + MPA_REQ_WAIT, + MPA_REQ_SENT, + MPA_REQ_RCVD, + MPA_REP_SENT, + FPDU_MODE, + ABORTING, + CLOSING, + MORIBUND, + DEAD, +}; + +struct iwch_ep_common { + struct iw_cm_id *cm_id; + struct iwch_qp *qp; + struct t3cdev *tdev; + enum iwch_ep_state state; + atomic_t refcnt; + spinlock_t lock; + struct sockaddr_in local_addr; + struct sockaddr_in remote_addr; + wait_queue_head_t waitq; + int rpl_done; + int rpl_err; +}; + +struct iwch_listen_ep { + struct iwch_ep_common com; + unsigned int stid; + int backlog; +}; + +struct iwch_ep { + struct iwch_ep_common com; + struct iwch_ep *parent_ep; + struct timer_list timer; + unsigned int atid; + u32 hwtid; + u32 snd_seq; + struct l2t_entry *l2t; + struct dst_entry *dst; + struct sk_buff *mpa_skb; + struct iwch_mpa_attributes mpa_attr; + unsigned int mpa_pkt_len; + u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA]; + u8 tos; + u16 emss; + u16 plen; + u32 ird; + u32 ord; +}; + +static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_ep *)cm_id->provider_data; +} + +static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id) +{ + return (struct iwch_listen_ep *)cm_id->provider_data; +} + +static inline int compute_wscale(int win) +{ + int wscale = 0; + + while (wscale < 14 && (65535< References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035857.22635.96296.stgit@dell3.ogc.int> Functions to manipulate CQs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cq.c | 231 +++++++++++++++++++++++++++++++++ 1 files changed, 231 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c new file mode 100644 index 0000000..aa5c0f6 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c @@ -0,0 +1,231 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" + +/* + * Get one cq entry from cxio and map it to openib. + * + * Returns: + * 0 EMPTY; + * 1 cqe returned + * -EAGAIN caller must try again + * any other -errno fatal error + */ +int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp, + struct ib_wc *wc) +{ + struct iwch_qp *qhp = NULL; + struct t3_cqe cqe, *rd_cqe; + struct t3_wq *wq; + u32 credit = 0; + u8 cqe_flushed; + u64 cookie; + int ret = 1; + + rd_cqe = cxio_next_cqe(&chp->cq); + + if (!rd_cqe) + return 0; + + qhp = get_qhp(rhp, CQE_QPID(*rd_cqe)); + if (!qhp) + wq = NULL; + else { + spin_lock(&qhp->lock); + wq = &(qhp->wq); + } + ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie, + &credit); + if (t3a_device(chp->rhp) && credit) { + PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, + credit, chp->cq.cqid); + cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit); + } + + if (ret) { + ret = -EAGAIN; + goto out; + } + ret = 1; + + wc->wr_id = cookie; + wc->qp_num = qhp->wq.qpid; + wc->vendor_err = CQE_STATUS(cqe); + + PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x " + "lo 0x%x cookie 0x%llx\n", __FUNCTION__, + CQE_QPID(cqe), CQE_TYPE(cqe), + CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe), + CQE_WRID_LOW(cqe), cookie); + + if (CQE_TYPE(cqe) == 0) { + if (!CQE_STATUS(cqe)) + wc->byte_len = CQE_LEN(cqe); + else + wc->byte_len = 0; + wc->opcode = IB_WC_RECV; + } else { + switch (CQE_OPCODE(cqe)) { + case T3_RDMA_WRITE: + wc->opcode = IB_WC_RDMA_WRITE; + break; + case T3_READ_REQ: + wc->opcode = IB_WC_RDMA_READ; + wc->byte_len = CQE_LEN(cqe); + break; + case T3_SEND: + case T3_SEND_WITH_SE: + wc->opcode = IB_WC_SEND; + break; + case T3_BIND_MW: + wc->opcode = IB_WC_BIND_MW; + break; + + /* these aren't supported yet */ + case T3_SEND_WITH_INV: + case T3_SEND_WITH_SE_INV: + case T3_LOCAL_INV: + case T3_FAST_REGISTER: + default: + printk(KERN_ERR MOD "Unexpected opcode %d " + "in the CQE received for QPID=0x%0x\n", + CQE_OPCODE(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + goto out; + } + } + + if (cqe_flushed) { + wc->status = IB_WC_WR_FLUSH_ERR; + } else { + + switch (CQE_STATUS(cqe)) { + case TPT_ERR_SUCCESS: + wc->status = IB_WC_SUCCESS; + break; + case TPT_ERR_STAG: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_PDID: + wc->status = IB_WC_LOC_PROT_ERR; + break; + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + wc->status = IB_WC_LOC_ACCESS_ERR; + break; + case TPT_ERR_WRAP: + wc->status = IB_WC_GENERAL_ERR; + break; + case TPT_ERR_BOUND: + wc->status = IB_WC_LOC_LEN_ERR; + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + wc->status = IB_WC_MW_BIND_ERR; + break; + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + case TPT_ERR_OPCODE: + wc->status = IB_WC_FATAL_ERR; + break; + case TPT_ERR_SWFLUSH: + wc->status = IB_WC_WR_FLUSH_ERR; + break; + default: + printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for " + "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe)); + ret = -EINVAL; + } + } +out: + if (wq) + spin_unlock(&qhp->lock); + return ret; +} + +int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc) +{ + struct iwch_dev *rhp; + struct iwch_cq *chp; + unsigned long flags; + int npolled; + int err = 0; + + chp = to_iwch_cq(ibcq); + rhp = chp->rhp; + + spin_lock_irqsave(&chp->lock, flags); + for (npolled = 0; npolled < num_entries; ++npolled) { +#ifdef DEBUG + int i=0; +#endif + + /* + * Because T3 can post CQEs that are _not_ associated + * with a WR, we might have to poll again after removing + * one of these. + */ + do { + err = iwch_poll_cq_one(rhp, chp, wc + npolled); +#ifdef DEBUG + BUG_ON(++i > 1000); +#endif + } while (err == -EAGAIN); + if (err <= 0) + break; + } + spin_unlock_irqrestore(&chp->lock, flags); + + if (err < 0) + return err; + else { + return npolled; + } +} + +int iwch_modify_cq(struct ib_cq *cq, int cqe) +{ + PDBG("iwch_modify_cq: TBD\n"); + return 0; +} From swise at opengridcomputing.com Wed Nov 15 19:58:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:58:52 -0600 Subject: [openib-general] [PATCH 05/13] Queue Pairs In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035852.22635.9497.stgit@dell3.ogc.int> Code to manipulate the QP. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 996 +++++++++++++++++++++++++++++++++ 1 files changed, 996 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c new file mode 100644 index 0000000..dc1c55e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -0,0 +1,996 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" + +#define NO_SUPPORT -1 + +static inline int iwch_build_rdma_send(union t3_wr *wqe, + struct ib_send_wr *wr, + u8 * flit_cnt) +{ + int i; + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + if (wr->send_flags & IB_SEND_SOLICITED) + wqe->send.rdmaop = T3_SEND_WITH_SE; + else + wqe->send.rdmaop = T3_SEND; + wqe->send.rem_stag = 0; + break; +#if 0 /* Not currently supported */ + case TYPE_SEND_INVALIDATE: + case TYPE_SEND_INVALIDATE_IMMEDIATE: + wqe->send.rdmaop = T3_SEND_WITH_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; + case TYPE_SEND_SE_INVALIDATE: + wqe->send.rdmaop = T3_SEND_WITH_SE_INV; + wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + break; +#endif + default: + break; + } + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->send.reserved = 0; + if (wr->opcode == IB_WR_SEND_WITH_IMM) { + wqe->send.plen = 4; + wqe->send.sgl[0].stag = wr->imm_data; + wqe->send.sgl[0].len = 0; + wqe->send.num_sgle = 0; + *flit_cnt = 5; + } else { + wqe->send.plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((wqe->send.plen + wr->sg_list[i].length) < + wqe->send.plen) { + return -EMSGSIZE; + } + wqe->send.plen += wr->sg_list[i].length; + wqe->send.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->send.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr); + } + wqe->send.plen = cpu_to_be32(wqe->send.plen); + wqe->send.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 4 + ((wr->num_sge) << 1); + } + return 0; +} + +static inline int iwch_build_rdma_write(union t3_wr *wqe, + struct ib_send_wr *wr, + u8 *flit_cnt) +{ + int i; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + wqe->write.rdmaop = T3_RDMA_WRITE; + wqe->write.reserved = 0; + wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey); + wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr); + + wqe->write.num_sgle = wr->num_sge; + + if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) { + wqe->write.plen = cpu_to_be32(4); + wqe->write.sgl[0].stag = cpu_to_be32(wr->imm_data); + wqe->write.sgl[0].len = 0; + wqe->write.num_sgle = 0; + *flit_cnt = 6; + } else { + wqe->write.plen = 0; + for (i = 0; i < wr->num_sge; i++) { + if ((wqe->send.plen + wr->sg_list[i].length) < + wqe->send.plen) { + return -EMSGSIZE; + } + wqe->write.plen += wr->sg_list[i].length; + wqe->write.sgl[i].stag = + cpu_to_be32(wr->sg_list[i].lkey); + wqe->write.sgl[i].len = + cpu_to_be32(wr->sg_list[i].length); + wqe->write.sgl[i].to = + cpu_to_be64(wr->sg_list[i].addr); + } + wqe->write.plen = cpu_to_be32(wqe->write.plen); + wqe->write.num_sgle = cpu_to_be32(wr->num_sge); + *flit_cnt = 5 + ((wr->num_sge) << 1); + } + return 0; +} + +static inline int iwch_build_rdma_read(union t3_wr *wqe, + struct ib_send_wr *wr, + u8 *flit_cnt) +{ + if (wr->num_sge > 1) + return -EINVAL; + wqe->read.rdmaop = T3_READ_REQ; + wqe->read.reserved = 0; + wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey); + wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr); + wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey); + wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length); + wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr); + *flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3; + return 0; +} + +/* + * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now. + */ +static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp, + struct ib_sge *sg_list, u32 num_sgle, + u32 * pbl_addr, u8 * page_size) +{ + int i; + struct iwch_mr *mhp; + u32 offset; + for (i = 0; i < num_sgle; i++) { + + mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8); + if (!mhp) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (!mhp->attr.state) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + if (mhp->attr.zbva) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EIO; + } + + if (sg_list[i].addr < mhp->attr.va_fbo) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) < + sg_list[i].addr) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + if (sg_list[i].addr + ((u64) sg_list[i].length) > + mhp->attr.va_fbo + ((u64) mhp->attr.len)) { + PDBG("%s %d\n", __FUNCTION__, __LINE__); + return -EINVAL; + } + offset = sg_list[i].addr - mhp->attr.va_fbo; + offset += ((u32) mhp->attr.va_fbo) % + (1UL << (12 + mhp->attr.page_size)); + pbl_addr[i] = ((mhp->attr.pbl_addr - + rhp->rdev.rnic_info.pbl_base) >> 3) + + (offset >> (12 + mhp->attr.page_size)); + page_size[i] = mhp->attr.page_size; + } + return 0; +} + +static inline int iwch_build_rdma_recv(struct iwch_dev *rhp, + union t3_wr *wqe, + struct ib_recv_wr *wr) +{ + int i, err = 0; + u32 pbl_addr[4]; + u8 page_size[4]; + if (wr->num_sge > T3_MAX_SGE) + return -EINVAL; + err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, + page_size); + if (err) + return err; + wqe->recv.pagesz[0] = page_size[0]; + wqe->recv.pagesz[1] = page_size[1]; + wqe->recv.pagesz[2] = page_size[2]; + wqe->recv.pagesz[3] = page_size[3]; + wqe->recv.num_sgle = cpu_to_be32(wr->num_sge); + for (i = 0; i < wr->num_sge; i++) { + wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey); + wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length); + + /* to in the WQE == the offset into the page */ + wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) % + (1UL << (12 + page_size[i]))); + + /* pbl_addr is the adapters address in the PBL */ + wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]); + } + for (; i < T3_MAX_SGE; i++) { + wqe->recv.sgl[i].stag = 0; + wqe->recv.sgl[i].len = 0; + wqe->recv.sgl[i].to = 0; + wqe->recv.pbl_addr[i] = 0; + } + return 0; +} + +int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + int err = 0; + u8 t3_wr_flit_cnt; + enum t3_wr_opcode t3_wr_opcode = 0; + enum t3_wr_flags t3_wr_flags; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + int flag; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if (num_wrs <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + while (wr) { + if (num_wrs == 0) { + err = -ENOMEM; + *bad_wr = wr; + break; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + t3_wr_flags = 0; + if (wr->send_flags & IB_SEND_SOLICITED) + t3_wr_flags |= T3_SOLICITED_EVENT_FLAG; + if (wr->send_flags & IB_SEND_FENCE) + t3_wr_flags |= T3_READ_FENCE_FLAG; + if (wr->send_flags & IB_SEND_SIGNALED) + t3_wr_flags |= T3_COMPLETION_FLAG; + sqp = qhp->wq.sq + + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + switch (wr->opcode) { + case IB_WR_SEND: + case IB_WR_SEND_WITH_IMM: + t3_wr_opcode = T3_WR_SEND; + err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_WRITE_WITH_IMM: + t3_wr_opcode = T3_WR_WRITE; + err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt); + break; + case IB_WR_RDMA_READ: + t3_wr_opcode = T3_WR_READ; + t3_wr_flags = 0; /* T3 reads are always signaled */ + err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt); + if (err) + break; + sqp->read_len = wqe->read.local_len; + if (!qhp->wq.oldest_read) + qhp->wq.oldest_read = sqp; + break; + default: + PDBG("%s post of type=%d TBD!\n", __FUNCTION__, + wr->opcode); + err = -EINVAL; + } + if (err) { + *bad_wr = wr; + break; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp->wr_id = wr->wr_id; + sqp->opcode = wr2opcode(t3_wr_opcode); + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED); + + build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, t3_wr_flit_cnt); + PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", + __FUNCTION__, wr->wr_id, idx, + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2), + sqp->opcode); + wr = wr->next; + num_wrs--; + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + } + spin_unlock_irqrestore(&qhp->lock, flag); + RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + int err = 0; + struct iwch_qp *qhp; + u32 idx; + union t3_wr *wqe; + u32 num_wrs; + int flag; + + qhp = to_iwch_qp(ibqp); + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, + qhp->wq.rq_size_log2) - 1; + if (!wr) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + while (wr) { + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + if (num_wrs) + err = iwch_build_rdma_recv(qhp->rhp, wqe, wr); + else + err = -ENOMEM; + if (err) { + *bad_wr = wr; + break; + } + qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = + wr->wr_id; + build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), + 0, sizeof(struct t3_receive_wr) >> 3); + PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x " + "wqe %p \n", __FUNCTION__, wr->wr_id, idx, + qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe); + ++(qhp->wq.rq_wptr); + ++(qhp->wq.wptr); + wr = wr->next; + num_wrs--; + } + spin_unlock_irqrestore(&qhp->lock, flag); + RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid); + return err; +} + +int iwch_bind_mw(struct ib_qp *qp, + struct ib_mw *mw, + struct ib_mw_bind *mw_bind) +{ + struct iwch_dev *rhp; + struct iwch_mw *mhp; + struct iwch_qp *qhp; + union t3_wr *wqe; + u32 pbl_addr; + u8 page_size; + u32 num_wrs; + int flag; + struct ib_sge sgl; + int err=0; + enum t3_wr_flags t3_wr_flags; + u32 idx; + struct t3_swsq *sqp; + + qhp = to_iwch_qp(qp); + mhp = to_iwch_mw(mw); + rhp = qhp->rhp; + + spin_lock_irqsave(&qhp->lock, flag); + if (qhp->attr.state > IWCH_QP_STATE_RTS) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -EINVAL; + } + num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, + qhp->wq.sq_size_log2); + if ((num_wrs) <= 0) { + spin_unlock_irqrestore(&qhp->lock, flag); + return -ENOMEM; + } + idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2); + PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, + mw, mw_bind); + wqe = (union t3_wr *) (qhp->wq.queue + idx); + + t3_wr_flags = 0; + if (mw_bind->send_flags & IB_SEND_SIGNALED) + t3_wr_flags = T3_COMPLETION_FLAG; + + sgl.addr = mw_bind->addr; + sgl.lkey = mw_bind->mr->lkey; + sgl.length = mw_bind->length; + wqe->bind.reserved = 0; + wqe->bind.type = T3_VA_BASED_TO; + + /* TBD: check perms */ + wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags); + wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey); + wqe->bind.mw_stag = cpu_to_be32(mw->rkey); + wqe->bind.mw_len = cpu_to_be32(mw_bind->length); + wqe->bind.mw_va = cpu_to_be64(mw_bind->addr); + err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size); + if (err) { + spin_unlock_irqrestore(&qhp->lock, flag); + return err; + } + wqe->send.wrid.id0.hi = qhp->wq.sq_wptr; + sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2); + sqp->wr_id = mw_bind->wr_id; + sqp->opcode = T3_BIND_MW; + sqp->sq_wptr = qhp->wq.sq_wptr; + sqp->complete = 0; + sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED); + wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr); + wqe->bind.mr_pagesz = page_size; + wqe->bind.reserved2 = 0; + wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id; + build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags, + Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, + sizeof(struct t3_bind_mw_wr) >> 3); + ++(qhp->wq.wptr); + ++(qhp->wq.sq_wptr); + spin_unlock_irqrestore(&qhp->lock, flag); + + RING_DOORBELL(qhp->wq.doorbell, qhp->wq.qpid); + + return err; +} + +static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode, + int tagged) +{ + switch (t3err) { + case TPT_ERR_STAG: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_STAG; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_INV_STAG; + } + break; + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_STAG_NOT_ASSOC; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_STAG_NOT_ASSOC; + } + break; + case TPT_ERR_WRAP: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_TO_WRAP; + break; + case TPT_ERR_BOUND: + if (tagged == 1) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + } else if (tagged == 2) { + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT; + *ecode = RDMAP_BASE_BOUNDS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + } + break; + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_CANT_INV_STAG; + break; + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + *layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_OUT_OF_RQE: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_NOBUF; + break; + case TPT_ERR_PBL_ADDR_BOUND: + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_BASE_BOUNDS; + break; + case TPT_ERR_CRC: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_CRC_ERR; + break; + case TPT_ERR_MARKER: + *layer_type = LAYER_MPA|DDP_LLP; + *ecode = MPA_MARKER_ERR; + break; + case TPT_ERR_PDU_LEN_ERR: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_MSG_TOOBIG; + break; + case TPT_ERR_DDP_VERSION: + if (tagged) { + *layer_type = LAYER_DDP|DDP_TAGGED_ERR; + *ecode = DDPT_INV_VERS; + } else { + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_VERS; + } + break; + case TPT_ERR_RDMA_VERSION: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_VERS; + break; + case TPT_ERR_OPCODE: + *layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP; + *ecode = RDMAP_INV_OPCODE; + break; + case TPT_ERR_DDP_QUEUE_NUM: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_QN; + break; + case TPT_ERR_MSN: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_IRD_OVERFLOW: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MSN_RANGE; + break; + case TPT_ERR_TBIT: + *layer_type = LAYER_DDP|DDP_LOCAL_CATA; + *ecode = 0; + break; + case TPT_ERR_MO: + *layer_type = LAYER_DDP|DDP_UNTAGGED_ERR; + *ecode = DDPU_INV_MO; + break; + default: + *layer_type = LAYER_RDMAP|DDP_LOCAL_CATA; + *ecode = 0; + break; + } +} + +/* + * This posts a TERMINATE with layer=RDMA, type=catastrophic. + */ +int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg) +{ + union t3_wr *wqe; + struct terminate_message *term; + int status; + int tagged = 0; + struct sk_buff *skb; + + PDBG("%s %d\n", __FUNCTION__, __LINE__); + skb = alloc_skb(40, GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__); + return -ENOMEM; + } + wqe = (union t3_wr *)skb_put(skb, 40); + memset(wqe, 0, 40); + wqe->send.rdmaop = T3_TERMINATE; + + /* immediate data length */ + wqe->send.plen = htonl(4); + + /* immediate data starts here. */ + term = (struct terminate_message *)wqe->send.sgl; + if (rsp_msg) { + status = CQE_STATUS(rsp_msg->cqe); + if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE) + tagged = 1; + if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) || + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) + tagged = 2; + } else { + status = TPT_ERR_INTERNAL_ERR; + } + build_term_codes(status, &term->layer_etype, &term->ecode, tagged); + build_fw_riwrh((void *)wqe, T3_WR_SEND, + T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, + qhp->ep->hwtid, 5); + skb->priority = CPL_PRIORITY_DATA; + return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb)); +} + +/* + * Assumes qhp lock is held. + */ +static void __flush_qp(struct iwch_qp *qhp, int *flag) +{ + struct iwch_cq *rchp, *schp; + int count; + + rchp = qhp->rhp->cqid2ptr[qhp->attr.rcq]; + schp = qhp->rhp->cqid2ptr[qhp->attr.scq]; + + PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp); + /* take a ref on the qhp since we must release the lock */ + atomic_inc(&qhp->refcnt); + spin_unlock_irqrestore(&qhp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&rchp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&rchp->cq); + cxio_count_rcqes(&rchp->cq, &qhp->wq, &count); + cxio_flush_rq(&qhp->wq, &rchp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&rchp->lock, *flag); + + /* locking heirarchy: cq lock first, then qp lock. */ + spin_lock_irqsave(&schp->lock, *flag); + spin_lock(&qhp->lock); + cxio_flush_hw_cq(&schp->cq); + cxio_count_scqes(&schp->cq, &qhp->wq, &count); + cxio_flush_sq(&qhp->wq, &schp->cq, count); + spin_unlock(&qhp->lock); + spin_unlock_irqrestore(&schp->lock, *flag); + + /* deref */ + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); + + spin_lock_irqsave(&qhp->lock, *flag); +} + +static inline void flush_qp(struct iwch_qp *qhp, int *flag) +{ + if (t3b_device(qhp->rhp)) + cxio_set_wq_in_error(&qhp->wq); + else + __flush_qp(qhp, flag); +} + +static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs) +{ + struct t3_rdma_init_attr init_attr; + int ret; + + init_attr.tid = qhp->ep->hwtid; + init_attr.qpid = qhp->wq.qpid; + init_attr.pdid = qhp->attr.pd; + init_attr.scqid = qhp->attr.scq; + init_attr.rcqid = qhp->attr.rcq; + init_attr.rq_addr = qhp->wq.rq_addr; + init_attr.rq_size = 1 << qhp->wq.rq_size_log2; + PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d\n", + __FUNCTION__, init_attr.rq_addr, init_attr.rq_size); + init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | + qhp->attr.mpa_attr.recv_marker_enabled | + (qhp->attr.mpa_attr.xmit_marker_enabled << 1) | + (qhp->attr.mpa_attr.crc_enabled << 2); + + /* + * XXX - The IWCM doesn't quite handle getting these + * attrs set before going into RTS. For now, just turn + * them on always... + */ +#if 0 + init_attr.qpcaps = qhp->attr.enableRdmaRead | + (qhp->attr.enableRdmaWrite << 1) | + (qhp->attr.enableBind << 2) | + (qhp->attr.enable_stag0_fastreg << 3) | + (qhp->attr.enable_stag0_fastreg << 4); +#else + init_attr.qpcaps = 0x1f; +#endif + init_attr.tcp_emss = qhp->ep->emss; + init_attr.ord = qhp->attr.max_ord; + init_attr.ird = qhp->attr.max_ird; + init_attr.qp_dma_addr = qhp->wq.dma_addr; + init_attr.qp_dma_size = (1UL << qhp->wq.size_log2); + init_attr.rqes_posted = Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr) ? + 0 : 1; + ret = cxio_rdma_init(&rhp->rdev, &init_attr); + PDBG("%s ret %d\n", __FUNCTION__, ret); + return ret; +} + +int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, + enum iwch_qp_attr_mask mask, + struct iwch_qp_attributes *attrs, + int internal) +{ + int ret = 0; + struct iwch_qp_attributes newattr = qhp->attr; + int flag; + int disconnect = 0; + int terminate = 0; + int abort = 0; + int free = 0; + struct iwch_ep *ep = NULL; + + PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, + qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, + (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1); + + spin_lock_irqsave(&qhp->lock, flag); + + /* Process attr changes if in IDLE */ + if (mask & IWCH_QP_ATTR_VALID_MODIFY) { + if (qhp->attr.state != IWCH_QP_STATE_IDLE) { + ret = -EIO; + goto out; + } + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ) + newattr.enable_rdma_read = attrs->enable_rdma_read; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE) + newattr.enable_rdma_write = attrs->enable_rdma_write; + if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND) + newattr.enable_bind = attrs->enable_bind; + if (mask & IWCH_QP_ATTR_MAX_ORD) { + if (attrs->max_ord > + rhp->attr.max_rdma_read_qp_depth) { + ret = -EINVAL; + goto out; + } + newattr.max_ord = attrs->max_ord; + } + if (mask & IWCH_QP_ATTR_MAX_IRD) { + if (attrs->max_ird > + rhp->attr.max_rdma_reads_per_qp) { + ret = -EINVAL; + goto out; + } + newattr.max_ird = attrs->max_ird; + } + qhp->attr = newattr; + } + + if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) + goto out; + if (qhp->attr.state == attrs->next_state) + goto out; + + switch (qhp->attr.state) { + case IWCH_QP_STATE_IDLE: + switch (attrs->next_state) { + case IWCH_QP_STATE_RTS: + if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) { + ret = -EINVAL; + goto out; + } + if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) { + ret = -EINVAL; + goto out; + } + qhp->attr.mpa_attr = attrs->mpa_attr; + qhp->attr.llp_stream_handle = attrs->llp_stream_handle; + qhp->ep = qhp->attr.llp_stream_handle; + qhp->attr.state = IWCH_QP_STATE_RTS; + + /* + * Ref the endpoint here and deref when we + * disassociate the endpoint from the QP. This + * happens in CLOSING->IDLE transition or *->ERROR + * transition. + */ + atomic_inc(&qhp->ep->com.refcnt); + spin_unlock_irqrestore(&qhp->lock, flag); + ret = rdma_init(rhp, qhp, mask, attrs); + spin_lock_irqsave(&qhp->lock, flag); + if (ret) + goto err; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + flush_qp(qhp, &flag); + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_RTS: + switch (attrs->next_state) { + case IWCH_QP_STATE_CLOSING: + BUG_ON(atomic_read(&qhp->ep->com.refcnt) < 2); + qhp->attr.state = IWCH_QP_STATE_CLOSING; + if (!internal) { + abort=0; + disconnect = 1; + ep = qhp->ep; + } + break; + case IWCH_QP_STATE_TERMINATE: + qhp->attr.state = IWCH_QP_STATE_TERMINATE; + if (!internal) + terminate = 1; + break; + case IWCH_QP_STATE_ERROR: + qhp->attr.state = IWCH_QP_STATE_ERROR; + if (!internal) { + abort=1; + disconnect = 1; + ep = qhp->ep; + } + goto err; + break; + default: + ret = -EINVAL; + goto out; + } + break; + case IWCH_QP_STATE_CLOSING: + if (!internal) { + ret = -EINVAL; + goto out; + } + switch (attrs->next_state) { + case IWCH_QP_STATE_IDLE: + qhp->attr.state = IWCH_QP_STATE_IDLE; + qhp->attr.llp_stream_handle = NULL; + free_ep(&qhp->ep->com); + qhp->ep = NULL; + wake_up(&qhp->wait); + break; + case IWCH_QP_STATE_ERROR: + goto err; + default: + ret = -EINVAL; + goto err; + } + break; + case IWCH_QP_STATE_ERROR: + if (attrs->next_state != IWCH_QP_STATE_IDLE) { + ret = -EINVAL; + goto out; + } + + if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || + !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) { + ret = -EINVAL; + goto out; + } + qhp->attr.state = IWCH_QP_STATE_IDLE; + memset(&qhp->attr, 0, sizeof(qhp->attr)); + break; + case IWCH_QP_STATE_TERMINATE: + if (!internal) { + ret = -EINVAL; + goto out; + } + goto err; + break; + default: + printk(KERN_ERR "%s in a bad state %d\n", + __FUNCTION__, qhp->attr.state); + ret = -EINVAL; + goto err; + break; + } + goto out; +err: + PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, + qhp->wq.qpid); + + /* disassociate the LLP connection */ + qhp->attr.llp_stream_handle = NULL; + ep = qhp->ep; + qhp->ep = NULL; + qhp->attr.state = IWCH_QP_STATE_ERROR; + free=1; + wake_up(&qhp->wait); + BUG_ON(!ep); + flush_qp(qhp, &flag); +out: + spin_unlock_irqrestore(&qhp->lock, flag); + + if (terminate) + iwch_post_terminate(qhp, NULL); + + /* + * If disconnect is 1, then we need to initiate a disconnect + * on the EP. This can be a normal close (RTS->CLOSING) or + * an abnormal close (RTS/CLOSING->ERROR). + */ + if (disconnect) + iwch_ep_disconnect(ep, abort, GFP_KERNEL); + + /* + * If free is 1, then we've disassociated the EP from the QP + * and we need to dereference the EP. + */ + if (free) + free_ep(&ep->com); + + PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state); + return ret; +} + +static int quiesce_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_quiesce_tid(qhp->ep); + qhp->flags |= QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +static int resume_qp(struct iwch_qp *qhp) +{ + spin_lock_irq(&qhp->lock); + iwch_resume_tid(qhp->ep); + qhp->flags &= ~QP_QUIESCED; + spin_unlock_irq(&qhp->lock); + return 0; +} + +int iwch_quiesce_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = chp->rhp->qpid2ptr[i]; + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) { + quiesce_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) + quiesce_qp(qhp); + } + return 0; +} + +int iwch_resume_qps(struct iwch_cq *chp) +{ + int i; + struct iwch_qp *qhp; + + for (i=0; i < T3_MAX_NUM_QP; i++) { + qhp = chp->rhp->qpid2ptr[i]; + if (!qhp) + continue; + if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) { + resume_qp(qhp); + continue; + } + if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp)) + resume_qp(qhp); + } + return 0; +} From swise at opengridcomputing.com Wed Nov 15 19:59:02 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:59:02 -0600 Subject: [openib-general] [PATCH 07/13] Async Event Handler In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035902.22635.63207.stgit@dell3.ogc.int> Code to handle async events coming from the T3 RDMA Core. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_ev.c | 228 +++++++++++++++++++++++++++++++++ 1 files changed, 228 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c new file mode 100644 index 0000000..0726fa6 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c @@ -0,0 +1,228 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include +#include +#include +#include "iwch_provider.h" +#include "iwch.h" +#include "iwch_cm.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp, + struct respQ_msg_t *rsp_msg, + enum ib_event_type ib_event, + int send_term) +{ + struct ib_event event; + struct iwch_qp_attributes attrs; + struct iwch_qp *qhp; + + printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + + spin_lock(&rnicp->lock); + qhp = rnicp->qpid2ptr[CQE_QPID(rsp_msg->cqe)]; + + if (!qhp) { + printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", + __FUNCTION__, CQE_STATUS(rsp_msg->cqe), + CQE_QPID(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + if ((qhp->attr.state == IWCH_QP_STATE_ERROR) || + (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) { + PDBG("%s AE received after RTS - " + "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, + qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + return; + } + + atomic_inc(&qhp->refcnt); + spin_unlock(&rnicp->lock); + + event.event = ib_event; + event.device = chp->ibcq.device; + if (ib_event == IB_EVENT_CQ_ERR) + event.element.cq = &chp->ibcq; + else + event.element.qp = &qhp->ibqp; + + if (qhp->ibqp.event_handler) + (*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context); + + attrs.next_state = IWCH_QP_STATE_TERMINATE; + if (send_term && (qhp->attr.state == IWCH_QP_STATE_RTS) && + !iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 1)) + iwch_post_terminate(qhp, rsp_msg); + + if (atomic_dec_and_test(&qhp->refcnt)) + wake_up(&qhp->wait); +} + +void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb) +{ + struct iwch_dev *rnicp; + struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data; + struct iwch_cq *chp; + struct iwch_qp *qhp; + u32 cqid = RSPQ_CQID(rsp_msg); + + rnicp = (struct iwch_dev *) rdev_p->ulp; + spin_lock(&rnicp->lock); + chp = rnicp->cqid2ptr[cqid]; + qhp = rnicp->qpid2ptr[CQE_QPID(rsp_msg->cqe)]; + if (!chp || !qhp) { + printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d " + "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", + cqid, CQE_QPID(rsp_msg->cqe), + CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), + CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), + CQE_WRID_LOW(rsp_msg->cqe)); + spin_unlock(&rnicp->lock); + goto out; + } + iwch_qp_add_ref(&qhp->ibqp); + atomic_inc(&chp->refcnt); + spin_unlock(&rnicp->lock); + + /* + * 1) completion of our sending a TERMINATE. + * 2) incoming TERMINATE message. + */ + if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && + (CQE_STATUS(rsp_msg->cqe) == 0)) { + if (SQ_TYPE(rsp_msg->cqe)) { + PDBG("%s QPID 0x%x ep %p disconnecting\n", + __FUNCTION__, qhp->wq.qpid, qhp->ep); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } else { + PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, + qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, + IB_EVENT_QP_REQ_ERR, 0); + iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC); + } + goto done; + } + + /* Bad incoming Read request */ + if (SQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + /* Bad incoming write */ + if (RQ_TYPE(rsp_msg->cqe) && + (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) { + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1); + goto done; + } + + switch (CQE_STATUS(rsp_msg->cqe)) { + + /* Completion Events */ + case TPT_ERR_SUCCESS: + + /* + * Confirm the destination entry if this is a RECV completion. + */ + if (qhp->ep && SQ_TYPE(rsp_msg->cqe)) + dst_confirm(qhp->ep->dst); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + break; + + case TPT_ERR_STAG: + case TPT_ERR_PDID: + case TPT_ERR_QPID: + case TPT_ERR_ACCESS: + case TPT_ERR_WRAP: + case TPT_ERR_BOUND: + case TPT_ERR_INVALIDATE_SHARED_MR: + case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND: + printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x " + "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, + CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), + CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe), + CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe)); + (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1); + break; + + /* Device Fatal Errors */ + case TPT_ERR_ECC: + case TPT_ERR_ECC_PSTAG: + case TPT_ERR_INTERNAL_ERR: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1); + break; + + /* QP Fatal Errors */ + case TPT_ERR_OUT_OF_RQE: + case TPT_ERR_PBL_ADDR_BOUND: + case TPT_ERR_CRC: + case TPT_ERR_MARKER: + case TPT_ERR_PDU_LEN_ERR: + case TPT_ERR_DDP_VERSION: + case TPT_ERR_RDMA_VERSION: + case TPT_ERR_OPCODE: + case TPT_ERR_DDP_QUEUE_NUM: + case TPT_ERR_MSN: + case TPT_ERR_TBIT: + case TPT_ERR_MO: + case TPT_ERR_MSN_GAP: + case TPT_ERR_MSN_RANGE: + case TPT_ERR_RQE_ADDR_BOUND: + case TPT_ERR_IRD_OVERFLOW: + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + + default: + printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", + CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid); + post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1); + break; + } +done: + if (atomic_dec_and_test(&chp->refcnt)) + wake_up(&chp->wait); + iwch_qp_rem_ref(&qhp->ibqp); +out: + dev_kfree_skb_irq(skb); +} From swise at opengridcomputing.com Wed Nov 15 19:59:07 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:59:07 -0600 Subject: [openib-general] [PATCH 08/13] Memory Registration In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035907.22635.90998.stgit@dell3.ogc.int> Functions to register memory regions. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_mem.c | 175 ++++++++++++++++++++++++++++++++ 1 files changed, 175 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c new file mode 100644 index 0000000..9aad3a0 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c @@ -0,0 +1,175 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#include + +#include +#include + +#include "cxio_hal.h" +#include "iwch.h" +#include "iwch_provider.h" + +int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list) +{ + u32 stag; + u32 mmid; + + + if (cxio_register_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) { + return -ENOMEM; + } + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + rhp->mmid2ptr[mmid] = mhp; + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php, + struct iwch_mr *mhp, + int shift, + u64 *page_list, + int npages) +{ + u32 stag; + u32 mmid; + + + /* We could support this... */ + if (npages > mhp->attr.pbl_size) + return -ENOMEM; + + stag = mhp->attr.stag; + if (cxio_reregister_phys_mem(&rhp->rdev, + &stag, mhp->attr.pdid, + mhp->attr.perms, + mhp->attr.zbva, + mhp->attr.va_fbo, + mhp->attr.len, + shift-12, + page_list, + &mhp->attr.pbl_size, &mhp->attr.pbl_addr)) { + return -ENOMEM; + } + mhp->attr.state = 1; + mhp->attr.stag = stag; + mmid = stag >> 8; + mhp->ibmr.rkey = mhp->ibmr.lkey = stag; + rhp->mmid2ptr[mmid] = mhp; + PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp); + return 0; +} + +int build_phys_page_list(struct ib_phys_buf *buffer_list, + int num_phys_buf, + u64 *iova_start, + u64 *total_size, + int *npages, + int *shift, + u64 **page_list) +{ + u64 mask; + int i, j, n; + + mask = 0; + *total_size = 0; + for (i = 0; i < num_phys_buf; ++i) { + if (i != 0 && buffer_list[i].addr & ~PAGE_MASK) + return -EINVAL; + if (i != 0 && i != num_phys_buf - 1 && + (buffer_list[i].size & ~PAGE_MASK)) + return -EINVAL; + *total_size += buffer_list[i].size; + if (i > 0) + mask |= buffer_list[i].addr; + } + + if (*total_size > 0xFFFFFFFFULL) + return -ENOMEM; + + /* Find largest page shift we can use to cover buffers */ + for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift)) + if (num_phys_buf > 1) { + if ((1ULL << *shift) & mask) + break; + } else { + if (1ULL << *shift >= + buffer_list[0].size + + (buffer_list[0].addr & ((1ULL << *shift) - 1))) + break; + } + + buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1); + buffer_list[0].addr &= ~0ull << *shift; + + *npages = 0; + for (i = 0; i < num_phys_buf; ++i) + *npages += (buffer_list[i].size + + (1ULL << *shift) - 1) >> *shift; + + if (!*npages) { + return -EINVAL; + } + + *page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL); + if (!*page_list) { + return -ENOMEM; + } + + n = 0; + for (i = 0; i < num_phys_buf; ++i) + for (j = 0; + j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift; + ++j) + (*page_list)[n++] = cpu_to_be64(buffer_list[i].addr + + ((u64) j << *shift)); + + PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n", + __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages); + + return 0; + +} From swise at opengridcomputing.com Wed Nov 15 19:59:12 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:59:12 -0600 Subject: [openib-general] [PATCH 09/13] Core WQE/CQE Types In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035912.22635.21736.stgit@dell3.ogc.int> T3 WQE and CQE structures, defines, etc... Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_wr.h | 658 ++++++++++++++++++++++++++++ 1 files changed, 658 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h new file mode 100644 index 0000000..ad84708 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h @@ -0,0 +1,658 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_WR_H__ +#define __CXIO_WR_H__ + +#include +#include +#include +#include "firmware_exports.h" + +#define T3_MAX_SGE 4 + +#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr)) +#define Q_FULL(rptr,wptr,size_log2) ( (((wptr)-(rptr))>>(size_log2)) && \ + ((rptr)!=(wptr)) ) +#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1)) +#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<> S_FW_RIWR_OP)) & M_FW_RIWR_OP) + +#define S_FW_RIWR_SOPEOP 22 +#define M_FW_RIWR_SOPEOP 0x3 +#define V_FW_RIWR_SOPEOP(x) ((x) << S_FW_RIWR_SOPEOP) + +#define S_FW_RIWR_FLAGS 8 +#define M_FW_RIWR_FLAGS 0x3fffff +#define V_FW_RIWR_FLAGS(x) ((x) << S_FW_RIWR_FLAGS) +#define G_FW_RIWR_FLAGS(x) ((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS) + +#define S_FW_RIWR_TID 8 +#define V_FW_RIWR_TID(x) ((x) << S_FW_RIWR_TID) + +#define S_FW_RIWR_LEN 0 +#define V_FW_RIWR_LEN(x) ((x) << S_FW_RIWR_LEN) + +#define S_FW_RIWR_GEN 31 +#define V_FW_RIWR_GEN(x) ((x) << S_FW_RIWR_GEN) + +struct t3_sge { + u32 stag; + u32 len; + u64 to; +}; + +/* If num_sgle is zero, flit 5+ contains immediate data.*/ +struct t3_send_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + + enum t3_rdma_opcode rdmaop:8; + u32 reserved:24; /* 2 */ + u32 rem_stag; /* 2 */ + u32 plen; /* 3 */ + u32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ +}; + +struct t3_local_inv_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u32 stag; /* 2 */ + u32 reserved3; +}; + +struct t3_rdma_write_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + enum t3_rdma_opcode rdmaop:8; /* 2 */ + u32 reserved:24; /* 2 */ + u32 stag_sink; + u64 to_sink; /* 3 */ + u32 plen; /* 4 */ + u32 num_sgle; + struct t3_sge sgl[T3_MAX_SGE]; /* 5+ */ +}; + +struct t3_rdma_read_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + enum t3_rdma_opcode rdmaop:8; /* 2 */ + u32 reserved:24; + u32 rem_stag; + u64 rem_to; /* 3 */ + u32 local_stag; /* 4 */ + u32 local_len; + u64 local_to; /* 5 */ +}; + +enum t3_addr_type { + T3_VA_BASED_TO = 0x0, + T3_ZERO_BASED_TO = 0x1 +} __attribute__ ((packed)); + +enum t3_mem_perms { + T3_MEM_ACCESS_LOCAL_READ = 0x1, + T3_MEM_ACCESS_LOCAL_WRITE = 0x2, + T3_MEM_ACCESS_REM_READ = 0x4, + T3_MEM_ACCESS_REM_WRITE = 0x8 +} __attribute__ ((packed)); + +struct t3_bind_mw_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u32 reserved:16; + enum t3_addr_type type:8; + enum t3_mem_perms perms:8; /* 2 */ + u32 mr_stag; + u32 mw_stag; /* 3 */ + u32 mw_len; + u64 mw_va; /* 4 */ + u32 mr_pbl_addr; /* 5 */ + u32 reserved2:24; + u32 mr_pagesz:8; +}; + +struct t3_receive_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u8 pagesz[T3_MAX_SGE]; + u32 num_sgle; /* 2 */ + struct t3_sge sgl[T3_MAX_SGE]; /* 3+ */ + u32 pbl_addr[T3_MAX_SGE]; +}; + +struct t3_bypass_wr { + struct fw_riwrh wrh; + union t3_wrid wrid; /* 1 */ +}; + +struct t3_modify_qp_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u32 flags; /* 2 */ + u32 quiesce; /* 2 */ + u32 max_ird; /* 3 */ + u32 max_ord; /* 3 */ + u64 sge_cmd; /* 4 */ + u64 ctx1; /* 5 */ + u64 ctx0; /* 6 */ +}; + +enum t3_modify_qp_flags { + MODQP_QUIESCE = 0x01, + MODQP_MAX_IRD = 0x02, + MODQP_MAX_ORD = 0x04, + MODQP_WRITE_EC = 0x08, + MODQP_READ_EC = 0x10, +}; + + +enum t3_mpa_attrs { + uP_RI_MPA_RX_MARKER_ENABLE = 0x1, + uP_RI_MPA_TX_MARKER_ENABLE = 0x2, + uP_RI_MPA_CRC_ENABLE = 0x4, + uP_RI_MPA_IETF_ENABLE = 0x8 +} __attribute__ ((packed)); + +enum t3_qp_caps { + uP_RI_QP_RDMA_READ_ENABLE = 0x01, + uP_RI_QP_RDMA_WRITE_ENABLE = 0x02, + uP_RI_QP_BIND_ENABLE = 0x04, + uP_RI_QP_FAST_REGISTER_ENABLE = 0x08, + uP_RI_QP_STAG0_ENABLE = 0x10 +} __attribute__ ((packed)); + +struct t3_rdma_init_attr { + u32 tid; + u32 qpid; + u32 pdid; + u32 scqid; + u32 rcqid; + u32 rq_addr; + u32 rq_size; + enum t3_mpa_attrs mpaattrs; + enum t3_qp_caps qpcaps; + u16 tcp_emss; + u32 ord; + u32 ird; + u64 qp_dma_addr; + u32 qp_dma_size; + u8 rqes_posted; +}; + +struct t3_rdma_init_wr { + struct fw_riwrh wrh; /* 0 */ + union t3_wrid wrid; /* 1 */ + u32 qpid; /* 2 */ + u32 pdid; + u32 scqid; /* 3 */ + u32 rcqid; + u32 rq_addr; /* 4 */ + u32 rq_size; + enum t3_mpa_attrs mpaattrs:8; /* 5 */ + enum t3_qp_caps qpcaps:8; + u32 ulpdu_size:16; + u32 rqes_posted; /* bits 31-1 - reservered */ + /* bit 0 - set if RECV posted */ + u32 ord; /* 6 */ + u32 ird; + u64 qp_dma_addr; /* 7 */ + u32 qp_dma_size; /* 8 */ + u32 rsvd; +}; + +union t3_wr { + struct t3_send_wr send; + struct t3_rdma_write_wr write; + struct t3_rdma_read_wr read; + struct t3_receive_wr recv; + struct t3_local_inv_wr local_inv; + struct t3_bind_mw_wr bind; + struct t3_bypass_wr bypass; + struct t3_rdma_init_wr init; + struct t3_modify_qp_wr qp_mod; + u64 flit[16]; +}; + +#define T3_SQ_CQE_FLIT 13 +#define T3_SQ_COOKIE_FLIT 14 + +#define T3_RQ_COOKIE_FLIT 13 +#define T3_RQ_CQE_FLIT 14 + +static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op, + enum t3_wr_flags flags, u8 genbit, u32 tid, + u8 len) +{ + wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) | + V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) | + V_FW_RIWR_FLAGS(flags)); + wmb(); + wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) | + V_FW_RIWR_TID(tid) | + V_FW_RIWR_LEN(len)); + /* 2nd gen bit... */ + ((union t3_wr *)wqe)->flit[15] = cpu_to_be64(genbit); +} + +/* + * T3 ULP2_TX commands + */ +enum t3_utx_mem_op { + T3_UTX_MEM_READ = 2, + T3_UTX_MEM_WRITE = 3 +}; + +/* T3 MC7 RDMA TPT entry format */ + +enum tpt_mem_type { + TPT_NON_SHARED_MR = 0x0, + TPT_SHARED_MR = 0x1, + TPT_MW = 0x2, + TPT_MW_RELAXED_PROTECTION = 0x3 +}; + +enum tpt_addr_type { + TPT_ZBTO = 0, + TPT_VATO = 1 +}; + +enum tpt_mem_perm { + TPT_LOCAL_READ = 0x8, + TPT_LOCAL_WRITE = 0x4, + TPT_REMOTE_READ = 0x2, + TPT_REMOTE_WRITE = 0x1 +}; + +struct tpt_entry { + u32 valid_stag_pdid; + u32 flags_pagesize_qpid; + + u32 rsvd_pbl_addr; + u32 len; + u32 va_hi; + u32 va_low_or_fbo; + + u32 rsvd_bind_cnt_or_pstag; + u32 rsvd_pbl_size; +}; +#define S_TPT_VALID 31 +#define V_TPT_VALID(x) ((x) << S_TPT_VALID) +#define F_TPT_VALID V_TPT_VALID(1U) + +#define S_TPT_STAG_KEY 23 +#define M_TPT_STAG_KEY 0xFF +#define V_TPT_STAG_KEY(x) ((x) << S_TPT_STAG_KEY) +#define G_TPT_STAG_KEY(x) (((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY) + +#define S_TPT_STAG_STATE 22 +#define V_TPT_STAG_STATE(x) ((x) << S_TPT_STAG_STATE) +#define F_TPT_STAG_STATE V_TPT_STAG_STATE(1U) + +#define S_TPT_STAG_TYPE 20 +#define M_TPT_STAG_TYPE 0x3 +#define V_TPT_STAG_TYPE(x) ((x) << S_TPT_STAG_TYPE) +#define G_TPT_STAG_TYPE(x) (((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE) + +#define S_TPT_PDID 0 +#define M_TPT_PDID 0xFFFFF +#define V_TPT_PDID(x) ((x) << S_TPT_PDID) +#define G_TPT_PDID(x) (((x) >> S_TPT_PDID) & M_TPT_PDID) + +#define S_TPT_PERM 28 +#define M_TPT_PERM 0xF +#define V_TPT_PERM(x) ((x) << S_TPT_PERM) +#define G_TPT_PERM(x) (((x) >> S_TPT_PERM) & M_TPT_PERM) + +#define S_TPT_REM_INV_DIS 27 +#define V_TPT_REM_INV_DIS(x) ((x) << S_TPT_REM_INV_DIS) +#define F_TPT_REM_INV_DIS V_TPT_REM_INV_DIS(1U) + +#define S_TPT_ADDR_TYPE 26 +#define V_TPT_ADDR_TYPE(x) ((x) << S_TPT_ADDR_TYPE) +#define F_TPT_ADDR_TYPE V_TPT_ADDR_TYPE(1U) + +#define S_TPT_MW_BIND_ENABLE 25 +#define V_TPT_MW_BIND_ENABLE(x) ((x) << S_TPT_MW_BIND_ENABLE) +#define F_TPT_MW_BIND_ENABLE V_TPT_MW_BIND_ENABLE(1U) + +#define S_TPT_PAGE_SIZE 20 +#define M_TPT_PAGE_SIZE 0x1F +#define V_TPT_PAGE_SIZE(x) ((x) << S_TPT_PAGE_SIZE) +#define G_TPT_PAGE_SIZE(x) (((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE) + +#define S_TPT_PBL_ADDR 0 +#define M_TPT_PBL_ADDR 0x1FFFFFFF +#define V_TPT_PBL_ADDR(x) ((x) << S_TPT_PBL_ADDR) +#define G_TPT_PBL_ADDR(x) (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR) + +#define S_TPT_QPID 0 +#define M_TPT_QPID 0xFFFFF +#define V_TPT_QPID(x) ((x) << S_TPT_QPID) +#define G_TPT_QPID(x) (((x) >> S_TPT_QPID) & M_TPT_QPID) + +#define S_TPT_PSTAG 0 +#define M_TPT_PSTAG 0xFFFFFF +#define V_TPT_PSTAG(x) ((x) << S_TPT_PSTAG) +#define G_TPT_PSTAG(x) (((x) >> S_TPT_PSTAG) & M_TPT_PSTAG) + +#define S_TPT_PBL_SIZE 0 +#define M_TPT_PBL_SIZE 0xFFFFF +#define V_TPT_PBL_SIZE(x) ((x) << S_TPT_PBL_SIZE) +#define G_TPT_PBL_SIZE(x) (((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE) + +/* + * CQE defs + */ +struct t3_cqe { + u32 header:32; + u32 len:32; + u32 wrid_hi_stag:32; + u32 wrid_low_msn:32; +}; + +#define S_CQE_OOO 31 +#define M_CQE_OOO 0x1 +#define G_CQE_OOO(x) ((((x) >> S_CQE_OOO)) & M_CQE_OOO) +#define V_CEQ_OOO(x) ((x)<> S_CQE_QPID)) & M_CQE_QPID) +#define V_CQE_QPID(x) ((x)<> S_CQE_SWCQE)) & M_CQE_SWCQE) +#define V_CQE_SWCQE(x) ((x)<> S_CQE_GENBIT) & M_CQE_GENBIT) +#define V_CQE_GENBIT(x) ((x)<> S_CQE_STATUS)) & M_CQE_STATUS) +#define V_CQE_STATUS(x) ((x)<> S_CQE_TYPE)) & M_CQE_TYPE) +#define V_CQE_TYPE(x) ((x)<> S_CQE_OPCODE)) & M_CQE_OPCODE) +#define V_CQE_OPCODE(x) ((x)<queue->flit[13] = 1; +} + +static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + return NULL; +} + +static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq) +{ + struct t3_cqe *cqe; + + if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) { + cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2)); + return cqe; + } + cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2)); + if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe)) + return cqe; + return NULL; +} + +#endif From swise at opengridcomputing.com Wed Nov 15 19:59:23 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:59:23 -0600 Subject: [openib-general] [PATCH 11/13] Core Resource Allocation In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035923.22635.5397.stgit@dell3.ogc.int> Core functions to carve up adapter memory, stag, qp, and cq IDs. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_resource.c | 357 ++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/core/cxio_resource.h | 70 ++++ 2 files changed, 427 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c new file mode 100644 index 0000000..ada44b9 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c @@ -0,0 +1,357 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +/* Crude resource management */ +#include +#include +#include +#include +#include +#include +#include "cxio_resource.h" +#include "cxio_hal.h" + +static struct kfifo *rhdl_fifo; +static spinlock_t rhdl_fifo_lock; + +#define RANDOM_SIZE 16 + + +/* Loosely based on the Mersenne twister algorithm */ +static u32 next_random(u32 rand) +{ + u32 y, ylast; + + y = rand; + ylast = y; + y = (y * 69069) & 0xffffffff; + y = (y & 0x80000000) + (ylast & 0x7fffffff); + if ((y & 1)) + y = ylast ^ (y > 1) ^ (2567483615UL); + else + y = ylast ^ (y > 1); + y = y ^ (y >> 11); + y = y ^ ((y >> 7) & 2636928640UL); + y = y ^ ((y >> 15) & 4022730752UL); + y = y ^ (y << 18); + return y; +} +static int __cxio_init_resource_fifo(struct kfifo **fifo, + spinlock_t *fifo_lock, + u32 nr, u32 skip_low, + u32 skip_high, + int random) +{ + u32 i, j, entry = 0, idx; + u32 random_bytes; + u32 rarray[16]; + spin_lock_init(fifo_lock); + + *fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock); + if (IS_ERR(*fifo)) + return -ENOMEM; + + for (i = 0; i < skip_low + skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32)); + if (random) { + j = 0; + get_random_bytes(&random_bytes,sizeof(random_bytes)); + for (i = 0; i < RANDOM_SIZE; i++) + rarray[i] = i + skip_low; + for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) { + if (j >= RANDOM_SIZE) { + j = 0; + random_bytes = next_random(random_bytes); + } + idx = (random_bytes >> (j * 2)) & 0xF; + __kfifo_put(*fifo, + (unsigned char *) &rarray[idx], + sizeof(u32)); + rarray[idx] = i; + j++; + } + for (i = 0; i < RANDOM_SIZE; i++) + __kfifo_put(*fifo, + (unsigned char *) &rarray[i], + sizeof(u32)); + } else + for (i = skip_low; i < nr - skip_high; i++) + __kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32)); + + for (i = 0; i < skip_low + skip_high; i++) + kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32)); + return 0; +} + +static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 0)); +} + +static int cxio_init_resource_fifo_random(struct kfifo **fifo, + spinlock_t * fifo_lock, + u32 nr, u32 skip_low, u32 skip_high) +{ + + return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, + skip_high, 1)); +} + +static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p) +{ + u32 i; + + spin_lock_init(&rdev_p->rscp->qpid_fifo_lock); + + rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), + GFP_KERNEL, + &rdev_p->rscp->qpid_fifo_lock); + if (IS_ERR(rdev_p->rscp->qpid_fifo)) + return -ENOMEM; + + for (i = 16; i < T3_MAX_NUM_QP; i++) { + if (!(i & rdev_p->qpmask)) + __kfifo_put(rdev_p->rscp->qpid_fifo, + (unsigned char *) &i, sizeof(u32)); + } + return 0; +} + +int cxio_hal_init_rhdl_resource(u32 nr_rhdl) +{ + return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1, + 0); +} + +void cxio_hal_destroy_rhdl_resource(void) +{ + kfifo_free(rhdl_fifo); +} + +/* nr_* must be power of 2 */ +int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid) +{ + int err = 0; + struct cxio_hal_resource *rscp; + + rscp = kmalloc(sizeof(*rscp), GFP_KERNEL); + if (!rscp) { + return -ENOMEM; + } + rdev_p->rscp = rscp; + err = cxio_init_resource_fifo_random(&rscp->tpt_fifo, + &rscp->tpt_fifo_lock, + nr_tpt, 1, 0); + if (err) + goto tpt_err; + err = cxio_init_qpid_fifo(rdev_p); + if (err) + goto qpid_err; + err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, + nr_cqid, 1, 0); + if (err) + goto cqid_err; + err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, + nr_pdid, 1, 0); + if (err) + goto pdid_err; + return 0; +pdid_err: + kfifo_free(rscp->cqid_fifo); +cqid_err: + kfifo_free(rscp->qpid_fifo); +qpid_err: + kfifo_free(rscp->tpt_fifo); +tpt_err: + return -ENOMEM; +} + +/* + * returns 0 if no resource available + */ +static inline u32 cxio_hal_get_resource(struct kfifo *fifo) +{ + u32 entry; + if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32))) + return entry; + else + return 0; /* fifo emptry */ +} + +static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry) +{ + BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0); +} + +u32 cxio_hal_get_rhdl(void) +{ + return cxio_hal_get_resource(rhdl_fifo); +} + +void cxio_hal_put_rhdl(u32 rhdl) +{ + cxio_hal_put_resource(rhdl_fifo, rhdl); +} + +u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->tpt_fifo); +} + +void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag) +{ + cxio_hal_put_resource(rscp->tpt_fifo, stag); +} + +u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp) +{ + u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo); + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + return qpid; +} + +void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid) +{ + PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid); + cxio_hal_put_resource(rscp->qpid_fifo, qpid); +} + +u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->cqid_fifo); +} + +void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid) +{ + cxio_hal_put_resource(rscp->cqid_fifo, cqid); +} + +u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp) +{ + return cxio_hal_get_resource(rscp->pdid_fifo); +} + +void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid) +{ + cxio_hal_put_resource(rscp->pdid_fifo, pdid); +} + +void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp) +{ + kfifo_free(rscp->tpt_fifo); + kfifo_free(rscp->cqid_fifo); + kfifo_free(rscp->qpid_fifo); + kfifo_free(rscp->pdid_fifo); + kfree(rscp); +} + +/* + * PBL Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_PBL_SHIFT 8 /* 256B == min PBL size (32 entries) */ +#define PBL_CHUNK 2*1024*1024 + +u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size); + return (u32)addr; +} + +void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size); + gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size); +} + +int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1); + if (rdev_p->pbl_pool) { + for (i = rdev_p->rnic_info.pbl_base; + i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; + i += PBL_CHUNK) { + gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1); + } + } + return rdev_p->pbl_pool ? 0 : -ENOMEM; +} + +void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->pbl_pool); +} + +/* + * RQT Memory Manager. Uses Linux generic allocator. + */ + +#define MIN_RQT_SHIFT 10 /* 1KB == mini RQT size (16 entries) */ +#define RQT_CHUNK 2*1024*1024 + +u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size) +{ + unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6); + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6); + return (u32)addr; +} + +void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size) +{ + PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6); + gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6); +} + +int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p) +{ + unsigned long i; + rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1); + if (rdev_p->rqt_pool) { + for (i = rdev_p->rnic_info.rqt_base; + i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; + i += RQT_CHUNK) { + gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1); + } + } + return rdev_p->rqt_pool ? 0 : -ENOMEM; +} + +void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p) +{ + gen_pool_destroy(rdev_p->rqt_pool); +} diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h new file mode 100644 index 0000000..a6bbe83 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifndef __CXIO_RESOURCE_H__ +#define __CXIO_RESOURCE_H__ + +#include +#include +#include +#include +#include +#include +#include +#include "cxio_hal.h" + +extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl); +extern void cxio_hal_destroy_rhdl_resource(void); +extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p, + u32 nr_tpt, u32 nr_pbl, + u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, + u32 nr_pdid); +extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag); +extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid); +extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp); +extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid); +extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp); + +#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base ) +extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); + +#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base ) +extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p); +extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p); +extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size); +extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size); +#endif From swise at opengridcomputing.com Wed Nov 15 19:59:28 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:59:28 -0600 Subject: [openib-general] [PATCH 12/13] Core Debug functions In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035928.22635.28647.stgit@dell3.ogc.int> Debug code to dump various data structs, some of which are in adapter memory. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/core/cxio_dbg.c | 205 +++++++++++++++++++++++++++ 1 files changed, 205 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c new file mode 100644 index 0000000..ffe5261 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c @@ -0,0 +1,205 @@ +/* + * Copyright (c) 2006 Chelsio, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ +#ifdef DEBUG +#include +#include "common.h" +#include "cxgb3_ioctl.h" +#include "cxio_hal.h" +#include "cxio_wr.h" + +void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size = 32; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base; + m->len = size; + PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift) +{ + struct ch_mem_range *m; + u64 *data; + int rc; + int size, npages; + + shift += 12; + npages = (len + (1ULL << shift) - 1) >> shift; + size = npages * sizeof(u64); + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = pbl_addr; + m->len = size; + PDBG("%s PBL addr 0x%x len %d depth %d\n", + __FUNCTION__, m->addr, m->len, npages); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_wqe(union t3_wr *wqe) +{ + u64 *data = (u64 *)wqe; + uint size = (uint)(be64_to_cpu(*data) & 0xff); + + if (size == 0) + size = 8; + while (size > 0) { + PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data)); + size--; + data++; + } +} + +void cxio_dump_wce(struct t3_cqe *wce) +{ + u64 *data = (u64 *)wce; + int size = sizeof(*wce); + + while (size > 0) { + PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data)); + size -= 8; + data++; + } +} + +void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents) +{ + struct ch_mem_range *m; + int size = nents * 64; + u64 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_PMRX; + m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base; + m->len = size; + PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u64 *)m->buf; + while (size > 0) { + PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data); + size -= 8; + data++; + m->addr += 8; + } + kfree(m); +} + +void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid) +{ + struct ch_mem_range *m; + int size = TCB_SIZE; + u32 *data; + int rc; + + m = kmalloc(sizeof(*m) + size, GFP_ATOMIC); + if (!m) { + PDBG("%s couldn't allocate memory.\n", __FUNCTION__); + return; + } + m->mem_id = MEM_CM; + m->addr = hwtid * size; + m->len = size; + PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len); + rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m); + if (rc) { + PDBG("%s toectl returned error %d\n", __FUNCTION__, rc); + kfree(m); + return; + } + + data = (u32 *)m->buf; + while (size > 0) { + printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", + m->addr, + *(data+2), *(data+3), *(data),*(data+1), + *(data+6), *(data+7), *(data+4), *(data+5)); + size -= 32; + data += 8; + m->addr += 32; + } + kfree(m); +} +#endif From swise at opengridcomputing.com Wed Nov 15 19:59:33 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 15 Nov 2006 21:59:33 -0600 Subject: [openib-general] [PATCH 13/13] Kconfig/Makefile In-Reply-To: <20061116035826.22635.61230.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> Message-ID: <20061116035933.22635.7293.stgit@dell3.ogc.int> Signed-off-by: Steve Wise --- drivers/infiniband/Kconfig | 1 + drivers/infiniband/Makefile | 1 + drivers/infiniband/hw/cxgb3/Kconfig | 27 +++++++++++++++++++++++++++ drivers/infiniband/hw/cxgb3/Makefile | 12 ++++++++++++ drivers/infiniband/hw/cxgb3/locking.txt | 25 +++++++++++++++++++++++++ 5 files changed, 66 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 59b3932..06453ab 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon source "drivers/infiniband/hw/ipath/Kconfig" source "drivers/infiniband/hw/ehca/Kconfig" source "drivers/infiniband/hw/amso1100/Kconfig" +source "drivers/infiniband/hw/cxgb3/Kconfig" source "drivers/infiniband/ulp/ipoib/Kconfig" diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index 570b30a..69bdd55 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += hw/mt obj-$(CONFIG_INFINIBAND_IPATH) += hw/ipath/ obj-$(CONFIG_INFINIBAND_EHCA) += hw/ehca/ obj-$(CONFIG_INFINIBAND_AMSO1100) += hw/amso1100/ +obj-$(CONFIG_INFINIBAND_CXGB3) += hw/cxgb3/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig new file mode 100644 index 0000000..84f0f6e --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Kconfig @@ -0,0 +1,27 @@ +config INFINIBAND_CXGB3 + tristate "Chelsio RDMA Driver" + depends on CHELSIO_T3 && INFINIBAND + select GENERIC_ALLOCATOR + ---help--- + This is an iWARP/RDMA driver for the Chelsio T3 1GbE and + 10GbE adapters. + + For general information about Chelsio and our products, visit + our website at . + + For customer support, please visit our customer support page at + . + + Please send feedback to . + + To compile this driver as a module, choose M here: the module + will be called iw_cxgb3. + +config INFINIBAND_CXGB3_DEBUG + bool "Verbose debugging output" + depends on INFINIBAND_CXGB3 + default n + ---help--- + This option causes the Chelsio RDMA driver to produce copious + amounts of debug messages. Select this if you are developing + the driver or trying to diagnose a problem. diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile new file mode 100644 index 0000000..0df2b3d --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/Makefile @@ -0,0 +1,12 @@ +EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \ + -I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core + +obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o + +iw_cxgb3-y := iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \ + iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o + +ifdef CONFIG_INFINIBAND_CXGB3_DEBUG +EXTRA_CFLAGS += -DDEBUG -O1 -g +iw_cxgb3-y += core/cxio_dbg.o +endif diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt new file mode 100644 index 0000000..e5e9991 --- /dev/null +++ b/drivers/infiniband/hw/cxgb3/locking.txt @@ -0,0 +1,25 @@ +cq lock: + - spin lock + - used to synchronize the t3_cq + +qp lock: + - spin lock + - used to synchronize updates to the qp state, attrs, and the t3_wq. + - touched on interrupt and process context + +rnicp lock: + - spin lock + - touched on interrupt and process context + - used around lookup tables mapping CQID and QPID to a structure. + - used also to bump the refcnt atomically with the lookup. + +poll: + lock+disable on cq lock + lock qp lock for each cqe that is polled around the call + to cxio_poll_cq(). + +post: + lock+disable qp lock + +global mutex iwch_mutex: + used to maintain global device list. From swise at opengridcomputing.com Wed Nov 15 20:04:24 2006 From: swise at opengridcomputing.com (Steve WIse) Date: Wed, 15 Nov 2006 22:04:24 -0600 Subject: [openib-general] [Fwd: [PATCH 00/13] Chelsio T3 RDMA Driver] Message-ID: <1163649864.4963.2.camel@linux-q667.site> Now the fun begins... -------- Forwarded Message -------- From: Steve Wise To: rdreier at cisco.com Cc: netdev at vger.kernel.org, linux-kernel at vger.kernel.org, openib-general at openib.org Subject: [openib-general] [PATCH 00/13] Chelsio T3 RDMA Driver Date: Wed, 15 Nov 2006 21:58:26 -0600 Roland / All: The following series implements the Chelsio T3 iWARP/RDMA Driver to be considered for inclusion in 2.6.20. It depends on the Chelsio T3 Ethernet Driver which is also under review now for 2.6.20 (http://marc.theaimsgroup.com/?l=linux-netdev&m=116363600816597&w=2). The patches are against 2.6.19-rc5. This patch series can also be pulled from: http://www.opengridcomputing.com/downloads/iw_cxgb3_patches.tar.bz2 The Chelsio T3 Ethernet Driver patch can be pulled from: http://service.chelsio.com/kernel.org/cxgb3.patch.bz2 Thanks, Steve. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at dev.mellanox.co.il Wed Nov 15 23:09:28 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 16 Nov 2006 09:09:28 +0200 Subject: [openib-general] ANNOUNCE: libibverbs and libmthca moving to git In-Reply-To: References: Message-ID: <200611160909.28853.jackm@dev.mellanox.co.il> On Thursday 16 November 2006 01:32, Roland Dreier wrote: > git clone git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git I tried invoking the above command, and got the following error: linux-hrf1:/mswg/work/jackm/g2.1>git clone git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git fatal: unable to connect a socket (Connection timed out) fetch-pack from 'git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git' failed. Any ideas as to why I can't connect a socket? (I'm not running as root, if that is an issue). - Jack From jackm at dev.mellanox.co.il Wed Nov 15 23:13:17 2006 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 16 Nov 2006 09:13:17 +0200 Subject: [openib-general] ANNOUNCE: libibverbs and libmthca moving to git In-Reply-To: <200611160909.28853.jackm@dev.mellanox.co.il> References: <200611160909.28853.jackm@dev.mellanox.co.il> Message-ID: <200611160913.17950.jackm@dev.mellanox.co.il> On Thursday 16 November 2006 09:09, Jack Morgenstein wrote: > On Thursday 16 November 2006 01:32, Roland Dreier wrote: > > git clone git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git > > I tried invoking the above command, and got the following error: > > > linux-hrf1:/mswg/work/jackm/g2.1>git clone git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git > fatal: unable to connect a socket (Connection timed out) > fetch-pack from 'git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git' failed. > > Any ideas as to why I can't connect a socket? (I'm not running as root, if that is an issue). > > - Jack > All is OK. (Oops). I was behind a firewall. - Jack From mst at mellanox.co.il Thu Nov 16 00:59:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 10:59:12 +0200 Subject: [openib-general] [PATCH] IB/ipoib: compliance/interoperability fix Message-ID: <20061116085911.GA15138@mellanox.co.il> ipoib assumes that high (reserved) octet in hardware address is 0, and copies it into the QPN. This violates RFC 4391 (which requires that the high 8 bits are ignored on receive), and will result in invalid QPN passed to hardware when inter-operating with IPoIB connected mode. Signed-off-by: Michael S. Tsirkin diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 1eaf00e..cdc98b1 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -49,6 +49,8 @@ #include #include +#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff) + MODULE_AUTHOR("Roland Dreier"); MODULE_DESCRIPTION("IP-over-InfiniBand net driver"); MODULE_LICENSE("Dual BSD/GPL"); @@ -517,6 +516,5 @@ static void neigh_add_path(struct sk_buf - ipoib_send(dev, skb, path->ah, - be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); } else { neigh->ah = NULL; __skb_queue_tail(&neigh->queue, skb); @@ -599,8 +594,7 @@ static void unicast_arp_send(struct sk_b ipoib_dbg(priv, "Send unicast ARP to %04x\n", be16_to_cpu(path->pathrec.dlid)); - ipoib_send(dev, skb, path->ah, - be32_to_cpup((__be32 *) phdr->hwaddr)); + ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr)); } else if ((path->query || !path_rec_start(dev, path)) && skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ @@ -661,8 +655,7 @@ static int ipoib_start_xmit(struct sk_bu goto out; } - ipoib_send(dev, skb, neigh->ah, - be32_to_cpup((__be32 *) skb->dst->neighbour->ha)); + ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); goto out; } @@ -694,7 +687,7 @@ static int ipoib_start_xmit(struct sk_bu IPOIB_GID_FMT "\n", skb->dst ? "neigh" : "dst", be16_to_cpup((__be16 *) skb->data), - be32_to_cpup((__be32 *) phdr->hwaddr), + IPOIB_QPN(phdr->hwaddr), IPOIB_GID_RAW_ARG(phdr->hwaddr + 4)); dev_kfree_skb_any(skb); ++priv->stats.tx_dropped; @@ -777,7 +770,7 @@ static void ipoib_neigh_destructor(struc ipoib_dbg(priv, "neigh_destructor for %06x " IPOIB_GID_FMT "\n", - be32_to_cpup((__be32 *) n->ha), + IPOIB_QPN(n->ha), IPOIB_GID_RAW_ARG(n->ha + 4)); spin_lock_irqsave(&priv->lock, flags); -- MST From vera_wx_cn at yahoo.com.cn Thu Nov 16 01:06:47 2006 From: vera_wx_cn at yahoo.com.cn (=?gb2312?q?=C7=BF=20=C2=ED?=) Date: Thu, 16 Nov 2006 17:06:47 +0800 (CST) Subject: [openib-general] I need your help. Message-ID: <20061116090647.81886.qmail@web92002.mail.cnb.yahoo.com> Hello! sir. I 've been developing my mpich projects on infiniband cluster for two months. $ ibstat CA type: MT25204 Number of ports: 1 Firmware version: 1.1.0 Hardware version: a0 Node GUID: 0xe865620060529997 System image GUID: 0xe86562006052999a Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 82 LMC: 0 SM lid: 82 Capability mask: 0x02510a6a Port GUID: 0xe865620060529998 I've downloaded << Mellanox IB-Verbs API (VAPI) >>, but I works on openib version. Would you mind telling me where I can download the API manual about OpenIB? thank you in advance. Wang. Nov.15 --------------------------------- Mp3疯狂搜-新歌热歌高速下 -------------- next part -------------- An HTML attachment was scrubbed... URL: From madhu.lakshmanan at qlogic.com Thu Nov 16 02:55:09 2006 From: madhu.lakshmanan at qlogic.com (Madhu Lakshmanan) Date: Thu, 16 Nov 2006 04:55:09 -0600 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE061190BDD64@EPEXCH2.qlogic.org> > From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of > Michael S. Tsirkin > Subject: Re: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O > Controller (VEx) > > Quoting Ramachandra K : > > Subject: [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) > > > > This patch set adds support for the QLogic Virtual Ethernet I/O > > controller (VEx), which presents a true Ethernet NIC to the host. > > > > This driver provides a standard Ethernet NIC interface to the system and > > treats IB as an I/O bus to allow a host CPU to use the VEx card as its NIC. > > Is the VEx wire protocol documented somewhere? For example, what is a viport? > What is a netpath? It's somewhat hard to understand the code without the > protocol spec it is trying to implement. > > -- > MST The VNIC software is a device driver for a remote device on the IB fabric, the VEx. We have followed the convention and standard set by previous submitters of device driver code to either OpenFabrics or the Linux kernel, in that the code is the documentation. The device drivers for mthca, ehca, ipath, or for that matter, Ethernet NICs like the Intel Pro 1000, do not document the "protocol" they implement when managing their device over PCI or PCI-X. The VNIC manages the VEx over the IB bus and as such it is a device driver in the same class as those mentioned above. Madhu From mst at mellanox.co.il Thu Nov 16 04:16:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 14:16:47 +0200 Subject: [openib-general] [PATCH] IB/ipoib: fix skb leak Message-ID: <20061116121647.GD30305@mellanox.co.il> ipoib_neigh_free is sometimes called while neighbour is still alive, so it might have queued skbs. Fix skb leak in this case. Signed-off-by: Michael S. Tsirkin --- Hi, Roland! I saw this potential issue when I went over the code. What do you think? diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index e5b793d..c0fb316 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -260,7 +279,7 @@ static inline struct ipoib_neigh **to_ip } struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh); -void ipoib_neigh_free(struct ipoib_neigh *neigh); +void ipoib_neigh_free(struct net_dev *dev, struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 1eaf00e..ac7e421 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -262,7 +264,7 @@ static void path_free(struct net_device if (neigh->ah) ipoib_put_ah(neigh->ah); - ipoib_neigh_free(neigh); + ipoib_neigh_free(dev, neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -517,9 +516,10 @@ static void neigh_add_path(struct sk_buf } else { neigh->ah = NULL; - __skb_queue_tail(&neigh->queue, skb); if (!path->query && path_rec_start(dev, path)) goto err_list; + + __skb_queue_tail(&neigh->queue, skb); } spin_unlock(&priv->lock); @@ -537,7 +533,7 @@ err_list: list_del(&neigh->list); err_path: - ipoib_neigh_free(neigh); + ipoib_neigh_free(dev, neigh); ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -655,9 +650,9 @@ static int ipoib_start_xmit(struct sk_bu */ ipoib_put_ah(neigh->ah); list_del(&neigh->list); - ipoib_neigh_free(neigh); + ipoib_neigh_free(dev, neigh); spin_unlock(&priv->lock); ipoib_path_lookup(skb, dev); goto out; } @@ -787,7 +781,7 @@ static void ipoib_neigh_destructor(struc if (neigh->ah) ah = neigh->ah; list_del(&neigh->list); - ipoib_neigh_free(neigh); + ipoib_neigh_free(dev, neigh); } spin_unlock_irqrestore(&priv->lock, flags); @@ -810,9 +804,15 @@ struct ipoib_neigh *ipoib_neigh_alloc(st return neigh; } -void ipoib_neigh_free(struct ipoib_neigh *neigh) +void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) { + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; *to_ipoib_neigh(neigh->neighbour) = NULL; + while ((skb = __skb_dequeue(&neigh->queue))) { + ++priv->stats.tx_dropped; + dev_kfree_skb_any(skb); + } kfree(neigh); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 3faa182..d282d65 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -114,7 +114,7 @@ static void ipoib_mcast_free(struct ipoi */ if (neigh->ah) ipoib_put_ah(neigh->ah); - ipoib_neigh_free(neigh); + ipoib_neigh_free(dev, neigh); } spin_unlock_irqrestore(&priv->lock, flags); -- MST From chevchenkovic at gmail.com Thu Nov 16 04:35:35 2006 From: chevchenkovic at gmail.com (Chevchenkovic Chevchenkovic) Date: Thu, 16 Nov 2006 18:05:35 +0530 Subject: [openib-general] Send_bw in UD Message-ID: <1c16cdf90611160435x4c57f4ebxd17cc9ae63ca7eb6@mail.gmail.com> Hi, Infiniband specification says that the completion notification in case of RC occurs when the data has actually reached the destination buffer. Whereas for UD it is given when the data is placed on the infiniband line. I was going through the code send_bw.c( https://openfabrics.org/svn/gen2/tags/openib-1.0-rc1/src/userspace/perftest/send_bw.c ). This tells the time taken for the data to reach the destination. In the case of UD the same code is used. Should it not have the code which waits for the acknowledgement from the destination? Alternately, is the bandwidth computation wrong in this case? Comments will be welcome, -Chev -------------- next part -------------- An HTML attachment was scrubbed... URL: From yosefe at voltaire.com Thu Nov 16 05:10:44 2006 From: yosefe at voltaire.com (Yosef Eitgin) Date: Thu, 16 Nov 2006 15:10:44 +0200 Subject: [openib-general] Mellanox ibtp requires vl.h which is not found Message-ID: Hello, Many of the Mellanox tests require a header file named "vl.h" For example: https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/userac cess/qp_test/main.c Where can I find it? It's not anywhere in /usr/local/ofed nor /usr/include ... Thanks _______________________________________________________________ Yosef Etigin, ib-host-stack | +972-9-971-7630 (o) | +972-54-218 8036(m) Voltaire - The Grid Backbone www.voltaire.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at dev.mellanox.co.il Thu Nov 16 05:16:17 2006 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 16 Nov 2006 15:16:17 +0200 Subject: [openib-general] Mellanox ibtp requires vl.h which is not found In-Reply-To: References: Message-ID: <455C64A1.5010401@dev.mellanox.co.il> Hi Yosef, You can found vl library under: https://openib.org/svn/trunk/contrib/mellanox/ibtp/common/tools/vl Regards, Vladimir Yosef Eitgin wrote: > > Hello, > > Many of the Mellanox tests require a header file named “vl.h” > > For example: > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/useraccess/qp_test/main.c > > Where can I find it? It’s not anywhere in /usr/local/ofed nor > /usr/include … > > Thanks > > _______________________________________________________________ > > Yosef Etigin, ib-host-stack | +972-9-971-7630 (o) | +972-54-218 8036(m) > > Voltaire – _The Grid Backbone_ > > www.voltaire.com > > ------------------------------------------------------------------------ > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dotanb at dev.mellanox.co.il Thu Nov 16 05:28:43 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 15:28:43 +0200 (IST) Subject: [openib-general] Mellanox ibtp requires vl.h which is not found In-Reply-To: References: Message-ID: <11683.194.90.237.34.1163683723.squirrel@dev.mellanox.co.il> Hi Yosef. > Hello, > > Many of the Mellanox tests require a header file named "vl.h" > > For example: > https://openib.org/svn/trunk/contrib/mellanox/ibtp/gen2/userspace/userac > cess/qp_test/main.c > > > > Where can I find it? It's not anywhere in /usr/local/ofed nor > /usr/include ... > > > > Thanks The VL library can be found in the following URL: https://openib.org/svn/trunk/contrib/mellanox/ibtp/common/tools/vl/ thanks Dotan From dotanb at dev.mellanox.co.il Thu Nov 16 06:06:41 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 16:06:41 +0200 (IST) Subject: [openib-general] Send_bw in UD In-Reply-To: <1c16cdf90611160435x4c57f4ebxd17cc9ae63ca7eb6@mail.gmail.com> References: <1c16cdf90611160435x4c57f4ebxd17cc9ae63ca7eb6@mail.gmail.com> Message-ID: <19748.194.90.237.34.1163686001.squirrel@dev.mellanox.co.il> Hi. > Hi, > Infiniband specification says that the completion notification in case > of > RC occurs when the data has actually reached the destination buffer. > Whereas > for UD it is given when the data is placed on the infiniband line. You are absolutely right. > I was going through the code send_bw.c( > https://openfabrics.org/svn/gen2/tags/openib-1.0-rc1/src/userspace/perftest/send_bw.c > ). > This tells the time taken for the data to reach the destination. In the > case > of UD the same code is used. Should it not have the code which waits for > the > acknowledgement from the destination? > Alternately, is the bandwidth computation wrong in this case? > Comments will be welcome, > -Chev This test is a pingpong test, so if data is being received from the remote side (even for UD QPs) that means that he got the data... This test assumes that no packet was dropped. Dotan From mst at mellanox.co.il Thu Nov 16 06:16:53 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 16:16:53 +0200 Subject: [openib-general] Send_bw in UD In-Reply-To: <19748.194.90.237.34.1163686001.squirrel@dev.mellanox.co.il> References: <1c16cdf90611160435x4c57f4ebxd17cc9ae63ca7eb6@mail.gmail.com> <19748.194.90.237.34.1163686001.squirrel@dev.mellanox.co.il> Message-ID: <20061116141653.GB8811@mellanox.co.il> > Quoting r. dotanb at dev.mellanox.co.il : > > I was going through the code send_bw.c( > > https://openfabrics.org/svn/gen2/tags/openib-1.0-rc1/src/userspace/perftest/send_bw.c > > ). > > This test is a pingpong test I think that there's no ping pong in send_bw - it measures one way streaming bw. We have the following comment at line 815: /* client is posting and not receiving. */ _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From dotanb at dev.mellanox.co.il Thu Nov 16 06:16:55 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 16:16:55 +0200 (IST) Subject: [openib-general] what should happen in a completion event channel is being destroyed when there are several CQs associated to it? In-Reply-To: References: <4553480F.80000@dev.mellanox.co.il> Message-ID: <21986.194.90.237.34.1163686615.squirrel@dev.mellanox.co.il> Hi roland. > > What should happen in a completion event channel is being destroyed > > when there are several CQs associated to it? > > Should this operation fail (return EBUSY)? > > I think that would be the most consistent thing, since we return EBUSY > for example if a CQ is destroyed with QPs still attached. > > > When i tried to do it and later on try to wait for a completion on > > this event channel i got seg fault... > > Does the destroy succeed? > > Anyway I'll look at this code to see if it seems OK. > > - R. > I'm writing the man pages to this verb, so which behaviour should i write the current behaviour or the future behaviour? for now, i'm writing the current behaviour. thanks Dotan From mst at mellanox.co.il Thu Nov 16 06:24:56 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 16:24:56 +0200 Subject: [openib-general] Send_bw in UD In-Reply-To: <1c16cdf90611160435x4c57f4ebxd17cc9ae63ca7eb6@mail.gmail.com> References: <1c16cdf90611160435x4c57f4ebxd17cc9ae63ca7eb6@mail.gmail.com> Message-ID: <20061116142456.GC8811@mellanox.co.il> > Quoting r. Chevchenkovic Chevchenkovic : > Subject: Send_bw in UD > > Hi, > Infiniband specification says that the completion notification in case of RC > occurs when the data has actually reached the destination buffer. Whereas for > UD it is given when the data is placed on the infiniband line. > I was going through the code send_bw.c( https://openfabrics.org/svn/gen2/ > tags/openib-1.0-rc1/src/userspace/perftest/send_bw.c ). This tells the time > taken for the data to reach the destination. No, this test measures streaming bandwidth. Compare this to UDP bandwidth test. > In the case of UD the same code is used. Should it not have the code which > waits for the acknowledgement from the destination? Once, at the end of the test? I believe the difference will be negligeable, and the test will get more confusing. > Alternately, is the bandwidth computation wrong in this case? > Comments will be welcome, The computation is performed correctly. The test currently will simply block forever on server side if there is some packet loss. If the test run to completion, no packets were lost and this means that streaming bandwidth was measured correctly. -- MST From dotanb at dev.mellanox.co.il Thu Nov 16 06:49:50 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 16:49:50 +0200 (IST) Subject: [openib-general] First draft of the man pages Message-ID: <29097.194.90.237.34.1163688590.squirrel@dev.mellanox.co.il> Hi. Attached is the first draft of the man pages for the libibverbs. I hope that in the next few weeks, the man pages will be committed to the openib svn (i guess with several changes ..). feedback is always welcome Dotan -------------- next part -------------- A non-text attachment was scrubbed... Name: man_pages.tar.gz Type: application/x-gzip Size: 20537 bytes Desc: not available URL: From sashak at voltaire.com Thu Nov 16 06:58:16 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Nov 2006 16:58:16 +0200 Subject: [openib-general] [PATCH 1/2] libibumad/libibmad/diags: fix printf style uses Message-ID: <11636890971762-git-send-email-sashak@voltaire.com> This fixes various uses of printf() style functions. Signed-off-by: Sasha Khapyorsky --- diags/src/ibnetdiscover.c | 7 ++++--- diags/src/ibtracert.c | 15 ++++++++------- libibmad/src/rpc.c | 2 +- libibumad/src/umad.c | 16 ++++++++-------- 4 files changed, 21 insertions(+), 19 deletions(-) diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c index c6e35e4..612aee0 100644 --- a/diags/src/ibnetdiscover.c +++ b/diags/src/ibnetdiscover.c @@ -44,6 +44,7 @@ #include #include #include #include +#include #define __BUILD_VERSION_TAG__ 1.2 #include @@ -175,8 +176,8 @@ get_node(Node *node, Port *port, ib_port mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); } - DEBUG("portid %s: got switch node %Lx '%s'", - portid2str(portid), node->nodeguid, nd); + DEBUG("portid %s: got switch node %" PRIx64 " '%s'", + portid2str(portid), node->nodeguid, node->nodedesc); return 1; } @@ -242,7 +243,7 @@ insert_node(Node *new) for (node = nodestbl[hash]; node; node = node->htnext) if (node->nodeguid == new->nodeguid) { - DEBUG("node %Lx already exists", new->nodeguid); + DEBUG("node %" PRIx64 " already exists", new->nodeguid); return node; } diff --git a/diags/src/ibtracert.c b/diags/src/ibtracert.c index 64dbe00..56c312d 100644 --- a/diags/src/ibtracert.c +++ b/diags/src/ibtracert.c @@ -43,6 +43,7 @@ #include #include #include #include +#include #define __BUILD_VERSION_TAG__ 1.2 #include @@ -166,7 +167,7 @@ get_node(Node *node, Port *port, ib_port mad_decode_field(pi, IB_PORT_LMC_F, &port->lmc); mad_decode_field(pi, IB_PORT_STATE_F, &port->state); - DEBUG("portid %s: got node %Lx '%s'", portid2str(portid), node->nodeguid, nd); + DEBUG("portid %s: got node %" PRIx64 " '%s'", portid2str(portid), node->nodeguid, node->nodedesc); return 0; } @@ -332,7 +333,7 @@ find_route(ib_portid_t *from, ib_portid_ DEBUG("ca or router node"); if (!sameport(port, &fromport)) { - IBWARN("can't continue: reached CA or router port %Lx, lid %d", + IBWARN("can't continue: reached CA or router port %" PRIx64 ", lid %d", port->portguid, port->lid); return -1; } @@ -378,7 +379,7 @@ badoutport: return -1; badtbl: IBWARN("Bad forwarding table entry found at: node \"%s\" lid entry %d is %d (top %d)", - node->nodedesc, to, outport, sw.linearFDBtop); + node->nodedesc, to->lid, outport, sw.linearFDBtop); return -1; badpath: IBWARN("Direct path too long!"); @@ -402,7 +403,7 @@ insert_node(Node *new) for (node = nodestbl[hash]; node; node = node->htnext) if (node->nodeguid == new->nodeguid) { - DEBUG("node %Lx already exists", new->nodeguid); + DEBUG("node %" PRIx64 " already exists", new->nodeguid); return -1; } @@ -501,7 +502,7 @@ switch_mclookup(Node *node, ib_portid_t *map = 1; else continue; - VERBOSE("Switch guid 0x%Lx: mlid 0x%x is forwarded to port %d", + VERBOSE("Switch guid 0x%" PRIx64 ": mlid 0x%x is forwarded to port %d", node->nodeguid, mlid + 0xc000, i + set * 16); } } @@ -565,7 +566,7 @@ find_mcpath(ib_portid_t *from, int mlid) leafport = path->drpath.p[path->drpath.cnt]; map[port->portnum] = 1; node->upport = 0; /* starting here */ - DEBUG("Starting from CA 0x%Lx lid %d port %d (leafport %d)", + DEBUG("Starting from CA 0x%" PRIx64 " lid %d port %d (leafport %d)", node->nodeguid, port->lid, port->portnum, leafport); } else { /* switch */ @@ -574,7 +575,7 @@ find_mcpath(ib_portid_t *from, int mlid) node->upport = leafport; if (switch_mclookup(node, path, mlid, map) < 0) { - IBWARN("skipping bad Switch 0x%Lx lid %d", + IBWARN("skipping bad Switch 0x%" PRIx64 " lid %" PRIx64 "", node->nodeguid, port->portguid); continue; } diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index 142f8d8..3164b12 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -123,7 +123,7 @@ _do_madrpc(int port_id, void *sndbuf, vo timeout = def_madrpc_timeout; if (ibdebug > 1) { - IBWARN(">>> sending: len %d pktsz %d", len, umad_size() + len); + IBWARN(">>> sending: len %d pktsz %zu", len, umad_size() + len); xdump(stderr, "send buf\n", sndbuf, umad_size() + len); } diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 71b6833..ee9f65f 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -473,7 +473,7 @@ umad_init(void) { uint abi_version; - TRACE(""); + TRACE("umad_init"); if (sys_read_uint(IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE, &abi_version) < 0) { IBWARN("can't read ABI version from %s/%s (%m): is ib_umad module loaded?", IB_UMAD_ABI_DIR, IB_UMAD_ABI_FILE); @@ -490,7 +490,7 @@ umad_init(void) int umad_done(void) { - TRACE(""); + TRACE("umad_done"); /* FIXME - verify that all ports are closed */ return 0; } @@ -756,7 +756,7 @@ umad_send(int portid, int agentid, void if (n == length + sizeof *mad) return 0; - DEBUG("write returned %d != sizeof umad %d + length %d (%m)", + DEBUG("write returned %d != sizeof umad %zu + length %d (%m)", n, sizeof *mad, length); if (!errno) errno = EIO; @@ -824,7 +824,7 @@ umad_recv(int portid, void *umad, int *l return n; } - DEBUG("read returned %d > sizeof umad %d + length %d (%m)", + DEBUG("read returned %zu > sizeof umad %zu + length %d (%m)", mad->length - sizeof *mad, sizeof *mad, *length); *length = mad->length - sizeof *mad; @@ -888,12 +888,12 @@ umad_register_oui(int portid, int mgmt_c memset(req.method_mask, 0, sizeof req.method_mask); if (!ioctl(port->dev_fd, IB_USER_MAD_REGISTER_AGENT, (void *)&req)) { - DEBUG("portid %d registered to use agent %d qp %d class 0x%x oui 0x%x", - portid, req.id, req.qpn, oui); + DEBUG("portid %d registered to use agent %d qp %d class 0x%x oui %p", + portid, req.id, req.qpn, req.mgmt_class, oui); return req.id; /* return agentid */ } - DEBUG("portid %d registering qp %d class 0x%x version %d oui 0x%x failed: %m", + DEBUG("portid %d registering qp %d class 0x%x version %d oui %p failed: %m", portid, req.qpn, req.mgmt_class, req.mgmt_class_version, oui); return -EPERM; } @@ -941,7 +941,7 @@ umad_unregister(int portid, int agentid) { Port *port; - TRACE("portid %d unregistering agent %d", agentid); + TRACE("portid %d unregistering agent %d", portid, agentid); if (!(port = port_get(portid))) return -EINVAL; -- 1.4.3.2.g4bf7 From sashak at voltaire.com Thu Nov 16 06:58:17 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Nov 2006 16:58:17 +0200 Subject: [openib-general] [PATCH 2/2] libibcommon: enable printf() style format strict checking In-Reply-To: <11636890971762-git-send-email-sashak@voltaire.com> References: <11636890971762-git-send-email-sashak@voltaire.com> Message-ID: <11636891023166-git-send-email-sashak@voltaire.com> This enables strict format/args checking for printf() style functions. Signed-off-by: Sasha Khapyorsky --- libibcommon/include/infiniband/common.h | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h index 83c0679..66afab0 100644 --- a/libibcommon/include/infiniband/common.h +++ b/libibcommon/include/infiniband/common.h @@ -114,11 +114,16 @@ #endif #define ENUM_STR_DEF(enumname, last, val) (((unsigned)(val) < last) ? enumname ## _str[val] : "???") #define ENUM_STR_ARRAY(name) char * name ## _str[] +#ifdef __GNUC__ +#define STRICT_FORMAT __attribute__((format(printf, 2, 3))) +#else +#define STRICT_FORMAT +#endif /* util.c: debugging and tracing */ -void ibwarn(const char * const fn, char *msg, ...); -void ibpanic(const char * const fn, char *msg, ...); -void logmsg(const char *const fn, char *msg, ...); +void ibwarn(const char * const fn, char *msg, ...) STRICT_FORMAT; +void ibpanic(const char * const fn, char *msg, ...) STRICT_FORMAT; +void logmsg(const char *const fn, char *msg, ...) STRICT_FORMAT; void xdump(FILE *file, char *msg, void *p, int size); -- 1.4.3.2.g4bf7 From mst at mellanox.co.il Thu Nov 16 07:03:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 17:03:33 +0200 Subject: [openib-general] [PATCH 2/2] libibcommon: enable printf() style format strict checking In-Reply-To: <11636891023166-git-send-email-sashak@voltaire.com> References: <11636890971762-git-send-email-sashak@voltaire.com> <11636891023166-git-send-email-sashak@voltaire.com> Message-ID: <20061116150333.GE8811@mellanox.co.il> > diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h > index 83c0679..66afab0 100644 > --- a/libibcommon/include/infiniband/common.h > +++ b/libibcommon/include/infiniband/common.h > @@ -114,11 +114,16 @@ #endif > #define ENUM_STR_DEF(enumname, last, val) (((unsigned)(val) < last) ? enumname ## _str[val] : "???") > #define ENUM_STR_ARRAY(name) char * name ## _str[] > > +#ifdef __GNUC__ > +#define STRICT_FORMAT __attribute__((format(printf, 2, 3))) > +#else > +#define STRICT_FORMAT > +#endif You are polluting the global namespace - macros must be prefixed with library name. But anyway - why is this necessary? Does anyone actually try compiling libibcommon not in gcc? Why? And AFAIK e.g. intel compiler implements this __attribute__. > /* util.c: debugging and tracing */ > -void ibwarn(const char * const fn, char *msg, ...); > -void ibpanic(const char * const fn, char *msg, ...); > -void logmsg(const char *const fn, char *msg, ...); > +void ibwarn(const char * const fn, char *msg, ...) STRICT_FORMAT; > +void ibpanic(const char * const fn, char *msg, ...) STRICT_FORMAT; > +void logmsg(const char *const fn, char *msg, ...) STRICT_FORMAT; > > void xdump(FILE *file, char *msg, void *p, int size); -- MST From pradeep at us.ibm.com Thu Nov 16 07:15:23 2006 From: pradeep at us.ibm.com (Pradeep Satyanarayana) Date: Thu, 16 Nov 2006 07:15:23 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: openib-general-bounces at openib.org wrote on 11/14/2006 03:18:23 PM: > Shirley> The rotting packet situation consistently happens for > Shirley> ehca driver. The napi could poll forever with your > Shirley> original patch. That's the reason I defer the rotting > Shirley> packet process in next napi poll. > > Hmm, I don't see it. In my latest patch, the poll routine does: > > repoll: > done = 0; > empty = 0; > > while (max) { > t = min(IPOIB_NUM_WC, max); > n = ib_poll_cq(priv->cq, t, priv->ibwc); > > for (i = 0; i < n; ++i) { > if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) { > ++done; > --max; > ipoib_ib_handle_rx_wc(dev, priv->ibwc + i); > } else > ipoib_ib_handle_tx_wc(dev, priv->ibwc + i); > } > > if (n != t) { > empty = 1; > break; > } > } > > dev->quota -= done; > *budget -= done; > > if (empty) { > netif_rx_complete(dev); > if (unlikely(ib_req_notify_cq(priv->cq, > IB_CQ_NEXT_COMP | > IB_CQ_REPORT_MISSED_EVENTS)) && > netif_rx_reschedule(dev, 0)) > goto repoll; > > return 0; > } > > return 1; > > so every receive completion will count against the limit set by the > variable max. The only way I could see the driver staying in the poll > routine for a long time would be if it was only processing send > completions, but even that doesn't actually seem bad: the driver is > making progress handling completions. > Is it possible that when one gets into the "rotting packet" case, the quota is at or close to 0 (on ehca). If in the cass it is 0 and netif_rx_reschedule() case wins (over netif_rx_schedule()) then it keeps spinning unable to process any packets since the undo parameter for netif_reschedule() is 0. If netif_rx_reschedule() keeps winning for a few iterations then the receive queues get full and dropping packets, thus causing a loss in performance. If this is indeed the case, then one option to try out may be is to change the undo parameter of netif_rx_rechedule()to either IB_WC or even dev->weight. > Shirley> It does help the performance from 1XXMb/s to 7XXMb/s, but > Shirley> not as expected 3XXXMb/s. > > Is that 3xxx Mb/sec the performance you see without the NAPI patch? > > Shirley> With the defer rotting packet process patch, I can see > Shirley> packets out of order problem in TCP layer. Is it > Shirley> possible there is a race somewhere causing two napi polls > Shirley> in the same time? mthca seems to use irq auto affinity, > Shirley> but ehca uses round-robin interrupt. > > I don't see how two NAPI polls could run at once, and I would expect > worse effects from them stepping on each other than just out-of-order > packets. However, the fact that ehca does round-robin interrupt > handling might lead to out-of-order packets just because different > CPUs are all feeding packets into the network stack. > > - R. > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From halr at voltaire.com Thu Nov 16 07:27:40 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Nov 2006 17:27:40 +0200 Subject: [openib-general] OpenSM log growing too big References: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> <455BD985.3010900@3leafnetworks.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944A5@taurus.voltaire.com> Not sure what question you are asking exactly. Is it what do those messages mean or the file getting large or both ? What options are you using on OpenSM startup ? Also, any chance you can move forward on a more recent and better OpenSM ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Venkatesh Babu Sent: Wed 11/15/2006 10:22 PM To: openib-general at openib.org Cc: Venkatesh Babu Subject: [openib-general] OpenSM log growing too big I have OFED 1.0 stack and running OpenSM on a server connected to a IB subnet with couple of nodes. Usually the log file size is small. But ocassionally it is growing too big and filling up the whole hard disk. [root at vortex3l-88 ~]# ls -l /var/log/opensm* -rw-r--r-- 1 root root 33879121502 Nov 15 14:54 /var/log/opensm.log Most of the opensm.log file is filled with following messages. Out of 240,168,770 lines of log file 239,782,972 lines are from this __osm_trap_rcv_process_request. Nov 14 13:59:35 273746 [42803960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 127 times consecutively Nov 14 13:59:35 273908 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733372Nov 14 13:59:35 273966 [41401960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 128 times consecutively Nov 14 13:59:35 274176 [41E02960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733373Nov 14 13:59:35 274234 [41E02960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 129 times consecutively Nov 14 13:59:35 274380 [43204960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733374Nov 14 13:59:35 274436 [43204960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 130 times consecutively Nov 14 13:59:35 274662 [42803960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733375Nov 14 13:59:35 274720 [42803960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 131 times consecutively Nov 14 13:59:35 274970 [41401960] -> __osm_trap_rcv_process_request: Received Generic Notice type:0x01 num:128 Producer:2 from LID:0x0005 TID:0x0000000009733376Nov 14 13:59:35 275026 [41401960] -> __osm_trap_rcv_process_request: ERR 3804: Received trap 132 times consecutively _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From mshefty at ichips.intel.com Thu Nov 16 07:28:18 2006 From: mshefty at ichips.intel.com (Sean Hefty) Date: Thu, 16 Nov 2006 07:28:18 -0800 Subject: [openib-general] [PATCH] IB/ucm: fix deadlock in cleanup In-Reply-To: <20061115094348.GA19619@mellanox.co.il> References: <20061115094348.GA19619@mellanox.co.il> Message-ID: <455C8392.4030908@ichips.intel.com> Michael S. Tsirkin wrote: > ib_ucm_cleanup_events has file_mutex while calling ib_destroy_cm_id. > It seems this can deadlock since ib_destroy_cm_id flushes event > handlers, and ib_ucm_event_handler needs file_mutex, too. > > Signed-off-by: Michael S. Tsirkin > > --- > > I'll be testing the following the next night - but it seems a right > thing to do regardless of whether it fixes the issues I reported > earlier. Sean, does this make sense to you? If yes, please ack for 2.6.19. Yes - this looks right to me. Acked-by: Sean Hefty > diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c > index ad4f4d5..0128288 100644 > --- a/drivers/infiniband/core/ucm.c > +++ b/drivers/infiniband/core/ucm.c > @@ -161,12 +161,14 @@ static void ib_ucm_cleanup_events(struct > struct ib_ucm_event, ctx_list); > list_del(&uevent->file_list); > list_del(&uevent->ctx_list); > + mutex_unlock(&ctx->file->file_mutex); > > /* clear incoming connections. */ > if (ib_ucm_new_cm_id(uevent->resp.event)) > ib_destroy_cm_id(uevent->cm_id); > > kfree(uevent); > + mutex_lock(&ctx->file->file_mutex); > } > mutex_unlock(&ctx->file->file_mutex); > } > > > From mst at mellanox.co.il Thu Nov 16 07:33:29 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 17:33:29 +0200 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: References: Message-ID: <20061116153329.GF8811@mellanox.co.il> Quoting r. Roland Dreier : > > I would really like to understand why ehca does worse with NAPI. In > my tests both mthca and ipath exhibit various degrees of improvement > depending on the test -- but I've never seen performance get worse. > This is the main thing holding back merging NAPI. Documentation/netowkring/NAPI_HOWTO.txt says: APPENDIX 3: Scheduling issues As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the general solution to schedule softirq's to run before next interrupt and by putting them under scheduler control. Also this prevents consecutive softirq's from monopolize the CPU. This also have the effect that the priority of ksoftirq needs to be considered when running very CPU-intensive applications and networking to get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 (eventually more) is reported cure problems with low network performance at high CPU load. So I wonder 1. Was this tried? Its clear that we have high CPU load. 2. Could this be the reason that e.g. e1000 disables NAPI by default? The issue seem sufficiently tricky that we may yet find ourselves debugging NAPI performance problems in the field. Maybe we still need a module option ... -- MST From dotanb at dev.mellanox.co.il Thu Nov 16 08:00:22 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 18:00:22 +0200 (IST) Subject: [openib-general] Question about the query QP mask Message-ID: <44208.194.90.237.34.1163692822.squirrel@dev.mellanox.co.il> Hi. in the file ib_verbs, in the description of the verb ib_query_qp it is written: "The qp_attr_mask may be used to limit the query o gathering only the selected attributes.". I checked the low level drivers of all of the HCAs and only the eHCA is actually behave like this (and set ONLY the masked attributes). What should be the expected behavior? Should this description should be changed or should the low level drivers of mthca and ipath need to be changed? thanks Dotan From thomas.bub at thomson.net Thu Nov 16 08:02:48 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Thu, 16 Nov 2006 17:02:48 +0100 Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? Message-ID: Here is my next and hopefully last problem. As described earlier I'm connecting a gen2 x86 clients to a gen1 PowerPC server After having sorted out the trouble with the CM parameters I'm now having trouble with RDMA read from client on the server. What works is: - gen2 x86 client doing a VAPI_SEND to gen1 PowerPC server. (this wasn't working last time) - RDMA write from gen1 PowerPC server to gen2 x86 client What is not working is: - RDMA read from gen2 x86 client on gen1 PowerPC server. I'm getting a vendor_error 0x81 VAPI_RETRY_EXC_ERR in the send completion queue. The RDMA start address, length and key have been exchanged and look identical on both sides. Doing connections and transfers between x86 only gen1 server x86 gen2 client works in all directions. (Send and receive as well as RDMA read and write) So a gen2 client can do a RDMA read from a gen1 server! Having a gen1 PowerPC server and a gen1 x86 client works as well. So a gen1 PowerPC server can be RDMA read from an x86 client! I'm again a little puzzled what can the gen2 server do wrong in a RDMA read on a PowerPC server when it can do the same operation a x86 server? Any ideas, thoughts, help.... are more then welcome Thanks Thomas Here is my code I'm using to do RDMA I'm always having only a single segment to be transmitted! rdma(ibv_sge *sgList, int sgListlen, int size, bool write) { struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; int res; int localErrno = 0; uint64_t remainingBytes = ntohl(_remoteBufferInfo->totalSize); sgList[0].length = remainingBytes; memset(&wr, 0, sizeof(wr)); wr.next = NULL; wr.wr_id = 1; wr.opcode = write ? IBV_WR_RDMA_WRITE : IBV_WR_RDMA_READ; wr.send_flags = IBV_SEND_SIGNALED; wr.sg_list = sgList; wr.num_sge = 1; wr.wr.rdma.remote_addr = ntohll(_remoteBufferInfo->sgList[0].addr); wr.wr.rdma.rkey = ntohl (_remoteBufferInfo->sgList[0].lkey); cleanCq(_sCq); res = ibv_post_send(_dataQp, &wr, &bad_wr); if (res != 0) { DEBUG1("Error in RDMA operation scheduling: %s\n", strerror(res)); sv2BreakConnection(); localErrno = ENOTCONN; return 0; } else { if (waitOnCq(_sCq)) { localErrno = -1; } } } ............................................................ Thomas Bub Grass Valley Germany GmbH Brunnenweg 9 64331 Weiterstadt, Germany Tel: +49 6150 104 147 Fax: +49 6150 104 656 Email: Thomas.Bub at thomson.net www.GrassValley.com ............................................................ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mst at mellanox.co.il Thu Nov 16 08:07:47 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 18:07:47 +0200 Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? In-Reply-To: References: Message-ID: <20061116160747.GH8811@mellanox.co.il> > Subject: What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? > > Here is my next and hopefully last problem. > > As described earlier I’m connecting a gen2 x86 clients to a gen1 PowerPC server Endian-ness issues? _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From ramachandra.kuchimanchi at qlogic.com Thu Nov 16 08:35:46 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra Kuchimanchi) Date: Thu, 16 Nov 2006 10:35:46 -0600 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) References: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> Message-ID: >> If you think these patches are good enough, could you please create a >> branch in your git tree based on for-2.6.20 for this code ? > Yes, I will create a vex branch for this in my tree. However, moving > this further upstream will depend on getting a real review of the > code, and some sort of protocol document will probably be required for > anyone to wade through this... > > - R. Thanks, Roland. I understand that the code has to go through several rounds of review before it is moved upstream. Would the branch that you create be in sync with the for-2.6.20 branch ? That way I can keep the code in sync with the latest changes. Also is the branch already created ? I tried to update my copy of the tree but could not see a vex branch. Please note that this driver is a device driver for a remote device and the communication between the driver and the device is like any other device driver, its just that this driver uses IB as its bus where as others use PCI etc. The entire communication between the driver and VEx can be understood from a reading of the code. To make it simpler to understand the code, I am providing a small note about the terminology and code organization: Each virtual NIC has a netpath, which is an abstraction of a connection to the VEx. Each netpath has a viport, the virtual port, which is an abstraction of the control and data IB connections through which control and data messages are exchanged. The control messages which are exchanged can be seen in vnic_control_pkt.h (patch 4). The data messages are nothing but transfer of data itself.(patch 5) The series of functions that are called in viport_statemachine() in vnic_viport.c (patch 3) are a good starting point to understand the control path. In this, establishment of control and data IB connections is done in viport_handle_init_states(). After that, the sequence of request and response messages that are exchanged can be seen in viport_handle_control_states(), viport_handle_data_states(), viport_handle_xchgpool_states(), viport_handle_idle_states() and so on. The code flow of the driver itself begins when the VEx information is written to the sysfs file create_primary. [vnic_create_primary() in vnic_sys.c (patch 8)] Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at opengridcomputing.com Thu Nov 16 08:38:03 2006 From: tom at opengridcomputing.com (Tom Tucker) Date: Thu, 16 Nov 2006 10:38:03 -0600 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: Message-ID: The race is that you've deleted the work queue element that is enqueued on the iwcm_wq. It's as simple as that. To prove it to yourself, apply your patch. Turn on memory debug support in the kernel and recompile your code. Then run rdma_krping clients in four different threads against your server with an I/O count of 1. You'll hit the race and can look at it yourself. I don't know any better way to explain it... Sorry. On 11/13/06 10:44 PM, "Krishna Kumar2" wrote: > Hi Tom, > >> No, to understand why go look at the implementation of queue_work. BTW, > this > > I was describing the implementation of queue_work() in my > previous mail. So sorry to be dense, but I do not understand > why this patch introduces a race. Can you explain the race > that you had found ? What I understood of queue_work() is : > > If cm_work_handler() is already running and processing the > last entry at the same time this new entry was added, it is > guaranteed to find this new entry in it's current run iteration, > and process it. The only issue is with the extra queue_work > by iwcm parallely on a different cpu for the same case. > > So if iwcm had done a redundant "queue_work" on this queue, > which, besides adding the new entry to the workqueue, also > does a wakeup of "worker_thread" (which is still running the > previous iteration of run_workqueue -> cm_work_handler). > I am assuming that the wake up function is > default_wake_function(), since I couldn't locate in wait* code > where this is initialized. > > When cm_work_handler finishes removing this new entry, it > returns to worker_thread, which will do a schedule() and > sleep till it is woken up again (since default_wake_function > found that the thread is already running and had done > nothing). Are you referring to a race where the queue_work > is done between the time cm_work_handler finished running > and before it gets back to schedule ? I feel that should not > matter as the run_workqueue() will find this entry in it's > cwq->worklist and continue processing instead of exiting to > worker_thread() and schedule(). > > Still confused about the race :) > > Thanks, > > - KK From monis at voltaire.com Thu Nov 16 09:50:24 2006 From: monis at voltaire.com (Moni Shoua) Date: Thu, 16 Nov 2006 19:50:24 +0200 Subject: [openib-general] [PATCH] IB/mthca: HCA profile module parameters In-Reply-To: References: <455B4230.6070101@voltaire.com> <455B42CB.7030008@voltaire.com> Message-ID: <455CA4E0.4000608@voltaire.com> Roland Dreier wrote: >The patch is line-wrapped and bizarrely corrupted and won't apply, eg: > > > + mthca_warn(mdev, "num_qp rounded to power of 2 (%d).\n", > > + default_profile.num_qp); + } > >This is completely unnecessary: > > > +#define to_up_power_of_2(x) (x = roundup_pow_of_two(x)) > >...just open code this. > >And this seems strange: > > > +#define is_power_of_2(x) (x>0 &&(x & (x - 1))) > >so there's no warning if someone passes in a negative value?? and >it's backwards too, (x & (x - 1)) is 0 precisely for the powers of 2. >Was this patch tested at all? > >Anyway, all this > > > + if (!is_power_of_2(default_profile.num_qp)){ > > + to_up_power_of_2(default_profile.num_qp); > > + mthca_warn(mdev, "num_qp rounded to power of 2 (%d).\n", > > + default_profile.num_qp); + } > >seems very repetive. Can't it be wrapped up in a function so we just >do something like > > mthca_check_profile_value(&default_profile.num_qp); > mthca_check_profile_value(&default_profile.rdb_per_qp); > mthca_check_profile_value(&default_profile.num_cq); > >etc. > > - R. > > > Thanks for the comments Lines became wrapped because I used a "wrong" email client. I'll re-submit with another client but this would be in a new thread because I still have problems reading mail with it and therefore I can't reply to this thread. Sorry for the bother... The patch was tested but unfortunately I sent the wrong one (not the final). The new version is the one I should have sent + changes according to the comments here. thanks MoniS From monis at voltaire.com Thu Nov 16 10:45:26 2006 From: monis at voltaire.com (Moni Shoua) Date: Thu, 16 Nov 2006 20:45:26 +0200 (IST) Subject: [openib-general] [PATCH v2] IB_mthca HCA profile module parameters Message-ID: From: Leonid Arsh Mleonida at voltaire.com> Adds module parameters that enable settting some of the HCA profile values Signed-off-by: Leonid Arsh Signed-off-by: Moni Shoua --- mthca_main.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 files changed, 128 insertions(+), 11 deletions(-) --- mthca_main.c.orig 2006-11-14 22:07:58.000000000 -0500 +++ mthca_main.c 2006-11-16 11:27:17.683513163 -0500 @@ -80,21 +80,134 @@ module_param(tune_pci, int, 0444); MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if nonzero"); +#define MTHCA_DEFAULT_NUM_QP (1 << 16) +#define MTHCA_DEFAULT_RDB_PER_QP (1 << 2) +#define MTHCA_DEFAULT_NUM_CQ (1 << 16) +#define MTHCA_DEFAULT_NUM_MCG (1 << 13) +#define MTHCA_DEFAULT_NUM_MPT (1 << 17) +#define MTHCA_DEFAULT_NUM_MTT (1 << 20) +#define MTHCA_DEFAULT_NUM_UDAV (1 << 15) +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18) +#define MTHCA_DEFAULT_NUM_UARC_SIZE (1 << 18) + +static struct mthca_profile default_profile = { + .num_qp = MTHCA_DEFAULT_NUM_QP, + .rdb_per_qp = MTHCA_DEFAULT_RDB_PER_QP, + .num_cq = MTHCA_DEFAULT_NUM_CQ, + .num_mcg = MTHCA_DEFAULT_NUM_MCG, + .num_mpt = MTHCA_DEFAULT_NUM_MPT, + .num_mtt = MTHCA_DEFAULT_NUM_MTT, + .num_udav = MTHCA_DEFAULT_NUM_UDAV, /* Tavor only */ + .fmr_reserved_mtts = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */ + .uarc_size = MTHCA_DEFAULT_NUM_UARC_SIZE, /* Arbel only */ +}; + +module_param_named(num_qp, default_profile.num_qp, int, 0444); +MODULE_PARM_DESC(num_qp, "maximum number of available QPs per HCA"); + +module_param_named(rdb_per_qp, default_profile.rdb_per_qp, int, 0444); +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP"); + +module_param_named(num_cq, default_profile.num_cq, int, 0444); +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA"); + +module_param_named(num_mcg, default_profile.num_mcg, int, 0444); +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA"); + +module_param_named(num_mpt, default_profile.num_mpt, int, 0444); +MODULE_PARM_DESC(num_mpt, + "maximum number of memory protection pable entries per HCA"); + +module_param_named(num_mtt, default_profile.num_mtt, int, 0444); +MODULE_PARM_DESC(num_mtt, + "maximum number of memory translation table segments per HCA"); +/* Tavor only */ +module_param_named(num_udav, default_profile.num_udav, int, 0444); +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA"); + +/* Tavor only */ +module_param_named(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, int, 0444); +MODULE_PARM_DESC(fmr_reserved_mtts, + "number of memory translation table segments reserved for FMR"); + static const char mthca_version[] __devinitdata = DRV_NAME ": Mellanox InfiniBand HCA driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; -static struct mthca_profile default_profile = { - .num_qp = 1 << 16, - .rdb_per_qp = 4, - .num_cq = 1 << 16, - .num_mcg = 1 << 13, - .num_mpt = 1 << 17, - .num_mtt = 1 << 20, - .num_udav = 1 << 15, /* Tavor only */ - .fmr_reserved_mtts = 1 << 18, /* Tavor only */ - .uarc_size = 1 << 18, /* Arbel only */ -}; +#define is_power_of_2(x) (!(x & (x - 1))) + +static int __devinit mthca_check_profile_value(int* pval,int pval_default){ + /* value must be positive and power of 2 */ + int old_pval = *pval; + if (old_pval <= 0) { + *pval = pval_default; + } else if (!is_power_of_2(old_pval)) { + *pval = roundup_pow_of_two(old_pval); + } + return old_pval-*pval; +} + +static int __devinit mthca_validate_profile(struct mthca_dev *mdev, + struct mthca_profile *profile) +{ + if (mthca_check_profile_value(&default_profile.num_qp, + MTHCA_DEFAULT_NUM_QP)){ + mthca_warn(mdev,"invalid num_qp passed. changed to %d.\n", + default_profile.num_qp); + } + + if (mthca_check_profile_value(&default_profile.rdb_per_qp, + MTHCA_DEFAULT_RDB_PER_QP)){ + mthca_warn(mdev,"invalid rdb_per_qp passed. changed to %d\n", + default_profile.rdb_per_qp); + } + + if (mthca_check_profile_value(&default_profile.num_cq, + MTHCA_DEFAULT_NUM_CQ)){ + mthca_warn(mdev,"invalid num_cq passed. changed to %d\n", + default_profile.num_cq); + } + + if (mthca_check_profile_value(&default_profile.num_mcg, + MTHCA_DEFAULT_NUM_MCG)){ + mthca_warn(mdev,"invalid num_mcg passed. changed to %d\n", + default_profile.num_mcg); + } + if (mthca_check_profile_value(&default_profile.num_mpt, + MTHCA_DEFAULT_NUM_MPT)){ + mthca_warn(mdev,"invalid num_mpt passed. changed to %d\n", + default_profile.num_mpt); + } + + if (mthca_check_profile_value(&default_profile.num_mtt, + MTHCA_DEFAULT_NUM_MTT)){ + mthca_warn(mdev,"invalid num_mtt passed. changed to %d\n", + default_profile.num_mtt); + } + + if (mthca_is_memfree(mdev)) { + if (mthca_check_profile_value(&default_profile.num_udav, + MTHCA_DEFAULT_NUM_UDAV)){ + mthca_warn(mdev,"invalid num_udav passed. changed to %d\n", + default_profile.num_udav); + } + + if (mthca_check_profile_value(&default_profile.fmr_reserved_mtts, + MTHCA_DEFAULT_NUM_RESERVED_MTTS)){ + mthca_warn(mdev,"invalid fmr_reserved_mtts passed. changed to %d\n", + default_profile.fmr_reserved_mtts); + } + if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) { + mthca_err(mdev,"Invalid fmr_reserved_mtts parameter" + "value (%d). Must be lower then num_mtt (%d)\n", + default_profile.fmr_reserved_mtts, + default_profile.num_mtt ); + return -EINVAL; + } + } + + return 0; +} static int __devinit mthca_tune_pci(struct mthca_dev *mdev) { @@ -1095,6 +1208,10 @@ if (err) goto err_cmd; + err = mthca_validate_profile(mdev, &default_profile); + if (err) + goto err_cmd; + err = mthca_init_hca(mdev); if (err) goto err_cmd; From rdreier at cisco.com Thu Nov 16 09:59:32 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 09:59:32 -0800 Subject: [openib-general] [PATCH] IB/SRP - increase supported CDB size In-Reply-To: <87ac2t0wcb.fsf@confield.dd.xiranet.com> (Arne Redlich's message of "Wed, 15 Nov 2006 12:43:00 +0100") References: <87ac2t0wcb.fsf@confield.dd.xiranet.com> Message-ID: Definitely makes sense. I queued the following version for 2.6.20, which gets the max CDB size directly from struct srp_cmd. Does this look OK to you? Thanks, Roland diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 4b09147..01776c9 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1716,7 +1716,8 @@ static ssize_t srp_create_target(struct if (!target_host) return -ENOMEM; - target_host->max_lun = SRP_MAX_LUN; + target_host->max_lun = SRP_MAX_LUN; + target_host->max_cmd_len = sizeof ((struct srp_cmd *) (void *) 0L)->cdb; target = host_to_target(target_host); From tziporet at dev.mellanox.co.il Thu Nov 16 09:59:35 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 16 Nov 2006 12:59:35 -0500 Subject: [openib-general] I need your help. In-Reply-To: <20061116090647.81886.qmail@web92002.mail.cnb.yahoo.com> References: <20061116090647.81886.qmail@web92002.mail.cnb.yahoo.com> Message-ID: <455CA707.6020605@dev.mellanox.co.il> 强 马 wrote: > Hello! sir. > I 've been developing my mpich projects on infiniband cluster for two > months. > $ ibstat > CA type: MT25204 > Number of ports: 1 > Firmware version: 1.1.0 > Hardware version: a0 > Node GUID: 0xe865620060529997 > System image GUID: 0xe86562006052999a > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 82 > LMC: 0 > SM lid: 82 > Capability mask: 0x02510a6a > Port GUID: 0xe865620060529998 > I've downloaded << Mellanox IB-Verbs API (VAPI) >>, but I works on > openib version. > Would you mind telling me where I can download the API manual about > OpenIB? > thank you in advance. > There is no API manual for openib, however Dotan has started MAN pages. See his email on the list with the man pages attached Tziporet From rdreier at cisco.com Thu Nov 16 10:02:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 10:02:42 -0800 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) In-Reply-To: (Ramachandra Kuchimanchi's message of "Thu, 16 Nov 2006 10:35:46 -0600") References: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> Message-ID: > Would the branch that you create be in sync with the for-2.6.20 branch ? > That way I can keep the code in sync with the latest changes. > Also is the branch already created ? I tried to update my copy of the tree > but could not see a vex branch. I have not created the branch yet. I will probably be able to do it today. The way I will do it is to make a branch from the "master" branch so that your patches should just be on top of Linus's tree. > Please note that this driver is a device driver for a remote device and > the communication between the driver and the device is like any other > device driver, its just that this driver uses IB as its bus where > as others use PCI etc. OK... > Each virtual NIC has a netpath, which is an abstraction of a connection to > the VEx. Each netpath has a viport, the virtual port, which is an abstraction > of the control and data IB connections through which control and data messages > are exchanged. ...but this seems like over-abstraction that makes the code harder to understand. - R. From rdreier at cisco.com Thu Nov 16 10:04:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 10:04:36 -0800 Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? In-Reply-To: (Bub Thomas's message of "Thu, 16 Nov 2006 17:02:48 +0100") References: Message-ID: > I'm again a little puzzled what can the gen2 server do wrong in a RDMA > read on a PowerPC server when it can do the same operation a x86 > server? Do you have any endian conversion bugs? > wr.wr.rdma.rkey = ntohl (_remoteBufferInfo->sgList[0].lkey); The R_Key you use should be the remote side's R_Key, not the L_Key (_local_ key) that it has. This doesn't matter for HCAs where they are the same, but it's better to do things correctly... - R. From thomas.bub at thomson.net Thu Nov 16 10:09:34 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Thu, 16 Nov 2006 19:09:34 +0100 Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? Message-ID: Endian-ness issues? Yes that is the first thought. But where? Since my gen2 x86 rdma code can do an rdma read from a gen1 and gen2 x86 server I think the only values in the ibv_send_wr that can be wrong talking to a PowerPC server can be remote_addr and rkey right I already swapped both but without success. Are there other places in the ibv_send_wr or the underlying code that might be endian-ness fooled? Since I can do a VAPI_SEND (non RDMA) from the gen2 x86 client to the gen1 PowerPC server I think the qp should be OK? Is there something RDMA READ specific in the qp that still might not be right after my CM connection from gen2 to gen1? Don't forget the RDMA WRITE from the gen1 PowerPC server to the gen2 x86 client on the same qp works just before the RDMA READ from gen2 x86 client on the gen1 PowerPC server fails. Still confusing. Thomas From dotanb at dev.mellanox.co.il Thu Nov 16 10:12:25 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 20:12:25 +0200 (IST) Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? In-Reply-To: <20061116160747.GH8811@mellanox.co.il> References: <20061116160747.GH8811@mellanox.co.il> Message-ID: <1235.85.65.223.132.1163700745.squirrel@dev.mellanox.co.il> >> Subject: What could prevent a gen2 x86 client qp from doing RDMA_READ on >> a gen1 PowerPC client? >> >> Here is my next and hopefully last problem. >> >> As described earlier I’m connecting a gen2 x86 clients to a gen1 >> PowerPC server > > Endian-ness issues? > How do you connect the QPs? if you are using CM i think there is an endianess inconsisstency between the CMs. i suggest you to check the following attributes between the two sides: rq_psn sq_psn dlid dest_qp_num problem in one of those attributes can cause the retry exceeded you got ... Dotan From mst at mellanox.co.il Thu Nov 16 10:14:12 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Thu, 16 Nov 2006 20:14:12 +0200 Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? In-Reply-To: References: Message-ID: <20061116181412.GP8811@mellanox.co.il> Hmm. Maybe responder resources/initiator depth are set incorrectly at one of the sides? These must match and affect only reads and atomics. Quoting r. Bub Thomas : Subject: Re: What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? Endian-ness issues? Yes that is the first thought. But where? Since my gen2 x86 rdma code can do an rdma read from a gen1 and gen2 x86 server I think the only values in the ibv_send_wr that can be wrong talking to a PowerPC server can be remote_addr and rkey right I already swapped both but without success. Are there other places in the ibv_send_wr or the underlying code that might be endian-ness fooled? Since I can do a VAPI_SEND (non RDMA) from the gen2 x86 client to the gen1 PowerPC server I think the qp should be OK? Is there something RDMA READ specific in the qp that still might not be right after my CM connection from gen2 to gen1? Don't forget the RDMA WRITE from the gen1 PowerPC server to the gen2 x86 client on the same qp works just before the RDMA READ from gen2 x86 client on the gen1 PowerPC server fails. Still confusing. Thomas _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST From dotanb at dev.mellanox.co.il Thu Nov 16 10:17:44 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 20:17:44 +0200 (IST) Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? In-Reply-To: References: Message-ID: <1267.85.65.223.132.1163701064.squirrel@dev.mellanox.co.il> > Endian-ness issues? > Yes that is the first thought. > > But where? > Since my gen2 x86 rdma code can do an rdma read from a gen1 and gen2 > x86 server I think the only values in the ibv_send_wr that can be wrong > talking to a PowerPC server can be remote_addr and rkey right > I already swapped both but without success. > > Are there other places in the ibv_send_wr or the underlying code that > might be endian-ness fooled? > Since I can do a VAPI_SEND (non RDMA) from the gen2 x86 client to the > gen1 PowerPC server I think the qp should be OK? > > Is there something RDMA READ specific in the qp that still might not be > right after my CM connection from gen2 to gen1? > Don't forget the RDMA WRITE from the gen1 PowerPC server to the gen2 x86 > client on the same qp works just before the RDMA READ from gen2 x86 > client on the gen1 PowerPC server fails. > > Still confusing. > Thomas > Yes. there are 2 attributes in every QP that handles RDMA Reads/Atomic operations: a) how many outstanding RDMA Read atomic the QP may send as an initiator b) how many outstanding RDMA Read atomic the QP may send as a target the connectivty between QPs X and Y should be: Xa = Yb Xb = Ya and ofcourese RDMA Read need to be enabled in the QP access permissions and in the MR permissions ... Dotan From venkatesh.babu at 3leafnetworks.com Thu Nov 16 10:39:56 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 16 Nov 2006 10:39:56 -0800 Subject: [openib-general] OpenSM log growing too big In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5018944A5@taurus.voltaire.com> References: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> <455BD985.3010900@3leafnetworks.com> <5CE025EE7D88BA4599A2C8FEFCF226F5018944A5@taurus.voltaire.com> Message-ID: <455CB07C.7080503@3leafnetworks.com> Hal Rosenstock wrote: > Not sure what question you are asking exactly. > > Is it what do those messages mean or the file getting large or both ? > > Both. The message looks like LID 5 is generating too many events. The log file grows few MBs a second. What ever the problem with the port it should not generate these many log messages. I guess it is a OpenSM bug. > > What options are you using on OpenSM startup ? > > root 7703 0.0 0.0 92784 1652 ? Sl 05:00 0:01 /usr/bin/opensm -g 0x005045014ac20001 -p 11 -s 10 -u -f /var/log/opensm.log > > Also, any chance you can move forward on a more recent and better > OpenSM ? > > It is difficult to use OpenSM from OFED 1.1. Because we need to do another QA verification cycle with our product. But I can find the specific patch to the OpenSM I can apply that patch to the existing OpenSM. VBabu > > -- Hal > From adit.262 at gmail.com Thu Nov 16 10:50:01 2006 From: adit.262 at gmail.com (Adit Ranadive) Date: Thu, 16 Nov 2006 13:50:01 -0500 Subject: [openib-general] Setting up VLArbitration tables & SL2VLMapping Message-ID: Hi, I have installed the OFED 1.1 distribution and have a Mellanox 25208 HCA. I want to know if there is any particular program/application that would allow me to set the SL2VL mapping and VL arbitration tables for the HCA? Thanks, Adit ----- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA From rjwalsh at pathscale.com Thu Nov 16 11:11:43 2006 From: rjwalsh at pathscale.com (Robert Walsh) Date: Thu, 16 Nov 2006 11:11:43 -0800 Subject: [openib-general] Question about multicast GIDs In-Reply-To: <455BB9DD.8080500@pathscale.com> References: <455B9EA0.6070106@pathscale.com> <455BB9DD.8080500@pathscale.com> Message-ID: <455CB7EF.1090300@pathscale.com> Robert Walsh wrote: > Roland Dreier wrote: >> > Is there are registration authority for multicast GIDs? Or at >> least a > safe way of assigning a range of GIDs to a vendor? >> >> I don't think so. Perhaps RFC 3307 would be of some use... > > Ah - looks exactly like what I was looking for. Thanks. Hmm - spoke too soon. This seems to be related to IPv6 multicast GIDs, but not IB. The idea is similar, but the allocation mechanism is entirely arbitrary (but consistent) and I don't think it would map from IPv6 to IB in any meaningful way. I'll talk to the folks here who are on the various IB committees and see if they have any thoughts on this. Regards, Robert. From rdreier at cisco.com Thu Nov 16 11:22:02 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:22:02 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: (Pradeep Satyanarayana's message of "Thu, 16 Nov 2006 07:15:23 -0800") References: Message-ID: Pradeep> Is it possible that when one gets into the "rotting Pradeep> packet" case, the quota is at or close to 0 (on ehca). If Pradeep> in the cass it is 0 and netif_rx_reschedule() case wins Pradeep> (over netif_rx_schedule()) then it keeps spinning unable Pradeep> to process any packets since the undo parameter for Pradeep> netif_reschedule() is 0. It is possible that the quota is close to 0, but I don't see how the poll routine could spin with quota (the variable max) equal to 0. If max is 0, then the "while (max)" loop will never be entered, empty will remain 0, and the poll routine will simply fall through and return 1. Do you agree with that summary? We don't want the undo parameter of netif_rx_reschedule() to be non-zero because when we go back to repoll, done is reset to 0. So there's no reason to increase the quota again. I guess you could instrument how many iterations there are with a small value of max, but I would assume it's self-limiting, since the last few completions should appear fairly quickly. - R. From swise at opengridcomputing.com Thu Nov 16 11:26:10 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 16 Nov 2006 13:26:10 -0600 Subject: [openib-general] account on the new ofa server Message-ID: <1163705170.6286.29.camel@stevo-desktop> How do I get an account on the new ofa server? Thanks, Steve. From rdreier at cisco.com Thu Nov 16 11:26:31 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:26:31 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: (Shirley Ma's message of "Wed, 15 Nov 2006 10:13:15 -0800") References: Message-ID: > What I have found in ehca driver, n! = t, does't mean it's empty. If poll > again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most > of the time reports 1. It relies on netif_rx_reschedule() returns 0 to exit > napi poll. That might be the reason in poll routine for a long time? I will > rerun my test to use n! = 0 to see any difference here. Maybe there's an ehca bug in poll CQ? If n != t then it should mean that the CQ was indeed drained. I would expect a missed event would be rare, because it means a completion occurs between the last poll CQ and the request notify, and that shouldn't be that common... My rough estimate is that even at a higher throughput than what you're seeing, IPoIB should only generate ~ 500K completions/sec, which means the average delay between completions is 2 microseconds. So I wouldn't expect completions to hit the window between poll and request notify that often. - R. From rdreier at cisco.com Thu Nov 16 11:28:57 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:28:57 -0800 Subject: [openib-general] make ipoib_ib_dev_stop void? In-Reply-To: <20061114102402.GB27446@mellanox.co.il> (Michael S. Tsirkin's message of "Tue, 14 Nov 2006 12:24:02 +0200") References: <20061114102402.GB27446@mellanox.co.il> Message-ID: > Shouldn't ipoib_ib_dev_stop be void? Looks like it -- after all we never check the return value. From swise at opengridcomputing.com Thu Nov 16 11:31:09 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 16 Nov 2006 13:31:09 -0600 Subject: [openib-general] [Fwd: [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support] Message-ID: <1163705469.6286.34.camel@stevo-desktop> Hey Roland, What's the plan on this series? Do you plan on pulling these into your for-2.6.20 tree? (don't mean to push...just wondering if they're on track) Thanks, Steve. -------- Forwarded Message -------- From: Sean Hefty To: 'Roland Dreier' , openib-general at openib.org Subject: [openib-general] [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support Date: Tue, 24 Oct 2006 15:25:48 -0700 The following set of patches expand the rdma_cm support to include UD and multicast, and expose the rdma_cm to userspace. I would like to target the 2.6.20 kernel, but at least getting them into one or more branches would be helpful for other developers to test against these changes. As mentioned in the RFC, the patches borrow heavily from the code checked into openfabrics svn, but there are some notable differences. The main difference from the patches submitted for the RFC is the integration of the ib_multicast module with the ib_sa module. The two modules are loosely coupled, with minimal changes made to the existing sa_query code. Signed-off-by: Sean Hefty _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dotanb at dev.mellanox.co.il Thu Nov 16 11:33:49 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Thu, 16 Nov 2006 21:33:49 +0200 (IST) Subject: [openib-general] Setting up VLArbitration tables & SL2VLMapping In-Reply-To: References: Message-ID: <1491.85.65.223.132.1163705629.squirrel@dev.mellanox.co.il> Hi Adit. > Hi, > > I have installed the OFED 1.1 distribution and have a Mellanox 25208 > HCA. I want to know if there is any particular program/application > that would allow me to set the SL2VL mapping and VL arbitration tables > for the HCA? > > Thanks, > Adit Yes, there is a program that should do this and it called "SM" Subnet Manager. The SM need to configure all of the tabls in all of the nodes in the subnet, including the tables you mentioned. Dotan From rdreier at cisco.com Thu Nov 16 11:35:04 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:35:04 -0800 Subject: [openib-general] [PATCH v2 5/11] Implementation of Data path of the communication protocol In-Reply-To: <455A2677.7102.612A002@ramachandra.kuchimanchi.qlogic.com> (Ramachandra K.'s message of "Tue, 14 Nov 2006 20:26:31 +0530") References: <455A2677.7102.612A002@ramachandra.kuchimanchi.qlogic.com> Message-ID: While importing these patches, I got several "Space in indent is followed by a tab." errors. For example, the line > + __constant_cpu_to_be16(ETH_P_8021Q))) { which also leads to the comment that there's no reason for __constant_cpu_to_be16() here -- just use cpu_to_be16 and let the compiler do the optimization. (the __constant form is only needed in places where the function call is a syntax error) From rdreier at cisco.com Thu Nov 16 11:37:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:37:10 -0800 Subject: [openib-general] [Fwd: [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support] In-Reply-To: <1163705469.6286.34.camel@stevo-desktop> (Steve Wise's message of "Thu, 16 Nov 2006 13:31:09 -0600") References: <1163705469.6286.34.camel@stevo-desktop> Message-ID: > What's the plan on this series? Do you plan on pulling these into your > for-2.6.20 tree? I need to make time to read them over. And I would like to get some resolution for the IPoIB crashes that Mellanox sees before we commit to merging them into 2.6.20. From rdreier at cisco.com Thu Nov 16 11:41:20 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:41:20 -0800 Subject: [openib-general] [PATCH 01/13] Linux RDMA Core Changes In-Reply-To: <20061116035831.22635.95377.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 15 Nov 2006 21:58:32 -0600") References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035831.22635.95377.stgit@dell3.ogc.int> Message-ID: This looks completely sane to me, so I have no problem merging this stuff once the rest of the Chelsio-specific stuff is reviewed. - R. From rdreier at cisco.com Thu Nov 16 11:40:42 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:40:42 -0800 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) In-Reply-To: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> (Ramachandra K.'s message of "Tue, 14 Nov 2006 20:20:33 +0530") References: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> Message-ID: OK, I just pushed out a "vex" branch with these patches in it. I noticed that you put your code under ulp/vnic -- that seems a little too generic to me, given that this is one particular proprietary vnic implementation. Maybe something like ulp/qlvex or something like that? And similarly for the config options -- probably something like CONFIG_INFINIBAND_QLOGIC_VEX would be better to avoid clashes. - R. From rdreier at cisco.com Thu Nov 16 11:42:56 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 11:42:56 -0800 Subject: [openib-general] Question about the query QP mask In-Reply-To: <44208.194.90.237.34.1163692822.squirrel@dev.mellanox.co.il> (dotanb@dev.mellanox.co.il's message of "Thu, 16 Nov 2006 18:00:22 +0200 (IST)") References: <44208.194.90.237.34.1163692822.squirrel@dev.mellanox.co.il> Message-ID: > What should be the expected behavior? > Should this description should be changed or should the low level drivers > of mthca and ipath need to be changed? The mask is used as a hint to the low-level driver about which attributes the consumer cares about. The driver may fill in more fields, but it can use the mask to optimize some calls, if filling in a particular field is expensive and that field is not requested by the consumer. I guess we should update the documentation to reflect this. - R. From rdreier at cisco.com Thu Nov 16 12:02:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 12:02:59 -0800 Subject: [openib-general] [PATCH v2 0/11] [RFC] Support for QLogic Virtual Ethernet I/O Controller (VEx) In-Reply-To: (Roland Dreier's message of "Thu, 16 Nov 2006 11:40:42 -0800") References: <455A2511.24576.60E2DB4@ramachandra.kuchimanchi.qlogic.com> Message-ID: Oh, one other things -- you probably want to add a MAINTAINERS entry for your driver so people know who to bother... - R. From rdreier at cisco.com Thu Nov 16 12:13:10 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 12:13:10 -0800 Subject: [openib-general] [PATCH v2] IB_mthca HCA profile module parameters In-Reply-To: (Moni Shoua's message of "Thu, 16 Nov 2006 20:45:26 +0200 (IST)") References: Message-ID: We seem to be making negative progress :( The patch is still corrupted, eg: > +module_param_named(num_mpt, default_profile.num_mpt, int, 0444); > +MODULE_PARM_DESC(num_mpt, + "maximum number of memory > protection pable entries per HCA"); Indentation is completely borken: > +static int __devinit mthca_check_profile_value(int* pval,int pval_default){ > + /* value must be positive and power of 2 */ > + int old_pval = *pval; No braces needed around one-statement blocks: > + if (old_pval <= 0) { > + *pval = pval_default; > + } else if (!is_power_of_2(old_pval)) { And that test is_power_of_2() is completely unnecessary -- just set *pval to roundup_pow_of_two unconditionally (and kill the is_power_of_2 macro completely). > + if (mthca_check_profile_value(&default_profile.num_qp, > + MTHCA_DEFAULT_NUM_QP)){ > + mthca_warn(mdev,"invalid num_qp passed. changed to %d.\n", > + default_profile.num_qp); + } You should be able to create a macro that passes the name of the parameter in too, and move the if statement and the warning into mthca_check_profile_value... - R. From swise at opengridcomputing.com Thu Nov 16 12:18:58 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 16 Nov 2006 14:18:58 -0600 Subject: [openib-general] opensm problem Message-ID: <1163708338.6286.50.camel@stevo-desktop> I'm using 2.6.19-rc5 + sean's ucma patch series plus the latest userspace/management code from svn. I'm running mthca point to point between two servers. When I start opensm, it continually logs these messages: Nov 16 14:15:15 567111 [42003940] -> __osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 : 0x0000000000000016 from port 0x0002c902002147c9 I think this is some sort of mismatch between the mcast code in seans patch series and the management code maybe? Anybody seen this? Thanks, Steve. From halr at voltaire.com Thu Nov 16 13:19:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Nov 2006 23:19:00 +0200 Subject: [openib-general] DevCon "decision" on userspace Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AB@taurus.voltaire.com> Following Roland's lead on his userspace libraries (verbs and mthca), the DevCon decided that the userspace trunk will be moved to git with each component maintainer have a public tree with one or more branches to be pushed up to a git trunk. It is a requirement to import all the version history from svn and prune as appropriate. The timeframe for this is TBD. Any comments from maintainers and any consumers of the current userspace trunk ? -- Hal From halr at voltaire.com Thu Nov 16 13:20:00 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Nov 2006 23:20:00 +0200 Subject: [openib-general] Setting up VLArbitration tables & SL2VLMapping References: Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AC@taurus.voltaire.com> OpenSM supports setting up these tables. There is info in the man page on opensm on this. -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Adit Ranadive Sent: Thu 11/16/2006 1:50 PM To: openib-general at openib.org Subject: [openib-general] Setting up VLArbitration tables & SL2VLMapping Hi, I have installed the OFED 1.1 distribution and have a Mellanox 25208 HCA. I want to know if there is any particular program/application that would allow me to set the SL2VL mapping and VL arbitration tables for the HCA? Thanks, Adit ----- Adit Ranadive MS CS Candidate Georgia Institute of Technology, Atlanta, GA _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From halr at voltaire.com Thu Nov 16 13:28:51 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Nov 2006 23:28:51 +0200 Subject: [openib-general] opensm problem References: <1163708338.6286.50.camel@stevo-desktop> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AD@taurus.voltaire.com> Steve, Those messages mean that you are joining a MC group which is not already created. The MGID iof 0xff12401bffff0000 : 0x0000000000000016 is for 224.0.0.22. That is for IGMP on your IPoIB subnet. The group either needs to be preconfigured or the "first" joiner needs to create the group (which requires more characteristics). OpenSM already precreates some groups but not this one. This can be added easily. Can it wait until next week ? -- Hal ________________________________ From: openib-general-bounces at openib.org on behalf of Steve Wise Sent: Thu 11/16/2006 3:18 PM To: openib-general Subject: [openib-general] opensm problem I'm using 2.6.19-rc5 + sean's ucma patch series plus the latest userspace/management code from svn. I'm running mthca point to point between two servers. When I start opensm, it continually logs these messages: Nov 16 14:15:15 567111 [42003940] -> __osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 : 0x0000000000000016 from port 0x0002c902002147c9 I think this is some sort of mismatch between the mcast code in seans patch series and the management code maybe? Anybody seen this? Thanks, Steve. _______________________________________________ openib-general mailing list openib-general at openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Thu Nov 16 13:32:52 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 16 Nov 2006 15:32:52 -0600 Subject: [openib-general] opensm problem In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AD@taurus.voltaire.com> References: <1163708338.6286.50.camel@stevo-desktop> <5CE025EE7D88BA4599A2C8FEFCF226F5018944AD@taurus.voltaire.com> Message-ID: <1163712772.6286.62.camel@stevo-desktop> On Thu, 2006-11-16 at 23:28 +0200, Hal Rosenstock wrote: > Steve, > > Those messages mean that you are joining a MC group which is not > already created. The MGID iof 0xff12401bffff0000 : 0x0000000000000016 > is for 224.0.0.22. That is for IGMP on your IPoIB subnet. The group > either needs to be preconfigured or the "first" joiner needs to create > the group (which requires more characteristics). > > OpenSM already precreates some groups but not this one. This can be > added easily. Can it wait until next week ? > I guess, but I'm wondering what changed? This used to "just work" out of the box. Perhaps the IPoIB module in 2.6.19 isn't up to date? From halr at voltaire.com Thu Nov 16 13:37:34 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Nov 2006 23:37:34 +0200 Subject: [openib-general] opensm problem References: <1163708338.6286.50.camel@stevo-desktop><5CE025EE7D88BA4599A2C8FEFCF226F5018944AD@taurus.voltaire.com> <1163712772.6286.62.camel@stevo-desktop> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944B0@taurus.voltaire.com> Steve, Did you configure the kernel differently ? Is IGMP turned on somehow ? (I haven't run with Sean's multicast code.) BTW, as I mentioned, this can be solved on the client side equally as well as the SM side. -- Hal ________________________________ From: Steve Wise [mailto:swise at opengridcomputing.com] Sent: Thu 11/16/2006 4:32 PM To: Hal Rosenstock Cc: openib-general at openib.org Subject: RE: [openib-general] opensm problem On Thu, 2006-11-16 at 23:28 +0200, Hal Rosenstock wrote: > Steve, > > Those messages mean that you are joining a MC group which is not > already created. The MGID iof 0xff12401bffff0000 : 0x0000000000000016 > is for 224.0.0.22. That is for IGMP on your IPoIB subnet. The group > either needs to be preconfigured or the "first" joiner needs to create > the group (which requires more characteristics). > > OpenSM already precreates some groups but not this one. This can be > added easily. Can it wait until next week ? > I guess, but I'm wondering what changed? This used to "just work" out of the box. Perhaps the IPoIB module in 2.6.19 isn't up to date? From halr at voltaire.com Thu Nov 16 13:49:04 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Thu, 16 Nov 2006 23:49:04 +0200 Subject: [openib-general] OpenSM log growing too big References: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> <455BD985.3010900@3leafnetworks.com> <5CE025EE7D88BA4599A2C8FEFCF226F5018944A5@taurus.voltaire.com> <455CB07C.7080503@3leafnetworks.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AF@taurus.voltaire.com> Hi Venkat, See embedded ... comments below. -- Hal ________________________________ From: Venkatesh Babu [mailto:venkatesh.babu at 3leafnetworks.com] Sent: Thu 11/16/2006 1:39 PM To: Hal Rosenstock Cc: openib-general at openib.org Subject: Re: [openib-general] OpenSM log growing too big Hal Rosenstock wrote: > Not sure what question you are asking exactly. > > Is it what do those messages mean or the file getting large or both ? > > Both. The message looks like LID 5 is generating too many events. Yes, LID 5 is a switch LID and there is a port which is flapping. Bad cable ? The log file grows few MBs a second. What ever the problem with the port it should not generate these many log messages. I guess it is a OpenSM bug. The code is reducing the messages which are similar (approx 128 traps). The SM is repressing the trap and then the switch regenerates it becuase there is a port going up and down. That issue should be resolved. There has been discussion on the list and patches on dealing with the log and limiting its size that are in more recent versions of OpenSM. I'll look at it to see if I can reduce these messages further. > What options are you using on OpenSM startup ? > > root 7703 0.0 0.0 92784 1652 ? Sl 05:00 0:01 /usr/bin/opensm -g 0x005045014ac20001 -p 11 -s 10 -u -f /var/log/opensm.log > Also, any chance you can move forward on a more recent and better > OpenSM ? > > It is difficult to use OpenSM from OFED 1.1. Because we need to do another QA verification cycle with our product. But I can find the specific patch to the OpenSM I can apply that patch to the existing OpenSM. I would highly recommend moving to OFED 1.1 OpenSM (from OFED 1.0). Many bugs have been fixed and it is much more robust. VBabu > > -- Hal > From swise at opengridcomputing.com Thu Nov 16 13:51:51 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 16 Nov 2006 15:51:51 -0600 Subject: [openib-general] opensm problem In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5018944B0@taurus.voltaire.com> References: <1163708338.6286.50.camel@stevo-desktop> <5CE025EE7D88BA4599A2C8FEFCF226F5018944AD@taurus.voltaire.com> <1163712772.6286.62.camel@stevo-desktop> <5CE025EE7D88BA4599A2C8FEFCF226F5018944B0@taurus.voltaire.com> Message-ID: <1163713911.6286.67.camel@stevo-desktop> On Thu, 2006-11-16 at 23:37 +0200, Hal Rosenstock wrote: > Steve, > > Did you configure the kernel differently ? Is IGMP turned on somehow ? > (I haven't run with Sean's multicast code.) IGMP turned on where? > > BTW, as I mentioned, this can be solved on the client side equally as > well as the SM side. How? From rdreier at cisco.com Thu Nov 16 13:57:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 13:57:59 -0800 Subject: [openib-general] [PATCH] IB/ipoib: compliance/interoperability fix In-Reply-To: <20061116085911.GA15138@mellanox.co.il> (Michael S. Tsirkin's message of "Thu, 16 Nov 2006 10:59:12 +0200") References: <20061116085911.GA15138@mellanox.co.il> Message-ID: Thanks, I queued this for 2.6.19. I assume it's been tested carefully? - R. From halr at voltaire.com Thu Nov 16 14:01:26 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Nov 2006 00:01:26 +0200 Subject: [openib-general] opensm problem References: <1163708338.6286.50.camel@stevo-desktop><5CE025EE7D88BA4599A2C8FEFCF226F5018944AD@taurus.voltaire.com><1163712772.6286.62.camel@stevo-desktop><5CE025EE7D88BA4599A2C8FEFCF226F5018944B0@taurus.voltaire.com> <1163713911.6286.67.camel@stevo-desktop> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944B3@taurus.voltaire.com> Steve, See ... embedded comments below. -- Hal ________________________________ From: Steve Wise [mailto:swise at opengridcomputing.com] Sent: Thu 11/16/2006 4:51 PM To: Hal Rosenstock Cc: openib-general at openib.org Subject: RE: [openib-general] opensm problem On Thu, 2006-11-16 at 23:37 +0200, Hal Rosenstock wrote: > Steve, > > Did you configure the kernel differently ? Is IGMP turned on somehow ? > (I haven't run with Sean's multicast code.) IGMP turned on where? | Not sure what turns this on. I think IP multicast needs to be configured in the kernel. I don't think it is automatic although that might be the default config. Also, using IP multicast (via Sean's multicast code) likely causes IGMP to be used so the routers know the IPmc groups being created/joined/left. > BTW, as I mentioned, this can be solved on the client side equally as > well as the SM side. How? The client side can do a full join with all the SA required characteristics (called components in IB). There are downsides to both approaches: configuration (SM side) versus first joiner node characteristics are the ones enforced for the group which may not be the desired result. From swise at opengridcomputing.com Thu Nov 16 14:07:07 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 16 Nov 2006 16:07:07 -0600 Subject: [openib-general] opensm problem In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5018944B3@taurus.voltaire.com> References: <1163708338.6286.50.camel@stevo-desktop> <5CE025EE7D88BA4599A2C8FEFCF226F5018944AD@taurus.voltaire.com> <1163712772.6286.62.camel@stevo-desktop> <5CE025EE7D88BA4599A2C8FEFCF226F5018944B0@taurus.voltaire.com> <1163713911.6286.67.camel@stevo-desktop> <5CE025EE7D88BA4599A2C8FEFCF226F5018944B3@taurus.voltaire.com> Message-ID: <1163714827.6286.81.camel@stevo-desktop> > IGMP turned on where? > | > Not sure what turns this on. I think IP multicast needs to be > configured in the kernel. I don't think it is automatic although that > might be the default config. Also, using IP multicast (via Sean's > multicast code) likely causes IGMP to be used so the routers know the > IPmc groups being created/joined/left. > > > BTW, as I mentioned, this can be solved on the client side equally as > > well as the SM side. > I figured it out what was causing all the joins. I was running mdnsd (Multicast DNS daemon). I turned it off and things are nice and quiet. I guess SUSE 10.1 turns this on by default... Thanks for clues!! Steve. From lagit at 012.net.il Thu Nov 16 14:01:36 2006 From: lagit at 012.net.il (=?windows-1255?Q?=E6=E4=F8?=) Date: Fri, 17 Nov 2006 00:01:36 +0200 Subject: [openib-general] =?windows-1255?b?7OTi6fIg4Owg7vLh+CDs9un06eX6?= =?windows-1255?b?IPns6g==?= Message-ID: An HTML attachment was scrubbed... URL: From sashak at voltaire.com Thu Nov 16 14:52:49 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Nov 2006 00:52:49 +0200 Subject: [openib-general] [PATCH 2/2] libibcommon: enable printf() style format strict checking In-Reply-To: <20061116150333.GE8811@mellanox.co.il> References: <11636890971762-git-send-email-sashak@voltaire.com> <11636891023166-git-send-email-sashak@voltaire.com> <20061116150333.GE8811@mellanox.co.il> Message-ID: <20061116225249.GE32434@sashak.voltaire.com> On 17:03 Thu 16 Nov , Michael S. Tsirkin wrote: > > diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h > > index 83c0679..66afab0 100644 > > --- a/libibcommon/include/infiniband/common.h > > +++ b/libibcommon/include/infiniband/common.h > > @@ -114,11 +114,16 @@ #endif > > #define ENUM_STR_DEF(enumname, last, val) (((unsigned)(val) < last) ? enumname ## _str[val] : "???") > > #define ENUM_STR_ARRAY(name) char * name ## _str[] > > > > +#ifdef __GNUC__ > > +#define STRICT_FORMAT __attribute__((format(printf, 2, 3))) > > +#else > > +#define STRICT_FORMAT > > +#endif > > You are polluting the global namespace - macros must be prefixed with > library name. This is not "the style" for this library, but I have nothing against adding prefix here. Will do. > > But anyway - why is this necessary? > Does anyone actually try compiling libibcommon not in gcc? Why? I don't know if anyone will want to build this with non-gcc compiler, but I know that this attribute is gcc extension. > And AFAIK e.g. intel compiler implements this __attribute__. As well as format(printf(...))? It is nice. I don't have icc to check this, but feel free to send the patch if you like. Sasha > > > /* util.c: debugging and tracing */ > > -void ibwarn(const char * const fn, char *msg, ...); > > -void ibpanic(const char * const fn, char *msg, ...); > > -void logmsg(const char *const fn, char *msg, ...); > > +void ibwarn(const char * const fn, char *msg, ...) STRICT_FORMAT; > > +void ibpanic(const char * const fn, char *msg, ...) STRICT_FORMAT; > > +void logmsg(const char *const fn, char *msg, ...) STRICT_FORMAT; > > > > void xdump(FILE *file, char *msg, void *p, int size); > > -- > MST From sean.hefty at intel.com Thu Nov 16 15:00:57 2006 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 16 Nov 2006 15:00:57 -0800 Subject: [openib-general] [Fwd: [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support] In-Reply-To: Message-ID: <000001c709d3$1427a380$9dfc070a@amr.corp.intel.com> >I need to make time to read them over. And I would like to get some >resolution for the IPoIB crashes that Mellanox sees before we commit >to merging them into 2.6.20. I agree that we need to fix the ipoib crashes before merging this upstream. After that is resolved, I need to make a couple of small updates to the patches before resubmitting. If this misses 2.6.20, I don't think it'll be a big deal. - Sean From swise at opengridcomputing.com Thu Nov 16 15:06:20 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 16 Nov 2006 17:06:20 -0600 Subject: [openib-general] [Fwd: [PATCH 0/7 v2] for 2.6.20 rdma/cma: add userspace support] In-Reply-To: <000001c709d3$1427a380$9dfc070a@amr.corp.intel.com> References: <000001c709d3$1427a380$9dfc070a@amr.corp.intel.com> Message-ID: <1163718380.6286.102.camel@stevo-desktop> On Thu, 2006-11-16 at 15:00 -0800, Sean Hefty wrote: > >I need to make time to read them over. And I would like to get some > >resolution for the IPoIB crashes that Mellanox sees before we commit > >to merging them into 2.6.20. > > I agree that we need to fix the ipoib crashes before merging this upstream. > After that is resolved, I need to make a couple of small updates to the patches > before resubmitting. If this misses 2.6.20, I don't think it'll be a big deal. > > - Sean It would be nice to get the user mode connection setup code in 2.6.20. Without it, there's no user mode support for iwarp. The instability is in the mcast stuff, right? Can we separate the two and pull in the connection setup support for user mode? Steve. From xma at us.ibm.com Thu Nov 16 15:50:14 2006 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 16 Nov 2006 15:50:14 -0800 Subject: [openib-general] [PATCH/RFC 1/2] IB: Return "maybe_missed_event" hint from ib_req_notify_cq() In-Reply-To: Message-ID: Roland Dreier wrote on 11/16/2006 11:26:31 AM: > > What I have found in ehca driver, n! = t, does't mean it's empty. If poll > > again, there are still some packets in cq. IB_CQ_REPORT_mISSED_EVENTS most > > of the time reports 1. It relies on netif_rx_reschedule() returns0 to exit > > napi poll. That might be the reason in poll routine for a long time? I will > > rerun my test to use n! = 0 to see any difference here. > > Maybe there's an ehca bug in poll CQ? If n != t then it should mean > that the CQ was indeed drained. I would expect a missed event would > be rare, because it means a completion occurs between the last poll CQ > and the request notify, and that shouldn't be that common... > > My rough estimate is that even at a higher throughput than what you're > seeing, IPoIB should only generate ~ 500K completions/sec, which means > the average delay between completions is 2 microseconds. So I > wouldn't expect completions to hit the window between poll and request > notify that often. > > - R. I have tried low_latency is 1 to disable TCP prequeue, the throughput was increased from 1XXMb/s to 4XXMb/s. If I delayed net_skb_receive() a little bit, I could get around 1700Mb/s. If I totally disable netif_rx_reschedule(), then there is no repoll and return 0, I could get around 2900Mb/s throughout without packet seeing out of order issues. I have tried to add a spin lock in ipoib_poll(). And I still see packets out of orders. disable prequeue: 2XXMb/s to 4XXMb/s (packets out of order) slowdown netif_receive_skb: 17XXMb/s (packets out of order) don't handle missed event: 28XXMb/s (no packets out of order) handler missed envent later: 7XXMb/s to 11XXMb/s (packets out of order) Maybe it is ehca driver deliver packets much faster? Which makes me think user processes tcp backlogqueue, prequeue might be out of order? Thanks Shirley Ma -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Thu Nov 16 16:27:40 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Nov 2006 02:27:40 +0200 Subject: [openib-general] DevCon "decision" on userspace In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AB@taurus.voltaire.com> References: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AB@taurus.voltaire.com> Message-ID: <20061117002740.GI32434@sashak.voltaire.com> On 23:19 Thu 16 Nov , Hal Rosenstock wrote: > Following Roland's lead on his userspace libraries (verbs and mthca), the DevCon decided that the userspace trunk will be moved to git with each component maintainer have a public tree with one or more branches to be pushed up to a git trunk. It is a requirement to import all the version history from svn and prune as appropriate. The timeframe for this is TBD. Any comments from maintainers and any consumers of the current userspace trunk ? Hal, we are lucky - we already have converted to git src/userspace/management tree on the new server: git://staging.openfabrics.org/~sashak/management.git As conversion tool I've used git-svnimport script distributed with git (the recent version is better - it supports openib subproject imports - '-P' option). There is another tool - git-svn, once it was not able to import branches, but it was under active development last time, so now this may be better. With git-svnimport the command should be like: git_svnimport -v -r -m -F -S -C -A \ -T gen2/trunk -b gen2/branches -t gen2/tags \ -P src/userspace/management https://openib.org/svn Some options (like -r, -v) can be omitted - see --help output. Better (and much much faster) to run import against local SVN repository. After import you may want to review resulted git tree, remove unrelated tags and branches, run git-repack -a -d and finally to push imported tree to public place. Sasha From sashak at voltaire.com Thu Nov 16 16:38:32 2006 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Nov 2006 02:38:32 +0200 Subject: [openib-general] [PATCH 2/2 v2] libibcommon: enable printf() style format strict checking In-Reply-To: <11636891023166-git-send-email-sashak@voltaire.com> References: <11636890971762-git-send-email-sashak@voltaire.com> <11636891023166-git-send-email-sashak@voltaire.com> Message-ID: <20061117003832.GJ32434@sashak.voltaire.com> This enables strict format/args checking for printf() style functions. Signed-off-by: Sasha Khapyorsky --- libibcommon/include/infiniband/common.h | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h index 83c0679..c41217d 100644 --- a/libibcommon/include/infiniband/common.h +++ b/libibcommon/include/infiniband/common.h @@ -114,11 +114,16 @@ static inline uint64_t htonll(uint64_t x #define ENUM_STR_DEF(enumname, last, val) (((unsigned)(val) < last) ? enumname ## _str[val] : "???") #define ENUM_STR_ARRAY(name) char * name ## _str[] +#ifdef __GNUC__ +#define IBCOMMON_STRICT_FORMAT __attribute__((format(printf, 2, 3))) +#else +#define IBCOMMON_STRICT_FORMAT +#endif /* util.c: debugging and tracing */ -void ibwarn(const char * const fn, char *msg, ...); -void ibpanic(const char * const fn, char *msg, ...); -void logmsg(const char *const fn, char *msg, ...); +void ibwarn(const char * const fn, char *msg, ...) IBCOMMON_STRICT_FORMAT; +void ibpanic(const char * const fn, char *msg, ...) IBCOMMON_STRICT_FORMAT; +void logmsg(const char *const fn, char *msg, ...) IBCOMMON_STRICT_FORMAT; void xdump(FILE *file, char *msg, void *p, int size); -- 1.4.4.g031c-dirty From sean.hefty at intel.com Thu Nov 16 16:42:14 2006 From: sean.hefty at intel.com (Hefty, Sean) Date: Thu, 16 Nov 2006 16:42:14 -0800 Subject: [openib-general] [Fwd: [PATCH 0/7 v2] for 2.6.20 rdma/cma: adduserspace support] Message-ID: >It would be nice to get the user mode connection setup code in 2.6.20. >Without it, there's no user mode support for iwarp. The instability is >in the mcast stuff, right? Can we separate the two and pull in the >connection setup support for user mode? I'd rather not, since removing multicast support changes the ABI. I had another request to try to merge this upstream for 2.6.20 by Voltaire, and I am actively debugging this when I'm not sitting in a conference... (I.e. I will try.) - Sean From bugzilla-daemon at openib.org Thu Nov 16 16:48:40 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Thu, 16 Nov 2006 16:48:40 -0800 (PST) Subject: [openib-general] [Bug 293] New: udev rules should be KERNEL== not KERNEL= Message-ID: <20061117004840.8654D2283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=293 Summary: udev rules should be KERNEL== not KERNEL= Product: OpenFabrics Linux Version: gen2 Platform: All OS/Version: Other Status: NEW Severity: major Priority: P2 Component: IB Core AssignedTo: bugzilla at openib.org ReportedBy: friedman at ucla.edu Apparently newer udev's are complaining about this, FC5 doesn't but works with the change. FC6 will emit errors when the rules are processed ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From rdreier at cisco.com Thu Nov 16 16:46:17 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 16:46:17 -0800 Subject: [openib-general] [Fwd: [PATCH 0/7 v2] for 2.6.20 rdma/cma: adduserspace support] In-Reply-To: (Sean Hefty's message of "Thu, 16 Nov 2006 16:42:14 -0800") References: Message-ID: If we're confident on the multicast ABI now, we could stub it out for 2.6.20 (just return -ENOSYS or something). Then the userspace side would fail gracefully against old kernels and we could merge multicast support later. But that adds work to strip out the multicast support. And it assumes that we know the ABI now. - R. From rdreier at cisco.com Thu Nov 16 16:58:29 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 16:58:29 -0800 Subject: [openib-general] [PATCH] RFC libibverbs - Pass provider data through ibv_cmd_req_notify_cq() In-Reply-To: <1160173774.4324.7.camel@stevo-desktop> (Steve Wise's message of "Fri, 06 Oct 2006 17:29:34 -0500") References: <1160173774.4324.7.camel@stevo-desktop> Message-ID: OK, I applied these patches to libibverbs (and a corresponding patch to libmthca) and pushed the new trees out. Steve, can you pull and make sure I got everything you needed in? - R. From halr at voltaire.com Thu Nov 16 17:39:58 2006 From: halr at voltaire.com (Hal Rosenstock) Date: Fri, 17 Nov 2006 03:39:58 +0200 Subject: [openib-general] [PATCH 2/2] libibcommon: enable printf() style format strict checking References: <11636890971762-git-send-email-sashak@voltaire.com> <11636891023166-git-send-email-sashak@voltaire.com> <20061116150333.GE8811@mellanox.co.il> <20061116225249.GE32434@sashak.voltaire.com> Message-ID: <5CE025EE7D88BA4599A2C8FEFCF226F5018944B7@taurus.voltaire.com> See embedded comment below. ________________________________ From: Sasha Khapyorsky [mailto:sashak at voltaire.com] Sent: Thu 11/16/2006 5:52 PM To: Michael S. Tsirkin Cc: Hal Rosenstock; openib-general at openib.org Subject: Re: [PATCH 2/2] libibcommon: enable printf() style format strict checking On 17:03 Thu 16 Nov , Michael S. Tsirkin wrote: > > diff --git a/libibcommon/include/infiniband/common.h b/libibcommon/include/infiniband/common.h > > index 83c0679..66afab0 100644 > > --- a/libibcommon/include/infiniband/common.h > > +++ b/libibcommon/include/infiniband/common.h > > @@ -114,11 +114,16 @@ #endif > > #define ENUM_STR_DEF(enumname, last, val) (((unsigned)(val) < last) ? enumname ## _str[val] : "???") > > #define ENUM_STR_ARRAY(name) char * name ## _str[] > > > > +#ifdef __GNUC__ > > +#define STRICT_FORMAT __attribute__((format(printf, 2, 3))) > > +#else > > +#define STRICT_FORMAT > > +#endif > > You are polluting the global namespace - macros must be prefixed with > library name. This is not "the style" for this library, This is something I want to clean up by deprecating the non prefixed names. but I have nothing against adding prefix here. Will do. > But anyway - why is this necessary? > Does anyone actually try compiling libibcommon not in gcc? Why? I don't know if anyone will want to build this with non-gcc compiler, but I know that this attribute is gcc extension. > And AFAIK e.g. intel compiler implements this __attribute__. As well as format(printf(...))? It is nice. I don't have icc to check this, but feel free to send the patch if you like. Sasha > > > /* util.c: debugging and tracing */ > > -void ibwarn(const char * const fn, char *msg, ...); > > -void ibpanic(const char * const fn, char *msg, ...); > > -void logmsg(const char *const fn, char *msg, ...); > > +void ibwarn(const char * const fn, char *msg, ...) STRICT_FORMAT; > > +void ibpanic(const char * const fn, char *msg, ...) STRICT_FORMAT; > > +void logmsg(const char *const fn, char *msg, ...) STRICT_FORMAT; > > > > void xdump(FILE *file, char *msg, void *p, int size); > > -- > MST From weiny2 at llnl.gov Thu Nov 16 17:53:04 2006 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 16 Nov 2006 17:53:04 -0800 Subject: [openib-general] add SIGUSR1 to reopen osm.log Message-ID: <20061116175304.668afcab.weiny2@llnl.gov> Our sysadmins have been rotating OpenSM's osm.log file and then restarting OpenSM. As this is a less than optimal solution if you have jobs running on the system, I wrote this patch (against OFED 1.1) which adds a handler for SIGUSR1 that reopens OpenSM's log file without a restart. Ira Weiny weiny2 at llnl.gov -------------- next part -------------- A non-text attachment was scrubbed... Name: sigusr1-logreopen-opensm.patch Type: application/octet-stream Size: 5019 bytes Desc: not available URL: From krkumar2 at in.ibm.com Thu Nov 16 19:49:34 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Fri, 17 Nov 2006 09:19:34 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Get rid of extra call to list_empty() In-Reply-To: Message-ID: Hi Tom, OK, I will try it and report what I find. Thanks! - KK Tom Tucker wrote on 11/16/2006 10:08:03 PM: > The race is that you've deleted the work queue element that is enqueued on > the iwcm_wq. It's as simple as that. To prove it to yourself, apply your > patch. Turn on memory debug support in the kernel and recompile your code. > Then run rdma_krping clients in four different threads against your server > with an I/O count of 1. > > You'll hit the race and can look at it yourself. I don't know any better way > to explain it... Sorry. > > > On 11/13/06 10:44 PM, "Krishna Kumar2" wrote: > > > Hi Tom, > > > >> No, to understand why go look at the implementation of queue_work. BTW, > > this > > > > I was describing the implementation of queue_work() in my > > previous mail. So sorry to be dense, but I do not understand > > why this patch introduces a race. Can you explain the race > > that you had found ? What I understood of queue_work() is : > > > > If cm_work_handler() is already running and processing the > > last entry at the same time this new entry was added, it is > > guaranteed to find this new entry in it's current run iteration, > > and process it. The only issue is with the extra queue_work > > by iwcm parallely on a different cpu for the same case. > > > > So if iwcm had done a redundant "queue_work" on this queue, > > which, besides adding the new entry to the workqueue, also > > does a wakeup of "worker_thread" (which is still running the > > previous iteration of run_workqueue -> cm_work_handler). > > I am assuming that the wake up function is > > default_wake_function(), since I couldn't locate in wait* code > > where this is initialized. > > > > When cm_work_handler finishes removing this new entry, it > > returns to worker_thread, which will do a schedule() and > > sleep till it is woken up again (since default_wake_function > > found that the thread is already running and had done > > nothing). Are you referring to a race where the queue_work > > is done between the time cm_work_handler finished running > > and before it gets back to schedule ? I feel that should not > > matter as the run_workqueue() will find this entry in it's > > cwq->worklist and continue processing instead of exiting to > > worker_thread() and schedule(). > > > > Still confused about the race :) > > > > Thanks, > > > > - KK > > From venkatesh.babu at 3leafnetworks.com Thu Nov 16 19:55:12 2006 From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu) Date: Thu, 16 Nov 2006 19:55:12 -0800 Subject: [openib-general] OpenSM log growing too big In-Reply-To: <5CE025EE7D88BA4599A2C8FEFCF226F5018944AF@taurus.voltaire.com> References: <000101c70448$c02e88a0$8698070a@amr.corp.intel.com> <455BD985.3010900@3leafnetworks.com> <5CE025EE7D88BA4599A2C8FEFCF226F5018944A5@taurus.voltaire.com> <455CB07C.7080503@3leafnetworks.com> <5CE025EE7D88BA4599A2C8FEFCF226F5018944AF@taurus.voltaire.com> Message-ID: <455D32A0.8040803@3leafnetworks.com> Hal Rosenstock wrote: > Yes, LID 5 is a switch LID and there is a port which is flapping. Bad cable ? > > When this port is disconnected the OpenSM stops logging these messages. It could have been bad connection. > The code is reducing the messages which are similar (approx 128 traps). >The SM is repressing the trap and then the switch regenerates it becuase there is a port going up and down. >That issue should be resolved. There has been discussion on the list and patches on dealing with the log and limiting its size that are in more recent versions of OpenSM. I'll look at it to see if I can reduce these messages further. > > It would be great if you can provide this patch. > I would highly recommend moving to OFED 1.1 OpenSM (from OFED 1.0). Many bugs have been fixed and it is much more robust. > > I agree. I am trying to push this. VBabu From krkumar2 at in.ibm.com Thu Nov 16 20:31:25 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 17 Nov 2006 10:01:25 +0530 Subject: [openib-general] [RFC] [PATCH] RDMA/iwcm: Cleanup IWCM_F_CALLBACK_DESTROY usage. Message-ID: <20061117043125.8065.16133.sendpatchset@localhost.localdomain> Cleanup IWCM_F_CALLBACK_DESTROY usage. It is being set only in cm_conn_req_handler(), and that too on a child handle. Remove IWCM_F_CALLBACK_DESTROY as the same result can be achieved otherwise. Patch against 2.6.19-rc5. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -161,8 +161,6 @@ static int iwcm_deref_id(struct iwcm_id_ BUG_ON(!list_empty(&cm_id_priv->work_list)); if (waitqueue_active(&cm_id_priv->destroy_comp.wait)) { BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); - BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, - &cm_id_priv->flags)); ret = 1; } complete(&cm_id_priv->destroy_comp); @@ -386,7 +384,6 @@ void iw_destroy_cm_id(struct iw_cm_id *c struct iwcm_id_private *cm_id_priv; cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); - BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)); destroy_cm_id(cm_id); @@ -833,11 +830,12 @@ static void cm_work_handler(void *arg) struct iwcm_id_private *cm_id_priv = work->cm_id; unsigned long flags; int empty; - int ret = 0; spin_lock_irqsave(&cm_id_priv->lock, flags); empty = list_empty(&cm_id_priv->work_list); while (!empty) { + int ret; + work = list_entry(cm_id_priv->work_list.next, struct iwcm_work, list); list_del_init(&work->list); @@ -847,16 +845,13 @@ static void cm_work_handler(void *arg) spin_unlock_irqrestore(&cm_id_priv->lock, flags); ret = process_event(cm_id_priv, &work->event); - if (ret) { - set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); + if (ret) destroy_cm_id(&cm_id_priv->id); - } BUG_ON(atomic_read(&cm_id_priv->refcount)==0); if (iwcm_deref_id(cm_id_priv)) return; - if (atomic_read(&cm_id_priv->refcount)==0 && - test_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags)) { + if (ret && atomic_read(&cm_id_priv->refcount) == 0) { dealloc_work_entries(cm_id_priv); kfree(cm_id_priv); return; diff -ruNp org/drivers/infiniband/core/iwcm.h new/drivers/infiniband/core/iwcm.h --- org/drivers/infiniband/core/iwcm.h 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.h 2006-10-09 16:52:03.000000000 +0530 @@ -56,7 +56,6 @@ struct iwcm_id_private { struct list_head work_free_list; }; -#define IWCM_F_CALLBACK_DESTROY 1 -#define IWCM_F_CONNECT_WAIT 2 +#define IWCM_F_CONNECT_WAIT 1 #endif /* IWCM_H */ From krkumar2 at in.ibm.com Thu Nov 16 20:31:28 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 17 Nov 2006 10:01:28 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Teach lockdep about nesting of lock-classes. Message-ID: <20061117043128.8065.92489.sendpatchset@localhost.localdomain> I sometimes get this erroneous warning message about lock recursion : : [ INFO: possible recursive locking detected ] : rdma_bw/3558 is trying to acquire lock: : (&cq->lock){....}, at: [] c2_free_qp+0x78/0x180 [iw_c2] : but task is already holding lock: : (&cq->lock){....}, at: [] c2_free_qp+0x6b/0x180 [iw_c2] The fix is to teach lockdep about this nesting of a lock-class. Patch against 2.6.19-rc5. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/hw/amso1100/c2_qp.c new/drivers/infiniband/hw/amso1100/c2_qp.c --- org/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-15 12:40:04.000000000 +0530 +++ new/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-15 13:02:03.000000000 +0530 @@ -578,7 +578,7 @@ void c2_free_qp(struct c2_dev *c2dev, st */ spin_lock_irq(&send_cq->lock); if (send_cq != recv_cq) - spin_lock(&recv_cq->lock); + spin_lock_nested(&recv_cq->lock, SINGLE_DEPTH_NESTING); c2_free_qpn(c2dev, qp->qpn); From krkumar2 at in.ibm.com Thu Nov 16 20:31:30 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 17 Nov 2006 10:01:30 +0530 Subject: [openib-general] [RFC] [PATCH] RDMA/iwcm: Prevent deadlock in locking. Message-ID: <20061117043130.8065.53462.sendpatchset@localhost.localdomain> Since create_qp and destroy_qp can be called from userspace and from other kernel routines, it is possible to swap send_cq and recv_cq in different calls for creating different qp's (RFC). This can result in a deadlock, if the two locks are got out of order. Patch against 2.6.19-rc5. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/hw/amso1100/c2_qp.c new/drivers/infiniband/hw/amso1100/c2_qp.c --- org/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-15 12:40:04.000000000 +0530 +++ new/drivers/infiniband/hw/amso1100/c2_qp.c 2006-11-16 18:10:03.000000000 +0530 @@ -564,6 +564,32 @@ int c2_alloc_qp(struct c2_dev *c2dev, return err; } +static inline void c2_lock_cqs(struct c2_cq *send_cq, struct c2_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_lock_irq(&send_cq->lock); + else if (send_cq > recv_cq) { + spin_lock_irq(&send_cq->lock); + spin_lock_nested(&recv_cq->lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock_irq(&recv_cq->lock); + spin_lock_nested(&send_cq->lock, SINGLE_DEPTH_NESTING); + } +} + +static inline void c2_unlock_cqs(struct c2_cq *send_cq, struct c2_cq *recv_cq) +{ + if (send_cq == recv_cq) + spin_unlock_irq(&send_cq->lock); + else if (send_cq > recv_cq) { + spin_unlock(&recv_cq->lock); + spin_unlock_irq(&send_cq->lock); + } else { + spin_unlock(&send_cq->lock); + spin_unlock_irq(&recv_cq->lock); + } +} + void c2_free_qp(struct c2_dev *c2dev, struct c2_qp *qp) { struct c2_cq *send_cq; @@ -576,15 +602,9 @@ void c2_free_qp(struct c2_dev *c2dev, st * Lock CQs here, so that CQ polling code can do QP lookup * without taking a lock. */ - spin_lock_irq(&send_cq->lock); - if (send_cq != recv_cq) - spin_lock_nested(&recv_cq->lock, SINGLE_DEPTH_NESTING); - + c2_lock_cqs(send_cq, recv_cq); c2_free_qpn(c2dev, qp->qpn); - - if (send_cq != recv_cq) - spin_unlock(&recv_cq->lock); - spin_unlock_irq(&send_cq->lock); + c2_unlock_cqs(send_cq, recv_cq); /* * Destory qp in the rnic... From krkumar2 at in.ibm.com Thu Nov 16 20:31:23 2006 From: krkumar2 at in.ibm.com (Krishna Kumar) Date: Fri, 17 Nov 2006 10:01:23 +0530 Subject: [openib-general] [PATCH] RDMA/iwcm: Bugs in cm_conn_req_handler() Message-ID: <20061117043123.8065.45420.sendpatchset@localhost.localdomain> cm_conn_req_handler() : 1. Setting IWCM_F_CALLBACK_DESTROY on cm_id (child handle), which doesn't achieve anything, since checking IWCM_F_CALLBACK_DESTROY in the parent's flag (in cm_work_handler) means that this will never be true. 2. Calling destroy_cm_id leaks 3 work 'free' list entries. 3. cm_id is freed up wrongly and not cm_id_priv (though the effect is the same since cm_id is the first element of cm_id_priv, but still a bug if the top level cm_id changes). 4. Reject message has to be sent on failure. Tested this without the fix and found the client hangs, waited for about 20 mins and then did Ctrl-C but the process is unkillable. All 4 above cases were tested by injecting error in iw_conn_req_handler() and they were confirmed. I added the BUG_ON() to confirm the earlier check for refcount == 0. Patch against 2.6.19-rc5. Signed-off-by: Krishna Kumar --- diff -ruNp org/drivers/infiniband/core/iwcm.c new/drivers/infiniband/core/iwcm.c --- org/drivers/infiniband/core/iwcm.c 2006-10-09 16:40:04.000000000 +0530 +++ new/drivers/infiniband/core/iwcm.c 2006-10-09 16:52:03.000000000 +0530 @@ -648,10 +648,9 @@ static void cm_conn_req_handler(struct i /* Call the client CM handler */ ret = cm_id->cm_handler(cm_id, iw_event); if (ret) { - set_bit(IWCM_F_CALLBACK_DESTROY, &cm_id_priv->flags); - destroy_cm_id(cm_id); - if (atomic_read(&cm_id_priv->refcount)==0) - kfree(cm_id); + BUG_ON(atomic_read(&cm_id_priv->refcount) != 1); + iw_cm_reject(cm_id, NULL, 0); + iw_destroy_cm_id(cm_id); } if (iw_event->private_data_len) From rdreier at cisco.com Thu Nov 16 20:45:36 2006 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Nov 2006 20:45:36 -0800 Subject: [openib-general] [PATCH 09/13] Core WQE/CQE Types In-Reply-To: <20061116035912.22635.21736.stgit@dell3.ogc.int> (Steve Wise's message of "Wed, 15 Nov 2006 21:59:12 -0600") References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035912.22635.21736.stgit@dell3.ogc.int> Message-ID: > +struct t3_send_wr { > + struct fw_riwrh wrh; /* 0 */ > + union t3_wrid wrid; /* 1 */ > + > + enum t3_rdma_opcode rdmaop:8; > + u32 reserved:24; /* 2 */ Does this do the right thing wrt endianness? I'd be more comfortable with something like u8 rdmaop; u8 reserved[3]; (although the __attribute__((packed)) on enum t3_rdma_opcode does make it OK to use here, I guess) > + u32 rem_stag; /* 2 */ > + u32 plen; /* 3 */ > + u32 num_sgle; > + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ > +}; From arne.redlich at xiranet.com Fri Nov 17 00:28:38 2006 From: arne.redlich at xiranet.com (Arne Redlich) Date: Fri, 17 Nov 2006 09:28:38 +0100 Subject: [openib-general] [PATCH] IB/SRP - increase supported CDB size In-Reply-To: (Roland Dreier's message of "Thu, 16 Nov 2006 09:59:32 -0800") References: <87ac2t0wcb.fsf@confield.dd.xiranet.com> Message-ID: <877ixu79zd.fsf@confield.dd.xiranet.com> Roland Dreier writes: > Definitely makes sense. I queued the following version for 2.6.20, > which gets the max CDB size directly from struct srp_cmd. Does this > look OK to you? Looks good to me. I guess there's no way to sneak it into 2.6.19 anymore, right? ;) Thanks, Arne > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c > index 4b09147..01776c9 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -1716,7 +1716,8 @@ static ssize_t srp_create_target(struct > if (!target_host) > return -ENOMEM; > > - target_host->max_lun = SRP_MAX_LUN; > + target_host->max_lun = SRP_MAX_LUN; > + target_host->max_cmd_len = sizeof ((struct srp_cmd *) (void *) 0L)->cdb; > > target = host_to_target(target_host); > From ramachandra.kuchimanchi at qlogic.com Fri Nov 17 04:41:31 2006 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra Kuchimanchi) Date: Fri, 17 Nov 2006 06:41:31 -0600 Subject: [openib-general] [PATCH v2 5/11] Implementation of Data path of the communication protocol References: <455A2677.7102.612A002@ramachandra.kuchimanchi.qlogic.com> Message-ID: > While importing these patches, I got several "Space in indent is > followed by a tab." errors. For example, the line >> + __constant_cpu_to_be16(ETH_P_8021Q))) { I did test my patches with git-apply --whitespace=error-all, before mailing them but did not see any error. I am using GIT 1.3, may be I should upgrade ? It will be great if you can share the options you use for applying patches. That way I can test them properly before mailing out to you. > which also leads to the comment that there's no reason for > __constant_cpu_to_be16() here -- just use cpu_to_be16 and let the > compiler do the optimization. (the __constant form is only needed in > places where the function call is a syntax error) Thanks for explaining that. I will convert these back to cpu_to_beXX and also work on the other naming related suggestions you had. Regards, Ram -------------- next part -------------- An HTML attachment was scrubbed... URL: From diego.guella at sircomtech.com Fri Nov 17 06:36:39 2006 From: diego.guella at sircomtech.com (Diego Guella) Date: Fri, 17 Nov 2006 15:36:39 +0100 Subject: [openib-general] Infiniband on Debian etch RC1 Message-ID: <005101c70a55$d6264a40$05c8a8c0@DIEGO> Hi, I just installed Debian Etch RC1 on a Dell PowerEdge 1950. I have written an application that performs a RDMA read and write on a server, and it works on Suse 9.3 with OFED 1.0 installed. Now, when I start that application on Debian, I got the following error: ----- libibverbs: Fatal: couldn't read uverbs ABI version. ----- Googling around, I saw that in /sys/class there should be a 'infiniband_verbs' directory, but in Debian there isn't. How can I resolve this? Thanks, Diego -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.bub at thomson.net Fri Nov 17 06:19:32 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Fri, 17 Nov 2006 15:19:32 +0100 Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? Message-ID: Yesterday I complained about a gen2 x86 client RDMA read not working on a gen1 PowerPC server. Today I got my hardware guy measuring the PCI-Express interface on the PowerPC gen1 server. It seems as if the RDMA read is at least started. So all ideas about endianess problems To explain the problem I have to give a little more details on our application: We are using the InfiniBand interface between a PowerPC gen1 server and a gen1/gen 2 x86 client for high speed film image transport from a scanner (server) to a workstation (client). In order to get the fastest response and performance the x86 client reads the images out of the scanner by RDMA on a hardware FIFO which is registered as physical memory. Unfortunately the scanner does not always have an image ready to deliver. So to avoid time consuming connected, rdma and disconnect for each and every image I'm holding the connection up and the scanner FIFO hardware delays the responses to the memory reads up to 500 msec. And here is now the problem: It seems as if the gen2 stack does not like that long delay on the rdma. Since a gen1 stack client can live with that long delay I think the gen2 stack should be able to do the same on the same hardware? The question now is: How to increase the waiting period/timeout for the completion of an RDMA operation? Things are getting clearer. Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.bub at thomson.net Fri Nov 17 06:23:53 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Fri, 17 Nov 2006 15:23:53 +0100 Subject: [openib-general] What could prevent a gen2 x86 client qp from doing RDMA_READ on a gen1 PowerPC client? Message-ID: Dotan and all others. I could isolate the problem with hardware measurement a little more today I just opened a new thread on the problem naming: "How to increase the waiting period/timeout for the completion of an RDMA operation?" Thanks for the support so far. Thomas From rdreier at cisco.com Fri Nov 17 08:44:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 08:44:06 -0800 Subject: [openib-general] [RFC] [PATCH] RDMA/iwcm: Prevent deadlock in locking. References: <20061117043130.8065.53462.sendpatchset@localhost.localdomain> Message-ID: Umm.. what's the point of the previous patch that changed spin_lock to spin_lock_nested? All it did was hide this bug. Also BTW your subject lines are off -- these are not RDMA/iwcm patches. - R. From rdreier at cisco.com Fri Nov 17 08:54:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 08:54:06 -0800 Subject: [openib-general] [PATCH 11/13] Core Resource Allocation References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035923.22635.5397.stgit@dell3.ogc.int> Message-ID: > +static u32 next_random(u32 rand) > +{ > + u32 y, ylast; > + > + y = rand; > + ylast = y; > + y = (y * 69069) & 0xffffffff; > + y = (y & 0x80000000) + (ylast & 0x7fffffff); > + if ((y & 1)) > + y = ylast ^ (y > 1) ^ (2567483615UL); > + else > + y = ylast ^ (y > 1); > + y = y ^ (y >> 11); > + y = y ^ ((y >> 7) & 2636928640UL); > + y = y ^ ((y >> 15) & 4022730752UL); > + y = y ^ (y << 18); > + return y; > +} How about just using the kernel's random32()? I haven't read the code really so I don't understand what's being randomized here, but random32() should be more than good enough for a typical randomized algorithm(). - R. From swise at opengridcomputing.com Fri Nov 17 09:02:44 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 17 Nov 2006 11:02:44 -0600 Subject: [openib-general] [PATCH 09/13] Core WQE/CQE Types In-Reply-To: References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035912.22635.21736.stgit@dell3.ogc.int> Message-ID: <1163782964.8457.35.camel@stevo-desktop> On Thu, 2006-11-16 at 20:45 -0800, Roland Dreier wrote: > > +struct t3_send_wr { > > + struct fw_riwrh wrh; /* 0 */ > > + union t3_wrid wrid; /* 1 */ > > + > > + enum t3_rdma_opcode rdmaop:8; > > + u32 reserved:24; /* 2 */ > > Does this do the right thing wrt endianness? I'd be more comfortable > with something like > > u8 rdmaop; > u8 reserved[3]; > > (although the __attribute__((packed)) on enum t3_rdma_opcode does make > it OK to use here, I guess) > > > + u32 rem_stag; /* 2 */ > > + u32 plen; /* 3 */ > > + u32 num_sgle; > > + struct t3_sge sgl[T3_MAX_SGE]; /* 4+ */ > > +}; I don't really like the bit fields either. I inherited these structs and I'm not adverse to changing them as you suggest to get rid of bit fields. But I think they are correct wrt endianness. I wrote a test program and on a LE machine it put the u8 first in memory followed by the 24 bit reserved. However, I think if you use bit fields less than 8 bits its not endian safe. BTW: I don't have a PPC system (yet) to test this code on BE... Here's a dumb program that plays around with bit fields... #include #include #include #include struct foo { uint32_t a:8; uint32_t b:24; uint32_t c:16; uint32_t d:8; uint32_t e:8; }; struct bar { uint8_t a; uint8_t b[3]; uint16_t c; uint8_t d; uint8_t e; }; struct bits { #if 0 /* BE */ uint32_t a:4; uint32_t b:4; #else /* LE */ uint32_t b:4; uint32_t a:4; #endif uint32_t c:8; uint32_t d:8; uint32_t e:8; }; main() { struct foo foo; struct bar bar; struct bits bits; uint8_t *cp; int i; foo.a = 0x01; foo.b = 0x020304; foo.c = 0x0506; foo.d = 0x07; foo.e = 0x08; printf("foo cpu: 0x%" PRIx64 "\n", *(uint64_t *)&foo); printf("foo mem: "); cp = (uint8_t *)&foo; for (i=0; i<8; i++) printf("%02x", *cp++); printf("\n"); bar.a = 0x01; bar.b[0] = 0x02; bar.b[1] = 0x03; bar.b[2] = 0x04; bar.c = 0x0506; bar.d = 0x07; bar.e = 0x08; printf("bar cpu: 0x%" PRIx64 "\n", *(uint64_t *)&bar); printf("bar mem: "); cp = (uint8_t *)&bar; for (i=0; i<8; i++) printf("%02x", *cp++); printf("\n"); bits.a = 0x1; bits.b = 0x2; bits.c = 0x3; bits.d = 0x4; bits.e = 0x5; printf("bits cpu: 0x%08x\n", *(uint32_t *)&bits); printf("bar mem: "); cp = (uint8_t *)&bits; for (i=0; i<4; i++) printf("%02x", *cp++); printf("\n"); } From rdreier at cisco.com Fri Nov 17 09:04:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 09:04:06 -0800 Subject: [openib-general] [PATCH] IB/SRP - increase supported CDB size References: <87ac2t0wcb.fsf@confield.dd.xiranet.com> <877ixu79zd.fsf@confield.dd.xiranet.com> Message-ID: > Looks good to me. I guess there's no way to sneak it into 2.6.19 > anymore, right? ;) I thought about it but I couldn't really justify it to myself... From rdreier at cisco.com Fri Nov 17 09:09:06 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 09:09:06 -0800 Subject: [openib-general] Infiniband on Debian etch RC1 References: <005101c70a55$d6264a40$05c8a8c0@DIEGO> Message-ID: > libibverbs: Fatal: couldn't read uverbs ABI version. Do you have the ib_uverbs kernel module installed? To load it automatically on boot you can add a line ib_uverbs to your /etc/modules directory. BTW out of curiousity are you using the standard Debian kernel and libibverbs/libmthca packages? That's what I would do -- it all should work fine and I would like to hear about it if it doesn't. - R. From rdreier at cisco.com Fri Nov 17 09:09:07 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 09:09:07 -0800 Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? References: Message-ID: > How to increase the waiting period/timeout for the completion of an > RDMA operation? This is the ACK timeout I think. What is your gen2 client setting for the local ACK timeout when connecting a QP? - R. From swise at opengridcomputing.com Fri Nov 17 09:25:11 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 17 Nov 2006 11:25:11 -0600 Subject: [openib-general] [PATCH 11/13] Core Resource Allocation In-Reply-To: References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035923.22635.5397.stgit@dell3.ogc.int> Message-ID: <1163784311.8457.44.camel@stevo-desktop> On Fri, 2006-11-17 at 08:54 -0800, Roland Dreier wrote: > > +static u32 next_random(u32 rand) > > +{ > > + u32 y, ylast; > > + > > + y = rand; > > + ylast = y; > > + y = (y * 69069) & 0xffffffff; > > + y = (y & 0x80000000) + (ylast & 0x7fffffff); > > + if ((y & 1)) > > + y = ylast ^ (y > 1) ^ (2567483615UL); > > + else > > + y = ylast ^ (y > 1); > > + y = y ^ (y >> 11); > > + y = y ^ ((y >> 7) & 2636928640UL); > > + y = y ^ ((y >> 15) & 4022730752UL); > > + y = y ^ (y << 18); > > + return y; > > +} > > How about just using the kernel's random32()? > > I haven't read the code really so I don't understand what's being > randomized here, but random32() should be more than good enough for a > typical randomized algorithm(). > > - R. I think we can use random32() or get_random_bytes(). I need to re-review how this algorithm works. Its randomizing the stag IDs so they are not predictable. Steve. From Thomas.Bub at gmx.net Fri Nov 17 09:28:31 2006 From: Thomas.Bub at gmx.net (Thomas Bub) Date: Fri, 17 Nov 2006 18:28:31 +0100 Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? In-Reply-To: Message-ID: <000c01c70a6d$da313740$0301a8c0@Ulrike> Roland, actually I'm answering from home. So I can't tell. Can you give me the exact attribute name so that I can modify it tomorrow? Is this a parameter that the CM I'm using modifies/overwrites? Thomas -----Ursprüngliche Nachricht----- Von: Roland Dreier [mailto:rdreier at cisco.com] Gesendet: Freitag, 17. November 2006 18:09 An: Bub Thomas Cc: openib-general at openib.org; Thomas.Bub at gmx.net; Erez Cohen Betreff: Re: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? > How to increase the waiting period/timeout for the completion of an > RDMA operation? This is the ACK timeout I think. What is your gen2 client setting for the local ACK timeout when connecting a QP? - R. -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.430 / Virus Database: 268.14.6/536 - Release Date: 16.11.2006 15:51 -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.430 / Virus Database: 268.14.6/536 - Release Date: 16.11.2006 15:51 From bos at pathscale.com Fri Nov 17 09:53:16 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 17 Nov 2006 09:53:16 -0800 Subject: [openib-general] [PATCH 02/13] Device Discovery and ULLD Linkage In-Reply-To: <20061116035837.22635.13571.stgit@dell3.ogc.int> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035837.22635.13571.stgit@dell3.ogc.int> Message-ID: <455DF70C.7090400@pathscale.com> Steve Wise wrote: > +static inline void *vzmalloc(int size) > +{ > + void *p = vmalloc(size); > + memset(p, 0, size); > + return p; > +} This isn't checking the return value from vmalloc. Also, we could do with a generic vzalloc and vcalloc, just as we now have kzalloc and kcalloc. There are lots of routines like this sitting around. References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035837.22635.13571.stgit@dell3.ogc.int> <455DF70C.7090400@pathscale.com> Message-ID: <1163786358.8457.68.camel@stevo-desktop> On Fri, 2006-11-17 at 09:53 -0800, Bryan O'Sullivan wrote: > Steve Wise wrote: > > > +static inline void *vzmalloc(int size) > > +{ > > + void *p = vmalloc(size); > > + memset(p, 0, size); > > + return p; > > +} > > This isn't checking the return value from vmalloc. > Oops... > Also, we could do with a generic vzalloc and vcalloc, just as we now > have kzalloc and kcalloc. There are lots of routines like this sitting > around. > > References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035847.22635.87333.stgit@dell3.ogc.int> Message-ID: <455DFA78.9090401@pathscale.com> Steve Wise wrote: > +static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) > +{ > + struct cpl_tid_release *req; > + > + skb = get_skb(skb, sizeof *req, GFP_KERNEL); > + if (!skb) { > + return; > + } Style micronit: no curlies for single-statement blocks. > +void __free_ep(struct iwch_ep_common *epc) > +{ > + PDBG("%s ep %p, &refcnt %p state %s, refcnt %d\n", > + __FUNCTION__, epc, &epc->refcnt, > + states[state_read(epc)], atomic_read(&epc->refcnt)); > + > + if (atomic_read(&epc->refcnt) == 1) { > + goto out; > + } > + if (!atomic_dec_and_test(&epc->refcnt)) { > + return; > + } > +out: > + PDBG("free ep %p\n", epc); > + kfree(epc); > +} Whatever you're trying to do with refcounting and atomics here looks extremely dodgy and race-prone to me. Why are you using atomic ops in such a scary manner, instead of just slapping a spinlock around this? Anyway, please drop this atomic refcounting stuff and use embedded krefs instead. You're tunnelling into a bug mine. By the way, it would be more consistent with normal kernel naming conventions to name these refcount-diddling routines ep_get and ep_put, since __ep_free doesn't actually free an object unless it feels like it. > +int __init iwch_cm_init(void) > +{ > + skb_queue_head_init(&rxq); > + > + workq = create_singlethread_workqueue("iw_cxgb3"); > + if (!workq) > + return -ENOMEM; > + > + /* > + * All upcalls from the T3 Core go to sched() to > + * schedule the processing on a work queue. > + */ > + t3c_handlers[CPL_ACT_ESTABLISH] = sched; > + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; > + t3c_handlers[CPL_RX_DATA] = sched; > + t3c_handlers[CPL_TX_DMA_ACK] = sched; > + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; > + t3c_handlers[CPL_ABORT_RPL] = sched; > + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; > + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; > + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; > + t3c_handlers[CPL_PASS_ESTABLISH] = sched; > + t3c_handlers[CPL_PEER_CLOSE] = sched; > + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; > + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; > + t3c_handlers[CPL_RDMA_TERMINATE] = sched; > + t3c_handlers[CPL_RDMA_EC_STATUS] = sched; > + > + /* > + * These are the real handlers that are called from a > + * work queue. > + */ > + work_handlers[CPL_ACT_ESTABLISH] = act_establish; > + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; > + work_handlers[CPL_RX_DATA] = rx_data; > + work_handlers[CPL_TX_DMA_ACK] = tx_ack; > + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; > + work_handlers[CPL_ABORT_RPL] = abort_rpl; > + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; > + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; > + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; > + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; > + work_handlers[CPL_PEER_CLOSE] = peer_close; > + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; > + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; > + work_handlers[CPL_RDMA_TERMINATE] = terminate; > + work_handlers[CPL_RDMA_EC_STATUS] = ec_status; > + return 0; > +} This seems mighty peculiar. Why aren't you keeping this stuff in structs, instead of faking up structs via arrays? References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035912.22635.21736.stgit@dell3.ogc.int> Message-ID: <455DFD23.8050504@pathscale.com> Steve Wise wrote: > T3 WQE and CQE structures, defines, etc... I notice that none of the fields in these structs seem to be endianness-annotated, but that there's a lot of cpu_to_be64 and so on being used to frob values into them. Please make sure that the driver passes a sparse check, which it looks like it almost certainly cannot right now. > +#define RING_DOORBELL(doorbell, QPID) { \ > + (writel(((1<<31) | (QPID)), doorbell)); \ > +} Should probably be an inline function instead of a macro. References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035847.22635.87333.stgit@dell3.ogc.int> <455DFA78.9090401@pathscale.com> Message-ID: <1163788000.8457.88.camel@stevo-desktop> On Fri, 2006-11-17 at 10:07 -0800, Bryan O'Sullivan wrote: > Steve Wise wrote: > > > +static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb) > > +{ > > + struct cpl_tid_release *req; > > + > > + skb = get_skb(skb, sizeof *req, GFP_KERNEL); > > + if (!skb) { > > + return; > > + } > > Style micronit: no curlies for single-statement blocks. > yup. > > +void __free_ep(struct iwch_ep_common *epc) > > +{ > > + PDBG("%s ep %p, &refcnt %p state %s, refcnt %d\n", > > + __FUNCTION__, epc, &epc->refcnt, > > + states[state_read(epc)], atomic_read(&epc->refcnt)); > > + > > + if (atomic_read(&epc->refcnt) == 1) { > > + goto out; > > + } > > + if (!atomic_dec_and_test(&epc->refcnt)) { > > + return; > > + } > > +out: > > + PDBG("free ep %p\n", epc); > > + kfree(epc); > > +} > > Whatever you're trying to do with refcounting and atomics here looks > extremely dodgy and race-prone to me. Why are you using atomic ops in > such a scary manner, instead of just slapping a spinlock around this? > This logic is the same as kfree_skb() (and kref_put() :). It optimizes the case where you're the last one freeing I guess and avoids the dec-and-test in that case.. > Anyway, please drop this atomic refcounting stuff and use embedded krefs > instead. You're tunnelling into a bug mine. The kref put code looks very much like the code above. But I can use krefs to avoid replicating code. > > By the way, it would be more consistent with normal kernel naming > conventions to name these refcount-diddling routines ep_get and ep_put, > since __ep_free doesn't actually free an object unless it feels like it. Again, it was modeled after skb. > > > +int __init iwch_cm_init(void) > > +{ > > + skb_queue_head_init(&rxq); > > + > > + workq = create_singlethread_workqueue("iw_cxgb3"); > > + if (!workq) > > + return -ENOMEM; > > + > > + /* > > + * All upcalls from the T3 Core go to sched() to > > + * schedule the processing on a work queue. > > + */ > > + t3c_handlers[CPL_ACT_ESTABLISH] = sched; > > + t3c_handlers[CPL_ACT_OPEN_RPL] = sched; > > + t3c_handlers[CPL_RX_DATA] = sched; > > + t3c_handlers[CPL_TX_DMA_ACK] = sched; > > + t3c_handlers[CPL_ABORT_RPL_RSS] = sched; > > + t3c_handlers[CPL_ABORT_RPL] = sched; > > + t3c_handlers[CPL_PASS_OPEN_RPL] = sched; > > + t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched; > > + t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched; > > + t3c_handlers[CPL_PASS_ESTABLISH] = sched; > > + t3c_handlers[CPL_PEER_CLOSE] = sched; > > + t3c_handlers[CPL_CLOSE_CON_RPL] = sched; > > + t3c_handlers[CPL_ABORT_REQ_RSS] = sched; > > + t3c_handlers[CPL_RDMA_TERMINATE] = sched; > > + t3c_handlers[CPL_RDMA_EC_STATUS] = sched; > > + > > + /* > > + * These are the real handlers that are called from a > > + * work queue. > > + */ > > + work_handlers[CPL_ACT_ESTABLISH] = act_establish; > > + work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl; > > + work_handlers[CPL_RX_DATA] = rx_data; > > + work_handlers[CPL_TX_DMA_ACK] = tx_ack; > > + work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl; > > + work_handlers[CPL_ABORT_RPL] = abort_rpl; > > + work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl; > > + work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl; > > + work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req; > > + work_handlers[CPL_PASS_ESTABLISH] = pass_establish; > > + work_handlers[CPL_PEER_CLOSE] = peer_close; > > + work_handlers[CPL_ABORT_REQ_RSS] = peer_abort; > > + work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl; > > + work_handlers[CPL_RDMA_TERMINATE] = terminate; > > + work_handlers[CPL_RDMA_EC_STATUS] = ec_status; > > + return 0; > > +} > > This seems mighty peculiar. Why aren't you keeping this stuff in > structs, instead of faking up structs via arrays? > Its a function handler table. Given an incoming message with a command number, we can index into the table using the command number to get the handler function for that message. From swise at opengridcomputing.com Fri Nov 17 10:32:19 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 17 Nov 2006 12:32:19 -0600 Subject: [openib-general] [PATCH 09/13] Core WQE/CQE Types In-Reply-To: <455DFD23.8050504@pathscale.com> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035912.22635.21736.stgit@dell3.ogc.int> <455DFD23.8050504@pathscale.com> Message-ID: <1163788339.8457.95.camel@stevo-desktop> On Fri, 2006-11-17 at 10:19 -0800, Bryan O'Sullivan wrote: > Steve Wise wrote: > > T3 WQE and CQE structures, defines, etc... > > I notice that none of the fields in these structs seem to be > endianness-annotated, but that there's a lot of cpu_to_be64 and so on > being used to frob values into them. Please make sure that the driver > passes a sparse check, which it looks like it almost certainly cannot > right now. It passes sparse with only a few warnings about calling memset() with a size > 100000. I don't know how to get around this warning, however, because I indeed want to initialize large chunks of memory to zero using memset... The HW is BE. So building WR's that get DMA'd to the adapter need values in BE. Also, pulling values out of the CQE require mapping back to cpu byte order. > > > +#define RING_DOORBELL(doorbell, QPID) { \ > > + (writel(((1<<31) | (QPID)), doorbell)); \ > > +} > > Should probably be an inline function instead of a macro. > Ok. BTW: here is the sparse output: vic13:/home/swise/git/linux-2.6.git # make C=1 CHK include/linux/version.h CHK include/linux/utsrelease.h CHK include/linux/compile.h CHECK drivers/infiniband/hw/cxgb3/iwch_cm.c CC [M] drivers/infiniband/hw/cxgb3/iwch_cm.o CHECK drivers/infiniband/hw/cxgb3/iwch_ev.c CC [M] drivers/infiniband/hw/cxgb3/iwch_ev.o CHECK drivers/infiniband/hw/cxgb3/iwch_cq.c CC [M] drivers/infiniband/hw/cxgb3/iwch_cq.o CHECK drivers/infiniband/hw/cxgb3/iwch_qp.c CC [M] drivers/infiniband/hw/cxgb3/iwch_qp.o CHECK drivers/infiniband/hw/cxgb3/iwch_mem.c CC [M] drivers/infiniband/hw/cxgb3/iwch_mem.o CHECK drivers/infiniband/hw/cxgb3/iwch_provider.c CC [M] drivers/infiniband/hw/cxgb3/iwch_provider.o CHECK drivers/infiniband/hw/cxgb3/iwch.c drivers/infiniband/hw/cxgb3/iwch.c:70:8: warning: memset with byte count of 262144 drivers/infiniband/hw/cxgb3/iwch.c:70:8: warning: memset with byte count of 262144 drivers/infiniband/hw/cxgb3/iwch.c:70:8: warning: memset with byte count of 262144 CC [M] drivers/infiniband/hw/cxgb3/iwch.o CHECK drivers/infiniband/hw/cxgb3/core/cxio_hal.c drivers/infiniband/hw/cxgb3/core/cxio_hal.c:550:8: warning: memset with byte count of 131072 CC [M] drivers/infiniband/hw/cxgb3/core/cxio_hal.o CHECK drivers/infiniband/hw/cxgb3/core/cxio_resource.c CC [M] drivers/infiniband/hw/cxgb3/core/cxio_resource.o LD [M] drivers/infiniband/hw/cxgb3/iw_cxgb3.o From bos at pathscale.com Fri Nov 17 10:45:07 2006 From: bos at pathscale.com (Bryan O'Sullivan) Date: Fri, 17 Nov 2006 10:45:07 -0800 Subject: [openib-general] [PATCH 09/13] Core WQE/CQE Types In-Reply-To: <1163788339.8457.95.camel@stevo-desktop> References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035912.22635.21736.stgit@dell3.ogc.int> <455DFD23.8050504@pathscale.com> <1163788339.8457.95.camel@stevo-desktop> Message-ID: <455E0333.70602@pathscale.com> Steve Wise wrote: > It passes sparse with only a few warnings about calling memset() with a > size > 100000. You need to pass in CF=-D__CHECK_ENDIAN__ too, on the kernel build command line. Otherwise, the endianness annotations aren't turned on in the kernel headers, and you get this nice false sense of security. (Steve Wise's message of "Fri, 17 Nov 2006 11:25:11 -0600") References: <20061116035826.22635.61230.stgit@dell3.ogc.int> <20061116035923.22635.5397.stgit@dell3.ogc.int> <1163784311.8457.44.camel@stevo-desktop> Message-ID: > I think we can use random32() or get_random_bytes(). I need to > re-review how this algorithm works. Its randomizing the stag IDs so > they are not predictable. I assume based on the algorithm you have now that they don't need to be cryptographically unpredictable. So random32() would probably be the best thing to do (get_random_bytes() should be used sparingly, since it depletes the kernel entropy pool). - R. From rdreier at cisco.com Fri Nov 17 12:43:59 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 12:43:59 -0800 Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? In-Reply-To: <000c01c70a6d$da313740$0301a8c0@Ulrike> (Thomas Bub's message of "Fri, 17 Nov 2006 18:28:31 +0100") References: <000c01c70a6d$da313740$0301a8c0@Ulrike> Message-ID: I think the attribute to look at is struct ibv_qp_attr.timeout. I'm not really sure if the CM sets it or not. - R. From rdreier at cisco.com Fri Nov 17 12:46:03 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 12:46:03 -0800 Subject: [openib-general] [PATCH v2 5/11] Implementation of Data path of the communication protocol In-Reply-To: (Ramachandra Kuchimanchi's message of "Fri, 17 Nov 2006 06:41:31 -0600") References: <455A2677.7102.612A002@ramachandra.kuchimanchi.qlogic.com> Message-ID: > I did test my patches with git-apply --whitespace=error-all, > before mailing them but did not see any error. > I am using GIT 1.3, may be I should upgrade ? Yes, git 1.3 is not that new. I am using git 1.4.3.2, and I think the relevant code you are missing is in the changeset commit d0c25035df4897bb58422b4d64f00b54cf11f07e Author: Junio C Hamano Date: Sat Sep 23 00:37:19 2006 -0700 git-apply: second war on whitespace. This makes --whitespace={warn,error,strip} option to also notice the leading whitespace errors in addition to the trailing whitespace errors. Spaces that are followed by a tab in indent are detected as errors, and --whitespace=strip option fixes them. Signed-off-by: Junio C Hamano From swise at opengridcomputing.com Fri Nov 17 14:22:58 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 17 Nov 2006 16:22:58 -0600 Subject: [openib-general] question - mapping QPIDs back to ptrs Message-ID: <1163802178.8457.159.camel@stevo-desktop> The Chelsio driver is hogging lots of memory right now for mapping PDIDs, QPIDs, CQIDs, and STAG IDs back to their respective kernel structures. This is done via an array of pointers, indexed by the ID. The critical performance mapping is finding a QP struct from the QPID in the poll path. Arrays are fast, but it hogs LOTS of kernel memory. And I've been asked WTF am I doing with all the memory. :-\ So I need to change this. Can other RDMA providers point me at similar code in their drivers? Are you using IDRs for this? Or some other structure? Or do you not need to do this in your implementation? Thanks, Steve. From rdreier at cisco.com Fri Nov 17 15:00:09 2006 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 17 Nov 2006 15:00:09 -0800 Subject: [openib-general] question - mapping QPIDs back to ptrs In-Reply-To: <1163802178.8457.159.camel@stevo-desktop> (Steve Wise's message of "Fri, 17 Nov 2006 16:22:58 -0600") References: <1163802178.8457.159.camel@stevo-desktop> Message-ID: > The Chelsio driver is hogging lots of memory right now for mapping > PDIDs, QPIDs, CQIDs, and STAG IDs back to their respective kernel > structures. This is done via an array of pointers, indexed by the ID. > The critical performance mapping is finding a QP struct from the QPID in > the poll path. mthca rolls its own two-level sparse arrays (the mthca_array_xxx) stuff, but it would probably be smarter to use the kernel's radix tree stuff. I've been meaning to benchmark mthca after converting to radix trees for those tables, to see if it makes a difference. - R. From dotanb at dev.mellanox.co.il Fri Nov 17 21:55:09 2006 From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il) Date: Sat, 18 Nov 2006 07:55:09 +0200 (IST) Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? In-Reply-To: References: <000c01c70a6d$da313740$0301a8c0@Ulrike> Message-ID: <1941.85.65.224.126.1163829309.squirrel@dev.mellanox.co.il> > I think the attribute to look at is struct ibv_qp_attr.timeout. I'm > not really sure if the CM sets it or not. > > - R. The CM sets this attribute and it called "local_ack_timeout" in the CM structures. Dotan From thomas.bub at thomson.net Sat Nov 18 03:40:52 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Sat, 18 Nov 2006 12:40:52 +0100 Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? Message-ID: Thanks that helped. Thomas > -----Original Message----- > From: dotanb at dev.mellanox.co.il [mailto:dotanb at dev.mellanox.co.il] > Sent: Saturday, November 18, 2006 6:55 AM > To: Roland Dreier > Cc: Thomas Bub; 'Erez Cohen'; Bub Thomas; openib-general at openib.org > Subject: Re: [openib-general] How to increase the waiting period/timeout > for the completion of an RDMA operation? > > > I think the attribute to look at is struct ibv_qp_attr.timeout. I'm > > not really sure if the CM sets it or not. > > > > - R. > > The CM sets this attribute and it called "local_ack_timeout" in the CM > structures. > > Dotan > From thomas.bub at thomson.net Sat Nov 18 03:58:19 2006 From: thomas.bub at thomson.net (Bub Thomas) Date: Sat, 18 Nov 2006 12:58:19 +0100 Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? Message-ID: Oops, I forgot to mention that the user paramter that fills this local_ack_timeout is called packet_life_time out of the ib_sa_path_record handed into the ib_cm_send_req. Maybe someone can put this into the upcoming man pages, if not already done. Thomas > -----Original Message----- > From: Bub Thomas > Sent: Saturday, November 18, 2006 12:41 PM > To: 'dotanb at dev.mellanox.co.il'; Roland Dreier > Cc: Thomas Bub; 'Erez Cohen'; openib-general at openib.org > Subject: RE: [openib-general] How to increase the waiting period/timeout > for the completion of an RDMA operation? > > Thanks that helped. > Thomas > > > > -----Original Message----- > > From: dotanb at dev.mellanox.co.il [mailto:dotanb at dev.mellanox.co.il] > > Sent: Saturday, November 18, 2006 6:55 AM > > To: Roland Dreier > > Cc: Thomas Bub; 'Erez Cohen'; Bub Thomas; openib-general at openib.org > > Subject: Re: [openib-general] How to increase the waiting period/timeout > > for the completion of an RDMA operation? > > > > > I think the attribute to look at is struct ibv_qp_attr.timeout. I'm > > > not really sure if the CM sets it or not. > > > > > > - R. > > > > The CM sets this attribute and it called "local_ack_timeout" in the CM > > structures. > > > > Dotan > > From swise at opengridcomputing.com Sat Nov 18 07:20:53 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 18 Nov 2006 09:20:53 -0600 Subject: [openib-general] [PATCH] RFC libibverbs - Pass provider data through ibv_cmd_req_notify_cq() In-Reply-To: References: <1160173774.4324.7.camel@stevo-desktop> Message-ID: <1163863253.24099.10.camel@stevo-desktop> I cloned the gits tree ok. I'm testing now, but crashing in ibv_open_device because it branched to 0 due to the NULL amso lib alloc_context. This all worked before, so something changed maybe. I recompiled everything but that didn't help. Dunno what's up exactly. I'll probably not continue debug on this until the middle of next week due to priorities. But I'll email back once I learn more... Thanks, Steve. PS: Here's a stack trace: Program received signal SIGSEGV, Segmentation fault. 0x0000000000000000 in ?? () (gdb) bt #0 0x0000000000000000 in ?? () #1 0x00002b069d316699 in ibv_open_device (device=0x5077e0) at src/device.c:126 #2 0x00002b069d20ea10 in ucma_init () at src/cma.c:224 #3 0x00002b069d210632 in rdma_create_event_channel () at src/cma.c:285 #4 0x00000000004022ed in main (argc=9, argv=0x7fff0d9b89c8) at examples/rping.c:1069 (gdb) up #1 0x00002b069d316699 in ibv_open_device (device=0x5077e0) at src/device.c:126 126 context = device->ops.alloc_context(device, cmd_fd); (gdb) p *device $1 = {ops = {alloc_context = 0, free_context = 0x2b069d8682e0 }, node_type = IBV_NODE_RNIC, transport_type = IBV_TRANSPORT_IWARP, name = "amso0\000erbs1/device/vendor\000\000\000\000\000\000\000\uffff\uffff\001\000\000\000\000\000\uffffYd\235\006+\000\000\uffffbP", '\0' , dev_name = "uverbs1\0000\000\000\000\000\000\000\000@\000\000\000\000\000\000\000\uffffwP\000\000\000\000\000ss/infiniband_verbs/uverbs1/devi", dev_path = "/sys/class/infiniband_verbs/uverbs1", '\0' , "0\000\000\000\000\000\000\000@\000\000\000\000\000\000\000 at xP\000\000\000\000\000ss/infiniband_verbs/uverbs1/device/vendor\000\000\000\000\000\000\000\021\uffff\001\000\000\000\000\000\uffffYd\235\006+\000\000\200xP", '\0' , "0\000\000\000\000\000\000\000@\000\000\000\000\000\000\000\uffffxP\000\000\000\000\000ss/infiniband_verbs/uverbs1/de"..., ibdev_path = "/sys/class/infiniband/amso0\000\000\000\000\000@\000\000\000\000\000\000\000 yP\000\000\000\000\000ss/infiniband_verbs/uverbs1/device/vendor\000\000\000\000\000\000\0001\uffff\001\000\000\000\000\000\uffffYd\235\006+\000\000`yP", '\0' , "0\000\000\000\000\000\000\000@\000\000\000\000\000\000\000\220yP\000\000\000\000\000ss/infiniband_verbs/uverbs1/device/device\000\000\000\000\000\000\000\uffff\uffff\001"...} On Thu, 2006-11-16 at 16:58 -0800, Roland Dreier wrote: > OK, I applied these patches to libibverbs (and a corresponding patch > to libmthca) and pushed the new trees out. Steve, can you pull and > make sure I got everything you needed in? > > - R. From swise at opengridcomputing.com Sat Nov 18 07:23:51 2006 From: swise at opengridcomputing.com (Steve Wise) Date: Sat, 18 Nov 2006 09:23:51 -0600 Subject: [openib-general] account on the new ofa server In-Reply-To: <1163705170.6286.29.camel@stevo-desktop> References: <1163705170.6286.29.camel@stevo-desktop> Message-ID: <1163863431.24099.14.camel@stevo-desktop> Who maintains the new OFA git server? I'd like to put some git trees on the server for the Ammasso lib and Chelsio lib + kernel driver I'll need a login account... Thanks, Steve. On Thu, 2006-11-16 at 13:26 -0600, Steve Wise wrote: > How do I get an account on the new ofa server? > > Thanks, > > Steve. > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From eitan at mellanox.co.il Sat Nov 18 08:58:53 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Sat, 18 Nov 2006 11:58:53 -0500 Subject: [openib-general] [PATCH] OpenSM: When not forcing link speed, use LinkSpeedSupported as LinkSpeedEnabled In-Reply-To: <1162934443.2783.894.camel@hal.voltaire.com> References: <1162934443.2783.894.camel@hal.voltaire.com> Message-ID: <455F3BCD.9020001@mellanox.co.il> Hi Hal, What the patch does is to force the LinkSpeedEnabled to IB_PORT_LINK_SPEED_ENABLED_MASK. The text implies that this would make the LinkSpeedEnabled be equal to LinkSpeedSupported. What I do not get is how come a static mask can represent a port specific attribute like LinkSpeedSupported. Eitan Hal Rosenstock wrote: > OpenSM: When not forcing link speed, use LinkSpeedSupported as > LinkSpeedEnabled. > > Signed-off-by: Hal Rosenstock > Index: opensm/osm_lid_mgr.c > =================================================================== > -- opensm/osm_lid_mgr.c (revision 10056) > +++ opensm/osm_lid_mgr.c (working copy) > @@ -1155,7 +1155,7 @@ __osm_lid_mgr_set_physp_pi( > if ( p_mgr->p_subn->opt.force_link_speed ) > ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); > else > - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); > + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); > if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, > sizeof(p_pi->link_speed) )) > send_set = TRUE; > Index: opensm/osm_link_mgr.c > =================================================================== > -- opensm/osm_link_mgr.c (revision 10056) > +++ opensm/osm_link_mgr.c (working copy) > @@ -313,7 +313,7 @@ __osm_link_mgr_set_physp_pi( > if ( p_mgr->p_subn->opt.force_link_speed ) > ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); > else > - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); > + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); > if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, > sizeof(p_pi->link_speed) )) > send_set = TRUE; > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From halr at voltaire.com Sat Nov 18 09:16:18 2006 From: halr at voltaire.com (Hal Rosenstock) Date: 18 Nov 2006 12:16:18 -0500 Subject: [openib-general] [PATCH] OpenSM: When not forcing link speed, use LinkSpeedSupported as LinkSpeedEnabled In-Reply-To: <455F3BCD.9020001@mellanox.co.il> References: <1162934443.2783.894.camel@hal.voltaire.com> <455F3BCD.9020001@mellanox.co.il> Message-ID: <1163870173.22090.7210.camel@hal.voltaire.com> Hi Eitan, On Sat, 2006-11-18 at 11:58, Eitan Zahavi wrote: > Hi Hal, > > What the patch does is to force the LinkSpeedEnabled to > IB_PORT_LINK_SPEED_ENABLED_MASK. > The text implies that this would make the LinkSpeedEnabled be equal to > LinkSpeedSupported. > What I do not get is how come a static mask can represent a port > specific attribute like LinkSpeedSupported. The mask value is the right value for the value of setting LinkSpeedEnabled set to LinkSpeedSupport value (15). There was no define in ib_types for this and other. It would be cleaner to add them there but I reused this. Do you prefer I add the define for this (and some other missing defines there) ? Does this make more sense now ? -- Hal > Eitan > > Hal Rosenstock wrote: > > OpenSM: When not forcing link speed, use LinkSpeedSupported as > > LinkSpeedEnabled. > > > > Signed-off-by: Hal Rosenstock > > Index: opensm/osm_lid_mgr.c > > =================================================================== > > -- opensm/osm_lid_mgr.c (revision 10056) > > +++ opensm/osm_lid_mgr.c (working copy) > > @@ -1155,7 +1155,7 @@ __osm_lid_mgr_set_physp_pi( > > if ( p_mgr->p_subn->opt.force_link_speed ) > > ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); > > else > > - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); > > + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); > > if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, > > sizeof(p_pi->link_speed) )) > > send_set = TRUE; > > Index: opensm/osm_link_mgr.c > > =================================================================== > > -- opensm/osm_link_mgr.c (revision 10056) > > +++ opensm/osm_link_mgr.c (working copy) > > @@ -313,7 +313,7 @@ __osm_link_mgr_set_physp_pi( > > if ( p_mgr->p_subn->opt.force_link_speed ) > > ib_port_info_set_link_speed_enabled( p_pi, IB_LINK_SPEED_ACTIVE_2_5 ); > > else > > - ib_port_info_set_link_speed_enabled( p_pi, ib_port_info_get_link_speed_enabled(p_old_pi) ); > > + ib_port_info_set_link_speed_enabled( p_pi, IB_PORT_LINK_SPEED_ENABLED_MASK ); > > if (memcmp( &p_pi->link_speed, &p_old_pi->link_speed, > > sizeof(p_pi->link_speed) )) > > send_set = TRUE; > > > > > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From dotanb at dev.mellanox.co.il Sat Nov 18 23:27:40 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 19 Nov 2006 09:27:40 +0200 Subject: [openib-general] How to increase the waiting period/timeout for the completion of an RDMA operation? In-Reply-To: References: Message-ID: <4560076C.3030300@dev.mellanox.co.il> Bub Thomas wrote: >Oops, >I forgot to mention that the user paramter that fills this >local_ack_timeout is called packet_life_time out of the >ib_sa_path_record handed into the ib_cm_send_req. > >Maybe someone can put this into the upcoming man pages, if not already >done. > >Thomas > > As much as i know, there are only man pages to the libibverbs (that i wrote), and it will be in OFED 1.2. For now, there aren't any man pages to the other libraries .... :( Dotan From bugzilla-daemon at openib.org Sat Nov 18 23:36:09 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sat, 18 Nov 2006 23:36:09 -0800 (PST) Subject: [openib-general] [Bug 272] IPoIB: kernel Oops as a result of interface Up/Down Message-ID: <20061119073609.C49102283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=272 ------- Comment #1 from vlad at mellanox.co.il 2006-11-18 23:36 ------- Created an attachment (id=68) --> (http://openib.org/bugzilla/attachment.cgi?id=68&action=view) Patch fixes this issue (by Roland) This patch fix race between cancel and receive. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Sat Nov 18 23:36:56 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sat, 18 Nov 2006 23:36:56 -0800 (PST) Subject: [openib-general] [Bug 272] IPoIB: kernel Oops as a result of interface Up/Down Message-ID: <20061119073656.54DC92283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=272 vlad at mellanox.co.il changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from vlad at mellanox.co.il 2006-11-18 23:36 ------- Fixed. The patch attached. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at openib.org Sat Nov 18 23:39:39 2006 From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org) Date: Sat, 18 Nov 2006 23:39:39 -0800 (PST) Subject: [openib-general] [Bug 272] IPoIB: kernel Oops as a result of interface Up/Down Message-ID: <20061119073939.4B0832283D4@openib.ca.sandia.gov> http://openib.org/bugzilla/show_bug.cgi?id=272 vlad at mellanox.co.il changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tziporet at dev.mellanox.co.il Sun Nov 19 06:00:41 2006 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 19 Nov 2006 16:00:41 +0200 Subject: [openib-general] account on the new ofa server In-Reply-To: <1163863431.24099.14.camel@stevo-desktop> References: <1163705170.6286.29.camel@stevo-desktop> <1163863431.24099.14.camel@stevo-desktop> Message-ID: <45606389.8070909@dev.mellanox.co.il> Steve Wise wrote: > Who maintains the new OFA git server? I'd like to put some git trees on > the server for the Ammasso lib and Chelsio lib + kernel driver > > I'll need a login account... > > Thanks, > > Steve. > > > On Thu, 2006-11-16 at 13:26 -0600, Steve Wise wrote: > >> How do I get an account on the new ofa server? >> >> Thanks, >> >> Steve. >> >> >> >> > Need to approach Johann George from Qlogic (johann.george at qlogic.com) Tziporet From dotanb at dev.mellanox.co.il Sun Nov 19 07:42:54 2006 From: dotanb at dev.mellanox.co.il (Dotan Barak) Date: Sun, 19 Nov 2006 17:42:54 +0200 Subject: [openib-general] a question about the local/remote CM.REQ attruibutes Message-ID: <45607B7E.8040207@dev.mellanox.co.il> Hi Sean. I'm validating the values of the CM:REQ packet and i noticed the following behavior: When side A sends a REQ message to side B, the values of several attributes are being switched: for example: side A puts the following values: REQ:initiator_depth = 1 REQ:responder_resources = 2 Side B (in the event) get the following values in the CM event (in the ib_cm_req_event_param structure) REQ:initiator_depth = 2 REQ:responder_resources = 1 I see the same behavior for the attributes: local_cm_response_timeout <--> remote_cm_response_timeout. Is this is the behavior i should expect from the CM? thanks Dotan From moshek at voltaire.com Sun Nov 19 07:45:15 2006 From: moshek at voltaire.com (Moshe Kazir) Date: Sun, 19 Nov 2006 17:45:15 +0200 Subject: [openib-general] mstflint error on ppc64 bug fix Message-ID: Michael, Looking at the attached error file will show a big endian data displayed in wrong little endian order. The attached file mstflint2.patch fix this problem. The mstflint2.patch has to be used after mstflint.patch packed in the OFED-1.1 openib.tgz.(user_patches...) The patch change only the displayed data and not the program's used internal structures , as I found that data write is performed o.k. and I did not want to cause errors in the writing process . Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: Moshe Kazir Sent: Monday, October 30, 2006 4:00 PM To: 'Michael S. Tsirkin' Cc: openib-general at openib.org; openfabrics-ewg at openib.org Subject: mstflint error on ppc64 Hi Michael, The output of mstflint is changed on ppc64 as result of byte ordering issues. If you take a HCA that was burned using x86_64 or Mellanox manufacturing and perform mstflint -d ... q on ppc64 you'll find that the value of PSID VSD and Board Id was changed. I tried to look at the code to find the error, but then I saw that vsd is defined twice in the code according to it's usage (char[205], or unsigned int[52] ) Can you please look and help ? Best regards, Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: mstflint_ppc64_byte_ordering_output_error.txt URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mstflint2.patch Type: application/octet-stream Size: 2544 bytes Desc: mstflint2.patch URL: From mst at mellanox.co.il Sun Nov 19 07:53:33 2006 From: mst at mellanox.co.il (Michael S. Tsirkin) Date: Sun, 19 Nov 2006 17:53:33 +0200 Subject: [openib-general] mstflint error on ppc64 bug fix In-Reply-To: References: Message-ID: <20061119155333.GA22700@mellanox.co.il> Thanks. The right thing is no to swap PSID regardless of endian-ness though. I'll give it a look. Quoting r. Moshe Kazir : Subject: RE: mstflint error on ppc64 bug fix Michael, Looking at the attached error file will show a big endian data displayed in wrong little endian order. The attached file mstflint2.patch fix this problem. The mstflint2.patch has to be used after mstflint.patch packed in the OFED-1.1 openib.tgz.(user_patches...) The patch change only the displayed data and not the program's used internal structures , as I found that data write is performed o.k. and I did not want to cause errors in the writing process . Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: Moshe Kazir Sent: Monday, October 30, 2006 4:00 PM To: 'Michael S. Tsirkin' Cc: openib-general at openib.org; openfabrics-ewg at openib.org Subject: mstflint error on ppc64 Hi Michael, The output of mstflint is changed on ppc64 as result of byte ordering issues. If you take a HCA that was burned using x86_64 or Mellanox manufacturing and perform mstflint -d ... q on ppc64 you'll find that the value of PSID VSD and Board Id was changed. I tried to look at the code to find the error, but then I saw that vsd is defined twice in the code according to it's usage (char[205], or unsigned int[52] ) Can you please look and help ? Best regards, Moshe ____________________________________________________________ Moshe Katzir | +972-9971-8639 (o) | +972-52-860-6042 (m) Voltaire - The Grid Backbone www.voltaire.com Content-Description: mstflint_ppc64_byte_ordering_output_error.txt Script started on Thu Nov 16 11:17:12 2006 js21-sles10:~ # mstflint -d 0c:00.0 -i /usr/voltaire/fw/BC-HSEC-128-SDR-Rev1-25208-4_7_927.img -vsd 1 01234567 -vsd2 0123456789012345 -nofs burn Current FW version on flash: N/A New FW version: N/A Burn image with the following GUIDs: Node: 0008f1040398047c Port1: 0008f1040398047d Port2: 0008f1040398047e Sys.Image: 0008f1040398047f You are about to replace current PSID in the image file - "0123456789012345" with a different PSID - "0TLV00700003". Note: It is highly recommended not to change the image PSID. Do you want to continue ? (y/n) [n] : y You are about to replace current PSID on flash - "3210765410985432" with a different PSID - "0123456789012345". Note: It is highly recommended not to change the PSID. Do you want to continue ? (y/n) [n] : y Burn process will not be failsafe. No checks are performed. ALL flash, including Invariant Sector will be overwritten. If this process fails computer may remain in inoperable state. Do you want to continue ? (y/n) [n] : y 000%000%001%002%003%004%005%006%007%008%009%010%011%012%013%014%015%016%017%018%019%020%021%022%023%024%025%026%027%028%029%030%031%032%033%034%035%036%037%038%039%040%041%042%043%044%045%046%047%048%049%050%051%052%053%054%055%056%057%058%059%060%061%062%063%064%065%066%067%068%069%070%071%072%073%074%075%076%077%078%079%080%081%082%083%084%085%086%087%088%089%090%091%092%093%094%095%096%097%098%099%100%100% js21-sles10:~ # mstflint -d 0c:00.0 q Image type: Failsafe I.S. Version: 1 Chip Revision: A0 GUID Des: Node Port1 Port2 Sys image GUIDs: 0008f1040398047c 0008f1040398047d 0008f1040398047e 0008f1040398047f Board ID: 32107654 (3210765410985432) VSD: 32107654 PSID: 3210765410985432 js21-sles10:~ # js21-sles10:~ # exit Script done on Thu Nov 16 11:28:46 2006 -- MST From greg.lindahl at qlogic.com Sun Nov 19 14:04:58 2006 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Sun, 19 Nov 2006 14:04:58 -0800 Subject: [openib-general] mthca question In-Reply-To: <1163626947.13803.48.camel@stevo-desktop> References: <1163626947.13803.48.camel@stevo-desktop> Message-ID: <20061119220458.GA19102@greglaptop.hsd1.ca.comcast.net> Or remove SVN code which is misleading, as it continues to mislead people repeatedly. -- greg > We should put this type of warning in all the infiniband/core modules > that have moved to the kernel... > > > On Wed, 2006-11-15 at 11:29 -0800, Roland Dreier wrote: > > > #warning The mthca driver is no longer kept up to date in svn. > > > #warning For the latest code, track the upstream kernel. > > > > > > What does this mean? What is the upstream kernel? Where do > > > I download the latest sources from? > > > > this means that the definitive source for the mthca driver is the > > standard Linux kernel. The upstream kernel just means Linus's kernel > > tree, which you can download from kernel.org or any of the many mirrors. > > > > - R. > > > > _______________________________________________ > > openib-general mailing list > > openib-general at openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > > > _______________________________________________ > openib-general mailing list > openib-general at openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vishal at endace.com Sun Nov 19 18:38:18 2006 From: vishal at endace.com (vishal) Date: Mon, 20 Nov 2006 15:38:18 +1300 Subject: [openib-general] rping.c client doesn't connect to the server! Message-ID: <1163990298.22576.16.camel@julia.et.endace.com> Hi, I got the following messages when I exceuted rping in the client debug mode (./rping -c -a 10.0.0.1 -p 9999 -C 2 -d -S 26):- created cm_id 0x508720 cma_event type 0 cma_id 0x508720 (parent) cma_event type 3 cma_id 0x508720 (parent) cma event 3, error 0 And, following is what I found executing dmesg:- rping[31136]: segfault at 0000000000000018 rip 00002b6241b157d4 rsp 00007fff691ba340 error 4 Any ideas ? Thanks! Vishal From randy.dunlap at oracle.com Sun Nov 19 18:44:37 2006 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Sun, 19 Nov 2006 18:44:37 -0800 Subject: [openib-general] infiniband section mismatches Message-ID: <20061119184437.2608912c.randy.dunlap@oracle.com> 2.6.19-rc6-git2: with CONFIG_HOTPLUG=n: WARNING: drivers/infiniband/hw/amso1100/iw_c2.o - Section mismatch: reference to .init.text:c2_init_pd_table from .text between 'c2_rnic_init' (at offset 0x25c6) and 'c2_add_addr' WARNING: drivers/infiniband/hw/amso1100/iw_c2.o - Section mismatch: reference to .init.text:c2_init_qp_table from .text between 'c2_rnic_init' (at offset 0x25d5) and 'c2_add_addr' WARNING: drivers/infiniband/hw/amso1100/iw_c2.o - Section mismatch: reference to .exit.text:c2_cleanup_qp_table from .text between 'c2_rnic_term' (at offset 0x1e64) and 'c2_del_addr' WARNING: drivers/infiniband/hw/amso1100/iw_c2.o - Section mismatch: reference to .exit.text:c2_cleanup_pd_table from .text between 'c2_rnic_term' (at offset 0x1e6c) and 'c2_del_addr' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text: from .text between '__mthca_init_one' (at offset 0x69a) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_pd_table from .text between '__mthca_init_one' (at offset 0x77e) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_mr_table from .text between '__mthca_init_one' (at offset 0x7a2) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_eq_table from .text between '__mthca_init_one' (at offset 0x7ff) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_cq_table from .text between '__mthca_init_one' (at offset 0x88f) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_srq_table from .text between '__mthca_init_one' (at offset 0x8b3) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_qp_table from .text between '__mthca_init_one' (at offset 0x8d4) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_av_table from .text between '__mthca_init_one' (at offset 0x8f5) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .init.text:mthca_init_mcg_table from .text between '__mthca_init_one' (at offset 0x916) and '__mthca_restart_one' WARNING: drivers/infiniband/hw/mthca/ib_mthca.o - Section mismatch: reference to .exit.text:mthca_free_agents from .text between '__mthca_remove_one' (at offset 0x179) and '__mthca_init_one' --- ~Randy From krkumar2 at in.ibm.com Sun Nov 19 22:17:36 2006 From: krkumar2 at in.ibm.com (Krishna Kumar2) Date: Mon, 20 Nov 2006 11:47:36 +0530 Subject: [openib-general] [RFC] [PATCH] RDMA/iwcm: Prevent deadlock in locking. In-Reply-To: Message-ID: The "previous" patch fixes the lockdep warning problem, while this patch (which I am not really sure can happen as I have not tested it), is solving the real deadlock problem. Oops, my script has a bug in the Subject, will fix it. thanks, - KK Roland Dreier wrote on 11/17/2006 10:14:06 PM: > Umm.. what's the point of the previous patch that changed spin_lock to > spin_lock_nested? All it did was hide this bug. > > Also BTW your subject lines are off -- these are not RDMA/iwcm > patches. > > - R. From eitan at mellanox.co.il Mon Nov 20 04:24:36 2006 From: eitan at mellanox.co.il (Eitan Zahavi) Date: Mon, 20 Nov 2006 14:24:36 +0200 Subject: [openib-general] Scalable Monitoring - RFC Message-ID: <6C2C79E72C305246B504CBA17B5500C95B32A5@mtlexch01.mtl.com> Hi All, Following the path forward requirements review by Matt L. in the last OFA Dev Summit I have started thinking what would make a monitoring system scale to tens of thousand of node. This RFC provides both what I propose as the requirements list as well as a draft implementation proposal - just to start the discussion. I apologize for the long mail but I think this issue deserves a careful design (been there done that...) Scalable fabric monitoring requirements: * scale up to 48k nodes * 16 ports which gets to about 1,000,000 ports. (16 ports per device is average for 32 ports for switch and 1 for HCA) * provide alerts for ports crossing some rate of change * support profiling of data flow through the fabric * be able to handle changes in topology due to MTBF. Basic design considerations: * a distributed data collection scheme is required. No single manager can go through all the ports in reasonable time. * the system should push as much work to the agents. Examples for that are rate calculation and counter resets. * features like data compression or aggregation are important for reducing the data reported up-the-tree to the console or data storage. To support that the agents should be able to: 1. Report error counters increase only if crosses a programmable threshold 2. Aggregate data and packet counters. Provide upstream data only when change is larger then a given threshold 3. Create and provides histogram of rate of change for each counter (merging data for all ports it collects counters for). * To support data profiling we will need to store xmit recv data for each port which boils down to huge amount of data. Instead of rolling that data up-the-tree each agent should be able to write its own file and merge offline. * handling topology changes is a challenge for a distributed set of agents. Two problem arise from topology dynamics: 1. An agent responsible for query of some device might loose connection to it. So dynamic responsibility assignment is required to support dynamic topology. 2. Agents are arranged in reporting hierarchy which by itself can be disconnected. Implementation proposal: This section describes a specific agent behavior. The actual communication implementation can be based on IB verbs or over plain sockets. Definitions: Responsibility Subnet: the set of nodes the agent is monitoring Responsibility Subnet Perimeter: the nodes connected to the responsibility subnet which are in the responsibility of other agents Peer Agents: the agents that are responsible for the nodes on the perimeter of the current agent Master Agent: the agent the current agent should report to. This is configured on the command line. Agents data structure: 1. List of node GUIDs it is handling 2. Last read values for all counters for each port it is handling 3. Change rate histogram for each port it is handling 4. List of other agents that are responsible for nodes on the perimeter of its responsibility network Agents Communication: A broadcast group is used to broadcast messages regarding agents node ownership. Point to point communication is also used as much as possible as described below. The messages involved are (I list data of the message in <>): 1. Query: Who monitors ? Response: , , (the distance the agent is from the monitored node), 2. Trap: